This project features a high-performance, 4-Lane SIMD (Single Instruction, Multiple Data) Vector Core Processor engineered for high-density, low-latency parallel matrix math routines (e.g., edge AI inference processing loops and multi-channel digital signal filtering). Designed completely from scratch in SystemVerilog, the architecture embraces an industrial hybrid-precision topology—utilizing packed 32-bit streaming data streams carrying four 8-bit components to maximize bus efficiency, while accumulating into sign-extended 32-bit registers to guarantee perfect mathematical precision.
In high-throughput vector processing, computational paths frequently face "data starvation" or system stalls due to rigid timing dependencies and memory imbalances. Furthermore, traditional linear data streams consume massive bus bandwidth without optimizing parallel execution. On the arithmetic layer, deep accumulation loops are inherently vulnerable to integer wrap-around errors, which can completely corrupt machine learning weights or introduce severe audio noise spikes. This processor scales past these limits by combining spatial hardware parallelism with strict flow control and robust saturation clipping rules.
The top-level core natively accepts a packed 32-bit vector bus configuration for operands A and B. Using structural hardware generation loops, the accelerator dynamically unbundles and fragments the streaming channels into independent, isolated data tracks for four side-by-side processing lanes simultaneously.
To safely isolate data ingestion from execution logic, the architecture implements an array of eight parallel synchronous ring-buffer FIFOs alongside a rigid Valid/Ready Handshake Protocol.
!(|lane_fifo_full)) that monitors buffer capacities. If any single lane queue runs full, the core pulls down its r_ready handshake line instantly, preventing upstream write overflow.The core computational datapath decouples deep combinatorial multiplier paths from the feedback accumulation loop across flip-flop boundaries to optimize critical paths and maximize clock frequencies (Fmax).
32'h7FFFFFFF) or negative (32'h80000000) boundary, maintaining mathematical correctness.The top-level controller utilizes a synchronous 3-stage shift-register validation pipeline. By tracking the exact operational movement of data from the initial FIFO pop down to the final accumulation step, the core drives its downstream v_out validation flag precisely in sync with the 3-cycle architectural retirement timeline.
To establish full sign-off without commercial EDA licensing overhead, I constructed an automated, object-oriented C++ Co-Simulation Framework using Verilator. The testbench instantiates a high-level mathematical reference model that runs in parallel with the compiled SystemVerilog hardware.
To handle the physical 3-cycle pipeline latency of the RTL, I developed a temporal software delay line queue (std::queue) within C++. This queue delays the software's expectations to perfectly time-align with the physical retirement of the wide 128-bit hardware output bus, executing randomized regressions to confirm 100% bit-accurate specification compliance across all lanes simultaneously.
SIMD Vector Throughput
Pipelined Retirement Latency
Bit-Accurate Saturation Guarded