4-Lane Parallel SIMD Vector Processor

Project Overview

This project features a high-performance, 4-Lane SIMD (Single Instruction, Multiple Data) Vector Core Processor engineered for high-density, low-latency parallel matrix math routines (e.g., edge AI inference processing loops and multi-channel digital signal filtering). Designed completely from scratch in SystemVerilog, the architecture embraces an industrial hybrid-precision topology—utilizing packed 32-bit streaming data streams carrying four 8-bit components to maximize bus efficiency, while accumulating into sign-extended 32-bit registers to guarantee perfect mathematical precision.

The Engineering Bottleneck

In high-throughput vector processing, computational paths frequently face "data starvation" or system stalls due to rigid timing dependencies and memory imbalances. Furthermore, traditional linear data streams consume massive bus bandwidth without optimizing parallel execution. On the arithmetic layer, deep accumulation loops are inherently vulnerable to integer wrap-around errors, which can completely corrupt machine learning weights or introduce severe audio noise spikes. This processor scales past these limits by combining spatial hardware parallelism with strict flow control and robust saturation clipping rules.

Architectural Deep Dive

Spatial Vector Parallelism & Slicing

The top-level core natively accepts a packed 32-bit vector bus configuration for operands A and B. Using structural hardware generation loops, the accelerator dynamically unbundles and fragments the streaming channels into independent, isolated data tracks for four side-by-side processing lanes simultaneously.

Elastic Lane Sourcing & Structural Backpressure

To safely isolate data ingestion from execution logic, the architecture implements an array of eight parallel synchronous ring-buffer FIFOs alongside a rigid Valid/Ready Handshake Protocol.

Dynamic Flow Throttling: Built centralized reduction logic (!(|lane_fifo_full)) that monitors buffer capacities. If any single lane queue runs full, the core pulls down its r_ready handshake line instantly, preventing upstream write overflow.
Pipeline Bubble Injection: If an input buffer runs dry or the system stalls, the control matrix automatically injects execution bubbles—dropping clock enable signals to lock internal multiplier and accumulator registers securely in time until stable streams recover.

3-Cycle Pipelined Math Core

The core computational datapath decouples deep combinatorial multiplier paths from the feedback accumulation loop across flip-flop boundaries to optimize critical paths and maximize clock frequencies (F_max).

Multi-Precision Execution: Features an optimized INT8 multiplier stage ($8\text{-bit} \times 8\text{-bit} \rightarrow 16\text{-bit}$ product) which is cleanly sign-extended to 32 bits before reaching the accumulation register.
Anti-Wrapping Saturation Blocks: Integrated custom hardware clamping circuitry. If a value breaches the 32-bit signed boundaries, the adder immediately snaps and locks the signal to its maximum positive (32'h7FFFFFFF) or negative (32'h80000000) boundary, maintaining mathematical correctness.

Control-Path Synchronous Realignment

The top-level controller utilizes a synchronous 3-stage shift-register validation pipeline. By tracking the exact operational movement of data from the initial FIFO pop down to the final accumulation step, the core drives its downstream v_out validation flag precisely in sync with the 3-cycle architectural retirement timeline.

UVM-Inspired Co-Simulation Verification

To establish full sign-off without commercial EDA licensing overhead, I constructed an automated, object-oriented C++ Co-Simulation Framework using Verilator. The testbench instantiates a high-level mathematical reference model that runs in parallel with the compiled SystemVerilog hardware.

To handle the physical 3-cycle pipeline latency of the RTL, I developed a temporal software delay line queue (std::queue) within C++. This queue delays the software's expectations to perfectly time-align with the physical retirement of the wide 128-bit hardware output bus, executing randomized regressions to confirm 100% bit-accurate specification compliance across all lanes simultaneously.

Performance Metrics

4 Ops/Cycle

SIMD Vector Throughput

3-Cycle

Pipelined Retirement Latency

100%

Bit-Accurate Saturation Guarded

Roadmap & Evolution

OpenLane ASIC Synthesis: Passing the verified core through the Yosys/OpenLane toolchain using the SkyWater 130nm PDK to extract post-layout static timing parameters, Worst Negative Slack (WNS), and true gate-area footprints.
AMBA AXI4-Stream Integration: Wrapping the existing Valid/Ready vector interfaces into a production-ready AXI4-Stream IP block for drop-in interoperability with embedded processors and crossbars.

View Project on GitHub