← Back to Projects

Hardware Accelerator

RTL Architecture • SystemVerilog • ASIC Design • 2025

SystemVerilog Datapath Design Elastic Buffers Digital Filtering

Project Overview

This project is a basic Hardware Accelerator designed for high-performance arithmetic operations, specifically Multiply-Accumulate (MAC) functions often used in DSP or AI workloads. Architected from scratch in SystemVerilog, this hardware module operates independently of the main CPU. It utilizes an autonomous pipeline and elastic data buffering to achieve maximum computational density per clock cycle.

The Engineering Bottleneck

In digital signal processing, computing elements frequently face "data starvation" when memory fetching cannot keep up with execution speeds. Furthermore, deep accumulation loops in standard ALUs are prone to integer overflow, leading to catastrophic phase inversion in audio or corrupted weights in AI models. This project was built to solve both the memory-compute impedance mismatch and arithmetic instability simultaneously.

Architectural Deep Dive

Elastic Operand Buffering (Sync FIFO)

Designed a dual-buffer system to decouple data ingestion from the processing core.

  • Optimized State Logic: Utilized n+1 bit addressing for read/write pointers. This allows the hardware to calculate "Full" and "Empty" states purely through bitwise comparison, eliminating the need for slow, area-heavy resource counters.
  • Synchronized Dispatch: Operands A and B stream asynchronously but are perfectly aligned by the top-level controller before being dispatched into the MAC datapath.

2-Stage Saturated MAC Engine

The core computational unit, stripped down for maximum frequency scaling.

  • Pipeline Segregation: Divided the math into two distinct stages (Cycle 1: Multiplier Array, Cycle 2: Accumulator) to shorten the critical path and boost the maximum achievable clock frequency (Fmax).
  • Hardware-Level Clipping: Integrated custom saturation logic. Instead of rolling over, values that exceed the 32-bit signed limit are clamped to MAX_POS or MAX_NEG, ensuring mathematically stable output streams.

Autonomous Pipeline Control

The top-level wrapper acts as a micro-sequencer. It autonomously drives the datapath enable signals based purely on buffer occupancy. A synchronous shift-register tracks data through the pipeline, asserting the output valid flag exactly aligned with the 2-cycle algorithmic latency.

Testbench Verification

Developed a rigorous simulation environment to prove the logic. The test suite heavily stressed the saturation boundaries, intentionally flooding the accumulator with edge-case limits (e.g., 0x7FFFFFFF + 0x7FFFFFFF) to verify that the clamping mechanisms triggered perfectly without stalling the pipeline throughput.

Performance Metrics

1 Op/Cycle

Sustained Throughput

Zero

CPU Overhead

100%

Overflow Immune

Roadmap & Evolution

  • AMBA AXI4 Integration: Transitioning the raw FIFO inputs to standard AXI4-Stream interfaces for plug-and-play capability with ARM Cortex cores.
  • UVM Verification: Upgrading the standard SystemVerilog testbench to a full Universal Verification Methodology (UVM) environment with constrained random testing.