Skip to content

apex-compute/unified-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

119 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apex Compute

Contributors: Hasan Unlu, Siqin Liu, Tin Nguyen, Rohit Rao, Dave Wei, Hiruna Vishwamith, Yinuo Zhao

Contact: hunlu@apexcompute.com, siqin.liu@apexcompute.com, tin.nguyen@apexcompute.com, rohit@apexcompute.com, dave.wei@apexcompute.com, hiruna@apexcompute.com, yinuo.zhao@apexcompute.com

⚙️ Hardware Architecture Update v1.1(3fa1735.bin)

FPGA Board
🛒 Purchase FPGA Board with Unified Engine IP Block for $49.99
Includes ongoing hardware design updates so you always have the latest architecture.

Discord

XDMA Driver Setup and Usage Guide

This guide covers installation and usage of the Xilinx XDMA driver for PCIe-based FPGA communication.

Prerequisites

  • Kernel headers installed: sudo apt install linux-headers-$(uname -r)

Installation

1. Install XDMA Driver from Xilinx Repository

Clone the official Xilinx DMA driver repository:

git clone https://github.com/Xilinx/dma_ip_drivers.git
cd dma_ip_drivers/XDMA/linux-kernel/xdma
sudo make install

Tip: If sudo make install fails, you may need to disable Secure Boot in your BIOS settings.

2. Load the Driver

Load the XDMA driver with interrupt mode 0 (auto-detect):

sudo insmod /lib/modules/$(uname -r)/xdma/xdma.ko interrupt_mode=0

3. Load the Driver Every Boot Automatically (Recommended)

Apply the following script

# 1. Remove any conflicting configs
sudo rm -f /etc/modprobe.d/blacklist-xdma.conf \
           /etc/modprobe.d/xdma.conf \
           /etc/modules-load.d/xdma.conf

# 2. Create systemd service
sudo tee /etc/systemd/system/xdma.service << 'EOF'
[Unit]
Description=Xilinx XDMA Driver
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c '/sbin/insmod /lib/modules/$(uname -r)/xdma/xdma.ko || true'
ExecStartPost=/bin/sh -c 'chmod 666 /dev/xdma*'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

# 3. Enable and start
sudo systemctl daemon-reload
sudo systemctl enable xdma
sudo systemctl restart xdma

# 4. Verify
sudo systemctl status xdma
ls -la /dev/xdma* | head -5

4. Set Up Python Environment

python3 -m venv ~/my_torch_env
source ~/my_torch_env/bin/activate
pip install -r requirements.txt

5. Run Hardware Tests

python3 user_hw_test.py

6. Run Gemma3 Inference (requires Hugging Face)

The Gemma3 test downloads the gated google/gemma-3-1b-it model from Hugging Face. You need to:

  1. Create a Hugging Face account at https://huggingface.co
  2. Accept the Gemma license at https://huggingface.co/google/gemma-3-1b-it
  3. Create an access token at https://huggingface.co/settings/tokens
  4. Log in from the command line:
pip install huggingface-hub
huggingface-cli login

Then run:

python3 models/gemma3/gemma3_test.py --prompt "your prompt"

7. Updating HW bin file

python3 update_flash.py update_xxxxxxxx.bin

Cold reboot the PC.


Apex Compute Unified Engine v1.1 — Benchmark Results

All benchmarks were collected on RTL running on a Kintex UltraScale+ FPGA in real time.

Benchmark Datasheet

📄 Download Benchmark Datasheet (PDF)

Specification Value
Engine frequency 333 MHz
Theoretical peak (BF16) 42 GFLOPS/s
Memory interface DDR4 @ 1333 MHz, 32-bit
AXI Master Data Width 256 bits
On-chip SRAM 1.05 MB
Total power 4.5 W
BF16 MatMul 40.17 GFLOPS/s (95.6% utilization)
BF16 MatMul + Bias + Activation 40.03 GFLOPS/s (95.3% utilization)
BF16 Softmax MatMul 37.76 GFLOPS/s (89.9% utilization)
Memory-Efficient Attention ~90% utilization
Quantized MatMul (BF16 × INT4/FP4) 40.03 GFLOPS/s (95.3% utilization)
Quantized MatVec (Streaming matrix, decoding mode friendly) (BF16 × INT4/FP4) 31.33 GFLOPS/s (74.6% utilization)
RMSNorm 4.81 GFLOPS/s
LayerNorm 5.90 GFLOPS/s
Quantize (BF16 → INT4/FP4) 5.72 GFLOPS/s
Dequantize (INT4/FP4 → BF16) 3.31 GFLOPS/s
Hardware trace buffer 8,192 timestamps
Multi-engine tensor parallelism Supported with Synchronization Flag instructions

FPGA Presilicon Prototype Setup

System Parameters

Parameter Value
Memory interface DDR4 at 1333 MHz, 32-bit data path
Engine frequency 333 MHz
Memory interface clock Synchronized 1:1 with engine clock
Data width 256 bits
Total power consumption 4.5 W
Total on-chip SRAM 1.05 MB

Peak Operation Rate

Total floating-point operations per second from the engine at 333 MHz is approximately 42 GFLOPS/s.

FPGA Resource Utilization

Name CLB LUTs CLB Registers Block RAM Tile URAM DSPs
unified_engine_top 78,348 50,045 16 30 197

FLOPS Definitions

Operation FLOPS
FMA (Fused Multiply-Add) 2
Addition / Multiplication 1
Exponent 1
Division 1

BF16 Operation Benchmarks

Engine speed: 333 MHz; theoretical peak: 42 GFLOPS/s. Metrics based on M=1024, K=1024, N=1024. O denotes the output tensor. All matrix-matrix operations we are reaching up to 95% FLOPS utilizations.

OpOperandsFLOPSCycles (latency)Achieved GFLOPS/s
A BᵀA[M,K], B[N,K] → O[M,N]2MKN17,820,455 (53.3 ms)40.17
A Bᵀ + CA[M,K], B[N,K], C[M,N] → O[M,N]2MKN + MN17,858,564 (53.5 ms)40.10
GELU(A Bᵀ)A[M,K], B[N,K] → O[M,N]2MKN + 4MN17,923,045 (53.7 ms)40.02
GELU(A Bᵀ + C)A[M,K], B[N,K], C[M,N] → O[M,N]2MKN + MN + 4MN17,927,850 (53.7 ms)40.03
SiLU(A Bᵀ)A[M,K], B[N,K] → O[M,N]2MKN + 4MN17,921,594 (53.7 ms)40.02
SiLU(A Bᵀ + C)A[M,K], B[N,K], C[M,N] → O[M,N]2MKN + MN + 4MN17,926,623 (53.7 ms)40.03
softmax(A Bᵀ)A[M,K], B[N,K] → O[M,N]2MKN + 5MN19,004,997 (57.01 ms)37.76
softmax(A Bᵀ + C)A[M,K], B[N,K], C[M,N] → O[M,N]2MKN + MN + 5MN19,051,310 (57.15 ms)37.68
AᵀA[M,N] → O[N,M]01,648,647 (4.9 ms)N/A
A · scalarA[M,N] → O[M,N]MN180,500 (541 µs)1.94
A + scalarA[M,N] → O[M,N]MN181,005 (543 µs)1.93
A · BA[M,N], B[M,N] → O[M,N]MN263,580 (790 µs)1.33
A + BA[M,N], B[M,N] → O[M,N]MN263,871 (791 µs)1.33
RMSNorm(A) · γA[M,N], γ[N] → O[M,N]4MN290,945 (872 µs)4.81
LayerNorm(A) · γ + βA[M,N], γ[N], β[N] → O[M,N]7MN414,679 (1.24 ms)5.90

Memory-Efficient Attention

The following kernel computes the attention block for given query/key/value tensors and an optional mask or bias. It reaches almost 90% utilization of theoretical FLOPS.

memory_efficient_attention(q, k, v, mask_or_bias)

Equivalent PyTorch reference:

def memory_efficient_attention(q, k, v, attn_bias=None):
    scale = 1.0 / math.sqrt(head_dim)
    attn_weights = (q @ k.T) * scale
    if attn_bias is not None:
        attn_weights = attn_weights + attn_bias
    scores = torch.softmax(attn_weights, dim=-1)
    return scores @ v

Flash attention benchmark (bias off)
Flash attention benchmark — bias off

Flash attention benchmark (bias on)
Flash attention benchmark — bias on

Quantized Operation Benchmarks

Engine speed: 333 MHz; theoretical peak: 42 GFLOPS/s. In quantized mode, achieved FLOPS are the same for any M. In contrast, for tiled matrix-matrix multiplication, smaller M reduces FLOPS utilization. fp4 refers to nvfp4 (Nvidia fp4).

Metrics based on M=1024, K=1024, N=1024.

OpPrecisionOperandsFLOPSCycles (latency)Achieved GFLOPS/s
A BᵀA(bf16) B(int4/fp4) O(bf16)A[M,K], B[N,K] → O[M,N]2MKN22,849,177 (68.5 ms)31.33
A Bᵀ + CA(bf16) B(int4/fp4) C(bf16) O(bf16)A[M,K], B[N,K], C[M,N] → O[M,N]2MKN + MN23,073,635 (69.2 ms)31.04
GELU(A Bᵀ)A(bf16) B(int4/fp4) O(bf16)A[M,K], B[N,K] → O[M,N]2MKN + 4MN22,850,336 (68.5 ms)31.39
GELU(A Bᵀ + C)A(bf16) B(int4/fp4) C(bf16) O(bf16)A[M,K], B[N,K], C[M,N] → O[M,N]2MKN + MN + 4MN23,100,231 (69.3 ms)31.06
SiLU(A Bᵀ)A(bf16) B(int4/fp4) O(bf16)A[M,K], B[N,K] → O[M,N]2MKN + 4MN22,850,243 (68.5 ms)31.39
SiLU(A Bᵀ + C)A(bf16) B(int4/fp4) C(bf16) O(bf16)A[M,K], B[N,K], C[M,N] → O[M,N]2MKN + MN + 4MN23,104,094 (69.3 ms)31.06

Quantization / Dequantization (N=131,072)

OpPrecisionOperandsFLOPSCycles (latency)Achieved GFLOPS/s
Quantize(A)A(bf16) O(int4/fp4)A[N] → O[N]2N15,266 (45.8 µs)5.72
Dequantize(A)A(int4/fp4) O(bf16)A[N] → O[N]N13,193 (39.5 µs)3.31

Trace Buffer and Tensor Parallelism

The engine includes a hardware trace buffer capable of recording 8,192 timestamps, allowing cycle-accurate profiling of kernel execution. This is useful for experimenting with tensor parallelism across multiple engines.

The example below demonstrates splitting a 256×2048 @ 2048×1024 matrix multiplication across two engines:

  • Engine 0: 192×2048 @ 2048×1024 (larger partition)
  • Engine 1: 64×2048 @ 2048×1024 (smaller partition)

Because the two partitions have unequal workloads, the smaller partition finishes before the larger one. A hardware synchronization flag is used to hold the faster engine until both are complete before proceeding to the next stage. The trace visualization below shows this synchronization in action — the idle gap on Engine 1 is where it waits for Engine 0 to finish.

Trace buffer visualization of tensor-parallel matrix multiplication with synchronization
Trace buffer visualization — 256×2048 @ 2048×1024 split across two engines with hardware synchronization

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages