Contributors: Hasan Unlu, Siqin Liu, Tin Nguyen, Rohit Rao, Dave Wei, Hiruna Vishwamith, Yinuo Zhao
Contact: hunlu@apexcompute.com, siqin.liu@apexcompute.com, tin.nguyen@apexcompute.com, rohit@apexcompute.com, dave.wei@apexcompute.com, hiruna@apexcompute.com, yinuo.zhao@apexcompute.com
⚙️ Hardware Architecture Update v1.1(3fa1735.bin)

🛒 Purchase FPGA Board with Unified Engine IP Block for $49.99
Includes ongoing hardware design updates so you always have the latest architecture.
This guide covers installation and usage of the Xilinx XDMA driver for PCIe-based FPGA communication.
- Kernel headers installed:
sudo apt install linux-headers-$(uname -r)
Clone the official Xilinx DMA driver repository:
git clone https://github.com/Xilinx/dma_ip_drivers.git
cd dma_ip_drivers/XDMA/linux-kernel/xdma
sudo make installTip: If
sudo make installfails, you may need to disable Secure Boot in your BIOS settings.
Load the XDMA driver with interrupt mode 0 (auto-detect):
sudo insmod /lib/modules/$(uname -r)/xdma/xdma.ko interrupt_mode=0Apply the following script
# 1. Remove any conflicting configs
sudo rm -f /etc/modprobe.d/blacklist-xdma.conf \
/etc/modprobe.d/xdma.conf \
/etc/modules-load.d/xdma.conf
# 2. Create systemd service
sudo tee /etc/systemd/system/xdma.service << 'EOF'
[Unit]
Description=Xilinx XDMA Driver
After=local-fs.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c '/sbin/insmod /lib/modules/$(uname -r)/xdma/xdma.ko || true'
ExecStartPost=/bin/sh -c 'chmod 666 /dev/xdma*'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
# 3. Enable and start
sudo systemctl daemon-reload
sudo systemctl enable xdma
sudo systemctl restart xdma
# 4. Verify
sudo systemctl status xdma
ls -la /dev/xdma* | head -5python3 -m venv ~/my_torch_env
source ~/my_torch_env/bin/activate
pip install -r requirements.txtpython3 user_hw_test.pyThe Gemma3 test downloads the gated google/gemma-3-1b-it model from Hugging Face. You need to:
- Create a Hugging Face account at https://huggingface.co
- Accept the Gemma license at https://huggingface.co/google/gemma-3-1b-it
- Create an access token at https://huggingface.co/settings/tokens
- Log in from the command line:
pip install huggingface-hub
huggingface-cli loginThen run:
python3 models/gemma3/gemma3_test.py --prompt "your prompt"python3 update_flash.py update_xxxxxxxx.bin
Cold reboot the PC.
All benchmarks were collected on RTL running on a Kintex UltraScale+ FPGA in real time.
📄 Download Benchmark Datasheet (PDF)
| Specification | Value |
|---|---|
| Engine frequency | 333 MHz |
| Theoretical peak (BF16) | 42 GFLOPS/s |
| Memory interface | DDR4 @ 1333 MHz, 32-bit |
| AXI Master Data Width | 256 bits |
| On-chip SRAM | 1.05 MB |
| Total power | 4.5 W |
| BF16 MatMul | 40.17 GFLOPS/s (95.6% utilization) |
| BF16 MatMul + Bias + Activation | 40.03 GFLOPS/s (95.3% utilization) |
| BF16 Softmax MatMul | 37.76 GFLOPS/s (89.9% utilization) |
| Memory-Efficient Attention | ~90% utilization |
| Quantized MatMul (BF16 × INT4/FP4) | 40.03 GFLOPS/s (95.3% utilization) |
| Quantized MatVec (Streaming matrix, decoding mode friendly) (BF16 × INT4/FP4) | 31.33 GFLOPS/s (74.6% utilization) |
| RMSNorm | 4.81 GFLOPS/s |
| LayerNorm | 5.90 GFLOPS/s |
| Quantize (BF16 → INT4/FP4) | 5.72 GFLOPS/s |
| Dequantize (INT4/FP4 → BF16) | 3.31 GFLOPS/s |
| Hardware trace buffer | 8,192 timestamps |
| Multi-engine tensor parallelism | Supported with Synchronization Flag instructions |
| Parameter | Value |
|---|---|
| Memory interface | DDR4 at 1333 MHz, 32-bit data path |
| Engine frequency | 333 MHz |
| Memory interface clock | Synchronized 1:1 with engine clock |
| Data width | 256 bits |
| Total power consumption | 4.5 W |
| Total on-chip SRAM | 1.05 MB |
Total floating-point operations per second from the engine at 333 MHz is approximately 42 GFLOPS/s.
| Name | CLB LUTs | CLB Registers | Block RAM Tile | URAM | DSPs |
|---|---|---|---|---|---|
| unified_engine_top | 78,348 | 50,045 | 16 | 30 | 197 |
| Operation | FLOPS |
|---|---|
| FMA (Fused Multiply-Add) | 2 |
| Addition / Multiplication | 1 |
| Exponent | 1 |
| Division | 1 |
Engine speed: 333 MHz; theoretical peak: 42 GFLOPS/s. Metrics based on M=1024, K=1024, N=1024. O denotes the output tensor. All matrix-matrix operations we are reaching up to 95% FLOPS utilizations.
| Op | Operands | FLOPS | Cycles (latency) | Achieved GFLOPS/s |
|---|---|---|---|---|
| A Bᵀ | A[M,K], B[N,K] → O[M,N] | 2MKN | 17,820,455 (53.3 ms) | 40.17 |
| A Bᵀ + C | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN | 17,858,564 (53.5 ms) | 40.10 |
| GELU(A Bᵀ) | A[M,K], B[N,K] → O[M,N] | 2MKN + 4MN | 17,923,045 (53.7 ms) | 40.02 |
| GELU(A Bᵀ + C) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN + 4MN | 17,927,850 (53.7 ms) | 40.03 |
| SiLU(A Bᵀ) | A[M,K], B[N,K] → O[M,N] | 2MKN + 4MN | 17,921,594 (53.7 ms) | 40.02 |
| SiLU(A Bᵀ + C) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN + 4MN | 17,926,623 (53.7 ms) | 40.03 |
| softmax(A Bᵀ) | A[M,K], B[N,K] → O[M,N] | 2MKN + 5MN | 19,004,997 (57.01 ms) | 37.76 |
| softmax(A Bᵀ + C) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN + 5MN | 19,051,310 (57.15 ms) | 37.68 |
| Aᵀ | A[M,N] → O[N,M] | 0 | 1,648,647 (4.9 ms) | N/A |
| A · scalar | A[M,N] → O[M,N] | MN | 180,500 (541 µs) | 1.94 |
| A + scalar | A[M,N] → O[M,N] | MN | 181,005 (543 µs) | 1.93 |
| A · B | A[M,N], B[M,N] → O[M,N] | MN | 263,580 (790 µs) | 1.33 |
| A + B | A[M,N], B[M,N] → O[M,N] | MN | 263,871 (791 µs) | 1.33 |
| RMSNorm(A) · γ | A[M,N], γ[N] → O[M,N] | 4MN | 290,945 (872 µs) | 4.81 |
| LayerNorm(A) · γ + β | A[M,N], γ[N], β[N] → O[M,N] | 7MN | 414,679 (1.24 ms) | 5.90 |
The following kernel computes the attention block for given query/key/value tensors and an optional mask or bias. It reaches almost 90% utilization of theoretical FLOPS.
memory_efficient_attention(q, k, v, mask_or_bias)
Equivalent PyTorch reference:
def memory_efficient_attention(q, k, v, attn_bias=None):
scale = 1.0 / math.sqrt(head_dim)
attn_weights = (q @ k.T) * scale
if attn_bias is not None:
attn_weights = attn_weights + attn_bias
scores = torch.softmax(attn_weights, dim=-1)
return scores @ v

Flash attention benchmark — bias off

Flash attention benchmark — bias on
Engine speed: 333 MHz; theoretical peak: 42 GFLOPS/s. In quantized mode, achieved FLOPS are the same for any M. In contrast, for tiled matrix-matrix multiplication, smaller M reduces FLOPS utilization. fp4 refers to nvfp4 (Nvidia fp4).
Metrics based on M=1024, K=1024, N=1024.
| Op | Precision | Operands | FLOPS | Cycles (latency) | Achieved GFLOPS/s |
|---|---|---|---|---|---|
| A Bᵀ | A(bf16) B(int4/fp4) O(bf16) | A[M,K], B[N,K] → O[M,N] | 2MKN | 22,849,177 (68.5 ms) | 31.33 |
| A Bᵀ + C | A(bf16) B(int4/fp4) C(bf16) O(bf16) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN | 23,073,635 (69.2 ms) | 31.04 |
| GELU(A Bᵀ) | A(bf16) B(int4/fp4) O(bf16) | A[M,K], B[N,K] → O[M,N] | 2MKN + 4MN | 22,850,336 (68.5 ms) | 31.39 |
| GELU(A Bᵀ + C) | A(bf16) B(int4/fp4) C(bf16) O(bf16) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN + 4MN | 23,100,231 (69.3 ms) | 31.06 |
| SiLU(A Bᵀ) | A(bf16) B(int4/fp4) O(bf16) | A[M,K], B[N,K] → O[M,N] | 2MKN + 4MN | 22,850,243 (68.5 ms) | 31.39 |
| SiLU(A Bᵀ + C) | A(bf16) B(int4/fp4) C(bf16) O(bf16) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN + 4MN | 23,104,094 (69.3 ms) | 31.06 |
| Op | Precision | Operands | FLOPS | Cycles (latency) | Achieved GFLOPS/s |
|---|---|---|---|---|---|
| Quantize(A) | A(bf16) O(int4/fp4) | A[N] → O[N] | 2N | 15,266 (45.8 µs) | 5.72 |
| Dequantize(A) | A(int4/fp4) O(bf16) | A[N] → O[N] | N | 13,193 (39.5 µs) | 3.31 |
The engine includes a hardware trace buffer capable of recording 8,192 timestamps, allowing cycle-accurate profiling of kernel execution. This is useful for experimenting with tensor parallelism across multiple engines.
The example below demonstrates splitting a 256×2048 @ 2048×1024 matrix multiplication across two engines:
- Engine 0: 192×2048 @ 2048×1024 (larger partition)
- Engine 1: 64×2048 @ 2048×1024 (smaller partition)
Because the two partitions have unequal workloads, the smaller partition finishes before the larger one. A hardware synchronization flag is used to hold the faster engine until both are complete before proceeding to the next stage. The trace visualization below shows this synchronization in action — the idle gap on Engine 1 is where it waits for Engine 0 to finish.

Trace buffer visualization — 256×2048 @ 2048×1024 split across two engines with hardware synchronization
