XDMA Driver Setup and Usage Guide

Contributors: Hasan Unlu, Siqin Liu, Tin Nguyen, Rohit Rao, Dave Wei, Hiruna Vishwamith, Yinuo Zhao

Contact: hunlu@apexcompute.com, siqin.liu@apexcompute.com, tin.nguyen@apexcompute.com, rohit@apexcompute.com, dave.wei@apexcompute.com, hiruna@apexcompute.com, yinuo.zhao@apexcompute.com

⚙️ Hardware Architecture Update v1.1(3fa1735.bin)

🛒 Purchase FPGA Board with Unified Engine IP Block for $49.99
Includes ongoing hardware design updates so you always have the latest architecture.

XDMA Driver Setup and Usage Guide

This guide covers installation and usage of the Xilinx XDMA driver for PCIe-based FPGA communication.

Prerequisites

Kernel headers installed: sudo apt install linux-headers-$(uname -r)

Installation

1. Install XDMA Driver from Xilinx Repository

Clone the official Xilinx DMA driver repository:

git clone https://github.com/Xilinx/dma_ip_drivers.git
cd dma_ip_drivers/XDMA/linux-kernel/xdma
sudo make install

Tip: If sudo make install fails, you may need to disable Secure Boot in your BIOS settings.

2. Load the Driver

Load the XDMA driver with interrupt mode 0 (auto-detect):

sudo insmod /lib/modules/$(uname -r)/xdma/xdma.ko interrupt_mode=0

3. Load the Driver Every Boot Automatically (Recommended)

Apply the following script

# 1. Remove any conflicting configs
sudo rm -f /etc/modprobe.d/blacklist-xdma.conf \
           /etc/modprobe.d/xdma.conf \
           /etc/modules-load.d/xdma.conf

# 2. Create systemd service
sudo tee /etc/systemd/system/xdma.service << 'EOF'
[Unit]
Description=Xilinx XDMA Driver
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c '/sbin/insmod /lib/modules/$(uname -r)/xdma/xdma.ko || true'
ExecStartPost=/bin/sh -c 'chmod 666 /dev/xdma*'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

# 3. Enable and start
sudo systemctl daemon-reload
sudo systemctl enable xdma
sudo systemctl restart xdma

# 4. Verify
sudo systemctl status xdma
ls -la /dev/xdma* | head -5

4. Set Up Python Environment

python3 -m venv ~/my_torch_env
source ~/my_torch_env/bin/activate
pip install -r requirements.txt

5. Run Hardware Tests

python3 user_hw_test.py

6. Run Gemma3 Inference (requires Hugging Face)

The Gemma3 test downloads the gated google/gemma-3-1b-it model from Hugging Face. You need to:

Create a Hugging Face account at https://huggingface.co
Accept the Gemma license at https://huggingface.co/google/gemma-3-1b-it
Create an access token at https://huggingface.co/settings/tokens
Log in from the command line:

pip install huggingface-hub
huggingface-cli login

Then run:

python3 models/gemma3/gemma3_test.py --prompt "your prompt"

7. Updating HW bin file

python3 update_flash.py update_xxxxxxxx.bin

Cold reboot the PC.

Apex Compute Unified Engine v1.1 — Benchmark Results

All benchmarks were collected on RTL running on a Kintex UltraScale+ FPGA in real time.

Benchmark Datasheet

📄 Download Benchmark Datasheet (PDF)

Specification	Value
Engine frequency	333 MHz
Theoretical peak (BF16)	42 GFLOPS/s
Memory interface	DDR4 @ 1333 MHz, 32-bit
AXI Master Data Width	256 bits
On-chip SRAM	1.05 MB
Total power	4.5 W
BF16 MatMul	40.17 GFLOPS/s (95.6% utilization)
BF16 MatMul + Bias + Activation	40.03 GFLOPS/s (95.3% utilization)
BF16 Softmax MatMul	37.76 GFLOPS/s (89.9% utilization)
Memory-Efficient Attention	~90% utilization
Quantized MatMul (BF16 × INT4/FP4)	40.03 GFLOPS/s (95.3% utilization)
Quantized MatVec (Streaming matrix, decoding mode friendly) (BF16 × INT4/FP4)	31.33 GFLOPS/s (74.6% utilization)
RMSNorm	4.81 GFLOPS/s
LayerNorm	5.90 GFLOPS/s
Quantize (BF16 → INT4/FP4)	5.72 GFLOPS/s
Dequantize (INT4/FP4 → BF16)	3.31 GFLOPS/s
Hardware trace buffer	8,192 timestamps
Multi-engine tensor parallelism	Supported with Synchronization Flag instructions

FPGA Presilicon Prototype Setup

System Parameters

Parameter	Value
Memory interface	DDR4 at 1333 MHz, 32-bit data path
Engine frequency	333 MHz
Memory interface clock	Synchronized 1:1 with engine clock
Data width	256 bits
Total power consumption	4.5 W
Total on-chip SRAM	1.05 MB

Peak Operation Rate

Total floating-point operations per second from the engine at 333 MHz is approximately 42 GFLOPS/s.

FPGA Resource Utilization

Name	CLB LUTs	CLB Registers	Block RAM Tile	URAM	DSPs
unified_engine_top	78,348	50,045	16	30	197

FLOPS Definitions

Operation	FLOPS
FMA (Fused Multiply-Add)	2
Addition / Multiplication	1
Exponent	1
Division	1

BF16 Operation Benchmarks

Engine speed: 333 MHz; theoretical peak: 42 GFLOPS/s. Metrics based on M=1024, K=1024, N=1024. O denotes the output tensor. All matrix-matrix operations we are reaching up to 95% FLOPS utilizations.

Op	Operands	FLOPS	Cycles (latency)	Achieved GFLOPS/s
A Bᵀ	A[M,K], B[N,K] → O[M,N]	2MKN	17,820,455 (53.3 ms)	40.17
A Bᵀ + C	A[M,K], B[N,K], C[M,N] → O[M,N]	2MKN + MN	17,858,564 (53.5 ms)	40.10
GELU(A Bᵀ)	A[M,K], B[N,K] → O[M,N]	2MKN + 4MN	17,923,045 (53.7 ms)	40.02
GELU(A Bᵀ + C)	A[M,K], B[N,K], C[M,N] → O[M,N]	2MKN + MN + 4MN	17,927,850 (53.7 ms)	40.03
SiLU(A Bᵀ)	A[M,K], B[N,K] → O[M,N]	2MKN + 4MN	17,921,594 (53.7 ms)	40.02
SiLU(A Bᵀ + C)	A[M,K], B[N,K], C[M,N] → O[M,N]	2MKN + MN + 4MN	17,926,623 (53.7 ms)	40.03
softmax(A Bᵀ)	A[M,K], B[N,K] → O[M,N]	2MKN + 5MN	19,004,997 (57.01 ms)	37.76
softmax(A Bᵀ + C)	A[M,K], B[N,K], C[M,N] → O[M,N]	2MKN + MN + 5MN	19,051,310 (57.15 ms)	37.68
Aᵀ	A[M,N] → O[N,M]	0	1,648,647 (4.9 ms)	N/A
A · scalar	A[M,N] → O[M,N]	MN	180,500 (541 µs)	1.94
A + scalar	A[M,N] → O[M,N]	MN	181,005 (543 µs)	1.93
A · B	A[M,N], B[M,N] → O[M,N]	MN	263,580 (790 µs)	1.33
A + B	A[M,N], B[M,N] → O[M,N]	MN	263,871 (791 µs)	1.33
RMSNorm(A) · γ	A[M,N], γ[N] → O[M,N]	4MN	290,945 (872 µs)	4.81
LayerNorm(A) · γ + β	A[M,N], γ[N], β[N] → O[M,N]	7MN	414,679 (1.24 ms)	5.90

Memory-Efficient Attention

The following kernel computes the attention block for given query/key/value tensors and an optional mask or bias. It reaches almost 90% utilization of theoretical FLOPS.

memory_efficient_attention(q, k, v, mask_or_bias)

Equivalent PyTorch reference:

def memory_efficient_attention(q, k, v, attn_bias=None):
    scale = 1.0 / math.sqrt(head_dim)
    attn_weights = (q @ k.T) * scale
    if attn_bias is not None:
        attn_weights = attn_weights + attn_bias
    scores = torch.softmax(attn_weights, dim=-1)
    return scores @ v

Flash attention benchmark — bias off

Flash attention benchmark — bias on

Quantized Operation Benchmarks

Engine speed: 333 MHz; theoretical peak: 42 GFLOPS/s. In quantized mode, achieved FLOPS are the same for any M. In contrast, for tiled matrix-matrix multiplication, smaller M reduces FLOPS utilization. fp4 refers to nvfp4 (Nvidia fp4).

Metrics based on M=1024, K=1024, N=1024.

Op	Precision	Operands	FLOPS	Cycles (latency)	Achieved GFLOPS/s
A Bᵀ	A(bf16) B(int4/fp4) O(bf16)	A[M,K], B[N,K] → O[M,N]	2MKN	22,849,177 (68.5 ms)	31.33
A Bᵀ + C	A(bf16) B(int4/fp4) C(bf16) O(bf16)	A[M,K], B[N,K], C[M,N] → O[M,N]	2MKN + MN	23,073,635 (69.2 ms)	31.04
GELU(A Bᵀ)	A(bf16) B(int4/fp4) O(bf16)	A[M,K], B[N,K] → O[M,N]	2MKN + 4MN	22,850,336 (68.5 ms)	31.39
GELU(A Bᵀ + C)	A(bf16) B(int4/fp4) C(bf16) O(bf16)	A[M,K], B[N,K], C[M,N] → O[M,N]	2MKN + MN + 4MN	23,100,231 (69.3 ms)	31.06
SiLU(A Bᵀ)	A(bf16) B(int4/fp4) O(bf16)	A[M,K], B[N,K] → O[M,N]	2MKN + 4MN	22,850,243 (68.5 ms)	31.39
SiLU(A Bᵀ + C)	A(bf16) B(int4/fp4) C(bf16) O(bf16)	A[M,K], B[N,K], C[M,N] → O[M,N]	2MKN + MN + 4MN	23,104,094 (69.3 ms)	31.06

Quantization / Dequantization (N=131,072)

Op	Precision	Operands	FLOPS	Cycles (latency)	Achieved GFLOPS/s
Quantize(A)	A(bf16) O(int4/fp4)	A[N] → O[N]	2N	15,266 (45.8 µs)	5.72
Dequantize(A)	A(int4/fp4) O(bf16)	A[N] → O[N]	N	13,193 (39.5 µs)	3.31

Trace Buffer and Tensor Parallelism

The engine includes a hardware trace buffer capable of recording 8,192 timestamps, allowing cycle-accurate profiling of kernel execution. This is useful for experimenting with tensor parallelism across multiple engines.

The example below demonstrates splitting a 256×2048 @ 2048×1024 matrix multiplication across two engines:

Engine 0: 192×2048 @ 2048×1024 (larger partition)
Engine 1: 64×2048 @ 2048×1024 (smaller partition)

Because the two partitions have unequal workloads, the smaller partition finishes before the larger one. A hardware synchronization flag is used to hold the faster engine until both are complete before proceeding to the next stage. The trace visualization below shows this synchronization in action — the idle gap on Engine 1 is where it waits for Engine 0 to finish.

Trace buffer visualization — 256×2048 @ 2048×1024 split across two engines with hardware synchronization

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
models		models
test_samples		test_samples
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
benchmark_datasheet.pdf		benchmark_datasheet.pdf
black-logo-with-text.png		black-logo-with-text.png
board_pic.jpg		board_pic.jpg
graph_bias_False.png		graph_bias_False.png
graph_bias_True.png		graph_bias_True.png
program_flash.tcl		program_flash.tcl
program_fpga.tcl		program_fpga.tcl
read_trace.py		read_trace.py
requirements.txt		requirements.txt
rescan_xilinx.sh		rescan_xilinx.sh
trace_vis.png		trace_vis.png
update_3fa1735.bin		update_3fa1735.bin
update_flash.py		update_flash.py
user_dma_core.c		user_dma_core.c
user_dma_core.py		user_dma_core.py
user_hw_test.py		user_hw_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XDMA Driver Setup and Usage Guide

Prerequisites

Installation

1. Install XDMA Driver from Xilinx Repository

2. Load the Driver

3. Load the Driver Every Boot Automatically (Recommended)

4. Set Up Python Environment

5. Run Hardware Tests

6. Run Gemma3 Inference (requires Hugging Face)

7. Updating HW bin file

Apex Compute Unified Engine v1.1 — Benchmark Results

Benchmark Datasheet

FPGA Presilicon Prototype Setup

System Parameters

Peak Operation Rate

FPGA Resource Utilization

FLOPS Definitions

BF16 Operation Benchmarks

Memory-Efficient Attention

Quantized Operation Benchmarks

Quantization / Dequantization (N=131,072)

Trace Buffer and Tensor Parallelism

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

XDMA Driver Setup and Usage Guide

Prerequisites

Installation

1. Install XDMA Driver from Xilinx Repository

2. Load the Driver

3. Load the Driver Every Boot Automatically (Recommended)

4. Set Up Python Environment

5. Run Hardware Tests

6. Run Gemma3 Inference (requires Hugging Face)

7. Updating HW bin file

Apex Compute Unified Engine v1.1 — Benchmark Results

Benchmark Datasheet

FPGA Presilicon Prototype Setup

System Parameters

Peak Operation Rate

FPGA Resource Utilization

FLOPS Definitions

BF16 Operation Benchmarks

Memory-Efficient Attention

Quantized Operation Benchmarks

Quantization / Dequantization (N=131,072)

Trace Buffer and Tensor Parallelism

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages