PTX Backend by WillTrojak · Pull Request #18 · PyFR/GiMMiK

WillTrojak · 2026-05-15T12:23:17Z

This adds a PTX backend to GiMMiK. The key features are:

Mild optimisation of exist CUDA algorithms.
Optional async loads for some sparse kernels
Added dense generation for Hopper and above

Optimisations have focused on FP64, FP32 is future work.

FreddieWitherden · 2026-05-15T13:44:24Z

+            yield (tpl, args, meta)
+
+        # Warp-specialised dense DMMA
+        if cc >= (10, 0):


Does this gate consumer cards with less shared memory?

Not sure what the best way to handle this is. I've added a DENSE_SMEM_MAX but we could set this via the ini or driver?

If consumer cards can pass the check they need to work. Not sure if there is a clear mapping from CC to max smem. Otherwise, have the caller pass in additional info about max shared memory.

FreddieWitherden · 2026-05-15T13:54:35Z

@@ -0,0 +1,276 @@
+# -*- coding: utf-8 -*-
+
+import struct


FreddieWitherden · 2026-05-15T18:25:33Z

+                    i = m_tile * 8 + lane // 4
+                    j = k_iter * 4 + lane % 4
+                    v = float(a[i, j]) if (i < m and j < k) else 0.0
+                    u = struct.unpack('<Q', struct.pack('<d', v))[0]


Can you unpick this for me?

FreddieWitherden · 2026-05-15T18:25:58Z

+
+        # A in fragment layout: lane l -> A[m_tile*8 + l/4][k_iter*4 + l%4]
+        a_u64 = []
+        for m_tile in range(m_tiles):


Can 3 arg range work here?

FreddieWitherden · 2026-05-15T18:31:49Z

I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)?

FreddieWitherden · 2026-05-15T18:33:25Z

+                         .param .u64 _c)
+{
+% endif
+    .reg .u32 n, id, tid_x, tid_y;


Ensure we throw higher up if n is too big.

FreddieWitherden · 2026-05-15T18:34:40Z

+## Async fill of chunk 0
+%   for idx, kx in enumerate(bchunks[0]):
+%     if idx % msplit == cid:
+% if n is None:


See if we can come up with some consistent indentation for Mako. Am open to ideas.

FreddieWitherden · 2026-05-15T18:35:02Z

+<%
+        buf_cur = bb % 2
+        buf_next = (bb + 1) % 2
+        is_last = (bb == len(bchunks) - 1)


There is a Mako var for this.

FreddieWitherden · 2026-05-15T18:36:01Z

+%       if afix[row_j] == -1:
+% if beta == 0:
+    {
+    .reg .${pftype} _tmp;


Can this be factored up as appears in both branches?

FreddieWitherden · 2026-05-15T18:39:06Z

+    fma.rn.${pftype} _ctmp, _ctmp, ${float(beta)}, dotp;
+    st.global.${pftype} [_cptr], _ctmp;
+% else:
+    ld.global.${pftype} _ctmp, [c_base + ${ldc*j*dwidth_i}];


Is there scope to lifting these ld's up or does the assembler handle this?

FreddieWitherden · 2026-05-19T18:46:56Z

+                    i = mt * 8 + lane // 4
+                    j = kt * 4 + lane % 4
+                    v = float(a[i, j]) if (i < m and j < k) else 0.0
+                    u, = struct.unpack('<Q', struct.pack('<d', v))


I thought Python f-strings/format could do this for getting hex representation of floating point?

FreddieWitherden · 2026-05-21T13:29:40Z

+        nnz = np.count_nonzero(arr)
+        nuq = len(np.unique(np.abs(arr)))
+        density = nnz / arr.size
+        return (nuq <= 28) or (density <= 0.15)


Check if these could do with tuning

FreddieWitherden · 2026-05-21T13:30:21Z

                continue
            setup = self._dense_mma_setup(nn=nn, warps_per_cta=w)
+            blkx = 32 * w
            args = (base_args | {'warps_per_cta': w, 'nn': nn,


Can we reorder the | args so things are a bit cleaner?

FreddieWitherden · 2026-05-21T13:32:13Z

+        # A in DMMA-fragment layout: lane l -> A[mt*8 + l//4][kt*4 + l%4]
+        # i.e. an (m_tiles, k_tiles) grid of row-major 8x4 tiles, packed as
+        # uint64
+        a_pad = np.zeros((m_tiles*8, k_tiles*4), dtype=np.float64)


Float64 is default

FreddieWitherden · 2026-05-21T13:32:39Z

+        # uint64
+        a_pad = np.zeros((m_tiles*8, k_tiles*4), dtype=np.float64)
+        a_pad[:m, :k] = a
+        tiles = a_pad.reshape(m_tiles, 8, k_tiles, 4).transpose(0, 2, 1, 3)


swapaxes(1, 2)

FreddieWitherden · 2026-05-21T13:35:25Z

@@ -1,6 +1,5 @@
 # -*- coding: utf-8 -*-


Avoid these for new code (they have not been needed for years)

Will Trojak and others added 6 commits December 2, 2025 22:13

[wip] added ptx generator for bstream

0cd7485

Addtional sparse and dense work

626c2f5

Dense and sparse optimisation

bbbb8ef

Added warp specialised dense kernel

393b409

Performance tuning and cleanup

67d1beb

Whitespace

e2a818b

WillTrojak mentioned this pull request May 15, 2026

Support for GiMMiK PTX Provider PyFR/PyFR#556

Open