PTX Backend#18
Conversation
| yield (tpl, args, meta) | ||
|
|
||
| # Warp-specialised dense DMMA | ||
| if cc >= (10, 0): |
There was a problem hiding this comment.
Does this gate consumer cards with less shared memory?
There was a problem hiding this comment.
Not sure what the best way to handle this is. I've added a DENSE_SMEM_MAX but we could set this via the ini or driver?
There was a problem hiding this comment.
If consumer cards can pass the check they need to work. Not sure if there is a clear mapping from CC to max smem. Otherwise, have the caller pass in additional info about max shared memory.
| @@ -0,0 +1,276 @@ | |||
| # -*- coding: utf-8 -*- | |||
|
|
|||
| import struct | |||
| i = m_tile * 8 + lane // 4 | ||
| j = k_iter * 4 + lane % 4 | ||
| v = float(a[i, j]) if (i < m and j < k) else 0.0 | ||
| u = struct.unpack('<Q', struct.pack('<d', v))[0] |
There was a problem hiding this comment.
Can you unpick this for me?
|
|
||
| # A in fragment layout: lane l -> A[m_tile*8 + l/4][k_iter*4 + l%4] | ||
| a_u64 = [] | ||
| for m_tile in range(m_tiles): |
There was a problem hiding this comment.
Can 3 arg range work here?
|
I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)? |
| .param .u64 _c) | ||
| { | ||
| % endif | ||
| .reg .u32 n, id, tid_x, tid_y; |
There was a problem hiding this comment.
Ensure we throw higher up if n is too big.
| ## Async fill of chunk 0 | ||
| % for idx, kx in enumerate(bchunks[0]): | ||
| % if idx % msplit == cid: | ||
| % if n is None: |
There was a problem hiding this comment.
See if we can come up with some consistent indentation for Mako. Am open to ideas.
| <% | ||
| buf_cur = bb % 2 | ||
| buf_next = (bb + 1) % 2 | ||
| is_last = (bb == len(bchunks) - 1) |
There was a problem hiding this comment.
There is a Mako var for this.
| % if afix[row_j] == -1: | ||
| % if beta == 0: | ||
| { | ||
| .reg .${pftype} _tmp; |
There was a problem hiding this comment.
Can this be factored up as appears in both branches?
| fma.rn.${pftype} _ctmp, _ctmp, ${float(beta)}, dotp; | ||
| st.global.${pftype} [_cptr], _ctmp; | ||
| % else: | ||
| ld.global.${pftype} _ctmp, [c_base + ${ldc*j*dwidth_i}]; |
There was a problem hiding this comment.
Is there scope to lifting these ld's up or does the assembler handle this?
| i = mt * 8 + lane // 4 | ||
| j = kt * 4 + lane % 4 | ||
| v = float(a[i, j]) if (i < m and j < k) else 0.0 | ||
| u, = struct.unpack('<Q', struct.pack('<d', v)) |
There was a problem hiding this comment.
I thought Python f-strings/format could do this for getting hex representation of floating point?
| nnz = np.count_nonzero(arr) | ||
| nuq = len(np.unique(np.abs(arr))) | ||
| density = nnz / arr.size | ||
| return (nuq <= 28) or (density <= 0.15) |
There was a problem hiding this comment.
Check if these could do with tuning
| continue | ||
| setup = self._dense_mma_setup(nn=nn, warps_per_cta=w) | ||
| blkx = 32 * w | ||
| args = (base_args | {'warps_per_cta': w, 'nn': nn, |
There was a problem hiding this comment.
Can we reorder the | args so things are a bit cleaner?
| # A in DMMA-fragment layout: lane l -> A[mt*8 + l//4][kt*4 + l%4] | ||
| # i.e. an (m_tiles, k_tiles) grid of row-major 8x4 tiles, packed as | ||
| # uint64 | ||
| a_pad = np.zeros((m_tiles*8, k_tiles*4), dtype=np.float64) |
There was a problem hiding this comment.
Float64 is default
| # uint64 | ||
| a_pad = np.zeros((m_tiles*8, k_tiles*4), dtype=np.float64) | ||
| a_pad[:m, :k] = a | ||
| tiles = a_pad.reshape(m_tiles, 8, k_tiles, 4).transpose(0, 2, 1, 3) |
| @@ -1,6 +1,5 @@ | |||
| # -*- coding: utf-8 -*- | |||
There was a problem hiding this comment.
Avoid these for new code (they have not been needed for years)
This adds a PTX backend to GiMMiK. The key features are:
Optimisations have focused on FP64, FP32 is future work.