Skip to content

bench: CUDA host-to-device copy modes#7815

Merged
0ax1 merged 1 commit intodevelopfrom
ad/cuda-copy-benchmarks
May 6, 2026
Merged

bench: CUDA host-to-device copy modes#7815
0ax1 merged 1 commit intodevelopfrom
ad/cuda-copy-benchmarks

Conversation

@0ax1
Copy link
Copy Markdown
Contributor

@0ax1 0ax1 commented May 6, 2026

Compare pageable host memory with cuMemHostAlloc pinned allocations using default flags and WRITECOMBINED.

Benchmark results on a GH200:

cuda/load_to_device/memcpy_htod/pageable/1GiB
                        time:   [10.717 ms 10.754 ms 10.793 ms]
                        thrpt:  [92.649 GiB/s 92.989 GiB/s 93.306 GiB/s]

cuda/load_to_device/memcpy_htod/pinned_default/1GiB
                        time:   [10.085 ms 10.265 ms 10.527 ms]
                        thrpt:  [94.992 GiB/s 97.423 GiB/s 99.159 GiB/s]

cuda/load_to_device/memcpy_htod/pinned_write_combined/1GiB
                        time:   [21.043 ms 21.127 ms 21.204 ms]
                        thrpt:  [47.161 GiB/s 47.333 GiB/s 47.522 GiB/s]

cuda/load_to_device/device_alloc_memcpy_htod/pageable/1GiB
                        time:   [42.625 ms 42.704 ms 42.781 ms]
                        thrpt:  [23.375 GiB/s 23.417 GiB/s 23.460 GiB/s]

cuda/load_to_device/device_alloc_memcpy_htod/pinned_default/1GiB
                        time:   [41.864 ms 42.186 ms 42.592 ms]
                        thrpt:  [23.478 GiB/s 23.704 GiB/s 23.887 GiB/s]
                 change:
                        time:   [+1.7580% +2.5859% +3.6570%] (p = 0.00 < 0.05)
                        thrpt:  [-3.5280% -2.5207% -1.7276%]

cuda/load_to_device/device_alloc_memcpy_htod/pinned_write_combined/1GiB
                        time:   [51.986 ms 52.077 ms 52.166 ms]
                        thrpt:  [19.170 GiB/s 19.202 GiB/s 19.236 GiB/s]

The insights here being, WRITECOMBINED yields significantly slower copy performance, whilst pageable host memory is roughly on par with pinned host memory (without using WRITECOMBINED).

Compare pageable host memory with cuMemHostAlloc pinned allocations
using default flags and WRITECOMBINED.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 added the changelog/chore A trivial change label May 6, 2026
@0ax1 0ax1 requested review from a10y and robert3005 May 6, 2026 14:39
@0ax1 0ax1 mentioned this pull request May 6, 2026
@0ax1 0ax1 enabled auto-merge (squash) May 6, 2026 14:58
@0ax1 0ax1 merged commit f307edc into develop May 6, 2026
92 of 95 checks passed
@0ax1 0ax1 deleted the ad/cuda-copy-benchmarks branch May 6, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/chore A trivial change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants