Context
So far we've only spent time optimizing the GPU kernels themselves but not how to get the data to the GPU fast. Prior art here is Onur's exploratory work in this area. As a first I'll capture the current GPU data loading state in Vortex which is expected to be slow.
Benchmarking copy modes
Insight => Copy performance does not differ between pageable host memory and pinned host memory. The main difference being that copy calls can return immediately (C style async) in case the memory is pinned to physical addresses. WRITECOMBINED yields significantly slower copy performance for cuMemHostAlloc (see #7815 (comment)).
Benchmark perf between
- copies from host memory to GPU
- NVMe to CPU
- object storage to GPU
GPU memory pooling
- GPU allocation is very expensive => buffer pool
GPUDirect - NVMe
- DMA copy from NVMe to the GPU
GPUDirect - Object Storage
- RDMA copy from object storage to the GPU
Vortex Scan Integration
- run the ideal respective modes as part of the scan, bypass host memory if possible
- come up with a decision mechanism if and which parts of the scan should run on the GPU
Context
So far we've only spent time optimizing the GPU kernels themselves but not how to get the data to the GPU fast. Prior art here is Onur's exploratory work in this area. As a first I'll capture the current GPU data loading state in Vortex which is expected to be slow.
Benchmarking copy modes
Insight => Copy performance does not differ between pageable host memory and pinned host memory. The main difference being that copy calls can return immediately (C style async) in case the memory is pinned to physical addresses.
WRITECOMBINEDyields significantly slower copy performance forcuMemHostAlloc(see #7815 (comment)).Benchmark perf between
GPU memory pooling
GPUDirect - NVMe
GPUDirect - Object Storage
Vortex Scan Integration