NanoVDB PointsToGrid::countNodes: Use segmented radix sort for higher tile counts#2170
Conversation
… key computation. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
There was a problem hiding this comment.
Pull request overview
This PR optimizes nanovdb::tools::cuda::PointsToGrid::countNodes by adding a bulk segmented radix-sort path for voxel-key sorting when the number of tiles is high, while keeping the existing per-tile sorting path for low tile counts to avoid overhead regressions.
Changes:
- Added a bulk voxel-key generation kernel (
BulkVoxelKeyFunctor) to compute voxel keys for all points in one launch when tile counts are high. - Switched sorting to
cub::DeviceSegmentedRadixSort::SortPairsfor high tile counts using computed per-tile segment offsets, with a threshold-based fallback to the original per-tile loop. - Simplified
setVerboseto only update the local verbosity level.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
harrism
left a comment
There was a problem hiding this comment.
Looks like a great optimization. Well done.
| uint64_t(NanoLeaf< BuildT>::CoordToOffset(ijk)); // voxel offset: 8^3 = 2^9, i.e. first 9 bits | ||
| };// voxelKey lambda functor | ||
| // Find tile index for this point via upper_bound in prefix-sum offsets | ||
| const uint64_t tileID = thrust::upper_bound(thrust::seq, d_tile_offsets, d_tile_offsets + numTiles + 1, uint32_t(tid)) - d_tile_offsets - 1; |
|
Is the difference in absolute time for small numbers of tiles large enough to warrant keeping the older path? |
I was unsure if we should keep the older path too. For small numbers of tiles, the overhead for the segmented sort makes it more expensive. For 100k points and 4 tiles, the new segmented sort is 0.371ms and the old serial per-tile sort is .278ms, a 34% regression (running on my Ada RTX 6000). For small tile counts, do you think it's reasonable to trade that off for code complexity? |
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Yeah, that's quite a big relative difference. I'd say keep the older path then. |
kmuseth
left a comment
There was a problem hiding this comment.
looks good but let's be sure we have unit-tests that validates both the new and the old code path. That is, un tests with less and more than 32 tiles.
|
|
||
| // For each tile in parallel, we construct another set of keys for the lower nodes, leaf nodes, and voxels within that tile followed by a radix sort of these keys. | ||
| for (int deviceId = 0, id = 0; deviceId < static_cast<int>(mDeviceMesh.deviceCount()); ++deviceId) { | ||
| static constexpr uint32_t SEGMENTED_SORT_TILE_THRESHOLD = 32; |
There was a problem hiding this comment.
is the 32 threshold based on measurements?
There was a problem hiding this comment.
Yes I did some tests on the Stanford dragon changing voxel size to generate different number of tiles and 32 is around where the crossover was. I was measuring with my Ada RTX 6000 and so perhaps different devices might have different characteristics but I felt 32 was near where the crossover in performance was for my data and device combination.
- Implemented unit tests for the serial per-tile sort path with fewer than 32 tiles. - Added tests for the segmented sort path handling 32 or more tiles. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Good call, I have added tests for those two scenarios to both the GPU and MultiGPU tests. |
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
This pull request introduces a performance optimization to the voxel key sorting process in the implementation of
PointsToGrid. The main improvement is the addition of a bulk segmented sort path for cases with many tiles instead of a serial loop of kernel launches per-tile, which significantly speeds up sorting in large datasets. I found that creating a grid for the Stanford dragon at a voxel size that produced 200 tiles, the speedup to sorting was 19x and for a voxel size producing 6000 tiles, the speedup was 73x. The end-to-endPointsToGridimprovements were 17% for the case with 6,000 tiles. For low tile counts, I found that the performance of segmented radix sort was worse than the original so I include a fallback when tile counts are low.Performance and algorithm improvements:
BulkVoxelKeyFunctorstruct and associated kernel launch to efficiently compute voxel keys for all points in a single pass (instead of multiple kernel launches) when the number of tiles exceeds a threshold. This enables a bulk segmented sort path for large tile counts, improving performance for large datasets.SEGMENTED_SORT_TILE_THRESHOLD). Bulk sort is used for large tile counts, while the original per-tile sort is retained for small tile counts where the original per-tile sort was faster.Minor fixes:
setVerbosemethod to only set the local verbosity variable, removing the flag manipulation for clarity and correctness.