Skip to content

NanoVDB PointsToGrid::countNodes: Use segmented radix sort for higher tile counts#2170

Merged
kmuseth merged 6 commits intoAcademySoftwareFoundation:masterfrom
swahtz:pointstogrid_segmented_sort
Mar 4, 2026
Merged

NanoVDB PointsToGrid::countNodes: Use segmented radix sort for higher tile counts#2170
kmuseth merged 6 commits intoAcademySoftwareFoundation:masterfrom
swahtz:pointstogrid_segmented_sort

Conversation

@swahtz
Copy link
Copy Markdown
Contributor

@swahtz swahtz commented Feb 17, 2026

This pull request introduces a performance optimization to the voxel key sorting process in the implementation of PointsToGrid. The main improvement is the addition of a bulk segmented sort path for cases with many tiles instead of a serial loop of kernel launches per-tile, which significantly speeds up sorting in large datasets. I found that creating a grid for the Stanford dragon at a voxel size that produced 200 tiles, the speedup to sorting was 19x and for a voxel size producing 6000 tiles, the speedup was 73x. The end-to-end PointsToGrid improvements were 17% for the case with 6,000 tiles. For low tile counts, I found that the performance of segmented radix sort was worse than the original so I include a fallback when tile counts are low.

Performance and algorithm improvements:

  • Added a new BulkVoxelKeyFunctor struct and associated kernel launch to efficiently compute voxel keys for all points in a single pass (instead of multiple kernel launches) when the number of tiles exceeds a threshold. This enables a bulk segmented sort path for large tile counts, improving performance for large datasets.
  • Modified the sorting logic to choose between bulk segmented sort and serial per-tile sort based on the number of tiles (SEGMENTED_SORT_TILE_THRESHOLD). Bulk sort is used for large tile counts, while the original per-tile sort is retained for small tile counts where the original per-tile sort was faster.

Minor fixes:

  • Fixed the setVerbose method to only set the local verbosity variable, removing the flag manipulation for clarity and correctness.

… key computation.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz swahtz requested a review from kmuseth as a code owner February 17, 2026 03:37
@swahtz swahtz added the nanovdb label Feb 17, 2026
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes nanovdb::tools::cuda::PointsToGrid::countNodes by adding a bulk segmented radix-sort path for voxel-key sorting when the number of tiles is high, while keeping the existing per-tile sorting path for low tile counts to avoid overhead regressions.

Changes:

  • Added a bulk voxel-key generation kernel (BulkVoxelKeyFunctor) to compute voxel keys for all points in one launch when tile counts are high.
  • Switched sorting to cub::DeviceSegmentedRadixSort::SortPairs for high tile counts using computed per-tile segment offsets, with a threshold-based fallback to the original per-tile loop.
  • Simplified setVerbose to only update the local verbosity level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great optimization. Well done.

uint64_t(NanoLeaf< BuildT>::CoordToOffset(ijk)); // voxel offset: 8^3 = 2^9, i.e. first 9 bits
};// voxelKey lambda functor
// Find tile index for this point via upper_bound in prefix-sum offsets
const uint64_t tileID = thrust::upper_bound(thrust::seq, d_tile_offsets, d_tile_offsets + numTiles + 1, uint32_t(tid)) - d_tile_offsets - 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 praise: ‏Nice!

@matthewdcong
Copy link
Copy Markdown
Contributor

Is the difference in absolute time for small numbers of tiles large enough to warrant keeping the older path?

@swahtz
Copy link
Copy Markdown
Contributor Author

swahtz commented Feb 18, 2026

Is the difference in absolute time for small numbers of tiles large enough to warrant keeping the older path?

I was unsure if we should keep the older path too. For small numbers of tiles, the overhead for the segmented sort makes it more expensive. For 100k points and 4 tiles, the new segmented sort is 0.371ms and the old serial per-tile sort is .278ms, a 34% regression (running on my Ada RTX 6000). For small tile counts, do you think it's reasonable to trade that off for code complexity?

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@matthewdcong
Copy link
Copy Markdown
Contributor

Is the difference in absolute time for small numbers of tiles large enough to warrant keeping the older path?

I was unsure if we should keep the older path too. For small numbers of tiles, the overhead for the segmented sort makes it more expensive. For 100k points and 4 tiles, the new segmented sort is 0.371ms and the old serial per-tile sort is .278ms, a 34% regression (running on my Ada RTX 6000). For small tile counts, do you think it's reasonable to trade that off for code complexity?

Yeah, that's quite a big relative difference. I'd say keep the older path then.

Copy link
Copy Markdown
Contributor

@kmuseth kmuseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but let's be sure we have unit-tests that validates both the new and the old code path. That is, un tests with less and more than 32 tiles.


// For each tile in parallel, we construct another set of keys for the lower nodes, leaf nodes, and voxels within that tile followed by a radix sort of these keys.
for (int deviceId = 0, id = 0; deviceId < static_cast<int>(mDeviceMesh.deviceCount()); ++deviceId) {
static constexpr uint32_t SEGMENTED_SORT_TILE_THRESHOLD = 32;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the 32 threshold based on measurements?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I did some tests on the Stanford dragon changing voxel size to generate different number of tiles and 32 is around where the crossover was. I was measuring with my Ada RTX 6000 and so perhaps different devices might have different characteristics but I felt 32 was near where the crossover in performance was for my data and device combination.

- Implemented unit tests for the serial per-tile sort path with fewer than 32 tiles.
- Added tests for the segmented sort path handling 32 or more tiles.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz
Copy link
Copy Markdown
Contributor Author

swahtz commented Feb 26, 2026

looks good but let's be sure we have unit-tests that validates both the new and the old code path. That is, un tests with less and more than 32 tiles.

Good call, I have added tests for those two scenarios to both the GPU and MultiGPU tests.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@kmuseth kmuseth merged commit fec5977 into AcademySoftwareFoundation:master Mar 4, 2026
14 checks passed
@swahtz swahtz deleted the pointstogrid_segmented_sort branch March 4, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants