NanoVDB `PointsToGrid::countNodes`: Use segmented radix sort for higher tile counts by swahtz · Pull Request #2170 · AcademySoftwareFoundation/openvdb

swahtz · 2026-02-17T03:37:20Z

This pull request introduces a performance optimization to the voxel key sorting process in the implementation of PointsToGrid. The main improvement is the addition of a bulk segmented sort path for cases with many tiles instead of a serial loop of kernel launches per-tile, which significantly speeds up sorting in large datasets. I found that creating a grid for the Stanford dragon at a voxel size that produced 200 tiles, the speedup to sorting was 19x and for a voxel size producing 6000 tiles, the speedup was 73x. The end-to-end PointsToGrid improvements were 17% for the case with 6,000 tiles. For low tile counts, I found that the performance of segmented radix sort was worse than the original so I include a fallback when tile counts are low.

Performance and algorithm improvements:

Added a new BulkVoxelKeyFunctor struct and associated kernel launch to efficiently compute voxel keys for all points in a single pass (instead of multiple kernel launches) when the number of tiles exceeds a threshold. This enables a bulk segmented sort path for large tile counts, improving performance for large datasets.
Modified the sorting logic to choose between bulk segmented sort and serial per-tile sort based on the number of tiles (SEGMENTED_SORT_TILE_THRESHOLD). Bulk sort is used for large tile counts, while the original per-tile sort is retained for small tile counts where the original per-tile sort was faster.

Minor fixes:

Fixed the setVerbose method to only set the local verbosity variable, removing the flag manipulation for clarity and correctness.

… key computation. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

nanovdb/nanovdb/tools/cuda/PointsToGrid.cuh

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot

Pull request overview

This PR optimizes nanovdb::tools::cuda::PointsToGrid::countNodes by adding a bulk segmented radix-sort path for voxel-key sorting when the number of tiles is high, while keeping the existing per-tile sorting path for low tile counts to avoid overhead regressions.

Changes:

Added a bulk voxel-key generation kernel (BulkVoxelKeyFunctor) to compute voxel keys for all points in one launch when tile counts are high.
Switched sorting to cub::DeviceSegmentedRadixSort::SortPairs for high tile counts using computed per-tile segment offsets, with a threshold-based fallback to the original per-tile loop.
Simplified setVerbose to only update the local verbosity level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

harrism

Looks like a great optimization. Well done.

harrism · 2026-02-18T02:14:02Z

nanovdb/nanovdb/tools/cuda/PointsToGrid.cuh

+                uint64_t(NanoLeaf< BuildT>::CoordToOffset(ijk));        // voxel offset:  8^3 =  2^9,   i.e. first 9 bits
+        };// voxelKey lambda functor
+        // Find tile index for this point via upper_bound in prefix-sum offsets
+        const uint64_t tileID = thrust::upper_bound(thrust::seq, d_tile_offsets, d_tile_offsets + numTiles + 1, uint32_t(tid)) - d_tile_offsets - 1;


👏 praise: ‏Nice!

matthewdcong · 2026-02-18T22:51:51Z

Is the difference in absolute time for small numbers of tiles large enough to warrant keeping the older path?

swahtz · 2026-02-18T23:56:22Z

Is the difference in absolute time for small numbers of tiles large enough to warrant keeping the older path?

I was unsure if we should keep the older path too. For small numbers of tiles, the overhead for the segmented sort makes it more expensive. For 100k points and 4 tiles, the new segmented sort is 0.371ms and the old serial per-tile sort is .278ms, a 34% regression (running on my Ada RTX 6000). For small tile counts, do you think it's reasonable to trade that off for code complexity?

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

matthewdcong · 2026-02-24T18:52:49Z

Is the difference in absolute time for small numbers of tiles large enough to warrant keeping the older path?

I was unsure if we should keep the older path too. For small numbers of tiles, the overhead for the segmented sort makes it more expensive. For 100k points and 4 tiles, the new segmented sort is 0.371ms and the old serial per-tile sort is .278ms, a 34% regression (running on my Ada RTX 6000). For small tile counts, do you think it's reasonable to trade that off for code complexity?

Yeah, that's quite a big relative difference. I'd say keep the older path then.

kmuseth

looks good but let's be sure we have unit-tests that validates both the new and the old code path. That is, un tests with less and more than 32 tiles.

kmuseth · 2026-02-26T00:33:14Z

nanovdb/nanovdb/tools/cuda/DistributedPointsToGrid.cuh


    // For each tile in parallel, we construct another set of keys for the lower nodes, leaf nodes, and voxels within that tile followed by a radix sort of these keys.
-    for (int deviceId = 0, id = 0; deviceId < static_cast<int>(mDeviceMesh.deviceCount()); ++deviceId) {
+    static constexpr uint32_t SEGMENTED_SORT_TILE_THRESHOLD = 32;


is the 32 threshold based on measurements?

Yes I did some tests on the Stanford dragon changing voxel size to generate different number of tiles and 32 is around where the crossover was. I was measuring with my Ada RTX 6000 and so perhaps different devices might have different characteristics but I felt 32 was near where the crossover in performance was for my data and device combination.

- Implemented unit tests for the serial per-tile sort path with fewer than 32 tiles. - Added tests for the segmented sort path handling 32 or more tiles. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz · 2026-02-26T00:53:30Z

looks good but let's be sure we have unit-tests that validates both the new and the old code path. That is, un tests with less and more than 32 tiles.

Good call, I have added tests for those two scenarios to both the GPU and MultiGPU tests.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Using segmented radix sort for higher tile counts for optimized voxel…

4cefc75

… key computation. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz requested a review from kmuseth as a code owner February 17, 2026 03:37

swahtz added the nanovdb label Feb 17, 2026

harrism reviewed Feb 17, 2026

View reviewed changes

nanovdb/nanovdb/tools/cuda/PointsToGrid.cuh Outdated Show resolved Hide resolved

Switch handrolled binary search to thrust::upper_bound

6802f1f

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz requested a review from Copilot February 17, 2026 21:46

Copilot started reviewing on behalf of swahtz February 17, 2026 21:47 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

harrism approved these changes Feb 18, 2026

View reviewed changes

Implemented segmented radix sort in DistributedPointsToGrid

540d274

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

kmuseth approved these changes Feb 26, 2026

View reviewed changes

Add tests for FewTiles and ManyTiles scenarios in PointsToGrid

7a88c49

- Implemented unit tests for the serial per-tile sort path with fewer than 32 tiles. - Added tests for the segmented sort path handling 32 or more tiles. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz added 2 commits February 27, 2026 09:18

patch version update

665641e

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

pnano version bump

d15d302

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

kmuseth merged commit fec5977 into AcademySoftwareFoundation:master Mar 4, 2026
14 checks passed

swahtz deleted the pointstogrid_segmented_sort branch March 4, 2026 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NanoVDB `PointsToGrid::countNodes`: Use segmented radix sort for higher tile counts#2170

NanoVDB `PointsToGrid::countNodes`: Use segmented radix sort for higher tile counts#2170
kmuseth merged 6 commits intoAcademySoftwareFoundation:masterfrom
swahtz:pointstogrid_segmented_sort

swahtz commented Feb 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

harrism left a comment

Uh oh!

harrism Feb 18, 2026

Uh oh!

matthewdcong commented Feb 18, 2026

Uh oh!

swahtz commented Feb 18, 2026

Uh oh!

matthewdcong commented Feb 24, 2026

Uh oh!

kmuseth left a comment

Uh oh!

kmuseth Feb 26, 2026

Uh oh!

swahtz Feb 26, 2026

Uh oh!

swahtz commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

swahtz commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

harrism left a comment

Choose a reason for hiding this comment

Uh oh!

harrism Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

matthewdcong commented Feb 18, 2026

Uh oh!

swahtz commented Feb 18, 2026

Uh oh!

matthewdcong commented Feb 24, 2026

Uh oh!

kmuseth left a comment

Choose a reason for hiding this comment

Uh oh!

kmuseth Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

swahtz Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

swahtz commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

swahtz commented Feb 17, 2026 •

edited

Loading