feat[fastlanes]: add optimized 1024-bit transpose implementations#6135
feat[fastlanes]: add optimized 1024-bit transpose implementations#6135joseph-isaacs wants to merge 10 commits intodevelopfrom
Conversation
Add a new transpose module implementing the FastLanes 1024-bit transpose algorithm with multiple optimized implementations: - Baseline: bit-by-bit reference implementation using fastlanes transpose_index - Scalar: optimized byte-level implementation (~2.6x faster than baseline) - AVX2: SIMD-accelerated version for x86_64 with AVX2 support - AVX2+GFNI: uses Galois Field instructions when available - AVX-512+GFNI: full 512-bit vector implementation Also includes inverse (untranspose) operations and comprehensive unit tests comparing all implementations against the baseline. Divan benchmarks are provided to measure performance. Signed-off-by: Claude <noreply@anthropic.com>
Add examples for inspecting assembly and measuring cycle counts: - check_asm.rs: Helper to generate assembly for each implementation - perf_transpose.rs: rdtsc-based cycle counter for performance analysis Also change SIMD functions to #[inline(never)] to ensure they appear as separate symbols for assembly inspection. Signed-off-by: Claude <noreply@anthropic.com>
…tions Add BMI2 implementation using PEXT/PDEP for efficient bit extraction/deposit, achieving 32x speedup over baseline (~48ns vs ~1.5µs per 1024-bit transpose). Fix AVX2+GFNI and AVX-512+GFNI implementations to use the classic 8x8 bit matrix transpose algorithm with XOR/shift operations, since GFNI's gf2p8affineqb operates per-byte and cannot shuffle bits between bytes. Performance summary (median times, 1024-bit transpose): - baseline: 1.562 µs (bit-by-bit) - scalar: 641.6 ns (2.4x faster) - avx2: 218.8 ns (7x faster) - avx2_gfni: 71.98 ns (22x faster) - bmi2: 47.92 ns (33x faster) - avx512_gfni: 44.38 ns (35x faster) Add BMI2 benchmarks for both transpose and untranspose operations. Signed-off-by: Claude <noreply@anthropic.com>
Merging this PR will degrade performance by 44.35%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | WallTime | u8_FoR[10M] |
5.7 µs | 10.2 µs | -44.35% |
| ❌ | WallTime | u16_FoR[10M] |
7.7 µs | 10.5 µs | -26.71% |
| ❌ | Simulation | canonical_into_non_nullable[(10000, 100, 0.1)] |
3.7 ms | 4.5 ms | -18.26% |
| 🆕 | Simulation | transpose_baseline_throughput |
N/A | 2.5 ms | N/A |
| 🆕 | Simulation | transpose_best_throughput |
N/A | 92.8 µs | N/A |
| 🆕 | Simulation | transpose_baseline |
N/A | 10.9 µs | N/A |
| 🆕 | Simulation | untranspose_best |
N/A | 2.8 µs | N/A |
| 🆕 | Simulation | transpose_scalar_throughput |
N/A | 661 µs | N/A |
| 🆕 | Simulation | transpose_scalar |
N/A | 3.4 µs | N/A |
| 🆕 | Simulation | transpose_best |
N/A | 2 µs | N/A |
| 🆕 | Simulation | untranspose_scalar |
N/A | 3.2 µs | N/A |
| 🆕 | Simulation | transpose_scalar_fast_throughput |
N/A | 64.2 µs | N/A |
| ❌ | Simulation | canonical_into_non_nullable[(10000, 100, 0.0)] |
1.9 ms | 2.7 ms | -29.9% |
| 🆕 | Simulation | untranspose_baseline |
N/A | 10.9 µs | N/A |
| ⚡ | Simulation | canonical_into_nullable[(10000, 10, 0.0)] |
528.5 µs | 445.6 µs | +18.61% |
| 🆕 | Simulation | transpose_avx2 |
N/A | 2.8 µs | N/A |
| 🆕 | Simulation | untranspose_bmi2 |
N/A | 2.7 µs | N/A |
| 🆕 | Simulation | transpose_avx2_throughput |
N/A | 314.3 µs | N/A |
| ⚡ | Simulation | canonical_into_nullable[(10000, 100, 0.0)] |
4.9 ms | 4.1 ms | +19.6% |
| ❌ | Simulation | canonical_into_non_nullable[(10000, 100, 0.01)] |
2.1 ms | 3 ms | -27.53% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Comparing claude/bitpacking-transpose-optimization-tM1U4 (17c7783) with develop (1a6ece1)
Footnotes
-
1290 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
…ions Add highly optimized transpose implementations: 1. scalar_fast: Uses 8x8 bit matrix transpose algorithm with XOR/shift operations, achieving ~59 ns per 1024-bit transpose (25x faster than baseline). This is portable and works on all platforms. 2. ARM64 NEON: Uses NEON intrinsics for parallel bit transpose on AArch64, processing 2 groups at a time with 128-bit vector registers. Performance results (median times, 1024-bit transpose on x86-64): - baseline: 1.512 µs (bit-by-bit reference) - scalar: 641.2 ns (2.4x faster) - scalar_fast: 58.92 ns (25.7x faster) - NEW - avx2: 212.7 ns (7.1x faster) - avx2_gfni: 72.54 ns (20.8x faster) - bmi2: 60.56 ns (25.0x faster) - avx512_gfni: 44.38 ns (34.1x faster) The scalar_fast implementation achieves near-SIMD performance through: - Gather 8 bytes at stride 16 into u64 - Apply 8x8 bit transpose using 3 XOR/shift steps - Fully unrolled loops for all 16 base patterns Assembly verified to use: - BMI2: PEXT instructions for bit extraction - AVX-512: vpxord/vpsrlq/vpsllq for parallel bit transpose Signed-off-by: Claude <noreply@anthropic.com>
Testing showed that fully unrolling the BMI2 PEXT operations yields approximately 12% better performance compared to the looped version. The compiler doesn't fully optimize nested loops with PEXT intrinsics. Signed-off-by: Claude <noreply@anthropic.com>
Add test to verify our transpose_index implementation exactly matches the fastlanes crate's transpose function for all 1024 indices. Signed-off-by: Claude <noreply@anthropic.com>
|
claude says SummaryFast implementations of the FastLanes 1024-bit transpose operation with multiple SIMD backends. Performance Results (cycles/call, lower is better)
Implementations
Runtime Dispatch
Verification
Test plan
|
Add AVX-512 VBMI optimized transpose implementation using vpermi2b/vpermb for vectorized gather and scatter operations. Performance improvements: - VBMI: 13.6 cycles/call (7.5x faster than avx512_gfni at 102.6 cycles) - VBMI: 240x faster than baseline (3276 cycles) Key optimizations: - Use vpermi2b to gather 8 bytes at stride-16 in parallel - Use vpermb for 8x8 byte transpose during scatter phase - Static permutation tables to avoid stack allocation Also adds: - Dual-block transpose_1024x2_avx512 for batch processing - VBMI detection via has_vbmi() function - Updated dispatch to prefer VBMI when available Signed-off-by: Claude <noreply@anthropic.com>
Add transpose_1024x2_vbmi and untranspose_1024x2_vbmi for batch processing of two 128-byte blocks simultaneously using interleaved VBMI operations. Performance: - vbmi_dual: 11.9 cycles/block (10.5% faster than single-block at 13.3) - Useful for bulk transpose operations The dual-block version achieves better throughput by: - Loading 4 input ZMM registers upfront (2 per block) - Interleaving gather/transpose/scatter operations - Better instruction-level parallelism hides latencies Signed-off-by: Claude <noreply@anthropic.com>
Add transpose_1024x4_vbmi that processes 4 independent 128-byte blocks simultaneously using fully interleaved operations for maximum ILP. Performance: 12.4 cycles/block (vs 13.3 for dual-block, 300x faster than baseline) Signed-off-by: Claude <noreply@anthropic.com>
Documents the 287x speedup achieved using AVX-512 VBMI for the FastLanes 1024-bit transpose operation. Signed-off-by: Claude <noreply@anthropic.com>
This would enable delta encoding.
Add a new transpose module implementing the FastLanes 1024-bit transpose
algorithm with multiple optimized implementations:
Also includes inverse (untranspose) operations and comprehensive unit tests
comparing all implementations against the baseline. Divan benchmarks are
provided to measure performance.
Signed-off-by: Claude noreply@anthropic.com