keccak: convert ARMv8 ASM into intrinsics by tarcieri · Pull Request #112 · RustCrypto/sponges

tarcieri · 2026-02-27T06:06:10Z

Rewrites the inline assembly implementation using an equivalent (but not identical) intrinsics implementation. Also exposes support for computing two Keccak states in parallel which a previous comment in the ASM implementation noted was possible but wasn't actually exposed, and is now available as p1600_armv8_sha3_times2 (though not yet in the public API, see #110).

This is a little tricky due to high register pressure: this implementation uses every vector register.

I started by rewriting the round loop and iterating over the round constants, then breaking apart the body into theta and everything else (rho/pi/chi/iota), mapping the NEON registers onto a [uint64x2_t; 25] state.

Theta was translated by hand, but the rest of them were too tedious regarding a manual mapping of the registers to slots in the state array. So I wrote a small program that operates over a representation of the original assembly, doing all the bookkeeping for which registers map to which slots in the state array, and outputs the equivalent intrinsics code.

Godbolt links to the original asm! versus this translation:

original: https://godbolt.org/z/G8Mf5vboE
translated: https://godbolt.org/z/sszzbdexK

It's using nearly the same number of instructions, but there are differences between the two versions, i.e. it isn't an identical recreation of the original assembly, which I'm not sure is possible/preferable, but it should be functionally equivalent.

Since we're no longer using asm!, cfg(armv8_asm) has been removed and this is now enabled by default on aarch64 targets.

Closes #95

Benchmarks (`sha3` crate on M1 Max)

Pure software implementation

test sha3_224_10    ... bench:          17.97 ns/iter (+/- 0.32) = 588 MB/s
test sha3_224_100   ... bench:         164.15 ns/iter (+/- 5.14) = 609 MB/s
test sha3_224_1000  ... bench:       1,646.07 ns/iter (+/- 139.45) = 607 MB/s
test sha3_224_10000 ... bench:      16,585.52 ns/iter (+/- 1,168.57) = 602 MB/s
test sha3_256_10    ... bench:          19.12 ns/iter (+/- 0.77) = 526 MB/s
test sha3_256_1000  ... bench:       1,694.21 ns/iter (+/- 41.20) = 590 MB/s
test sha3_256_10000 ... bench:      16,807.40 ns/iter (+/- 556.17) = 594 MB/s
test sha3_265_100   ... bench:         173.41 ns/iter (+/- 4.98) = 578 MB/s
test sha3_384_10    ... bench:          24.32 ns/iter (+/- 1.16) = 416 MB/s
test sha3_384_100   ... bench:         225.00 ns/iter (+/- 5.50) = 444 MB/s
test sha3_384_1000  ... bench:       2,224.49 ns/iter (+/- 47.86) = 449 MB/s
test sha3_384_10000 ... bench:      22,181.02 ns/iter (+/- 971.37) = 450 MB/s
test sha3_512_10    ... bench:          33.78 ns/iter (+/- 0.32) = 303 MB/s
test sha3_512_100   ... bench:         320.54 ns/iter (+/- 10.77) = 312 MB/s
test sha3_512_1000  ... bench:       3,174.62 ns/iter (+/- 80.98) = 315 MB/s
test sha3_512_10000 ... bench:      31,629.97 ns/iter (+/- 871.85) = 316 MB/s
test shake128_10    ... bench:          15.97 ns/iter (+/- 0.44) = 666 MB/s
test shake128_100   ... bench:         142.19 ns/iter (+/- 6.58) = 704 MB/s
test shake128_1000  ... bench:       1,390.27 ns/iter (+/- 56.14) = 719 MB/s
test shake128_10000 ... bench:      13,813.13 ns/iter (+/- 677.65) = 723 MB/s
test shake256_10    ... bench:          19.06 ns/iter (+/- 0.44) = 526 MB/s
test shake256_100   ... bench:         173.50 ns/iter (+/- 4.26) = 578 MB/s
test shake256_1000  ... bench:       1,695.05 ns/iter (+/- 87.19) = 589 MB/s
test shake256_10000 ... bench:      16,882.98 ns/iter (+/- 683.56) = 592 MB/s

This new intrinsics implementation

test sha3_224_10    ... bench:          13.07 ns/iter (+/- 0.55) = 769 MB/s
test sha3_224_100   ... bench:         111.29 ns/iter (+/- 6.62) = 900 MB/s
test sha3_224_1000  ... bench:       1,113.87 ns/iter (+/- 29.88) = 898 MB/s
test sha3_224_10000 ... bench:      11,095.95 ns/iter (+/- 302.99) = 901 MB/s
test sha3_256_10    ... bench:          13.53 ns/iter (+/- 0.51) = 769 MB/s
test sha3_256_1000  ... bench:       1,173.40 ns/iter (+/- 33.72) = 852 MB/s
test sha3_256_10000 ... bench:      12,305.99 ns/iter (+/- 623.31) = 812 MB/s
test sha3_265_100   ... bench:         118.16 ns/iter (+/- 2.85) = 847 MB/s
test sha3_384_10    ... bench:          17.27 ns/iter (+/- 0.78) = 588 MB/s
test sha3_384_100   ... bench:         153.80 ns/iter (+/- 5.42) = 653 MB/s
test sha3_384_1000  ... bench:       1,529.35 ns/iter (+/- 18.99) = 654 MB/s
test sha3_384_10000 ... bench:      15,239.19 ns/iter (+/- 189.19) = 656 MB/s
test sha3_512_10    ... bench:          23.43 ns/iter (+/- 0.95) = 434 MB/s
test sha3_512_100   ... bench:         218.97 ns/iter (+/- 4.01) = 458 MB/s
test sha3_512_1000  ... bench:       2,193.58 ns/iter (+/- 37.98) = 455 MB/s
test sha3_512_10000 ... bench:      21,968.75 ns/iter (+/- 385.75) = 455 MB/s
test shake128_10    ... bench:          11.47 ns/iter (+/- 0.32) = 909 MB/s
test shake128_100   ... bench:          95.51 ns/iter (+/- 1.32) = 1052 MB/s
test shake128_1000  ... bench:         960.08 ns/iter (+/- 34.57) = 1041 MB/s
test shake128_10000 ... bench:       9,564.39 ns/iter (+/- 255.34) = 1045 MB/s
test shake256_10    ... bench:          13.61 ns/iter (+/- 0.53) = 769 MB/s
test shake256_100   ... bench:         116.77 ns/iter (+/- 1.94) = 862 MB/s
test shake256_1000  ... bench:       1,163.09 ns/iter (+/- 27.17) = 859 MB/s
test shake256_10000 ... bench:      11,750.47 ns/iter (+/- 250.38) = 851 MB/s

Original assembly

test sha3_224_10    ... bench:          12.54 ns/iter (+/- 0.43) = 833 MB/s
test sha3_224_100   ... bench:         109.49 ns/iter (+/- 2.54) = 917 MB/s
test sha3_224_1000  ... bench:       1,095.79 ns/iter (+/- 32.04) = 913 MB/s
test sha3_224_10000 ... bench:      10,953.02 ns/iter (+/- 157.49) = 912 MB/s
test sha3_256_10    ... bench:          13.05 ns/iter (+/- 0.25) = 769 MB/s
test sha3_256_1000  ... bench:       1,161.46 ns/iter (+/- 28.09) = 861 MB/s
test sha3_256_10000 ... bench:      11,609.98 ns/iter (+/- 148.88) = 861 MB/s
test sha3_265_100   ... bench:         118.17 ns/iter (+/- 7.42) = 847 MB/s
test sha3_384_10    ... bench:          17.07 ns/iter (+/- 2.80) = 588 MB/s
test sha3_384_100   ... bench:         151.93 ns/iter (+/- 4.39) = 662 MB/s
test sha3_384_1000  ... bench:       1,506.50 ns/iter (+/- 40.71) = 664 MB/s
test sha3_384_10000 ... bench:      15,119.04 ns/iter (+/- 495.59) = 661 MB/s
test sha3_512_10    ... bench:          22.93 ns/iter (+/- 0.53) = 454 MB/s
test sha3_512_100   ... bench:         216.77 ns/iter (+/- 7.42) = 462 MB/s
test sha3_512_1000  ... bench:       2,165.67 ns/iter (+/- 49.04) = 461 MB/s
test sha3_512_10000 ... bench:      21,666.71 ns/iter (+/- 651.02) = 461 MB/s
test shake128_10    ... bench:          11.30 ns/iter (+/- 0.14) = 909 MB/s
test shake128_100   ... bench:          94.75 ns/iter (+/- 3.86) = 1063 MB/s
test shake128_1000  ... bench:         961.72 ns/iter (+/- 81.88) = 1040 MB/s
test shake128_10000 ... bench:       9,573.39 ns/iter (+/- 311.05) = 1044 MB/s
test shake256_10    ... bench:          13.17 ns/iter (+/- 0.54) = 769 MB/s
test shake256_100   ... bench:         117.39 ns/iter (+/- 3.22) = 854 MB/s
test shake256_1000  ... bench:       1,174.65 ns/iter (+/- 45.62) = 851 MB/s
test shake256_10000 ... bench:      11,659.19 ns/iter (+/- 330.23) = 857 MB/s

The performance seems pretty close to the original assembly, maybe just slightly slower.

Here's the program I used to translate rho/pi/chi/iota:

main.rs

Rewrites the inline assembly implementation using an equivalent (but not identical) intrinsics implementation. Also exposes support for computing two Keccak states in parallel which a previous comment in the ASM implementation noted was possible but wasn't actually exposed, and is now available as `p1600_armv8_sha3_times2` (though not yet in the public API, see #110). This is a little tricky due to high register pressure: this implementation uses every vector register. I started by rewriting the round loop and iterating over the round constants, then breaking apart the body into theta and everything else (rho/pi/chi/iota), mapping the NEON registers onto a `[uint64x2_t; 25]` state. Theta was translated by hand, but the rest of them were too tedious regarding a manual mapping of the registers to slots in the state array. So I wrote a small program that operates over a representation of the original assembly, doing all the bookkeeping for which registers map to which slots in the state array, and outputs the equivalent intrinsics code. Godbolt links to the original `asm!` versus this translation: - original: https://godbolt.org/z/G8Mf5vboE - translated: https://godbolt.org/z/sszzbdexK It's using nearly the same number of instructions, but there are differences between the two versions, i.e. it isn't an identical recreation of the original assembly, which I'm not sure is possible/preferable, but it should be functionally equivalent. Benchmarks (`sha3` crate): - Pure software implementation: test sha3_224_10 ... bench: 17.97 ns/iter (+/- 0.32) = 588 MB/s test sha3_224_100 ... bench: 164.15 ns/iter (+/- 5.14) = 609 MB/s test sha3_224_1000 ... bench: 1,646.07 ns/iter (+/- 139.45) = 607 MB/s test sha3_224_10000 ... bench: 16,585.52 ns/iter (+/- 1,168.57) = 602 MB/s test sha3_256_10 ... bench: 19.12 ns/iter (+/- 0.77) = 526 MB/s test sha3_256_1000 ... bench: 1,694.21 ns/iter (+/- 41.20) = 590 MB/s test sha3_256_10000 ... bench: 16,807.40 ns/iter (+/- 556.17) = 594 MB/s test sha3_265_100 ... bench: 173.41 ns/iter (+/- 4.98) = 578 MB/s test sha3_384_10 ... bench: 24.32 ns/iter (+/- 1.16) = 416 MB/s test sha3_384_100 ... bench: 225.00 ns/iter (+/- 5.50) = 444 MB/s test sha3_384_1000 ... bench: 2,224.49 ns/iter (+/- 47.86) = 449 MB/s test sha3_384_10000 ... bench: 22,181.02 ns/iter (+/- 971.37) = 450 MB/s test sha3_512_10 ... bench: 33.78 ns/iter (+/- 0.32) = 303 MB/s test sha3_512_100 ... bench: 320.54 ns/iter (+/- 10.77) = 312 MB/s test sha3_512_1000 ... bench: 3,174.62 ns/iter (+/- 80.98) = 315 MB/s test sha3_512_10000 ... bench: 31,629.97 ns/iter (+/- 871.85) = 316 MB/s test shake128_10 ... bench: 15.97 ns/iter (+/- 0.44) = 666 MB/s test shake128_100 ... bench: 142.19 ns/iter (+/- 6.58) = 704 MB/s test shake128_1000 ... bench: 1,390.27 ns/iter (+/- 56.14) = 719 MB/s test shake128_10000 ... bench: 13,813.13 ns/iter (+/- 677.65) = 723 MB/s test shake256_10 ... bench: 19.06 ns/iter (+/- 0.44) = 526 MB/s test shake256_100 ... bench: 173.50 ns/iter (+/- 4.26) = 578 MB/s test shake256_1000 ... bench: 1,695.05 ns/iter (+/- 87.19) = 589 MB/s test shake256_10000 ... bench: 16,882.98 ns/iter (+/- 683.56) = 592 MB/s - New intrinsics implementation: test sha3_224_10 ... bench: 13.07 ns/iter (+/- 0.55) = 769 MB/s test sha3_224_100 ... bench: 111.29 ns/iter (+/- 6.62) = 900 MB/s test sha3_224_1000 ... bench: 1,113.87 ns/iter (+/- 29.88) = 898 MB/s test sha3_224_10000 ... bench: 11,095.95 ns/iter (+/- 302.99) = 901 MB/s test sha3_256_10 ... bench: 13.53 ns/iter (+/- 0.51) = 769 MB/s test sha3_256_1000 ... bench: 1,173.40 ns/iter (+/- 33.72) = 852 MB/s test sha3_256_10000 ... bench: 12,305.99 ns/iter (+/- 623.31) = 812 MB/s test sha3_265_100 ... bench: 118.16 ns/iter (+/- 2.85) = 847 MB/s test sha3_384_10 ... bench: 17.27 ns/iter (+/- 0.78) = 588 MB/s test sha3_384_100 ... bench: 153.80 ns/iter (+/- 5.42) = 653 MB/s test sha3_384_1000 ... bench: 1,529.35 ns/iter (+/- 18.99) = 654 MB/s test sha3_384_10000 ... bench: 15,239.19 ns/iter (+/- 189.19) = 656 MB/s test sha3_512_10 ... bench: 23.43 ns/iter (+/- 0.95) = 434 MB/s test sha3_512_100 ... bench: 218.97 ns/iter (+/- 4.01) = 458 MB/s test sha3_512_1000 ... bench: 2,193.58 ns/iter (+/- 37.98) = 455 MB/s test sha3_512_10000 ... bench: 21,968.75 ns/iter (+/- 385.75) = 455 MB/s test shake128_10 ... bench: 11.47 ns/iter (+/- 0.32) = 909 MB/s test shake128_100 ... bench: 95.51 ns/iter (+/- 1.32) = 1052 MB/s test shake128_1000 ... bench: 960.08 ns/iter (+/- 34.57) = 1041 MB/s test shake128_10000 ... bench: 9,564.39 ns/iter (+/- 255.34) = 1045 MB/s test shake256_10 ... bench: 13.61 ns/iter (+/- 0.53) = 769 MB/s test shake256_100 ... bench: 116.77 ns/iter (+/- 1.94) = 862 MB/s test shake256_1000 ... bench: 1,163.09 ns/iter (+/- 27.17) = 859 MB/s test shake256_10000 ... bench: 11,750.47 ns/iter (+/- 250.38) = 851 MB/s - Original assembly: test sha3_224_10 ... bench: 12.54 ns/iter (+/- 0.43) = 833 MB/s test sha3_224_100 ... bench: 109.49 ns/iter (+/- 2.54) = 917 MB/s test sha3_224_1000 ... bench: 1,095.79 ns/iter (+/- 32.04) = 913 MB/s test sha3_224_10000 ... bench: 10,953.02 ns/iter (+/- 157.49) = 912 MB/s test sha3_256_10 ... bench: 13.05 ns/iter (+/- 0.25) = 769 MB/s test sha3_256_1000 ... bench: 1,161.46 ns/iter (+/- 28.09) = 861 MB/s test sha3_256_10000 ... bench: 11,609.98 ns/iter (+/- 148.88) = 861 MB/s test sha3_265_100 ... bench: 118.17 ns/iter (+/- 7.42) = 847 MB/s test sha3_384_10 ... bench: 17.07 ns/iter (+/- 2.80) = 588 MB/s test sha3_384_100 ... bench: 151.93 ns/iter (+/- 4.39) = 662 MB/s test sha3_384_1000 ... bench: 1,506.50 ns/iter (+/- 40.71) = 664 MB/s test sha3_384_10000 ... bench: 15,119.04 ns/iter (+/- 495.59) = 661 MB/s test sha3_512_10 ... bench: 22.93 ns/iter (+/- 0.53) = 454 MB/s test sha3_512_100 ... bench: 216.77 ns/iter (+/- 7.42) = 462 MB/s test sha3_512_1000 ... bench: 2,165.67 ns/iter (+/- 49.04) = 461 MB/s test sha3_512_10000 ... bench: 21,666.71 ns/iter (+/- 651.02) = 461 MB/s test shake128_10 ... bench: 11.30 ns/iter (+/- 0.14) = 909 MB/s test shake128_100 ... bench: 94.75 ns/iter (+/- 3.86) = 1063 MB/s test shake128_1000 ... bench: 961.72 ns/iter (+/- 81.88) = 1040 MB/s test shake128_10000 ... bench: 9,573.39 ns/iter (+/- 311.05) = 1044 MB/s test shake256_10 ... bench: 13.17 ns/iter (+/- 0.54) = 769 MB/s test shake256_100 ... bench: 117.39 ns/iter (+/- 3.22) = 854 MB/s test shake256_1000 ... bench: 1,174.65 ns/iter (+/- 45.62) = 851 MB/s test shake256_10000 ... bench: 11,659.19 ns/iter (+/- 330.23) = 857 MB/s The performance seems pretty close to the original assembly, maybe just slightly slower.

tarcieri · 2026-02-27T20:09:36Z

Note: I think there's some possible refactoring/cleanups to be had here, but I'd like to land an initial working implementation first to iterate on, and in particular since I have an ASM -> Rust translator and the Godbolt links to match, I'd prefer to keep any cosmetic changes for a followup

newpavlov

Do not forget to add a changelog entry.

newpavlov · 2026-02-27T21:13:05Z

keccak/src/armv8.rs

    use super::*;

    #[test]
    fn test_keccak_f1600() {


This test duplicates the crate doc example, so I think we can remove it (maybe in a separate PR).

It'd be nice to have some common testing for different backends somehow. Maybe a macro that writes tests.

I think it's nice for the backends to be individually unit tested.

I don't think it makes much sense. Why duplicate the same tests to each backend instead of forcing backend application in CI like we do for example in aes?

keccak/src/armv8.rs

tarcieri force-pushed the keccak/convert-armv8-asm-to-intrinsics branch 2 times, most recently from 969e422 to e48f822 Compare February 27, 2026 06:09

tarcieri mentioned this pull request Feb 27, 2026

keccak: add ParKeccakP1600 struct #110

Open

tarcieri force-pushed the keccak/convert-armv8-asm-to-intrinsics branch 4 times, most recently from 8eb9579 to 9e7f994 Compare February 27, 2026 19:55

tarcieri force-pushed the keccak/convert-armv8-asm-to-intrinsics branch from 9e7f994 to a25bf1d Compare February 27, 2026 19:56

tarcieri changed the title ~~[WIP] keccak: convert ARMv8 ASM into intrinsics~~ keccak: convert ARMv8 ASM into intrinsics Feb 27, 2026

tarcieri requested a review from newpavlov February 27, 2026 20:04

tarcieri marked this pull request as ready for review February 27, 2026 20:04

newpavlov approved these changes Feb 27, 2026

View reviewed changes

Add CHANGELOG entry and TODO for MSRV bump

70463c1

tarcieri merged commit 806d446 into master Feb 28, 2026
17 checks passed

tarcieri deleted the keccak/convert-armv8-asm-to-intrinsics branch February 28, 2026 00:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keccak: convert ARMv8 ASM into intrinsics#112

keccak: convert ARMv8 ASM into intrinsics#112
tarcieri merged 2 commits intomasterfrom
keccak/convert-armv8-asm-to-intrinsics

tarcieri commented Feb 27, 2026 •

edited

Loading

Uh oh!

tarcieri commented Feb 27, 2026

Uh oh!

newpavlov left a comment •

edited

Loading

Uh oh!

newpavlov Feb 27, 2026 •

edited

Loading

Uh oh!

tarcieri Feb 28, 2026

Uh oh!

newpavlov Feb 28, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tarcieri commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks (sha3 crate on M1 Max)

Pure software implementation

This new intrinsics implementation

Original assembly

Uh oh!

tarcieri commented Feb 27, 2026

Uh oh!

newpavlov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

newpavlov Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarcieri Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

newpavlov Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tarcieri commented Feb 27, 2026 •

edited

Loading

Benchmarks (`sha3` crate on M1 Max)

newpavlov left a comment •

edited

Loading

newpavlov Feb 27, 2026 •

edited

Loading