Skip to content

keccak: convert ARMv8 ASM into intrinsics#112

Merged
tarcieri merged 2 commits intomasterfrom
keccak/convert-armv8-asm-to-intrinsics
Feb 28, 2026
Merged

keccak: convert ARMv8 ASM into intrinsics#112
tarcieri merged 2 commits intomasterfrom
keccak/convert-armv8-asm-to-intrinsics

Conversation

@tarcieri
Copy link
Member

@tarcieri tarcieri commented Feb 27, 2026

Rewrites the inline assembly implementation using an equivalent (but not identical) intrinsics implementation. Also exposes support for computing two Keccak states in parallel which a previous comment in the ASM implementation noted was possible but wasn't actually exposed, and is now available as p1600_armv8_sha3_times2 (though not yet in the public API, see #110).

This is a little tricky due to high register pressure: this implementation uses every vector register.

I started by rewriting the round loop and iterating over the round constants, then breaking apart the body into theta and everything else (rho/pi/chi/iota), mapping the NEON registers onto a [uint64x2_t; 25] state.

Theta was translated by hand, but the rest of them were too tedious regarding a manual mapping of the registers to slots in the state array. So I wrote a small program that operates over a representation of the original assembly, doing all the bookkeeping for which registers map to which slots in the state array, and outputs the equivalent intrinsics code.

Godbolt links to the original asm! versus this translation:

It's using nearly the same number of instructions, but there are differences between the two versions, i.e. it isn't an identical recreation of the original assembly, which I'm not sure is possible/preferable, but it should be functionally equivalent.

Since we're no longer using asm!, cfg(armv8_asm) has been removed and this is now enabled by default on aarch64 targets.

Closes #95

Benchmarks (sha3 crate on M1 Max)

Pure software implementation

test sha3_224_10    ... bench:          17.97 ns/iter (+/- 0.32) = 588 MB/s
test sha3_224_100   ... bench:         164.15 ns/iter (+/- 5.14) = 609 MB/s
test sha3_224_1000  ... bench:       1,646.07 ns/iter (+/- 139.45) = 607 MB/s
test sha3_224_10000 ... bench:      16,585.52 ns/iter (+/- 1,168.57) = 602 MB/s
test sha3_256_10    ... bench:          19.12 ns/iter (+/- 0.77) = 526 MB/s
test sha3_256_1000  ... bench:       1,694.21 ns/iter (+/- 41.20) = 590 MB/s
test sha3_256_10000 ... bench:      16,807.40 ns/iter (+/- 556.17) = 594 MB/s
test sha3_265_100   ... bench:         173.41 ns/iter (+/- 4.98) = 578 MB/s
test sha3_384_10    ... bench:          24.32 ns/iter (+/- 1.16) = 416 MB/s
test sha3_384_100   ... bench:         225.00 ns/iter (+/- 5.50) = 444 MB/s
test sha3_384_1000  ... bench:       2,224.49 ns/iter (+/- 47.86) = 449 MB/s
test sha3_384_10000 ... bench:      22,181.02 ns/iter (+/- 971.37) = 450 MB/s
test sha3_512_10    ... bench:          33.78 ns/iter (+/- 0.32) = 303 MB/s
test sha3_512_100   ... bench:         320.54 ns/iter (+/- 10.77) = 312 MB/s
test sha3_512_1000  ... bench:       3,174.62 ns/iter (+/- 80.98) = 315 MB/s
test sha3_512_10000 ... bench:      31,629.97 ns/iter (+/- 871.85) = 316 MB/s
test shake128_10    ... bench:          15.97 ns/iter (+/- 0.44) = 666 MB/s
test shake128_100   ... bench:         142.19 ns/iter (+/- 6.58) = 704 MB/s
test shake128_1000  ... bench:       1,390.27 ns/iter (+/- 56.14) = 719 MB/s
test shake128_10000 ... bench:      13,813.13 ns/iter (+/- 677.65) = 723 MB/s
test shake256_10    ... bench:          19.06 ns/iter (+/- 0.44) = 526 MB/s
test shake256_100   ... bench:         173.50 ns/iter (+/- 4.26) = 578 MB/s
test shake256_1000  ... bench:       1,695.05 ns/iter (+/- 87.19) = 589 MB/s
test shake256_10000 ... bench:      16,882.98 ns/iter (+/- 683.56) = 592 MB/s

This new intrinsics implementation

test sha3_224_10    ... bench:          13.07 ns/iter (+/- 0.55) = 769 MB/s
test sha3_224_100   ... bench:         111.29 ns/iter (+/- 6.62) = 900 MB/s
test sha3_224_1000  ... bench:       1,113.87 ns/iter (+/- 29.88) = 898 MB/s
test sha3_224_10000 ... bench:      11,095.95 ns/iter (+/- 302.99) = 901 MB/s
test sha3_256_10    ... bench:          13.53 ns/iter (+/- 0.51) = 769 MB/s
test sha3_256_1000  ... bench:       1,173.40 ns/iter (+/- 33.72) = 852 MB/s
test sha3_256_10000 ... bench:      12,305.99 ns/iter (+/- 623.31) = 812 MB/s
test sha3_265_100   ... bench:         118.16 ns/iter (+/- 2.85) = 847 MB/s
test sha3_384_10    ... bench:          17.27 ns/iter (+/- 0.78) = 588 MB/s
test sha3_384_100   ... bench:         153.80 ns/iter (+/- 5.42) = 653 MB/s
test sha3_384_1000  ... bench:       1,529.35 ns/iter (+/- 18.99) = 654 MB/s
test sha3_384_10000 ... bench:      15,239.19 ns/iter (+/- 189.19) = 656 MB/s
test sha3_512_10    ... bench:          23.43 ns/iter (+/- 0.95) = 434 MB/s
test sha3_512_100   ... bench:         218.97 ns/iter (+/- 4.01) = 458 MB/s
test sha3_512_1000  ... bench:       2,193.58 ns/iter (+/- 37.98) = 455 MB/s
test sha3_512_10000 ... bench:      21,968.75 ns/iter (+/- 385.75) = 455 MB/s
test shake128_10    ... bench:          11.47 ns/iter (+/- 0.32) = 909 MB/s
test shake128_100   ... bench:          95.51 ns/iter (+/- 1.32) = 1052 MB/s
test shake128_1000  ... bench:         960.08 ns/iter (+/- 34.57) = 1041 MB/s
test shake128_10000 ... bench:       9,564.39 ns/iter (+/- 255.34) = 1045 MB/s
test shake256_10    ... bench:          13.61 ns/iter (+/- 0.53) = 769 MB/s
test shake256_100   ... bench:         116.77 ns/iter (+/- 1.94) = 862 MB/s
test shake256_1000  ... bench:       1,163.09 ns/iter (+/- 27.17) = 859 MB/s
test shake256_10000 ... bench:      11,750.47 ns/iter (+/- 250.38) = 851 MB/s

Original assembly

test sha3_224_10    ... bench:          12.54 ns/iter (+/- 0.43) = 833 MB/s
test sha3_224_100   ... bench:         109.49 ns/iter (+/- 2.54) = 917 MB/s
test sha3_224_1000  ... bench:       1,095.79 ns/iter (+/- 32.04) = 913 MB/s
test sha3_224_10000 ... bench:      10,953.02 ns/iter (+/- 157.49) = 912 MB/s
test sha3_256_10    ... bench:          13.05 ns/iter (+/- 0.25) = 769 MB/s
test sha3_256_1000  ... bench:       1,161.46 ns/iter (+/- 28.09) = 861 MB/s
test sha3_256_10000 ... bench:      11,609.98 ns/iter (+/- 148.88) = 861 MB/s
test sha3_265_100   ... bench:         118.17 ns/iter (+/- 7.42) = 847 MB/s
test sha3_384_10    ... bench:          17.07 ns/iter (+/- 2.80) = 588 MB/s
test sha3_384_100   ... bench:         151.93 ns/iter (+/- 4.39) = 662 MB/s
test sha3_384_1000  ... bench:       1,506.50 ns/iter (+/- 40.71) = 664 MB/s
test sha3_384_10000 ... bench:      15,119.04 ns/iter (+/- 495.59) = 661 MB/s
test sha3_512_10    ... bench:          22.93 ns/iter (+/- 0.53) = 454 MB/s
test sha3_512_100   ... bench:         216.77 ns/iter (+/- 7.42) = 462 MB/s
test sha3_512_1000  ... bench:       2,165.67 ns/iter (+/- 49.04) = 461 MB/s
test sha3_512_10000 ... bench:      21,666.71 ns/iter (+/- 651.02) = 461 MB/s
test shake128_10    ... bench:          11.30 ns/iter (+/- 0.14) = 909 MB/s
test shake128_100   ... bench:          94.75 ns/iter (+/- 3.86) = 1063 MB/s
test shake128_1000  ... bench:         961.72 ns/iter (+/- 81.88) = 1040 MB/s
test shake128_10000 ... bench:       9,573.39 ns/iter (+/- 311.05) = 1044 MB/s
test shake256_10    ... bench:          13.17 ns/iter (+/- 0.54) = 769 MB/s
test shake256_100   ... bench:         117.39 ns/iter (+/- 3.22) = 854 MB/s
test shake256_1000  ... bench:       1,174.65 ns/iter (+/- 45.62) = 851 MB/s
test shake256_10000 ... bench:      11,659.19 ns/iter (+/- 330.23) = 857 MB/s

The performance seems pretty close to the original assembly, maybe just slightly slower.

Here's the program I used to translate rho/pi/chi/iota:

main.rs

@tarcieri tarcieri force-pushed the keccak/convert-armv8-asm-to-intrinsics branch 2 times, most recently from 969e422 to e48f822 Compare February 27, 2026 06:09
@tarcieri tarcieri force-pushed the keccak/convert-armv8-asm-to-intrinsics branch 4 times, most recently from 8eb9579 to 9e7f994 Compare February 27, 2026 19:55
Rewrites the inline assembly implementation using an equivalent (but not
identical) intrinsics implementation. Also exposes support for computing
two Keccak states in parallel which a previous comment in the ASM
implementation noted was possible but wasn't actually exposed, and is
now available as `p1600_armv8_sha3_times2` (though not yet in the public
API, see #110).

This is a little tricky due to high register pressure: this
implementation uses every vector register.

I started by rewriting the round loop and iterating over the round
constants, then breaking apart the body into theta and everything else
(rho/pi/chi/iota), mapping the NEON registers onto a `[uint64x2_t; 25]`
state.

Theta was translated by hand, but the rest of them were too tedious
regarding a manual mapping of the registers to slots in the state array.
So I wrote a small program that operates over a representation of the
original assembly, doing all the bookkeeping for which registers map to
which slots in the state array, and outputs the equivalent intrinsics
code.

Godbolt links to the original `asm!` versus this translation:
- original: https://godbolt.org/z/G8Mf5vboE
- translated: https://godbolt.org/z/sszzbdexK

It's using nearly the same number of instructions, but there are
differences between the two versions, i.e. it isn't an identical
recreation of the original assembly, which I'm not sure is
possible/preferable, but it should be functionally equivalent.

Benchmarks (`sha3` crate):

- Pure software implementation:

test sha3_224_10    ... bench:          17.97 ns/iter (+/- 0.32) = 588 MB/s
test sha3_224_100   ... bench:         164.15 ns/iter (+/- 5.14) = 609 MB/s
test sha3_224_1000  ... bench:       1,646.07 ns/iter (+/- 139.45) = 607 MB/s
test sha3_224_10000 ... bench:      16,585.52 ns/iter (+/- 1,168.57) = 602 MB/s
test sha3_256_10    ... bench:          19.12 ns/iter (+/- 0.77) = 526 MB/s
test sha3_256_1000  ... bench:       1,694.21 ns/iter (+/- 41.20) = 590 MB/s
test sha3_256_10000 ... bench:      16,807.40 ns/iter (+/- 556.17) = 594 MB/s
test sha3_265_100   ... bench:         173.41 ns/iter (+/- 4.98) = 578 MB/s
test sha3_384_10    ... bench:          24.32 ns/iter (+/- 1.16) = 416 MB/s
test sha3_384_100   ... bench:         225.00 ns/iter (+/- 5.50) = 444 MB/s
test sha3_384_1000  ... bench:       2,224.49 ns/iter (+/- 47.86) = 449 MB/s
test sha3_384_10000 ... bench:      22,181.02 ns/iter (+/- 971.37) = 450 MB/s
test sha3_512_10    ... bench:          33.78 ns/iter (+/- 0.32) = 303 MB/s
test sha3_512_100   ... bench:         320.54 ns/iter (+/- 10.77) = 312 MB/s
test sha3_512_1000  ... bench:       3,174.62 ns/iter (+/- 80.98) = 315 MB/s
test sha3_512_10000 ... bench:      31,629.97 ns/iter (+/- 871.85) = 316 MB/s
test shake128_10    ... bench:          15.97 ns/iter (+/- 0.44) = 666 MB/s
test shake128_100   ... bench:         142.19 ns/iter (+/- 6.58) = 704 MB/s
test shake128_1000  ... bench:       1,390.27 ns/iter (+/- 56.14) = 719 MB/s
test shake128_10000 ... bench:      13,813.13 ns/iter (+/- 677.65) = 723 MB/s
test shake256_10    ... bench:          19.06 ns/iter (+/- 0.44) = 526 MB/s
test shake256_100   ... bench:         173.50 ns/iter (+/- 4.26) = 578 MB/s
test shake256_1000  ... bench:       1,695.05 ns/iter (+/- 87.19) = 589 MB/s
test shake256_10000 ... bench:      16,882.98 ns/iter (+/- 683.56) = 592 MB/s

- New intrinsics implementation:

test sha3_224_10    ... bench:          13.07 ns/iter (+/- 0.55) = 769 MB/s
test sha3_224_100   ... bench:         111.29 ns/iter (+/- 6.62) = 900 MB/s
test sha3_224_1000  ... bench:       1,113.87 ns/iter (+/- 29.88) = 898 MB/s
test sha3_224_10000 ... bench:      11,095.95 ns/iter (+/- 302.99) = 901 MB/s
test sha3_256_10    ... bench:          13.53 ns/iter (+/- 0.51) = 769 MB/s
test sha3_256_1000  ... bench:       1,173.40 ns/iter (+/- 33.72) = 852 MB/s
test sha3_256_10000 ... bench:      12,305.99 ns/iter (+/- 623.31) = 812 MB/s
test sha3_265_100   ... bench:         118.16 ns/iter (+/- 2.85) = 847 MB/s
test sha3_384_10    ... bench:          17.27 ns/iter (+/- 0.78) = 588 MB/s
test sha3_384_100   ... bench:         153.80 ns/iter (+/- 5.42) = 653 MB/s
test sha3_384_1000  ... bench:       1,529.35 ns/iter (+/- 18.99) = 654 MB/s
test sha3_384_10000 ... bench:      15,239.19 ns/iter (+/- 189.19) = 656 MB/s
test sha3_512_10    ... bench:          23.43 ns/iter (+/- 0.95) = 434 MB/s
test sha3_512_100   ... bench:         218.97 ns/iter (+/- 4.01) = 458 MB/s
test sha3_512_1000  ... bench:       2,193.58 ns/iter (+/- 37.98) = 455 MB/s
test sha3_512_10000 ... bench:      21,968.75 ns/iter (+/- 385.75) = 455 MB/s
test shake128_10    ... bench:          11.47 ns/iter (+/- 0.32) = 909 MB/s
test shake128_100   ... bench:          95.51 ns/iter (+/- 1.32) = 1052 MB/s
test shake128_1000  ... bench:         960.08 ns/iter (+/- 34.57) = 1041 MB/s
test shake128_10000 ... bench:       9,564.39 ns/iter (+/- 255.34) = 1045 MB/s
test shake256_10    ... bench:          13.61 ns/iter (+/- 0.53) = 769 MB/s
test shake256_100   ... bench:         116.77 ns/iter (+/- 1.94) = 862 MB/s
test shake256_1000  ... bench:       1,163.09 ns/iter (+/- 27.17) = 859 MB/s
test shake256_10000 ... bench:      11,750.47 ns/iter (+/- 250.38) = 851 MB/s

- Original assembly:

test sha3_224_10    ... bench:          12.54 ns/iter (+/- 0.43) = 833 MB/s
test sha3_224_100   ... bench:         109.49 ns/iter (+/- 2.54) = 917 MB/s
test sha3_224_1000  ... bench:       1,095.79 ns/iter (+/- 32.04) = 913 MB/s
test sha3_224_10000 ... bench:      10,953.02 ns/iter (+/- 157.49) = 912 MB/s
test sha3_256_10    ... bench:          13.05 ns/iter (+/- 0.25) = 769 MB/s
test sha3_256_1000  ... bench:       1,161.46 ns/iter (+/- 28.09) = 861 MB/s
test sha3_256_10000 ... bench:      11,609.98 ns/iter (+/- 148.88) = 861 MB/s
test sha3_265_100   ... bench:         118.17 ns/iter (+/- 7.42) = 847 MB/s
test sha3_384_10    ... bench:          17.07 ns/iter (+/- 2.80) = 588 MB/s
test sha3_384_100   ... bench:         151.93 ns/iter (+/- 4.39) = 662 MB/s
test sha3_384_1000  ... bench:       1,506.50 ns/iter (+/- 40.71) = 664 MB/s
test sha3_384_10000 ... bench:      15,119.04 ns/iter (+/- 495.59) = 661 MB/s
test sha3_512_10    ... bench:          22.93 ns/iter (+/- 0.53) = 454 MB/s
test sha3_512_100   ... bench:         216.77 ns/iter (+/- 7.42) = 462 MB/s
test sha3_512_1000  ... bench:       2,165.67 ns/iter (+/- 49.04) = 461 MB/s
test sha3_512_10000 ... bench:      21,666.71 ns/iter (+/- 651.02) = 461 MB/s
test shake128_10    ... bench:          11.30 ns/iter (+/- 0.14) = 909 MB/s
test shake128_100   ... bench:          94.75 ns/iter (+/- 3.86) = 1063 MB/s
test shake128_1000  ... bench:         961.72 ns/iter (+/- 81.88) = 1040 MB/s
test shake128_10000 ... bench:       9,573.39 ns/iter (+/- 311.05) = 1044 MB/s
test shake256_10    ... bench:          13.17 ns/iter (+/- 0.54) = 769 MB/s
test shake256_100   ... bench:         117.39 ns/iter (+/- 3.22) = 854 MB/s
test shake256_1000  ... bench:       1,174.65 ns/iter (+/- 45.62) = 851 MB/s
test shake256_10000 ... bench:      11,659.19 ns/iter (+/- 330.23) = 857 MB/s

The performance seems pretty close to the original assembly, maybe
just slightly slower.
@tarcieri tarcieri force-pushed the keccak/convert-armv8-asm-to-intrinsics branch from 9e7f994 to a25bf1d Compare February 27, 2026 19:56
@tarcieri tarcieri changed the title [WIP] keccak: convert ARMv8 ASM into intrinsics keccak: convert ARMv8 ASM into intrinsics Feb 27, 2026
@tarcieri tarcieri requested a review from newpavlov February 27, 2026 20:04
@tarcieri tarcieri marked this pull request as ready for review February 27, 2026 20:04
@tarcieri
Copy link
Member Author

Note: I think there's some possible refactoring/cleanups to be had here, but I'd like to land an initial working implementation first to iterate on, and in particular since I have an ASM -> Rust translator and the Godbolt links to match, I'd prefer to keep any cosmetic changes for a followup

Copy link
Member

@newpavlov newpavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not forget to add a changelog entry.

use super::*;

#[test]
fn test_keccak_f1600() {
Copy link
Member

@newpavlov newpavlov Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test duplicates the crate doc example, so I think we can remove it (maybe in a separate PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have some common testing for different backends somehow. Maybe a macro that writes tests.

I think it's nice for the backends to be individually unit tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes much sense. Why duplicate the same tests to each backend instead of forcing backend application in CI like we do for example in aes?

@tarcieri tarcieri merged commit 806d446 into master Feb 28, 2026
17 checks passed
@tarcieri tarcieri deleted the keccak/convert-armv8-asm-to-intrinsics branch February 28, 2026 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

keccak: migrate the asm backend to AArch64 intrinsics

2 participants