keccak: convert ARMv8 ASM into intrinsics#112
Conversation
969e422 to
e48f822
Compare
8eb9579 to
9e7f994
Compare
Rewrites the inline assembly implementation using an equivalent (but not identical) intrinsics implementation. Also exposes support for computing two Keccak states in parallel which a previous comment in the ASM implementation noted was possible but wasn't actually exposed, and is now available as `p1600_armv8_sha3_times2` (though not yet in the public API, see #110). This is a little tricky due to high register pressure: this implementation uses every vector register. I started by rewriting the round loop and iterating over the round constants, then breaking apart the body into theta and everything else (rho/pi/chi/iota), mapping the NEON registers onto a `[uint64x2_t; 25]` state. Theta was translated by hand, but the rest of them were too tedious regarding a manual mapping of the registers to slots in the state array. So I wrote a small program that operates over a representation of the original assembly, doing all the bookkeeping for which registers map to which slots in the state array, and outputs the equivalent intrinsics code. Godbolt links to the original `asm!` versus this translation: - original: https://godbolt.org/z/G8Mf5vboE - translated: https://godbolt.org/z/sszzbdexK It's using nearly the same number of instructions, but there are differences between the two versions, i.e. it isn't an identical recreation of the original assembly, which I'm not sure is possible/preferable, but it should be functionally equivalent. Benchmarks (`sha3` crate): - Pure software implementation: test sha3_224_10 ... bench: 17.97 ns/iter (+/- 0.32) = 588 MB/s test sha3_224_100 ... bench: 164.15 ns/iter (+/- 5.14) = 609 MB/s test sha3_224_1000 ... bench: 1,646.07 ns/iter (+/- 139.45) = 607 MB/s test sha3_224_10000 ... bench: 16,585.52 ns/iter (+/- 1,168.57) = 602 MB/s test sha3_256_10 ... bench: 19.12 ns/iter (+/- 0.77) = 526 MB/s test sha3_256_1000 ... bench: 1,694.21 ns/iter (+/- 41.20) = 590 MB/s test sha3_256_10000 ... bench: 16,807.40 ns/iter (+/- 556.17) = 594 MB/s test sha3_265_100 ... bench: 173.41 ns/iter (+/- 4.98) = 578 MB/s test sha3_384_10 ... bench: 24.32 ns/iter (+/- 1.16) = 416 MB/s test sha3_384_100 ... bench: 225.00 ns/iter (+/- 5.50) = 444 MB/s test sha3_384_1000 ... bench: 2,224.49 ns/iter (+/- 47.86) = 449 MB/s test sha3_384_10000 ... bench: 22,181.02 ns/iter (+/- 971.37) = 450 MB/s test sha3_512_10 ... bench: 33.78 ns/iter (+/- 0.32) = 303 MB/s test sha3_512_100 ... bench: 320.54 ns/iter (+/- 10.77) = 312 MB/s test sha3_512_1000 ... bench: 3,174.62 ns/iter (+/- 80.98) = 315 MB/s test sha3_512_10000 ... bench: 31,629.97 ns/iter (+/- 871.85) = 316 MB/s test shake128_10 ... bench: 15.97 ns/iter (+/- 0.44) = 666 MB/s test shake128_100 ... bench: 142.19 ns/iter (+/- 6.58) = 704 MB/s test shake128_1000 ... bench: 1,390.27 ns/iter (+/- 56.14) = 719 MB/s test shake128_10000 ... bench: 13,813.13 ns/iter (+/- 677.65) = 723 MB/s test shake256_10 ... bench: 19.06 ns/iter (+/- 0.44) = 526 MB/s test shake256_100 ... bench: 173.50 ns/iter (+/- 4.26) = 578 MB/s test shake256_1000 ... bench: 1,695.05 ns/iter (+/- 87.19) = 589 MB/s test shake256_10000 ... bench: 16,882.98 ns/iter (+/- 683.56) = 592 MB/s - New intrinsics implementation: test sha3_224_10 ... bench: 13.07 ns/iter (+/- 0.55) = 769 MB/s test sha3_224_100 ... bench: 111.29 ns/iter (+/- 6.62) = 900 MB/s test sha3_224_1000 ... bench: 1,113.87 ns/iter (+/- 29.88) = 898 MB/s test sha3_224_10000 ... bench: 11,095.95 ns/iter (+/- 302.99) = 901 MB/s test sha3_256_10 ... bench: 13.53 ns/iter (+/- 0.51) = 769 MB/s test sha3_256_1000 ... bench: 1,173.40 ns/iter (+/- 33.72) = 852 MB/s test sha3_256_10000 ... bench: 12,305.99 ns/iter (+/- 623.31) = 812 MB/s test sha3_265_100 ... bench: 118.16 ns/iter (+/- 2.85) = 847 MB/s test sha3_384_10 ... bench: 17.27 ns/iter (+/- 0.78) = 588 MB/s test sha3_384_100 ... bench: 153.80 ns/iter (+/- 5.42) = 653 MB/s test sha3_384_1000 ... bench: 1,529.35 ns/iter (+/- 18.99) = 654 MB/s test sha3_384_10000 ... bench: 15,239.19 ns/iter (+/- 189.19) = 656 MB/s test sha3_512_10 ... bench: 23.43 ns/iter (+/- 0.95) = 434 MB/s test sha3_512_100 ... bench: 218.97 ns/iter (+/- 4.01) = 458 MB/s test sha3_512_1000 ... bench: 2,193.58 ns/iter (+/- 37.98) = 455 MB/s test sha3_512_10000 ... bench: 21,968.75 ns/iter (+/- 385.75) = 455 MB/s test shake128_10 ... bench: 11.47 ns/iter (+/- 0.32) = 909 MB/s test shake128_100 ... bench: 95.51 ns/iter (+/- 1.32) = 1052 MB/s test shake128_1000 ... bench: 960.08 ns/iter (+/- 34.57) = 1041 MB/s test shake128_10000 ... bench: 9,564.39 ns/iter (+/- 255.34) = 1045 MB/s test shake256_10 ... bench: 13.61 ns/iter (+/- 0.53) = 769 MB/s test shake256_100 ... bench: 116.77 ns/iter (+/- 1.94) = 862 MB/s test shake256_1000 ... bench: 1,163.09 ns/iter (+/- 27.17) = 859 MB/s test shake256_10000 ... bench: 11,750.47 ns/iter (+/- 250.38) = 851 MB/s - Original assembly: test sha3_224_10 ... bench: 12.54 ns/iter (+/- 0.43) = 833 MB/s test sha3_224_100 ... bench: 109.49 ns/iter (+/- 2.54) = 917 MB/s test sha3_224_1000 ... bench: 1,095.79 ns/iter (+/- 32.04) = 913 MB/s test sha3_224_10000 ... bench: 10,953.02 ns/iter (+/- 157.49) = 912 MB/s test sha3_256_10 ... bench: 13.05 ns/iter (+/- 0.25) = 769 MB/s test sha3_256_1000 ... bench: 1,161.46 ns/iter (+/- 28.09) = 861 MB/s test sha3_256_10000 ... bench: 11,609.98 ns/iter (+/- 148.88) = 861 MB/s test sha3_265_100 ... bench: 118.17 ns/iter (+/- 7.42) = 847 MB/s test sha3_384_10 ... bench: 17.07 ns/iter (+/- 2.80) = 588 MB/s test sha3_384_100 ... bench: 151.93 ns/iter (+/- 4.39) = 662 MB/s test sha3_384_1000 ... bench: 1,506.50 ns/iter (+/- 40.71) = 664 MB/s test sha3_384_10000 ... bench: 15,119.04 ns/iter (+/- 495.59) = 661 MB/s test sha3_512_10 ... bench: 22.93 ns/iter (+/- 0.53) = 454 MB/s test sha3_512_100 ... bench: 216.77 ns/iter (+/- 7.42) = 462 MB/s test sha3_512_1000 ... bench: 2,165.67 ns/iter (+/- 49.04) = 461 MB/s test sha3_512_10000 ... bench: 21,666.71 ns/iter (+/- 651.02) = 461 MB/s test shake128_10 ... bench: 11.30 ns/iter (+/- 0.14) = 909 MB/s test shake128_100 ... bench: 94.75 ns/iter (+/- 3.86) = 1063 MB/s test shake128_1000 ... bench: 961.72 ns/iter (+/- 81.88) = 1040 MB/s test shake128_10000 ... bench: 9,573.39 ns/iter (+/- 311.05) = 1044 MB/s test shake256_10 ... bench: 13.17 ns/iter (+/- 0.54) = 769 MB/s test shake256_100 ... bench: 117.39 ns/iter (+/- 3.22) = 854 MB/s test shake256_1000 ... bench: 1,174.65 ns/iter (+/- 45.62) = 851 MB/s test shake256_10000 ... bench: 11,659.19 ns/iter (+/- 330.23) = 857 MB/s The performance seems pretty close to the original assembly, maybe just slightly slower.
9e7f994 to
a25bf1d
Compare
|
Note: I think there's some possible refactoring/cleanups to be had here, but I'd like to land an initial working implementation first to iterate on, and in particular since I have an ASM -> Rust translator and the Godbolt links to match, I'd prefer to keep any cosmetic changes for a followup |
| use super::*; | ||
|
|
||
| #[test] | ||
| fn test_keccak_f1600() { |
There was a problem hiding this comment.
This test duplicates the crate doc example, so I think we can remove it (maybe in a separate PR).
There was a problem hiding this comment.
It'd be nice to have some common testing for different backends somehow. Maybe a macro that writes tests.
I think it's nice for the backends to be individually unit tested.
There was a problem hiding this comment.
I don't think it makes much sense. Why duplicate the same tests to each backend instead of forcing backend application in CI like we do for example in aes?
Rewrites the inline assembly implementation using an equivalent (but not identical) intrinsics implementation. Also exposes support for computing two Keccak states in parallel which a previous comment in the ASM implementation noted was possible but wasn't actually exposed, and is now available as
p1600_armv8_sha3_times2(though not yet in the public API, see #110).This is a little tricky due to high register pressure: this implementation uses every vector register.
I started by rewriting the round loop and iterating over the round constants, then breaking apart the body into theta and everything else (rho/pi/chi/iota), mapping the NEON registers onto a
[uint64x2_t; 25]state.Theta was translated by hand, but the rest of them were too tedious regarding a manual mapping of the registers to slots in the state array. So I wrote a small program that operates over a representation of the original assembly, doing all the bookkeeping for which registers map to which slots in the state array, and outputs the equivalent intrinsics code.
Godbolt links to the original
asm!versus this translation:It's using nearly the same number of instructions, but there are differences between the two versions, i.e. it isn't an identical recreation of the original assembly, which I'm not sure is possible/preferable, but it should be functionally equivalent.
Since we're no longer using
asm!,cfg(armv8_asm)has been removed and this is now enabled by default onaarch64targets.Closes #95
Benchmarks (
sha3crate on M1 Max)Pure software implementation
This new intrinsics implementation
Original assembly
The performance seems pretty close to the original assembly, maybe just slightly slower.
Here's the program I used to translate rho/pi/chi/iota:
main.rs