-
Notifications
You must be signed in to change notification settings - Fork 356
[generator.c] optimize copy_remaining_bytes #924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…n copy_remaining_bytes to avoid a branch to MEMCPY. Additionally use a space as padding byte instead of an 'X' so it can be represented diretly on AArch64 with a single instruction.
Have you tried the same with |
ext/json/ext/generator/generator.c
Outdated
| #if defined(__has_builtin) | ||
| #if __has_builtin(__builtin_memcpy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #if defined(__has_builtin) | |
| #if __has_builtin(__builtin_memcpy) | |
| #if defined(__has_builtin) && __has_builtin(__builtin_memcpy) |
|
Ok, I yet to finish my coffee, but just to see if I get it:
Have you considered always copying 16B regardless? |
I have, both The first is a conditional select of For completeness, with The |
…SUME to address PR feedback
Yes, correct. I haven't considered that, as I don't know if it's safe to potentially read up to 10 bytes past the end of the string. Even if it is safe, there would likely need to be code to either mask off a portion of the |
Ah right sorry, again didn't yet finish my coffee, I was only considering the buffer we write into, not the string we read from. |
No worries, all good. I'm currently in my drinking coffee / boot up phase. I get it. |
| if (len >= 8) { | ||
| __builtin_memcpy(s, search->ptr, 8); | ||
| __builtin_memcpy(s + len - 8, search->ptr + len - 8, 8); | ||
| } else { | ||
| __builtin_memcpy(s, search->ptr, 4); | ||
| __builtin_memcpy(s + len - 4, search->ptr + len - 4, 4); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, but I'd extract it to some sort of fast_memcpy16 sort of static inline helper to make it easier to grasp, and with the fallback in it for when __builtin_memcpy isn't available.
Could even be in simd.h.
This PR focuses on optimizing
copy_remaining_bytes. TheMEMCPY(s, search->ptr, char, len);generates a function call aslenis not constant. However, we know thatlenis between 6 (now 4) andvec_len-1bytes.Instead of the
MEMCPY, if available, we use__builtin_memcpywith a constant length which ends up emitting directloadandstoreinstructions. The copies are structured to copy between 4 and 15 bytes by utilizing copying overlapping byte ranges to copy the correct number of bytes. The__builtin_memcpyis important, at least forclangon MacOS. Attempting to usememcpy, the compiler is smart enough to recognize the only difference is either a8or an4then uses a conditional select to choose the right value then loads and stores. This is quite a bit slower than the__builtin_memcpy.Additionally, I noticed that the
memset(s, 'X', vec_len);generates three instructions:This is because
X(0x5858585858585858) cannot be represented as an immediate in Aarch64/ARM64 assembly. However, a space (0x20) can be. It doesn't really matter what filler character is used as long as it doesn't need to be escaped. Using a space,clangnow generates this:I realize this only save a single instruction and doesn't really make much difference but I'll take it.
The
__builtin_memcpycertainly introduces a level of complexity I wouldn't normally entertain but the performance improvements were quite surprising. Here are the results of running a benchmark on my M1 Macbook Air. The percentages are similar on my M4 Macbook Pro. As always the percentages vary a bit between runs but this one is fairly typical.The numbers were shocking enough that I thought I broke something. I added a few more tests in addition to running this shell script to verify the output between the default
jsongem that comes with Ruby and the current version.I excluded
canada.jsonas some of the numbers output slightly different precision.I did lower the
SIMD_MINIMUM_THRESHOLDto4as the copy is now almost free and that seems to change the math a bit for when it makes sense to fall back to the lookup table. Additionally, since the "else" copies overlapping 4 byte chunks, 4 seemed like the logical minimum threshold.I have thought about how to clean this up a little and have this idea:
in
simd.h:Then in the
generator.c: