[generator.c] optimize copy_remaining_bytes #924

samyron · 2026-01-16T04:03:06Z

This PR focuses on optimizing copy_remaining_bytes. The MEMCPY(s, search->ptr, char, len); generates a function call as len is not constant. However, we know that len is between 6 (now 4) and vec_len-1 bytes.

Instead of the MEMCPY, if available, we use __builtin_memcpy with a constant length which ends up emitting direct load and store instructions. The copies are structured to copy between 4 and 15 bytes by utilizing copying overlapping byte ranges to copy the correct number of bytes. The __builtin_memcpy is important, at least for clang on MacOS. Attempting to use memcpy, the compiler is smart enough to recognize the only difference is either a 8 or an 4 then uses a conditional select to choose the right value then loads and stores. This is quite a bit slower than the __builtin_memcpy.

Additionally, I noticed that the memset(s, 'X', vec_len); generates three instructions:

mov x8, #0x1818181818181818
orr x8, x8, #0x4040404040404040
stp x8, x8, [x23]

This is because X (0x5858585858585858) cannot be represented as an immediate in Aarch64/ARM64 assembly. However, a space (0x20) can be. It doesn't really matter what filler character is used as long as it doesn't need to be escaped. Using a space, clang now generates this:

mov x10, #0x2020202020202020
stp x10, x10, [x8]

I realize this only save a single instruction and doesn't really make much difference but I'll take it.

The __builtin_memcpy certainly introduces a level of complexity I wouldn't normally entertain but the performance improvements were quite surprising. Here are the results of running a benchmark on my M1 Macbook Air. The percentages are similar on my M4 Macbook Pro. As always the percentages vary a bit between runs but this one is fairly typical.

== Encoding activitypub.json (52595 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.570k i/100ms
Calculating -------------------------------------
               after     26.661k (± 1.4%) i/s   (37.51 μs/i) -    133.640k in   5.013636s

Comparison:
              before:    25175.5 i/s
               after:    26660.6 i/s - 1.06x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   133.000 i/100ms
Calculating -------------------------------------
               after      1.338k (± 0.7%) i/s  (747.18 μs/i) -      6.783k in   5.068387s

Comparison:
              before:     1273.7 i/s
               after:     1338.4 i/s - 1.05x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   269.000 i/100ms
Calculating -------------------------------------
               after      2.694k (± 2.1%) i/s  (371.21 μs/i) -     13.719k in   5.094830s

Comparison:
              before:     2509.0 i/s
               after:     2693.9 i/s - 1.07x  faster


== Encoding ohai.json (20145 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     3.418k i/100ms
Calculating -------------------------------------
               after     34.115k (± 0.8%) i/s   (29.31 μs/i) -    170.900k in   5.009885s

Comparison:
              before:    31162.9 i/s
               after:    34114.9 i/s - 1.09x  faster

The numbers were shocking enough that I thought I broke something. I added a few more tests in addition to running this shell script to verify the output between the default json gem that comes with Ruby and the current version.

#!/bin/bash

all_match=true

for file in $(ls ./benchmark/data/*.json | grep -v canada.json); do
    echo "Testing: $(basename $file)"
    
    dev_hash=$(ruby -Ilib:ext -e "require 'json'; puts JSON.generate(JSON.parse(File.read('$file')));" | sha1sum | cut -d' ' -f1)
    rel_hash=$(ruby -e "require 'json'; puts JSON.generate(JSON.parse(File.read('$file')));" | sha1sum | cut -d' ' -f1)
    
    if [ "$dev_hash" = "$rel_hash" ]; then
        echo "  ✓ $dev_hash"
    else
        echo "  ✗ MISMATCH: dev=$dev_hash rel=$rel_hash"
        all_match=false
    fi
done

$all_match && echo -e "\n✓ All tests passed!" || echo -e "\n✗ Some tests failed!"

% bash ./verify-changes.sh 
Testing: activitypub-pretty.json
  ✓ 421ab39e9ee7eed1392a24ddde4deee218abeae2
Testing: activitypub.json
  ✓ 421ab39e9ee7eed1392a24ddde4deee218abeae2
Testing: citm_catalog.json
  ✓ 09b74151f3e4310e9339be0ff1c0d8b316dbdabd
Testing: github_events.json
  ✓ d5248cad8f4e573145009842e9f0116919e2b4f0
Testing: integers-pretty.json
  ✓ 6ffad71396cc4207b53e34eaf02c5bf1a6f65d92
Testing: integers.json
  ✓ 7274a0b8c6c55fa0478b5a5f80433ea31b51c123
Testing: ohai.json
  ✓ 8dde8bd01ef7baac0330da5a094ab1781bfbe142
Testing: semanticscholar-corpus.json
  ✓ 52e8b47615c1cbcc538d47d55d999eadfaffdf84
Testing: twitter.json
  ✓ ccab6079fd5cf8408f7bf4ff832729d4fc59f9e8
Testing: twitterescaped.json
  ✓ ccab6079fd5cf8408f7bf4ff832729d4fc59f9e8
Testing: update-center.json
  ✓ 12d1ba8473163a50c6632821c7228b4f067642a2

✓ All tests passed!

I excluded canada.json as some of the numbers output slightly different precision.

I did lower the SIMD_MINIMUM_THRESHOLD to 4 as the copy is now almost free and that seems to change the math a bit for when it makes sense to fall back to the lookup table. Additionally, since the "else" copies overlapping 4 byte chunks, 4 seemed like the logical minimum threshold.

I have thought about how to clean this up a little and have this idea:

in simd.h:

#if defined(__has_builtin)
#if __has_builtin(__builtin_memcpy)
#define EXPLICIT_MEMCPY(dst, src, n) __builtin_memcpy(dst, src, n)
#define SIMD_MINIMUM_THRESHOLD 4
#endif
#else
SIMD_MINIMUM_THRESHOLD 6
#endif

Then in the generator.c:

#ifdef EXPLICIT_MEMCPY
<new optimized code>
#else
MEMCPY(s, search->ptr, char, len);
#endif

…n copy_remaining_bytes to avoid a branch to MEMCPY. Additionally use a space as padding byte instead of an 'X' so it can be represented diretly on AArch64 with a single instruction.

byroot · 2026-01-16T07:21:36Z

if available, we use __builtin_memcpy with a constant length which ends up emitting direct load and store instructions.

Have you tried the same with MEMCPY or memcpy ? I know that compilers can do similar optiomization when Ruby's MEMCMP has convenient constant size too. (just to get rid of the ifdefs etc).

byroot · 2026-01-16T07:26:16Z

ext/json/ext/generator/generator.c

+#if defined(__has_builtin)
+#if __has_builtin(__builtin_memcpy)


Suggested change

#if defined(__has_builtin)

#if __has_builtin(__builtin_memcpy)

#if defined(__has_builtin) && __has_builtin(__builtin_memcpy)

ext/json/ext/generator/generator.c

byroot · 2026-01-16T07:36:12Z

Ok, I yet to finish my coffee, but just to see if I get it:

copy_remaining_bytes only ever copy up to 16B.

Have you considered always copying 16B regardless?

samyron · 2026-01-16T14:23:30Z

if available, we use __builtin_memcpy with a constant length which ends up emitting direct load and store instructions.

Have you tried the same with MEMCPY or memcpy ? I know that compilers can do similar optiomization when Ruby's MEMCMP has convenient constant size too. (just to get rid of the ifdefs etc).

I have, both MEMCPY and memcpy emit:

mov w8, #0x4
mov w9, #0x8
csel x24, x9, x8, hi
mov x8, #0xfffffffffffffffc
mov x9, #0xfffffffffffffff8
csel x21, x9, x8, hi
mov x0, x23
mov x1, x20
mov x2, x24
bl $+0x27f8

The first is a conditional select of 4 or 8 followed by a conditional select of the -4 or the -8 followed by a call to memcpy (I believe).

For completeness, with __builtin_memcpy we get:

cmp x21, #0x8
b.lo $+0xd0
ldr x11, [x20]
str x11, [x8]
ldur x10, [x10, #-0x8]
stur x10, [x9, #-0x8]
b $+0xcc
<snip>
ldr w11, [x20]
str w11, [x8]
ldur w10, [x10, #-0x4]
stur w10, [x9, #-0x4]

The <snip> varies in length depending on compiler. It seems like clang has added a bunch of code between the if and the else of the (len >= 8) case. gcc has them very close together.

…SUME to address PR feedback

…y 2.7

samyron · 2026-01-16T14:44:18Z

Ok, I yet to finish my coffee, but just to see if I get it:

copy_remaining_bytes only ever copy up to 16B.

Have you considered always copying 16B regardless?

Yes, correct. I haven't considered that, as I don't know if it's safe to potentially read up to 10 bytes past the end of the string. Even if it is safe, there would likely need to be code to either mask off a portion of the match_mask or memset those bytes with a character that doesn't need to be escaped. I think it's trading complexity either way.

byroot · 2026-01-16T14:49:53Z

read up to 10 bytes past the end of the string.

Ah right sorry, again didn't yet finish my coffee, I was only considering the buffer we write into, not the string we read from.

samyron · 2026-01-16T14:54:58Z

read up to 10 bytes past the end of the string.

Ah right sorry, again didn't yet finish my coffee, I was only considering the buffer we write into, not the string we read from.

No worries, all good. I'm currently in my drinking coffee / boot up phase. I get it.

byroot · 2026-01-16T14:58:25Z

ext/json/ext/generator/generator.c

+        if (len >= 8) {
+            __builtin_memcpy(s, search->ptr, 8);
+            __builtin_memcpy(s + len - 8, search->ptr + len - 8, 8);
+        } else {
+            __builtin_memcpy(s, search->ptr, 4);
+            __builtin_memcpy(s + len - 4, search->ptr + len - 4, 4);
+        }


I think this makes sense, but I'd extract it to some sort of fast_memcpy16 sort of static inline helper to make it easier to grasp, and with the fallback in it for when __builtin_memcpy isn't available.

Could even be in simd.h.

Use __builtin_memcpy, if available, to copy overlapping byte ranges i…

0383eb8

…n copy_remaining_bytes to avoid a branch to MEMCPY. Additionally use a space as padding byte instead of an 'X' so it can be represented diretly on AArch64 with a single instruction.

byroot reviewed Jan 16, 2026

View reviewed changes

ext/json/ext/generator/generator.c Show resolved Hide resolved

samyron added 2 commits January 16, 2026 08:38

combined two preprocessor directives and added an RBIMPL_ASSERT_OR_AS…

2fa9417

…SUME to address PR feedback

wrap RBIMPL_ASSERT_OR_ASSUME in an ifdef as it isn't available on Rub…

1319922

…y 2.7

byroot reviewed Jan 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[generator.c] optimize copy_remaining_bytes #924

[generator.c] optimize copy_remaining_bytes #924

samyron commented Jan 16, 2026

Uh oh!

byroot commented Jan 16, 2026

Uh oh!

byroot Jan 16, 2026

Uh oh!

Uh oh!

byroot commented Jan 16, 2026

Uh oh!

samyron commented Jan 16, 2026 •

edited

Loading

Uh oh!

samyron commented Jan 16, 2026

Uh oh!

byroot commented Jan 16, 2026

Uh oh!

samyron commented Jan 16, 2026

Uh oh!

byroot Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		#if defined(__has_builtin)
		#if __has_builtin(__builtin_memcpy)

	#if defined(__has_builtin)
	#if __has_builtin(__builtin_memcpy)
	#if defined(__has_builtin) && __has_builtin(__builtin_memcpy)

[generator.c] optimize copy_remaining_bytes #924

Are you sure you want to change the base?

[generator.c] optimize copy_remaining_bytes #924

Conversation

samyron commented Jan 16, 2026

Uh oh!

byroot commented Jan 16, 2026

Uh oh!

byroot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

byroot commented Jan 16, 2026

Uh oh!

samyron commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samyron commented Jan 16, 2026

Uh oh!

byroot commented Jan 16, 2026

Uh oh!

samyron commented Jan 16, 2026

Uh oh!

byroot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samyron commented Jan 16, 2026 •

edited

Loading