Skip to content

Conversation

@ChALkeR
Copy link
Member

@ChALkeR ChALkeR commented Dec 17, 2025

This fixes both performance and correctness for all single-byte encodings in the TextDecoder spec.

Tracking issue: #61041

ICU is both slower and can't be trusted, it returns invalid results on many encodings (see #61041)
Fast path from #60893 is not faster than ICU and is instead just a perf regression

This is the first of the changes, aiming to fix 1-byte decoders only
It's based on code in https://github.com/ExodusOSS/bytes and essentially just copy-pastes the 1-byte encodings path for Node.js with some simplifications
I authored that code

Fixes

  1. Fixes ibm866, koi8-u, windows-874, windows-1253, windows-1255 encodings which were returning incorrect results
  2. Adds support for iso-8859-16 and x-user-defined encodings which are present in the spec
  3. Adds support for single-byte encodings for Node.js built without ICU (further PRs will improve on this)
  4. Fixes windows1252 incorrect handling of ignoreBOM:
 > new TextDecoder('latin1', { ignoreBOM: true }).decode(Uint8Array.of(0xff)).length
 0

Performance

  1. improves windows-1252 (aka latin1 / asci) perf on ASCII -- 108x
  2. improves windows-1252 (aka latin1 / asci) perf on non-ASCII -- 25x
  3. improves windows-1253 perf on ASCII -- 27x
  4. improves windows-1253 perf on non-ASCII -- ~7%
  5. ... and so on for all single-byte encodings, they are similar to windows-1253 or better

Benchmarks were made with https://github.com/lemire/jstextdecoderbench but changed the encodings:


windows-1252, main:

Test Size Throughput Mean Time
Latin lipsum (ASCII) 84.902 KiB 0.31 GiB/s 0.272 ms
Complex 1 79.771 KiB 0.06 GiB/s 1.292 ms

windows-1252, PR:

Test Size Throughput Mean Time
Latin lipsum (ASCII) 84.902 KiB 33.41 GiB/s 0.003 ms
Complex 1 79.771 KiB 1.48 GiB/s 0.056 ms

windows-1253, main:

Test Size Throughput Mean Time
Latin lipsum (ASCII) 84.902 KiB 1.27 GiB/s 0.065 ms
Complex 1 79.771 KiB 1.41 GiB/s 0.056 ms

windows-1253, PR:

Test Size Throughput Mean Time
Latin lipsum (ASCII) 84.902 KiB 34.46 GiB/s 0.003 ms
Complex 1 79.771 KiB 1.53 GiB/s 0.053 ms

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Dec 17, 2025
@ChALkeR ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from d49cc98 to 5c80725 Compare December 17, 2025 11:53
@ChALkeR ChALkeR changed the title fix: implement all 1-byte encodings in js lib: implement all 1-byte encodings in js Dec 17, 2025
@ChALkeR
Copy link
Member Author

ChALkeR commented Dec 17, 2025

Perhaps native path from #60893 could be reverted separately and should not be blocked on this, as that causes much of the regression (see perf difference between windows-1252 and windows-1253 above)

@ChALkeR ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from 5c80725 to 9e8e785 Compare December 17, 2025 12:00
Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ChALkeR ChALkeR force-pushed the chalker/decoder/single-byte/0 branch 6 times, most recently from 0f4a5ca to e047a11 Compare December 17, 2025 12:51
@ChALkeR ChALkeR marked this pull request as ready for review December 17, 2025 13:29
@ChALkeR
Copy link
Member Author

ChALkeR commented Dec 17, 2025

This should be ready except for tests which I'll add

@ChALkeR ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from e047a11 to 1920216 Compare December 17, 2025 14:43
Copy link
Member

@Renegade334 Renegade334 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Just a few minor suggestions.

@ChALkeR ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from 1920216 to 7c615e2 Compare December 17, 2025 15:55
const w3 = [...w8, -8060, 8330, -8328, 8096, -8094];
const m0 = [8558, -8328, 8374, -66, -8539, 16, 8043, -8070];

const encodings = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without documentation this is really not maintainable. At the very least, this needs some comments about what these numbers are, how they are determined, etc.

Copy link
Member Author

@ChALkeR ChALkeR Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasnell A test will cover that explanation, which is why the method to get those is exported
I'll add that test

It will also be able to reproduce those from the spec data (albeit without the common chunks extraction, but that is trivially maintainable e.g. by just omitting that)

Copy link
Member Author

@ChALkeR ChALkeR Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the tables for 0x80+ ranges from the spec, as e.g. https://encoding.spec.whatwg.org/index-iso-8859-6.txt, with two adjustments for compactness and ease-of-use:

  1. 0xfffd designates a hole (unmapped offset), those tables are non-continuous
  2. All other values are deltas from the previous seen entry, the first one is delta from 0x7f
    This is because values 0x00-0x7f are always mapped as identity and are not even present in spec tables

E.g. if the table in the spec says

0 0x0080
1 0x0081
5 0x0085
6 0x0086
7 0x0087

The vector is [1, 1, f, f, f, 4, 1, 1] (with f being 0xfffd)

I'll document that in the testcase/generator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand what's happening here but really the comments/explanation needs to be in the source here. I don't think placing the explanation in the test is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Also I removed all the common ranges logic to make this easier, size doesn't matter that much here

@codecov
Copy link

codecov bot commented Dec 17, 2025

Codecov Report

❌ Patch coverage is 97.94872% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.53%. Comparing base (4f24aff) to head (3557542).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
lib/internal/encoding.js 90.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main   #61093    +/-   ##
========================================
  Coverage   88.53%   88.53%            
========================================
  Files         703      704     +1     
  Lines      208546   208663   +117     
  Branches    40217    40238    +21     
========================================
+ Hits       184634   184747   +113     
- Misses      15926    15953    +27     
+ Partials     7986     7963    -23     
Files with missing lines Coverage Δ
lib/internal/encoding/single-byte.js 100.00% <100.00%> (ø)
src/encoding_binding.cc 51.80% <ø> (-2.81%) ⬇️
src/encoding_binding.h 100.00% <ø> (ø)
lib/internal/encoding.js 99.36% <90.00%> (-0.16%) ⬇️

... and 39 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@anonrig anonrig added the needs-benchmark-ci PR that need a benchmark CI run. label Dec 18, 2025
@mertcanaltin
Copy link
Member

mertcanaltin commented Dec 18, 2025

Thanks for repair!

fails because internal/encoding/single-byte and internalBinding('os') are loaded before pre-execution.
lazy loading both can fix this. https://github.com/nodejs/node/actions/runs/20308888356/job/58334026159?pr=61093

@ChALkeR ChALkeR force-pushed the chalker/decoder/single-byte/0 branch 4 times, most recently from 74fed43 to 2cb2499 Compare December 18, 2025 12:14
@ChALkeR
Copy link
Member Author

ChALkeR commented Dec 18, 2025

@Renegade334 @jasnell I decided to remove all common ranges logic and made the arrays literal except repeats of holes and ones.

  1. Using too much safe iterators there defeats the purpose of compacting them and makes code more complex
  2. They should more maintainable this way, now they can be reproduced from the spec tables without manual steps
  3. Size doesn't matter that much to save of a few bytes here

That simplifies the structure

@ChALkeR ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from 2cb2499 to 35df300 Compare December 18, 2025 12:22
@ChALkeR ChALkeR force-pushed the chalker/decoder/single-byte/0 branch 3 times, most recently from eb9f467 to eb074bd Compare December 18, 2025 16:42
@ChALkeR
Copy link
Member Author

ChALkeR commented Dec 18, 2025

Moving to a proper native path, we can additionally beat this almost 2x on non-ascii and get:

| Latin lipsum (ASCII) | 84.902 KiB | 34.33 GiB/s | 0.003 ms |
| Complex 1 | 79.771 KiB | 2.92 GiB/s | 0.030 ms |

With a cleaner impl that does not do branching and does not convert result to and from utf8

I can try filing an alternative PR with only native impl for all encodings

@ChALkeR
Copy link
Member Author

ChALkeR commented Dec 18, 2025

I removed os binding usage in favor of a quick js endianness detection as in upstream lib

I'm planning to replace this with fast native implementation later anyway
We should not block on waiting for that and land this though

This is a fix + improves perf 100x-25x, move to native will improve perf additionally ~2x on non-ascii.

@ChALkeR
Copy link
Member Author

ChALkeR commented Dec 18, 2025

Added tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. needs-benchmark-ci PR that need a benchmark CI run. needs-ci PRs that need a full CI run.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants