lib: implement all 1-byte encodings in js #61093

ChALkeR · 2025-12-17T11:48:46Z

This fixes both performance and correctness for all single-byte encodings in the `TextDecoder` spec.

Tracking issue: #61041

ICU is both slower and can't be trusted, it returns invalid results on many encodings (see #61041)
Fast path from #60893 is not faster than ICU and is instead just a perf regression

This is the first of the changes, aiming to fix 1-byte decoders only
It's based on code in https://github.com/ExodusOSS/bytes and essentially just copy-pastes the 1-byte encodings path for Node.js with some simplifications
I authored that code

Fixes

Fixes ibm866, koi8-u, windows-874, windows-1253, windows-1255 encodings which were returning incorrect results
Adds support for iso-8859-16 and x-user-defined encodings which are present in the spec
Adds support for single-byte encodings for Node.js built without ICU (further PRs will improve on this)
Fixes windows1252 incorrect handling of ignoreBOM:

 > new TextDecoder('latin1', { ignoreBOM: true }).decode(Uint8Array.of(0xff)).length
 0

Performance

improves windows-1252 (aka latin1 / asci) perf on ASCII -- 108x
improves windows-1252 (aka latin1 / asci) perf on non-ASCII -- 25x
improves windows-1253 perf on ASCII -- 27x
improves windows-1253 perf on non-ASCII -- ~7%
... and so on for all single-byte encodings, they are similar to windows-1253 or better

Benchmarks were made with https://github.com/lemire/jstextdecoderbench but changed the encodings:

windows-1252, main:

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	0.31 GiB/s	0.272 ms
Complex 1	79.771 KiB	0.06 GiB/s	1.292 ms

windows-1252, PR:

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	33.41 GiB/s	0.003 ms
Complex 1	79.771 KiB	1.48 GiB/s	0.056 ms

windows-1253, main:

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	1.27 GiB/s	0.065 ms
Complex 1	79.771 KiB	1.41 GiB/s	0.056 ms

windows-1253, PR:

Test	Size	Throughput	Mean Time
Latin lipsum (ASCII)	84.902 KiB	34.46 GiB/s	0.003 ms
Complex 1	79.771 KiB	1.53 GiB/s	0.053 ms

ChALkeR · 2025-12-17T11:56:51Z

Perhaps native path from #60893 could be reverted separately and should not be blocked on this, as that causes much of the regression (see perf difference between windows-1252 and windows-1253 above)

mcollina

lgtm

ChALkeR · 2025-12-17T13:30:10Z

This should be ready except for tests which I'll add

Renegade334

Very nice! Just a few minor suggestions.

lib/internal/encoding/single-byte.js

lib/internal/encoding.js

lib/internal/encoding/single-byte.js

jasnell · 2025-12-17T16:26:28Z

lib/internal/encoding/single-byte.js

+const w3 = [...w8, -8060, 8330, -8328, 8096, -8094];
+const m0 = [8558, -8328, 8374, -66, -8539, 16, 8043, -8070];
+
+const encodings = {


Without documentation this is really not maintainable. At the very least, this needs some comments about what these numbers are, how they are determined, etc.

@jasnell A test will cover that explanation, which is why the method to get those is exported
I'll add that test

It will also be able to reproduce those from the spec data (albeit without the common chunks extraction, but that is trivially maintainable e.g. by just omitting that)

This is just the tables for 0x80+ ranges from the spec, as e.g. https://encoding.spec.whatwg.org/index-iso-8859-6.txt, with two adjustments for compactness and ease-of-use:

0xfffd designates a hole (unmapped offset), those tables are non-continuous

All other values are deltas from the previous seen entry, the first one is delta from 0x7f
This is because values 0x00-0x7f are always mapped as identity and are not even present in spec tables

E.g. if the table in the spec says

0 0x0080 1 0x0081 5 0x0085 6 0x0086 7 0x0087

The vector is [1, 1, f, f, f, 4, 1, 1] (with f being 0xfffd)

I'll document that in the testcase/generator

Yeah, I understand what's happening here but really the comments/explanation needs to be in the source here. I don't think placing the explanation in the test is correct.

Done.
Also I removed all the common ranges logic to make this easier, size doesn't matter that much here

lib/internal/encoding/single-byte.js

codecov · 2025-12-17T16:53:03Z

Codecov Report

❌ Patch coverage is 97.94872% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.53%. Comparing base (4f24aff) to head (3557542).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
lib/internal/encoding.js	90.00%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main   #61093    +/-   ##
========================================
  Coverage   88.53%   88.53%            
========================================
  Files         703      704     +1     
  Lines      208546   208663   +117     
  Branches    40217    40238    +21     
========================================
+ Hits       184634   184747   +113     
- Misses      15926    15953    +27     
+ Partials     7986     7963    -23

Files with missing lines	Coverage Δ
lib/internal/encoding/single-byte.js	`100.00% <100.00%> (ø)`
src/encoding_binding.cc	`51.80% <ø> (-2.81%)`	⬇️
src/encoding_binding.h	`100.00% <ø> (ø)`
lib/internal/encoding.js	`99.36% <90.00%> (-0.16%)`	⬇️

... and 39 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mertcanaltin · 2025-12-18T07:57:45Z

Thanks for repair!

fails because internal/encoding/single-byte and internalBinding('os') are loaded before pre-execution.
lazy loading both can fix this. https://github.com/nodejs/node/actions/runs/20308888356/job/58334026159?pr=61093

ChALkeR · 2025-12-18T12:17:18Z

@Renegade334 @jasnell I decided to remove all common ranges logic and made the arrays literal except repeats of holes and ones.

Using too much safe iterators there defeats the purpose of compacting them and makes code more complex
They should more maintainable this way, now they can be reproduced from the spec tables without manual steps
Size doesn't matter that much to save of a few bytes here

That simplifies the structure

ChALkeR · 2025-12-18T18:18:41Z

Moving to a proper native path, we can additionally beat this almost 2x on non-ascii and get:

| Latin lipsum (ASCII) | 84.902 KiB | 34.33 GiB/s | 0.003 ms |
| Complex 1 | 79.771 KiB | 2.92 GiB/s | 0.030 ms |

With a cleaner impl that does not do branching and does not convert result to and from utf8

I can try filing an alternative PR with only native impl for all encodings

ChALkeR · 2025-12-18T19:16:25Z

I removed os binding usage in favor of a quick js endianness detection as in upstream lib

I'm planning to replace this with fast native implementation later anyway
We should not block on waiting for that and land this though

This is a fix + improves perf 100x-25x, move to native will improve perf additionally ~2x on non-ascii.

ChALkeR · 2025-12-18T23:20:44Z

Added tests

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Dec 17, 2025

ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from d49cc98 to 5c80725 Compare December 17, 2025 11:53

ChALkeR changed the title ~~fix: implement all 1-byte encodings in js~~ lib: implement all 1-byte encodings in js Dec 17, 2025

ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from 5c80725 to 9e8e785 Compare December 17, 2025 12:00

mcollina approved these changes Dec 17, 2025

View reviewed changes

ChALkeR force-pushed the chalker/decoder/single-byte/0 branch 6 times, most recently from 0f4a5ca to e047a11 Compare December 17, 2025 12:51

ChALkeR marked this pull request as ready for review December 17, 2025 13:29

ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from e047a11 to 1920216 Compare December 17, 2025 14:43

Renegade334 reviewed Dec 17, 2025

View reviewed changes

ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from 1920216 to 7c615e2 Compare December 17, 2025 15:55

jasnell reviewed Dec 17, 2025

View reviewed changes

lib/internal/encoding/single-byte.js Outdated Show resolved Hide resolved

jasnell reviewed Dec 17, 2025

View reviewed changes

lib/internal/encoding/single-byte.js Outdated Show resolved Hide resolved

ChALkeR mentioned this pull request Dec 17, 2025

TextDecoder is wrong and very slow #61041

Open

anonrig added the needs-benchmark-ci PR that need a benchmark CI run. label Dec 18, 2025

ChALkeR force-pushed the chalker/decoder/single-byte/0 branch 4 times, most recently from 74fed43 to 2cb2499 Compare December 18, 2025 12:14

ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from 2cb2499 to 35df300 Compare December 18, 2025 12:22

ChALkeR requested review from Renegade334 and jasnell December 18, 2025 12:24

ChALkeR force-pushed the chalker/decoder/single-byte/0 branch 3 times, most recently from eb9f467 to eb074bd Compare December 18, 2025 16:42

ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from eb074bd to e29d03b Compare December 18, 2025 19:13

ChALkeR mentioned this pull request Dec 18, 2025

src: implement Windows-1252 encoding support and update related tests #60893

Merged

lib: implement all 1-byte encodings in js

efa5be8

ChALkeR force-pushed the chalker/decoder/single-byte/0 branch from e29d03b to efa5be8 Compare December 18, 2025 20:44

ChALkeR mentioned this pull request Dec 18, 2025

src: move all 1-byte encodings to native #61118

Draft

test: add single-byte encoding tests

3557542

ChALkeR mentioned this pull request Dec 19, 2025

src: improve windows1252 decoding speed #61120

Draft

Uh oh!

lib: implement all 1-byte encodings in js #61093

Are you sure you want to change the base?

lib: implement all 1-byte encodings in js #61093

Uh oh!

Conversation

ChALkeR commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This fixes both performance and correctness for all single-byte encodings in the TextDecoder spec.

Fixes

Performance

Uh oh!

ChALkeR commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcollina left a comment

Choose a reason for hiding this comment

Uh oh!

ChALkeR commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Renegade334 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jasnell Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

ChALkeR Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChALkeR Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jasnell Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

ChALkeR Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mertcanaltin commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChALkeR commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChALkeR commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChALkeR commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChALkeR commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ChALkeR commented Dec 17, 2025 •

edited

Loading

This fixes both performance and correctness for all single-byte encodings in the `TextDecoder` spec.

ChALkeR commented Dec 17, 2025 •

edited

Loading

ChALkeR commented Dec 17, 2025 •

edited

Loading

ChALkeR Dec 17, 2025 •

edited

Loading

ChALkeR Dec 17, 2025 •

edited

Loading

codecov bot commented Dec 17, 2025 •

edited

Loading

mertcanaltin commented Dec 18, 2025 •

edited

Loading

ChALkeR commented Dec 18, 2025 •

edited

Loading

ChALkeR commented Dec 18, 2025 •

edited

Loading

ChALkeR commented Dec 18, 2025 •

edited

Loading