-
-
Notifications
You must be signed in to change notification settings - Fork 34.1k
lib: implement all 1-byte encodings in js #61093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d49cc98 to
5c80725
Compare
|
Perhaps native path from #60893 could be reverted separately and should not be blocked on this, as that causes much of the regression (see perf difference between |
5c80725 to
9e8e785
Compare
mcollina
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
0f4a5ca to
e047a11
Compare
|
This should be ready except for tests which I'll add |
e047a11 to
1920216
Compare
Renegade334
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Just a few minor suggestions.
1920216 to
7c615e2
Compare
| const w3 = [...w8, -8060, 8330, -8328, 8096, -8094]; | ||
| const m0 = [8558, -8328, 8374, -66, -8539, 16, 8043, -8070]; | ||
|
|
||
| const encodings = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without documentation this is really not maintainable. At the very least, this needs some comments about what these numbers are, how they are determined, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasnell A test will cover that explanation, which is why the method to get those is exported
I'll add that test
It will also be able to reproduce those from the spec data (albeit without the common chunks extraction, but that is trivially maintainable e.g. by just omitting that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just the tables for 0x80+ ranges from the spec, as e.g. https://encoding.spec.whatwg.org/index-iso-8859-6.txt, with two adjustments for compactness and ease-of-use:
0xfffddesignates a hole (unmapped offset), those tables are non-continuous- All other values are deltas from the previous seen entry, the first one is delta from 0x7f
This is because values 0x00-0x7f are always mapped as identity and are not even present in spec tables
E.g. if the table in the spec says
0 0x0080
1 0x0081
5 0x0085
6 0x0086
7 0x0087
The vector is [1, 1, f, f, f, 4, 1, 1] (with f being 0xfffd)
I'll document that in the testcase/generator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I understand what's happening here but really the comments/explanation needs to be in the source here. I don't think placing the explanation in the test is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Also I removed all the common ranges logic to make this easier, size doesn't matter that much here
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #61093 +/- ##
========================================
Coverage 88.53% 88.53%
========================================
Files 703 704 +1
Lines 208546 208663 +117
Branches 40217 40238 +21
========================================
+ Hits 184634 184747 +113
- Misses 15926 15953 +27
+ Partials 7986 7963 -23
🚀 New features to boost your workflow:
|
|
Thanks for repair! fails because |
74fed43 to
2cb2499
Compare
|
@Renegade334 @jasnell I decided to remove all common ranges logic and made the arrays literal except repeats of holes and ones.
That simplifies the structure |
2cb2499 to
35df300
Compare
eb9f467 to
eb074bd
Compare
|
Moving to a proper native path, we can additionally beat this almost 2x on non-ascii and get: With a cleaner impl that does not do branching and does not convert result to and from utf8 I can try filing an alternative PR with only native impl for all encodings |
eb074bd to
e29d03b
Compare
|
I removed os binding usage in favor of a quick js endianness detection as in upstream lib I'm planning to replace this with fast native implementation later anyway This is a fix + improves perf 100x-25x, move to native will improve perf additionally ~2x on non-ascii. |
e29d03b to
efa5be8
Compare
|
Added tests |
This fixes both performance and correctness for all single-byte encodings in the
TextDecoderspec.Tracking issue: #61041
ICU is both slower and can't be trusted, it returns invalid results on many encodings (see #61041)
Fast path from #60893 is not faster than ICU and is instead just a perf regression
This is the first of the changes, aiming to fix 1-byte decoders only
It's based on code in https://github.com/ExodusOSS/bytes and essentially just copy-pastes the 1-byte encodings path for Node.js with some simplifications
I authored that code
Fixes
ibm866,koi8-u,windows-874,windows-1253,windows-1255encodings which were returning incorrect resultsiso-8859-16andx-user-definedencodings which are present in the specwindows1252incorrect handling ofignoreBOM:Performance
windows-1252(aka latin1 / asci) perf on ASCII --108xwindows-1252(aka latin1 / asci) perf on non-ASCII --25xwindows-1253perf on ASCII --27xwindows-1253perf on non-ASCII --~7%windows-1253or betterBenchmarks were made with https://github.com/lemire/jstextdecoderbench but changed the encodings:
windows-1252, main:windows-1252, PR:windows-1253, main:windows-1253, PR: