TextDecoder is wrong and very slow

## Correctness

Encodings that return invalid results:
 * Single-byte:
    * `ibm866` (fails at even ascii input)
    * `koi8-u`
    * `windows-874`
    * `windows-1252`
    * `windows-1253`
    * `windows-1255`
 * Multi-byte (all except `gb18030`):
   * `gbk` (should be identical to `gb18030` but it is instead broken)
   * `big5`
   * `euc-jp`
   * `iso-2022-jp`
   * `shift_jis` (fails at even ascii input)
   * `euc-kr`

Unimplemented encodings that throw:
   * `iso-8859-16`
   * `x-user-defined`


If built without `icu`, `utf-16le` encoding also returns invalid results:
```js
> new TextDecoder('utf-16le').decode(Uint16Array.of(0xd800))
'�' // correct
'\ud800' // no ICU
```

## Performance

* `utf-8` (aka default) `TextDecoder` is much slower on ascii input than it can and should be
   `1.3x` on 4096 bytes, `~3x` on 1 MiB input
*  The above applies to `buffer.toString()` too
   It's much slower on ASCII input than a checked js impl (same `1.3x`-`3x`)
*  `windows-1252` aka `new TextDecoder('ascii')` aka `new TextDecoder('latin1')`
    is ~`2x-4x` slower than an optimized impl on ascii input
* `windows-1252` aka `new TextDecoder('latin1')`
   is ~`6x-12x` slower than an optimized impl on latin1 input
* `windows-1252` is ~`7x-12x` slower than an optimized js impl
* Other single-byte encodings that are significantly slower than js impl even on non-ascii input:
   `iso-8859-3`, `iso-8859-6`, `iso-8859-7`, `iso-8859-8`, `iso-8859-8-i`, `windows-1253`, `windows-1255`, `windows-1257`
* None of the single-byte encodings are _faster_ than the js impl even on non-ascii input
* All of the single-byte encodings except `windows-1252` are `>=10x` slower than the js impl on ascii input
  (`windows-1252` is _only_ ~`2-4x` slower)

## References

Nothing of the above requires any changes on the native side, I compared to a somewhat optimized JS implementation

See https://docs.google.com/spreadsheets/d/1pdEefRG6r9fZy61WHGz0TKSt8cO4ISWqlpBN5KntIvQ/edit

See tests in https://github.com/ExodusOSS/bytes/blob/master/tests/encoding/mistakes.test.js (comment out the import and it can be run on Node.js without deps with only that file)

## Suggestions

1. Add a proper ASCII fast path to `buffer.toString()`
   https://github.com/nodejs/node/pull/61119
2. Add a proper ASCII fast path to `new TextDecoder().decode(arg)`
   https://github.com/nodejs/node/pull/61119
3. Perhaps replace single-byte decoders with a js impl, remove native paths and lib usage.   They all are just mappers, the implementation for all of them is identical
   Or at least replace the slow, unsupported, or invalid ones.
   #61093 
4. Remove `gbk` decoder path and make it do the same as `gb18030` as the spec says
    https://github.com/nodejs/node/pull/61099
5. Fix or replace implementations for `big5`, `euc-jp`, `iso-2022-jp`, `shift_jis`, `euc-kr`
6. For utf16 decode optimistically using existing fast apis, then check the string for validity
7. Fix bugs in the non-ICU codepath
8. To fix legacy multi-byte decoders, attempt to re-use what Chromium has



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TextDecoder is wrong and very slow #61041

Correctness

Performance

References

Suggestions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

TextDecoder is wrong and very slow #61041

Description

Correctness

Performance

References

Suggestions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions