Skip to content

TextDecoder is wrong and very slow #61041

@ChALkeR

Description

@ChALkeR

Correctness

Encodings that return invalid results:

  • Single-byte:
    • ibm866 (fails at even ascii input)
    • koi8-u
    • windows-874
    • windows-1252
    • windows-1253
    • windows-1255
  • Multi-byte (all except gb18030):
    • gbk (should be identical to gb18030 but it is instead broken)
    • big5
    • euc-jp
    • iso-2022-jp
    • shift_jis (fails at even ascii input)
    • euc-kr

Unimplemented encodings that throw:

  • iso-8859-16
  • x-user-defined

If built without icu, utf-16le encoding also returns invalid results:

> new TextDecoder('utf-16le').decode(Uint16Array.of(0xd800))
'�' // correct
'\ud800' // no ICU

Performance

  • utf-8 (aka default) TextDecoder is much slower on ascii input than it can and should be
    1.3x on 4096 bytes, ~3x on 1 MiB input
  • The above applies to buffer.toString() too
    It's much slower on ASCII input than a checked js impl (same 1.3x-3x)
  • windows-1252 aka new TextDecoder('ascii') aka new TextDecoder('latin1')
    is ~2x-4x slower than an optimized impl on ascii input
  • windows-1252 aka new TextDecoder('latin1')
    is ~6x-12x slower than an optimized impl on latin1 input
  • windows-1252 is ~7x-12x slower than an optimized js impl
  • Other single-byte encodings that are significantly slower than js impl even on non-ascii input:
    iso-8859-3, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-8-i, windows-1253, windows-1255, windows-1257
  • None of the single-byte encodings are faster than the js impl even on non-ascii input
  • All of the single-byte encodings except windows-1252 are >=10x slower than the js impl on ascii input
    (windows-1252 is only ~2-4x slower)

References

Nothing of the above requires any changes on the native side, I compared to a somewhat optimized JS implementation

See https://docs.google.com/spreadsheets/d/1pdEefRG6r9fZy61WHGz0TKSt8cO4ISWqlpBN5KntIvQ/edit

See tests in https://github.com/ExodusOSS/bytes/blob/master/tests/encoding/mistakes.test.js (comment out the import and it can be run on Node.js without deps with only that file)

Suggestions

  1. Add a proper ASCII fast path to buffer.toString()
    src: improve StringBytes::Encode perf on ASCII #61119
  2. Add a proper ASCII fast path to new TextDecoder().decode(arg)
    src: improve StringBytes::Encode perf on ASCII #61119
  3. Perhaps replace single-byte decoders with a js impl, remove native paths and lib usage. They all are just mappers, the implementation for all of them is identical
    Or at least replace the slow, unsupported, or invalid ones.
    lib: implement all 1-byte encodings in js #61093
  4. Remove gbk decoder path and make it do the same as gb18030 as the spec says
    lib: gbk decoder is gb18030 decoder per spec #61099
  5. Fix or replace implementations for big5, euc-jp, iso-2022-jp, shift_jis, euc-kr
  6. For utf16 decode optimistically using existing fast apis, then check the string for validity
  7. Fix bugs in the non-ICU codepath
  8. To fix legacy multi-byte decoders, attempt to re-use what Chromium has

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceIssues and PRs related to the performance of Node.js.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions