Skip to content

Add TarsParser: RegularParser's output at RegexParser's speed#123

Open
rhukster wants to merge 2 commits into
thunderer:masterfrom
rhukster:add-tars-parser
Open

Add TarsParser: RegularParser's output at RegexParser's speed#123
rhukster wants to merge 2 commits into
thunderer:masterfrom
rhukster:add-tars-parser

Conversation

@rhukster

Copy link
Copy Markdown
Contributor

Summary

This adds TarsParser, a fourth parser that produces exactly the same result as RegularParser, including proper nesting and invalid syntax detection, but does the work in a single PCRE pass plus a flat stack instead of a recursive token parser. The goal was RegularParser's correctness at close to RegexParser's speed and memory.

How it works

One preg_match_all lexes every individual tag, opening and closing, in a single C-level pass. The regex understands quoted values and escapes, so a broken tag like [a k="v] fails to lex instead of inventing a parameter. A linear stack pass then resolves nesting, mismatched closing tags, and open-only shortcodes. There is no full token array, no recursion, and no content backreference.

A few implementation notes:

  • Quoted bodies use possessive quantifiers, so the lexer commits greedily like a deterministic tokenizer instead of backtracking into a different reading of the same text. That is what keeps [a k="v] and friends correct.
  • A shortcode name must end on a token boundary, so [foo.bar] is rejected wholesale rather than read as foo plus a stray parameter.
  • Parameter and bbCode parsing is deferred until a node is known to be emitted, so shortcodes absorbed into a closed ancestor's content never pay for it.
  • Nesting resolution is a single O(n) pass, not a per-node ancestor walk.
  • Pure-ASCII input skips the multibyte offset bookkeeping entirely.

Comparison to the existing parsers

  • RegularParser is the correctness reference, and TarsParser matches it byte for byte: same names, parameters, content, text, offsets, and bbCode. It is several times faster and much lighter on memory because it never builds a token array for the whole input.
  • RegexParser is quick but not robust. Its content backreference cannot cover the backtracking cases in testIssue77 and testIssue119, and an unterminated quote like [a k="v] makes it emit a bogus parameter. TarsParser is as fast or faster on prose and gets those cases right.
  • WordpressParser is the lightest and quickest, but it cannot do configurable syntax, numeric names, escaped tokens, or spaced tags. TarsParser handles all of them.

The honest tradeoff: on parameter-heavy or deeply nested input, RegexParser can still be faster, because its backreference swallows nested blocks as opaque content and never looks inside. TarsParser lexes every tag because correct nesting needs it.

Benchmarks

Measured on PHP 8.5. Time is the mean of many parses (3000 for the small corpora, fewer for the two large ones). Peak memory is for a single parse, captured with memory_reset_peak_usage(). Corpora:

  • plain: 11.4 KB of prose, no shortcodes
  • light: 7.8 KB of prose with 200 simple shortcodes
  • heavy: 7.8 KB, 100 parameter-dense shortcodes (nested gallery/img)
  • nested: 3.8 KB, 100 shortcodes nested a few levels
  • utf8: 4.2 KB of multibyte content, 100 shortcodes
  • nested-deep: 10 KB, a single 25-level-deep tree repeated 40 times
  • large-document: ~1 MB of prose with 26,000 shortcodes

Time per parse (microseconds, lower is better)

corpus RegexParser WordpressParser RegularParser TarsParser
plain 0.4 0.4 758 0.4
light 226.0 132.1 1,412 188.7
heavy 160.7 98.3 1,783 196.2
nested 103.4 52.1 1,760 243.6
utf8 131.3 73.6 757 110.4
nested-deep 79.4 30.8 6,435 512.2
large-document 59,758 46,616 295,333 56,852

Peak memory per parse (lower is better)

corpus RegexParser WordpressParser RegularParser TarsParser
plain 2.7 KB 0.9 KB 2.10 MB 1.4 KB
light 519 KB 418 KB 2.79 MB 499 KB
heavy 314 KB 239 KB 3.42 MB 606 KB
nested 221 KB 175 KB 2.67 MB 893 KB
utf8 269 KB 219 KB 1.41 MB 281 KB
nested-deep 139 KB 94 KB 5.66 MB 1.79 MB
large-document 59.0 MB 48.8 MB 361.4 MB 58.5 MB

TarsParser vs RegularParser (the parser it matches byte for byte)

corpus faster by less memory by
plain ~1900x ~1500x
light 7.5x 5.7x
heavy 9.1x 5.8x
nested 7.2x 3.1x
utf8 6.9x 5.2x
nested-deep 12.6x 3.2x
large-document 5.2x 6.2x

On the 1 MB document, RegularParser needs 361 MB and 295 ms; TarsParser does the same work, with identical output, in 58 MB and 57 ms, landing right on top of RegexParser for both. The result-object cost is shared (both RegexParser and TarsParser sit near 59 MB there because 26,000 ParsedShortcode objects dominate), so RegularParser's extra 300 MB is purely the retained token array. The two spots where TarsParser uses more memory than RegexParser are the deeply nested ones, which is the cost of lexing every tag rather than swallowing inner blocks as opaque content. It is still about 3x leaner than RegularParser there.

Robustness

  • Passes the existing test suite, and I wired TarsParser into the ParserTest data provider and testInstances so it runs against every existing case alongside the other parsers.
  • Extended testIssue77 and testIssue119 to assert TarsParser matches RegularParser on the backtracking cases.
  • Differential fuzzed against RegularParser across 2 million-plus random inputs (multiple seeds) with zero divergence on names, parameters, content, text, offsets, and bbCode. It even reproduces the behaviour where a [/0] closing tag is ignored, since the closing name passes through your if(!$closingName = ...) check and '0' is falsy in PHP. I matched that on purpose rather than "fixing" it, so the output stays identical.

Safety and compatibility

  • Possessive quantifiers mean no catastrophic backtracking. 50k stray brackets parse in well under a millisecond.
  • Throws on PCRE failure (preg_last_error()) instead of silently returning no shortcodes, same as RegularParser.
  • Psalm-clean at errorLevel 1.
  • One new file, no new dependencies, runs on PHP 7.1+, and supports configurable SyntaxInterface like RegularParser and RegexParser.

What's in this PR

  • src/Parser/TarsParser.php: the parser.
  • tests/ParserTest.php: TarsParser added to the data provider, testInstances, and the issue77 / issue119 parity checks.
  • README.md: one factual bullet in the existing parser list, and "three" to "four".

I left benchmark numbers out of the README on purpose. They live here so you can decide whether any of it belongs in the docs. Happy to adjust naming, wording, or scope however you prefer.

rhukster added 2 commits June 18, 2026 08:52
TarsParser lexes every shortcode tag (opening and closing) in a single
PCRE pass, then resolves nesting with a linear stack pass. This pairs
RegexParser-class scanning speed with RegularParser-grade robustness:

- the lexer understands quoted values and escapes, so an unterminated
  quote like [a k="v] correctly fails to lex instead of inventing a
  bogus parameter
- nesting, mismatched closing tags and open-only shortcodes resolve
  exactly like the default RegularParser
- pure-ASCII fast path for offsets, deferred parameter parsing for
  absorbed nodes, and an O(n) absorption pass (no O(n^2) ancestor walk)

Verified byte-identical to RegularParser across 2M+ differential fuzz
inputs, and 6.5-9.1x faster than RegularParser (2.7-6.1x faster than
FastParser) on representative content. Throws on PCRE failure rather
than silently returning no shortcodes. Psalm-clean at errorLevel 1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant