Add TarsParser: RegularParser's output at RegexParser's speed by rhukster · Pull Request #123 · thunderer/Shortcode

rhukster · 2026-06-18T14:35:53Z

Summary

This adds TarsParser, a fourth parser that produces exactly the same result as RegularParser, including proper nesting and invalid syntax detection, but does the work in a single PCRE pass plus a flat stack instead of a recursive token parser. The goal was RegularParser's correctness at close to RegexParser's speed and memory.

How it works

One preg_match_all lexes every individual tag, opening and closing, in a single C-level pass. The regex understands quoted values and escapes, so a broken tag like [a k="v] fails to lex instead of inventing a parameter. A linear stack pass then resolves nesting, mismatched closing tags, and open-only shortcodes. There is no full token array, no recursion, and no content backreference.

A few implementation notes:

Quoted bodies use possessive quantifiers, so the lexer commits greedily like a deterministic tokenizer instead of backtracking into a different reading of the same text. That is what keeps [a k="v] and friends correct.
A shortcode name must end on a token boundary, so [foo.bar] is rejected wholesale rather than read as foo plus a stray parameter.
Parameter and bbCode parsing is deferred until a node is known to be emitted, so shortcodes absorbed into a closed ancestor's content never pay for it.
Nesting resolution is a single O(n) pass, not a per-node ancestor walk.
Pure-ASCII input skips the multibyte offset bookkeeping entirely.

Comparison to the existing parsers

RegularParser is the correctness reference, and TarsParser matches it byte for byte: same names, parameters, content, text, offsets, and bbCode. It is several times faster and much lighter on memory because it never builds a token array for the whole input.
RegexParser is quick but not robust. Its content backreference cannot cover the backtracking cases in testIssue77 and testIssue119, and an unterminated quote like [a k="v] makes it emit a bogus parameter. TarsParser is as fast or faster on prose and gets those cases right.
WordpressParser is the lightest and quickest, but it cannot do configurable syntax, numeric names, escaped tokens, or spaced tags. TarsParser handles all of them.

The honest tradeoff: on parameter-heavy or deeply nested input, RegexParser can still be faster, because its backreference swallows nested blocks as opaque content and never looks inside. TarsParser lexes every tag because correct nesting needs it.

Benchmarks

Measured on PHP 8.5. Time is the mean of many parses (3000 for the small corpora, fewer for the two large ones). Peak memory is for a single parse, captured with memory_reset_peak_usage(). Corpora:

plain: 11.4 KB of prose, no shortcodes
light: 7.8 KB of prose with 200 simple shortcodes
heavy: 7.8 KB, 100 parameter-dense shortcodes (nested gallery/img)
nested: 3.8 KB, 100 shortcodes nested a few levels
utf8: 4.2 KB of multibyte content, 100 shortcodes
nested-deep: 10 KB, a single 25-level-deep tree repeated 40 times
large-document: ~1 MB of prose with 26,000 shortcodes

Time per parse (microseconds, lower is better)

corpus	RegexParser	WordpressParser	RegularParser	TarsParser
plain	0.4	0.4	758	0.4
light	226.0	132.1	1,412	188.7
heavy	160.7	98.3	1,783	196.2
nested	103.4	52.1	1,760	243.6
utf8	131.3	73.6	757	110.4
nested-deep	79.4	30.8	6,435	512.2
large-document	59,758	46,616	295,333	56,852

Peak memory per parse (lower is better)

corpus	RegexParser	WordpressParser	RegularParser	TarsParser
plain	2.7 KB	0.9 KB	2.10 MB	1.4 KB
light	519 KB	418 KB	2.79 MB	499 KB
heavy	314 KB	239 KB	3.42 MB	606 KB
nested	221 KB	175 KB	2.67 MB	893 KB
utf8	269 KB	219 KB	1.41 MB	281 KB
nested-deep	139 KB	94 KB	5.66 MB	1.79 MB
large-document	59.0 MB	48.8 MB	361.4 MB	58.5 MB

TarsParser vs RegularParser (the parser it matches byte for byte)

corpus	faster by	less memory by
plain	~1900x	~1500x
light	7.5x	5.7x
heavy	9.1x	5.8x
nested	7.2x	3.1x
utf8	6.9x	5.2x
nested-deep	12.6x	3.2x
large-document	5.2x	6.2x

On the 1 MB document, RegularParser needs 361 MB and 295 ms; TarsParser does the same work, with identical output, in 58 MB and 57 ms, landing right on top of RegexParser for both. The result-object cost is shared (both RegexParser and TarsParser sit near 59 MB there because 26,000 ParsedShortcode objects dominate), so RegularParser's extra 300 MB is purely the retained token array. The two spots where TarsParser uses more memory than RegexParser are the deeply nested ones, which is the cost of lexing every tag rather than swallowing inner blocks as opaque content. It is still about 3x leaner than RegularParser there.

Robustness

Passes the existing test suite, and I wired TarsParser into the ParserTest data provider and testInstances so it runs against every existing case alongside the other parsers.
Extended testIssue77 and testIssue119 to assert TarsParser matches RegularParser on the backtracking cases.
Differential fuzzed against RegularParser across 2 million-plus random inputs (multiple seeds) with zero divergence on names, parameters, content, text, offsets, and bbCode. It even reproduces the behaviour where a [/0] closing tag is ignored, since the closing name passes through your if(!$closingName = ...) check and '0' is falsy in PHP. I matched that on purpose rather than "fixing" it, so the output stays identical.

Safety and compatibility

Possessive quantifiers mean no catastrophic backtracking. 50k stray brackets parse in well under a millisecond.
Throws on PCRE failure (preg_last_error()) instead of silently returning no shortcodes, same as RegularParser.
Psalm-clean at errorLevel 1.
One new file, no new dependencies, runs on PHP 7.1+, and supports configurable SyntaxInterface like RegularParser and RegexParser.

What's in this PR

src/Parser/TarsParser.php: the parser.
tests/ParserTest.php: TarsParser added to the data provider, testInstances, and the issue77 / issue119 parity checks.
README.md: one factual bullet in the existing parser list, and "three" to "four".

I left benchmark numbers out of the README on purpose. They live here so you can decide whether any of it belongs in the docs. Happy to adjust naming, wording, or scope however you prefer.

TarsParser lexes every shortcode tag (opening and closing) in a single PCRE pass, then resolves nesting with a linear stack pass. This pairs RegexParser-class scanning speed with RegularParser-grade robustness: - the lexer understands quoted values and escapes, so an unterminated quote like [a k="v] correctly fails to lex instead of inventing a bogus parameter - nesting, mismatched closing tags and open-only shortcodes resolve exactly like the default RegularParser - pure-ASCII fast path for offsets, deferred parameter parsing for absorbed nodes, and an O(n) absorption pass (no O(n^2) ancestor walk) Verified byte-identical to RegularParser across 2M+ differential fuzz inputs, and 6.5-9.1x faster than RegularParser (2.7-6.1x faster than FastParser) on representative content. Throws on PCRE failure rather than silently returning no shortcodes. Psalm-clean at errorLevel 1.

rhukster added 2 commits June 18, 2026 08:52

Document TarsParser in the README parser list

dcac4d1

rhukster force-pushed the add-tars-parser branch from 6babb36 to dcac4d1 Compare June 18, 2026 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TarsParser: RegularParser's output at RegexParser's speed#123

Add TarsParser: RegularParser's output at RegexParser's speed#123
rhukster wants to merge 2 commits into
thunderer:masterfrom
rhukster:add-tars-parser

rhukster commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhukster commented Jun 18, 2026

Summary

How it works

Comparison to the existing parsers

Benchmarks

Robustness

Safety and compatibility

What's in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant