Add TarsParser: RegularParser's output at RegexParser's speed#123
Open
rhukster wants to merge 2 commits into
Open
Add TarsParser: RegularParser's output at RegexParser's speed#123rhukster wants to merge 2 commits into
rhukster wants to merge 2 commits into
Conversation
TarsParser lexes every shortcode tag (opening and closing) in a single PCRE pass, then resolves nesting with a linear stack pass. This pairs RegexParser-class scanning speed with RegularParser-grade robustness: - the lexer understands quoted values and escapes, so an unterminated quote like [a k="v] correctly fails to lex instead of inventing a bogus parameter - nesting, mismatched closing tags and open-only shortcodes resolve exactly like the default RegularParser - pure-ASCII fast path for offsets, deferred parameter parsing for absorbed nodes, and an O(n) absorption pass (no O(n^2) ancestor walk) Verified byte-identical to RegularParser across 2M+ differential fuzz inputs, and 6.5-9.1x faster than RegularParser (2.7-6.1x faster than FastParser) on representative content. Throws on PCRE failure rather than silently returning no shortcodes. Psalm-clean at errorLevel 1.
6babb36 to
dcac4d1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This adds TarsParser, a fourth parser that produces exactly the same result as RegularParser, including proper nesting and invalid syntax detection, but does the work in a single PCRE pass plus a flat stack instead of a recursive token parser. The goal was RegularParser's correctness at close to RegexParser's speed and memory.
How it works
One
preg_match_alllexes every individual tag, opening and closing, in a single C-level pass. The regex understands quoted values and escapes, so a broken tag like[a k="v]fails to lex instead of inventing a parameter. A linear stack pass then resolves nesting, mismatched closing tags, and open-only shortcodes. There is no full token array, no recursion, and no content backreference.A few implementation notes:
[a k="v]and friends correct.[foo.bar]is rejected wholesale rather than read asfooplus a stray parameter.Comparison to the existing parsers
testIssue77andtestIssue119, and an unterminated quote like[a k="v]makes it emit a bogus parameter. TarsParser is as fast or faster on prose and gets those cases right.The honest tradeoff: on parameter-heavy or deeply nested input, RegexParser can still be faster, because its backreference swallows nested blocks as opaque content and never looks inside. TarsParser lexes every tag because correct nesting needs it.
Benchmarks
Measured on PHP 8.5. Time is the mean of many parses (3000 for the small corpora, fewer for the two large ones). Peak memory is for a single parse, captured with
memory_reset_peak_usage(). Corpora:gallery/img)Time per parse (microseconds, lower is better)
Peak memory per parse (lower is better)
TarsParser vs RegularParser (the parser it matches byte for byte)
On the 1 MB document, RegularParser needs 361 MB and 295 ms; TarsParser does the same work, with identical output, in 58 MB and 57 ms, landing right on top of RegexParser for both. The result-object cost is shared (both RegexParser and TarsParser sit near 59 MB there because 26,000
ParsedShortcodeobjects dominate), so RegularParser's extra 300 MB is purely the retained token array. The two spots where TarsParser uses more memory than RegexParser are the deeply nested ones, which is the cost of lexing every tag rather than swallowing inner blocks as opaque content. It is still about 3x leaner than RegularParser there.Robustness
ParserTestdata provider andtestInstancesso it runs against every existing case alongside the other parsers.testIssue77andtestIssue119to assert TarsParser matches RegularParser on the backtracking cases.[/0]closing tag is ignored, since the closing name passes through yourif(!$closingName = ...)check and'0'is falsy in PHP. I matched that on purpose rather than "fixing" it, so the output stays identical.Safety and compatibility
preg_last_error()) instead of silently returning no shortcodes, same as RegularParser.SyntaxInterfacelike RegularParser and RegexParser.What's in this PR
src/Parser/TarsParser.php: the parser.tests/ParserTest.php: TarsParser added to the data provider,testInstances, and the issue77 / issue119 parity checks.README.md: one factual bullet in the existing parser list, and "three" to "four".I left benchmark numbers out of the README on purpose. They live here so you can decide whether any of it belongs in the docs. Happy to adjust naming, wording, or scope however you prefer.