Skip to content

[Repo Assist] Precompile markdown parser regexes to avoid per-call construction#1070

Draft
github-actions[bot] wants to merge 2 commits intomainfrom
repo-assist/perf-precompile-markdown-regexes-2026-03-07-9f11514c53596463
Draft

[Repo Assist] Precompile markdown parser regexes to avoid per-call construction#1070
github-actions[bot] wants to merge 2 commits intomainfrom
repo-assist/perf-precompile-markdown-regexes-2026-03-07-9f11514c53596463

Conversation

@github-actions
Copy link
Contributor

@github-actions github-actions bot commented Mar 7, 2026

🤖 This PR was created by Repo Assist, an automated AI assistant.

Summary

Three regex patterns in the Markdown parser were constructed or compiled on every call to their containing active patterns, making them unnecessarily expensive hot paths.

Before

MarkdownInlineParser.fsPunctuation pattern (called for every character during inline parsing):

let (|Punctuation|_|) input =
    match input with
    | EscapedChar _ -> None
    | _ ->
        let re = """^[!"#$%&'()*+,...huge ~1.5 KB string...]"""
        let match' = Regex.Match(Array.ofList input |> String, re)  // entire remaining input converted each call
        ...

MarkdownInlineParser.fsHtmlEntity pattern and MarkdownBlockParser.fsBlockquoteStart: similar per-call string construction and Regex.Match calls.

After

  • All three regex patterns are extracted to private module-level let bindings and compiled with RegexOptions.Compiled, so the NFA is built once at startup.
  • Punctuation: passes only the first 2 chars of remaining input (sufficient for all cases: 1 char for BMP punctuation, 2 chars for surrogate-pair cases). Avoids converting potentially thousands of characters on every call.
  • HtmlEntity: passes only the first 34 chars (the maximum possible entity length: & + 32 name/digit chars + ;).
  • BlockquoteStart: removes per-call string concatenation to build the pattern.

Impact

The Punctuation active pattern is in the inner loop of parseChars, which processes every character in the inline content. For documents with long prose sections this is called thousands of times per parse. Precompiling the regex and shortening the input slice should meaningfully reduce allocation and CPU time in the markdown parser hot path.

Test Status

✅ Build: dotnet build FSharp.Formatting.sln --configuration Releasesucceeded (0 errors)
✅ Tests: dotnet test tests/FSharp.Markdown.Tests257/257 passed
✅ Formatting: dotnet fantomas — no changes needed after manual format

Generated by Repo Assist ·

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@8e6d7c86bba37371d2d0eee1a23563db3e561eb5

The Punctuation and HtmlEntity active patterns in MarkdownInlineParser.fs
were constructing regex pattern strings and calling Regex.Match on each
invocation. The Punctuation pattern is called for every character during
inline parsing, making the repeated regex construction a hot path.

Changes:
- Extract punctuationRegex and htmlEntityRegex as module-level compiled
  Regex values (RegexOptions.Compiled) in MarkdownInlineParser.fs
- Limit the input string passed to each regex to its theoretical maximum
  match length (2 chars for punctuation, 34 chars for entities), avoiding
  conversion of the entire remaining input list on every call
- Extract blockquoteRegex as a module-level compiled Regex in
  MarkdownBlockParser.fs, removing the per-call string concatenation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants