Skip to content

chore(deps): bump form-data from 4.0.5 to 4.0.6#7340

Merged
jdstrand merged 1 commit into
masterfrom
dependabot/npm_and_yarn/form-data-4.0.6
Jun 16, 2026
Merged

chore(deps): bump form-data from 4.0.5 to 4.0.6#7340
jdstrand merged 1 commit into
masterfrom
dependabot/npm_and_yarn/form-data-4.0.6

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github Jun 16, 2026

Copy link
Copy Markdown
Contributor

Bumps form-data from 4.0.5 to 4.0.6.

Changelog

Sourced from form-data's changelog.

v4.0.6 - 2026-06-12

Commits

  • [Fix] escape CR, LF, and " in field names and filenames 8dff42c
  • [Dev Deps] update @ljharb/eslint-config, auto-changelog, tape f31d21e
  • [Deps] update hasown, mime-types 92ae0eb
  • [Dev Deps] update js-randomness-predictor 67b0f65
Commits
  • 64190db v4.0.6
  • 92ae0eb [Deps] update hasown, mime-types
  • f31d21e [Dev Deps] update @ljharb/eslint-config, auto-changelog, tape
  • 8dff42c [Fix] escape CR, LF, and " in field names and filenames
  • 67b0f65 [Dev Deps] update js-randomness-predictor
  • See full diff in compare view

@dependabot dependabot Bot added dependencies Pull requests that update a dependency file javascript Pull requests that update javascript code labels Jun 16, 2026
@dependabot dependabot Bot requested a review from a team as a code owner June 16, 2026 16:30
@dependabot dependabot Bot requested review from sanderson and removed request for a team June 16, 2026 16:30
@dependabot dependabot Bot added dependencies Pull requests that update a dependency file javascript Pull requests that update javascript code labels Jun 16, 2026
@dependabot dependabot Bot force-pushed the dependabot/npm_and_yarn/form-data-4.0.6 branch from f122b6c to 6cd7248 Compare June 16, 2026 21:15
@jdstrand

Copy link
Copy Markdown
Contributor

@dependabot rebase

Bumps [form-data](https://github.com/form-data/form-data) from 4.0.5 to 4.0.6.
- [Release notes](https://github.com/form-data/form-data/releases)
- [Changelog](https://github.com/form-data/form-data/blob/master/CHANGELOG.md)
- [Commits](form-data/form-data@v4.0.5...v4.0.6)

---
updated-dependencies:
- dependency-name: form-data
  dependency-version: 4.0.6
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot force-pushed the dependabot/npm_and_yarn/form-data-4.0.6 branch from 6cd7248 to 1bfa6a2 Compare June 16, 2026 21:21
@jdstrand jdstrand merged commit ced5a8a into master Jun 16, 2026
11 checks passed
@dependabot dependabot Bot deleted the dependabot/npm_and_yarn/form-data-4.0.6 branch June 16, 2026 21:25
@github-actions

Copy link
Copy Markdown
Contributor

Development Plan (preserved by cleanup workflow)

Full plan details from PLAN.md, preserved before automated cleanup:


Design: Rust becomes the production Markdown converter; JS removed

Status: design approved, ready for implementation planning.
Branch artifact only — CI scrubs PLAN.md before merge to master.

Problem

The repo ships two HTML→Markdown converters that generate the .md "twins" of
every docs page:

  • A JavaScript converter (scripts/lib/markdown-converter.cjs, turndown + jsdom).
  • A Rust converter (scripts/rust-markdown-converter/, html2md + scraper,
    ~10x faster, exposed to Node via napi-rs).

markdown-converter.cjs tries to require() the compiled Rust module and, on
failure, silently falls back to JS with only a console.log. The Rust build
outputs (*.node, index.js) are gitignored, and nothing in CI, the staging
deploy, or package.json ever runs napi build. So:

  • Local clones where someone once ran napi build load Rust.
  • A fresh worktree, CI (cimg/node:24.5.0, no Rust toolchain), and staging all
    silently use JS.

The result: the 10x-faster converter has never run in CI or production. The
silent fallback hid the gap. This design wires Rust in as the one production
converter, deletes the JS conversion path, and guarantees no regression in the
generated Markdown.

Goal

CI and staging generate the .md twins with the Rust converter. The JS
conversion path is deleted. No semantic regressions in output. All e2e and
integration tests pass.

Constraints and decisions

Decision Choice
Direction Rust is the production converter; remove JS conversion entirely
Parity bar Semantic parity — diff the full corpus, fix all semantic diffs, document accepted cosmetic diffs
Binary delivery Build from source in CI (lockstep source/binary), plus a guarded local postinstall
Fallback No silent fallback — hard-fail loudly if the Rust binary can't load
markdown-converter.cjs Delete it; repoint consumers and package.json exports at the Rust module
publisher/canonical JS post-step, not the converter. #7294 added scripts/lib/provenance.js; injectPageProvenance (per page) and combineMarkdown (sections) stamp them after conversion. It is converter-agnostic by design. Rust does not emit them.
date/lastmod JS post-step too — fold into injectPageProvenance, sourced from sitemap-md.xml <lastmod> (per page only). Keeps Rust an exact drop-in; near-zero build cost (one sitemap parse at startup, O(1)/page). Rust does not emit them.
Rust converter scope Exact drop-in. Emits only title, description, url, estimated_tokens, product, version — matching the post-#7294 JS converter field-for-field. API stays (html, url_path, base_url); no HTML-head extraction.
Section bundling ownership Stays in JS (build-llm-markdown.js combineMarkdown/findSections, untouched). It concatenates generated .md files, not HTML. The Rust convert_section_to_markdown binding is unused → removed.
pages semantics Preserve children.length + 1 (parent + children), the current active contract. Preserved automatically since section bundling is untouched.
Legacy html-to-markdown.js Retire it (and build:md:legacy / build:md:verbose). It is the only caller of convertSectionToMarkdown, not used in CI, and superseded by build-llm-markdown.js.
docs-tooling Does not consume the converter — removal is fully safe, no cross-repo coordination needed
Acceptance All markdown e2e + integration tests pass

Current state (verified)

  • Runtime consumers of the converter in this repo: scripts/build-llm-markdown.js
    (active build:md) and scripts/html-to-markdown.js (legacy build:md:legacy).
    Both require('./lib/markdown-converter.cjs').
  • package.json exports: "./markdown-converter" points at the .cjs. No
    external repo consumes it (docs-tooling does not use the converter).
  • publisher/canonical are added by a JS post-step, not the converter.
    feat(jsonld): carry InfluxData provenance into LLM-native artifacts (#7290) #7294 added scripts/lib/provenance.js. build-llm-markdown.js:154 calls
    injectPageProvenance(markdown, { publisher, canonical }) after conversion
    (canonical = sitemap origin + urlPath); combineMarkdown does the same
    for sections. injectPageProvenance is documented converter-agnostic. feat(jsonld): carry InfluxData provenance into LLM-native artifacts (#7290) #7294
    also added test:build-md (build-llm-markdown.test.mjs, provenance.test.mjs)
    — part of the acceptance gate; both stay green untouched.
  • The post-feat(jsonld): carry InfluxData provenance into LLM-native artifacts (#7290) #7294 JS converter base frontmatter
    (markdown-converter.cjs:501-513) emits exactly title, description,
    url, estimated_tokens, product, version. This is the Rust drop-in
    target.
  • The Rust lib.rs is a full reimplementation: it emits frontmatter and
    product detection, not just the HTML→MD body. It currently emits
    product_version (not version) and build-time date/lastmod — both must
    change (rename, and remove timestamps) to become a drop-in.
  • Product mappings are already configuration on the Rust side. build.rs
    reads data/products.yml at build time (cargo:rerun-if-changed) and
    generates the URL→product map. The JS converter hardcodes its own copy
    (URL_PATTERN_MAP / PRODUCT_NAME_MAP). The dying JS maps are not ported.
  • turndown, @types/turndown, and jsdom are used only by the JS conversion
    path (the jsdom mention in build-llm-markdown.js is a comment). All
    removable.
  • The abandoned Lambda@Edge markdown-generator references have already been
    removed from .gitignore, DOCS-TESTING.md, and scripts/html-to-markdown.js.
    The live DocsOriginRequestRewriteLambda (deploy/edge.js,
    deploy/docs-website.yml) is a separate URL-rewrite function and stays.

Two converters, two scopes (clarified)

The active CI path (build:mdscripts/build-llm-markdown.js) splits work:

  • Per-page conversion (Phase 1): calls convertToMarkdown(html, urlPath)
    the only converter binding the active path uses. This is what Rust replaces.
  • Section bundles (Phase 2): combineMarkdown reads the already-generated
    per-page .md files and concatenates them into index.section.md, emitting
    section frontmatter (type: section, pages: children + 1, child_pages,
    summed estimated_tokens). This is pure JS string/IO work over .md files,
    not HTML. It stays in JS, untouched.

So the Rust convert_section_to_markdown binding is never used by the active
path — it is reachable only from the legacy html-to-markdown.js, which is
retired. The migration's parity scope is therefore per-page conversion only;
section frontmatter is preserved by construction.

The real parity work: make Rust an exact drop-in

The Rust converter must emit the same base frontmatter as the post-#7294 JS
converter
, field-for-field, so the JS post-steps (provenance, timestamps)
layer on identically. The converter's only job is the six base fields;
everything else is JS.

Per-page base field Post-#7294 JS Current Rust Action
title unchanged
description unchanged
url base_url + path base_url + path unchanged
estimated_tokens unchanged
product ✓ (optional) ✓ (optional) unchanged
version version: product_version: rename in Rust
date/lastmod added later by JS post-step build-time now() remove from Rust
publisher/canonical added later by JS post-step not emitted keep not-emitted
conversion engine turndown/jsdom html2md/scraper html2md/scraper

Required Rust changes before cutover:

  1. Rename frontmatter field product_versionversion.
  2. Remove date/lastmod from the Rust frontmatter (the JS post-step owns
    them now). Remove chrono::Utc::now(); drop the chrono dependency if it
    becomes unused.
  3. Match the post-feat(jsonld): carry InfluxData provenance into LLM-native artifacts (#7290) #7294 base field order exactly: title, description,
    url, estimated_tokens, product, version.

The build script gains one responsibility: resolve base_url (port
detectBaseUrl from the deleted .cjs) and pass it to the binding, which it
currently does not (build-llm-markdown.js:144 calls it with two args).

Converter API contract

The binding signature is unchanged, and Rust does no HTML-head extraction:

convert_to_markdown(
  html_content: String,
  url_path: String,
  base_url: String,
) -> Option<String>   // frontmatter: title, description, url,
                      //   estimated_tokens, product?, version?

publisher, canonical, date, and lastmod are added afterward in JS:

  • injectPageProvenance (scripts/lib/provenance.js) already stamps
    publisher and canonical (= sitemap origin + urlPath). It is
    converter-agnostic.
  • Extend the same step to also stamp date/lastmod from sitemap-md.xml
    <lastmod> (per page). One parse at startup builds a urlPath → lastmod
    Map; O(1) per page. Sections keep provenance only (no per-page timestamp).

The caller's only new responsibility is supplying base_url, which the deleted
.cjs resolved internally (detectBaseUrl, ported into the build script).

The convert_section_to_markdown binding is removed.

Design

1. Wire Rust into CI (build from source, lockstep)

  • Add a step to the CircleCI build job: install a cached rustup toolchain
    (cache ~/.cargo, ~/.rustup, and scripts/rust-markdown-converter/target/)
    and run napi build --release in scripts/rust-markdown-converter/
    (target x86_64-unknown-linux-gnu) before build:md.
  • Add a guarded postinstall that builds the binary locally when cargo is
    present; when absent, print install instructions rather than failing
    yarn install.
  • Build outputs stay gitignored (binaries never committed).

2. Remove the silent fallback

Replace catch { console.log('using JS') } with a loud hard-fail that explains
how to build the binary. A converter that silently degrades is the root cause
this work exists.

3. Delete the JS converter and the legacy entrypoint

  • Delete scripts/lib/markdown-converter.cjs.
  • Repoint build-llm-markdown.js and package.json exports directly at the
    Rust napi module.
  • Retire the legacy scripts/html-to-markdown.js and its npm scripts
    (build:md:legacy, build:md:verbose). It is the only convertSectionToMarkdown
    caller, is not used in CI, and is superseded by build-llm-markdown.js.
  • Remove the unused Rust convert_section_to_markdown binding.
  • Remove turndown, @types/turndown, jsdom from package.json.

4. Pin the frontmatter contract as one definition

Product mappings already have one source (data/products.yml via build.rs) —
keep that model. Make the Rust frontmatter struct the single source for the
frontmatter field set, and align DOCS-TESTING.md to it. This converts an
implicit, drift-prone contract (duplicated across JS code, Rust, and doc prose)
into one enforced definition.

5. Golden-snapshot regression test

Once JS is gone, parity can no longer be JS-vs-Rust. Add a golden-snapshot test:
a small fixture corpus of representative HTML pages with checked-in expected
.md, run via cargo test and a Node integration test. Locks the frontmatter
and conversion contract so it cannot silently drift again, and satisfies the
"every PR with testable code ships a test" rule.

6. Migration parity gate (one-time)

Prerequisite: rebase the branch onto post-#7294 master so the baseline
includes the current frontmatter contract (publisher, canonical).

Scope: per-page index.md output (section bundles are unchanged JS, so
they need only the coherence check, not a converter diff). A harness that:

  1. Captures post-feat(jsonld): carry InfluxData provenance into LLM-native artifacts (#7290) #7294 JS per-page output as a baseline over the built public/
    corpus.
  2. Generates Rust per-page output over the same corpus.
  3. Diffs every page, classifying each diff:
    • Semantic (lost content, broken code fence, malformed frontmatter,
      dropped/altered link) → must fix in Rust.
    • Cosmetic (whitespace, escaping) → accept and record.
  4. Passes at zero semantic diffs. Accepted cosmetic diffs are documented in the
    PR.

7. Acceptance gate

The reimplementation is done only when all pass:

  • cypress/e2e/content/markdown-content-validation.cy.js
  • cypress/e2e/content/markdown-autodiscovery.cy.js
  • scripts/__tests__/build-llms-full-txt.test.mjs
  • scripts/__tests__/corpus-paths.test.mjs
  • yarn check:md-coherence
  • the new golden-snapshot test (cargo test + Node integration)

Out of scope (follow-up issue)

The CLI (scripts/docs-cli/lib/product-resolver.js) and TypeScript
(assets/js/utils/product-mappings.ts) also hardcode the URL→product map.
Unifying all three onto data/products.yml is a real content-as-data cleanup,
but it is a different feature area from "Rust converter in CI." File it as a
separate issue rather than expanding this migration's scope.

Frontmatter version semantics → #7299.
The emitted version field is the docs edition slug (core, enterprise, …),
not a software release; the real release (latest_patch, e.g. 3.9.3) is never
surfaced. Clarifying the key (e.g. edition) and optionally adding a real
release field is a frontmatter-contract change affecting all twins + consumers,
so it stays out of this migration. This migration keeps version: <slug> for
drop-in parity with the post-#7294 JS baseline.


Rust Markdown Converter Migration — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Make the Rust converter the one converter that runs in CI and staging, delete the JS conversion path, and ship the .md twins with no semantic regression.

Architecture: The active build (build:mdscripts/build-llm-markdown.js) keeps its two phases — per-page conversion (Phase 1) and JS section bundling (Phase 2, untouched). Phase 1 switches from the JS .cjs to the Rust napi module. Rust is built from source in CI and via a local postinstall. A diff harness gates parity against the post-#7294 JS baseline; a golden-snapshot test locks the contract going forward.

Tech Stack: Rust + napi-rs (@napi-rs/cli), Node ESM build scripts, CircleCI (cimg/node), Cypress, node --test.


Sequencing note

Task 1 (rebase) is a hard prerequisite for the parity harness (Task 9) and the
Rust frontmatter task (Task 5), because the baseline field set (publisher,
canonical, exact order) lives only in post-#7294 master. Tasks 2–4 (build
wiring) and Task 6 (section/legacy removal) are independent of the rebase and
can be done first.

Execution status / handoff (read this first)

Branch: worktree-fix-rust-markdown-conversion, rebased onto origin/master
(includes #7294). Work in this worktree; use relative paths.

Working preference (directive)

Continue autonomously through the remaining tasks. Surface only genuine
decisions
— choices the plan/code/sensible-defaults can't resolve and that
change what gets built (e.g. an output-contract change affecting all twins, a
naming/schema choice, deleting something you didn't create). Do not pause
for routine progress check-ins or "should I proceed?" Make and state reversible
calls; keep moving. Commit per task; let lefthook run.

Done and committed

  • Task 1 — rebase onto post-feat(jsonld): carry InfluxData provenance into LLM-native artifacts (#7290) #7294 master + JS baseline at .parity-baseline/
    (4,684 per-page index.md, gitignored, on disk here). Pre-migration reference
    for Task 9/10b — do not regenerate after cutover. A fresh clone must
    rebuild it from pre-cutover state before Task 7.
  • Task 2 — postinstall Rust build (scripts/build-rust-converter.js) + napi
    CLI v2 pin.
  • Task 3 — CI builds Rust before build:md (.circleci/config.yml).
  • Task 4 — removed unused convert_section_to_markdown binding.
  • Task 5 — drop-in frontmatter: version (not product_version), no
    build-time timestamps, chrono dropped.
  • Task 6detectBaseUrlscripts/lib/base-url.js.
  • Task 7build:md cut over to the Rust module (hard-fail load guard).
  • Task 8 — deleted JS converter + legacy CLI; dropped turndown/jsdom; added
    build:md --path/--limit; migrated Cypress fixture generation. (plan gap
    fixed: the grep must include cypress/.)
  • Task 9 — converter parity fixes + scan. format-selector leak fixed
    (reliable same-parse clean_article_html); body h1 omitted; headings
    normalized to open-ATX (normalize_headings); Note-callout faithfulness
    accepted. Parity scan: 0 semantic regressions over 4,684 pages (only 2
    pre-existing EMPTY_BODY landing pages). scripts/parity-scan.mjs in-tree.
  • Task 8bdate/lastmod stamped from sitemap-md.xml in the JS
    provenance post-step (readSitemapLastmods + extended injectPageProvenance).
  • Task 10 — golden snapshot (markdown-parity.test.mjs); fixture stored as
    .txt so markdown formatters don't rewrite it.
  • Task 10b — completeness guards (pages + sections) and the Section markdown truncated in clipboard  #6792 Cypress
    clipboard test (section-clipboard-copy.cy.js).
  • Task 11 — acceptance (mostly green). Passing: cargo test (7),
    test:build-md (13), test:corpus-paths (8), test:build-llms-full (8),
    test:markdown-parity (2), test:markdown-completeness (2),
    check:md-coherence, build:llms-full. Cypress
    markdown-content-validation: 46/48 pass, 1 skipped, 1 dev-only artifact
    — "no localhost URLs in frontmatter" fails because detectBaseUrl returns
    http://localhost:1313 under NODE_ENV=development; the JS baseline has the
    same localhost URLs, so it is pre-existing and passes in CI/production.
    Cypress markdown-autodiscovery 10/10 and section-clipboard-copy (Section markdown truncated in clipboard  #6792
    guard) 1/1 — the clipboard receives the full index.section.md.
  • Task 12 — docs: rewrote scripts/README.md, updated DOCS-TESTING.md
    (architecture, frontmatter, --path/--limit).

Decisions on record (don't re-litigate)

  • Frontmatter key = version (edition slug), matching the JS baseline.
    Renaming/adding a real release version is out of scope → Generated .md frontmatter: clarify edition slug vs. release version #7299.
  • Body h1 = omitted (title in frontmatter; consistent with API-ref twins).
    Accepted parity diff vs the pre-migration baseline; update Cypress :240.
  • Note callouts rendered as > [!Note] are a Rust improvement over JS —
    accept callout-vs-paragraph diffs in the scan.

Environment

cargo 1.95, hugo 0.157 extended, node 26, yarn 1.22. After any lib.rs edit,
rebuild the napi module: node scripts/build-rust-converter.js.

Start here

Task 9 remaining → 8b → 10 → 10b → 11 → 12. Immediate next converter fix:
normalize headings to open-ATX (html2md emits setext Title\n--- for h2 and
closed-ATX #### X ####; the baseline is consistent open-ATX ## X / #### X).
Then build the parity scan (Task 9 Step 1) to surface anything else, then the
remaining verification/doc tasks.

Truncation (Task 10b): tracked by #6792 (section markdown truncated in
clipboard; example /influxdb3/enterprise/admin/last-value-cache/). Diagnosed
during planning: generation is complete (both index.md and index.section.md
match their source/parts), so #6792 lives in the runtime fetch/clipboard path in
format-selector.ts — a UI subsystem outside this migration's scope. Task
10b ships a build-time completeness test (pages + sections) and a Cypress
clipboard test that reproduces/locks #6792; the underlying clipboard fix, if
needed, is a separate follow-up under #6792.

File structure

  • Modify: scripts/rust-markdown-converter/src/lib.rs — drop-in frontmatter (rename version, remove timestamps); remove section binding.
  • Create: scripts/build-rust-converter.js — postinstall build (cargo-guarded).
  • Create: scripts/lib/base-url.js — ported detectBaseUrl (ESM).
  • Modify: scripts/lib/provenance.js — add readSitemapLastmods; stamp date/lastmod in injectPageProvenance.
  • Modify: scripts/build-llm-markdown.js — require Rust module, resolve+pass base_url, build lastmod Map, hard-fail on load, add --path/--limit targeting (Task 8).
  • Delete: scripts/lib/markdown-converter.cjs, scripts/html-to-markdown.js.
  • Modify: cypress/e2e/content/markdown-content-validation.cy.js, cypress/e2e/content/llm-format-selector.cy.js — generate fixtures via build:md --path instead of the deleted legacy script; update product_version:version: assertions (Task 8/11).
  • Modify: package.jsonexports, postinstall, remove deps + legacy scripts.
  • Modify: .circleci/config.yml — Rust toolchain + napi build step.
  • Rewrite: scripts/README.md — drop legacy html-to-markdown.js docs, document Rust + build-llm-markdown.js (Task 12).
  • Create: scripts/parity-scan.mjs — corpus structural + content-loss scan (Task 9).
  • Create: scripts/__tests__/markdown-parity.test.mjs — golden-snapshot test + fixtures.
  • Create: scripts/__tests__/markdown-completeness.test.mjs — truncation guard, pages + sections (Task 10b).
  • Create: cypress/e2e/content/section-clipboard-copy.cy.jsSection markdown truncated in clipboard  #6792 clipboard guard (Task 10b).
  • Modify: DOCS-TESTING.md — frontmatter schema + architecture.

Task 1: Rebase onto post-#7294 master and capture the JS baseline ✅ DONE

Files: none (git + build artifacts)

  • Step 1: Rebase the branch onto current master
git fetch origin
git rebase origin/master

Expected: clean rebase. Resolve conflicts if any (most likely in package.json scripts or DOCS-TESTING.md).

grep -nE "publisher|canonical" scripts/lib/markdown-converter.cjs

Expected: matches showing how publisher (constant) and canonical (from the page) are assembled. Note the exact field order in the frontmatter object — this is the contract Rust must mirror.

  • Step 3: Build the site and the JS baseline .md
yarn build:ts
npx hugo --quiet --destination public
yarn build:md --public-dir public

Expected: public/**/index.md files exist (JS-generated).

  • Step 4: Snapshot the JS baseline for later diffing
mkdir -p .parity-baseline
rsync -a --prune-empty-dirs --include='*/' --include='index.md' --exclude='*' public/ .parity-baseline/
echo ".parity-baseline/" >> .gitignore

Expected: .parity-baseline/ mirrors every per-page index.md. This is the parity target. Do not commit it.

  • Step 5: Commit the rebase resolution only
git add .gitignore
git commit -m "chore: ignore .parity-baseline snapshot dir"

Task 2: Add a cargo-guarded local build (postinstall) ✅ DONE

Toolchain prerequisite (found during execution): the crate is napi v2
(napi/napi-derive 2.16), but scripts/rust-markdown-converter/package.json
pinned @napi-rs/cli@^3.4.1 (v3), which builds the .node binary but does
not emit the index.js JS wrapper for a v2 crate — so the module can't be
required. Pin the CLI to ^2.18.4 (and regenerate the subpackage
yarn.lock) so napi build emits index.js/index.d.ts. Done in commit
064b28d7a.

Files:

  • Create: scripts/build-rust-converter.js

  • Modify: package.json:68 (postinstall)

  • Step 1: Write the build script

Create scripts/build-rust-converter.js:

#!/usr/bin/env node
/**
 * Build the Rust markdown converter napi module when a Rust toolchain is
 * present. Skips with a clear message otherwise so `yarn install` never fails
 * on machines without Rust. CI builds it explicitly (see .circleci/config.yml).
 */
import { execFileSync } from 'node:child_process';
import { existsSync } from 'node:fs';
import path from 'node:path';

const pkgDir = path.resolve('scripts/rust-markdown-converter');

function has(cmd) {
  try {
    execFileSync(cmd, ['--version'], { stdio: 'ignore' });
    return true;
  } catch {
    return false;
  }
}

if (!has('cargo')) {
  console.log(
    'ℹ Skipping Rust converter build: cargo not found. ' +
      'Install Rust (https://rustup.rs) to build it locally, or rely on CI.'
  );
  process.exit(0);
}

console.log('🦀 Building Rust markdown converter...');
execFileSync('yarn', ['install', '--frozen-lockfile'], {
  cwd: pkgDir,
  stdio: 'inherit',
});
execFileSync('yarn', ['build'], { cwd: pkgDir, stdio: 'inherit' });

if (!existsSync(path.join(pkgDir, 'index.js'))) {
  console.error('✗ Rust build did not produce index.js');
  process.exit(1);
}
console.log('✓ Rust markdown converter built');
  • Step 2: Chain it into postinstall

In package.json, change line 68:

"postinstall": "node scripts/setup-local-bin.js && node scripts/build-rust-converter.js",
  • Step 3: Run it on a machine with cargo

Run: node scripts/build-rust-converter.js
Expected: ✓ Rust markdown converter built, and scripts/rust-markdown-converter/index.js + a *.node file exist.

  • Step 4: Verify the no-cargo path degrades gracefully

Run: PATH=/usr/bin node scripts/build-rust-converter.js (a PATH without cargo)
Expected: prints the "Skipping Rust converter build" message and exits 0.

  • Step 5: Commit
git add scripts/build-rust-converter.js package.json
git commit -m "build: build Rust converter on postinstall when cargo is present"

Task 3: Build Rust in CI before build:md

Files:

  • Modify: .circleci/config.yml (build job, before the "Generate LLM-friendly Markdown" step at ~line 47)

  • Step 1: Add a Rust toolchain + build step

In .circleci/config.yml, in the build job steps:, insert immediately before the - run: name: Hugo Build step:

      - restore_cache:
          keys:
            - rust-{{ checksum "scripts/rust-markdown-converter/Cargo.toml" }}
            - rust-
      - run:
          name: Install Rust toolchain
          command: |
            curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs \
              | sh -s -- -y --profile minimal --default-toolchain stable
            echo 'source "$HOME/.cargo/env"' >> "$BASH_ENV"
      - run:
          name: Build Rust markdown converter
          command: |
            source "$HOME/.cargo/env"
            node scripts/build-rust-converter.js
      - save_cache:
          key: rust-{{ checksum "scripts/rust-markdown-converter/Cargo.toml" }}
          paths:
            - /home/circleci/.cargo
            - /home/circleci/.rustup
            - scripts/rust-markdown-converter/target
  • Step 2: Validate the CircleCI config

Run: npx --yes @circleci/circleci-config-validate .circleci/config.yml (or circleci config validate if the CLI is installed)
Expected: config is valid. If neither validator is available, at minimum run npx yaml-lint .circleci/config.yml to confirm YAML parses.

  • Step 3: Commit
git add .circleci/config.yml
git commit -m "ci: build Rust markdown converter before generating Markdown"

Task 4: Remove the unused Rust section binding

Files:

  • Modify: scripts/rust-markdown-converter/src/lib.rs:606-669

  • Step 1: Delete convert_section_to_markdown and ChildPageInput

In scripts/rust-markdown-converter/src/lib.rs, remove the entire block from the doc comment above ChildPageInput (line ~606) through the end of convert_section_to_markdown (line ~669), including the #[napi(object)] pub struct ChildPageInput { ... } and both #[napi] items. Leave convert_to_markdown (above) and detect_product_from_path (below) intact.

  • Step 2: Build to confirm nothing else referenced it

Run: cd scripts/rust-markdown-converter && cargo build --release
Expected: compiles with no errors (no remaining references to ChildPageInput).

  • Step 3: Commit
git add scripts/rust-markdown-converter/src/lib.rs
git commit -m "refactor(rust): remove unused convert_section_to_markdown binding"

Task 5: Make the Rust per-page frontmatter an exact drop-in

Files:

  • Modify: scripts/rust-markdown-converter/src/lib.rs (the Frontmatter struct, generate_frontmatter, and #[cfg(test)] mod tests)
  • Modify: scripts/rust-markdown-converter/Cargo.toml (drop chrono)

Target field set/order = the post-#7294 JS converter base output (markdown-converter.cjs:501-513): title, description, url, estimated_tokens, product, version. Rust emits exactly these — no publisher/canonical/date/lastmod (those are JS post-steps; see Task 8b). convert_to_markdown's signature is unchanged and does no HTML-head extraction.

  • Step 1: Write/adjust unit tests for the drop-in frontmatter

In the #[cfg(test)] mod tests block in lib.rs, append:

    #[test]
    fn test_frontmatter_uses_version_not_product_version() {
        let html = r#"<html><head></head><body>
            <article class="article--content"><h1>Get started</h1><p>Body.</p></article>
          </body></html>"#;
        let out = convert_to_markdown(
            html.to_string(),
            "/influxdb3/core/get-started/".to_string(),
            "https://docs.influxdata.com".to_string(),
        )
        .unwrap()
        .unwrap();
        assert!(out.contains("\nversion: core\n"));
        assert!(!out.contains("product_version:"));
    }

    #[test]
    fn test_frontmatter_omits_provenance_and_timestamps() {
        // publisher/canonical/date/lastmod are added later by the JS post-step,
        // never by the converter.
        let html = r#"<html><head>
            <meta name="last-modified" content="2025-01-15T00:00:00Z">
          </head><body>
            <article class="article--content"><h1>X</h1><p>Body.</p></article>
          </body></html>"#;
        let out = convert_to_markdown(
            html.to_string(),
            "/influxdb3/core/x/".to_string(),
            "https://docs.influxdata.com".to_string(),
        )
        .unwrap()
        .unwrap();
        assert!(!out.contains("publisher:"));
        assert!(!out.contains("canonical:"));
        assert!(!out.contains("date:"));
        assert!(!out.contains("lastmod:"));
    }
  • Step 2: Run the tests to confirm they fail

Run: cd scripts/rust-markdown-converter && cargo test
Expected: FAIL — current struct emits product_version and build-time date/lastmod.

  • Step 3: Update the Frontmatter struct to the drop-in field set/order

Replace the struct with:

#[derive(Debug, Serialize)]
struct Frontmatter {
    title: String,
    description: String,
    url: String,
    estimated_tokens: usize,
    #[serde(skip_serializing_if = "Option::is_none")]
    product: Option<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    version: Option<String>,
}
  • Step 4: Simplify generate_frontmatter (rename field, drop chrono)

Remove the chrono::Utc::now() block and the date/lastmod/product_version fields; build the struct with version. The signature is unchanged (5 args), so convert_to_markdown needs no edit:

fn generate_frontmatter(
    title: &str,
    description: &str,
    url_path: &str,
    content_length: usize,
    base_url: &str,
) -> String {
    let product = detect_product(url_path);

    let description = description
        .chars()
        .filter(|c| !c.is_control() || *c == '\n')
        .collect::<String>()
        .split_whitespace()
        .collect::<Vec<_>>()
        .join(" ")
        .chars()
        .take(500)
        .collect::<String>();

    let estimated_tokens = (content_length + 3) / 4;
    let full_url = format!("{}{}", base_url, url_path);

    let frontmatter = Frontmatter {
        title: title.to_string(),
        description,
        url: full_url,
        estimated_tokens,
        product: product.as_ref().map(|p| p.name.clone()),
        version: product.as_ref().map(|p| p.version.clone()),
    };

    match serde_yaml::to_string(&frontmatter) {
        Ok(yaml) => format!("---\n{}---", yaml),
        Err(_) => "---\n---".to_string(),
    }
}
  • Step 5: Remove the now-unused chrono dependency

Run grep -n chrono src/lib.rs. If no matches remain, delete the chrono = "0.4" line from scripts/rust-markdown-converter/Cargo.toml.

  • Step 6: Run the tests to confirm they pass

Run: cd scripts/rust-markdown-converter && cargo test
Expected: PASS, including the two new tests and the existing test_product_detection / test_html_to_markdown.

  • Step 7: Commit
git add scripts/rust-markdown-converter/src/lib.rs scripts/rust-markdown-converter/Cargo.toml
git commit -m "feat(rust): emit drop-in per-page frontmatter (version, no timestamps)"

Task 6: Port detectBaseUrl into the build script

Files:

  • Create: scripts/lib/base-url.js

  • Step 1: Write the helper (ported verbatim from the .cjs)

Create scripts/lib/base-url.js:

/**
 * Detect the base URL for the current build environment.
 * Ported from the removed markdown-converter.cjs so the build script can pass
 * base_url to the Rust converter (which no longer resolves it internally).
 * @returns {string}
 */
export function detectBaseUrl() {
  if (process.env.BASE_URL) {
    return process.env.BASE_URL;
  }
  if (
    process.env.HUGO_ENV === 'development' ||
    process.env.NODE_ENV === 'development'
  ) {
    return 'http://localhost:1313';
  }
  if (
    process.env.HUGO_ENV === 'staging' ||
    process.env.DEPLOY_ENV === 'staging'
  ) {
    return process.env.STAGING_URL || 'https://test2.docs.influxdata.com';
  }
  return 'https://docs.influxdata.com';
}
  • Step 2: Sanity-check the env mapping

Run: node -e "import('./scripts/lib/base-url.js').then(m=>{process.env.HUGO_ENV='staging';console.log(m.detectBaseUrl())})"
Expected: https://test2.docs.influxdata.com

  • Step 3: Commit
git add scripts/lib/base-url.js
git commit -m "refactor: extract detectBaseUrl helper for the build script"

Task 7: Cut build-llm-markdown.js over to the Rust module

Files:

  • Modify: scripts/build-llm-markdown.js:22 and :62-134

  • Step 1: Replace the converter import and add a hard-fail load guard

In scripts/build-llm-markdown.js, replace line 22:

const { convertToMarkdown } = require('./lib/markdown-converter.cjs');

with:

let convertToMarkdown;
try {
  ({ convertToMarkdown } = require('./rust-markdown-converter'));
} catch (err) {
  console.error(
    '✗ Rust markdown converter is not built.\n' +
      '  Build it with: node scripts/build-rust-converter.js\n' +
      `  (load error: ${err.message})`
  );
  process.exit(1);
}

Add the base-url import near the other imports (top of file, ESM):

import { detectBaseUrl } from './lib/base-url.js';
  • Step 2: Resolve base_url once and pass it to the converter

In buildPageMarkdown, compute the base URL once after const startTime = Date.now(); (around line 66):

  const baseUrl = detectBaseUrl();

Then change the call at line 134 from:

        const markdown = await convertToMarkdown(html, urlPath);

to:

        const markdown = await convertToMarkdown(html, urlPath, baseUrl);
  • Step 3: Smoke-test conversion on a small subset

Run:

npx hugo --quiet --destination public
yarn build:md --public-dir public --only-changed --base-branch HEAD~1
head -20 public/influxdb3/core/get-started/index.md

Expected: frontmatter shows version:, publisher: InfluxData, canonical:, and date/lastmod. No product_version. Process exits 0 (converter loaded).

  • Step 4: Commit
git add scripts/build-llm-markdown.js
git commit -m "feat: generate per-page Markdown with the Rust converter"

Task 8: Delete the JS converter, retire the legacy script, drop deps ✅ DONE

Plan gap found during execution: Task 8's original consumer grep scoped
only scripts/ deploy/ and declared html-to-markdown.js safe to delete. But
the Cypress acceptance specs invoke it in before() to generate fixtures
markdown-content-validation.cy.js (5×) and llm-format-selector.cy.js
(2×) call node scripts/html-to-markdown.js --path <p> [--limit N]. Deleting
it would have broken the Task 11 acceptance gate. Fix (chosen with the user,
option A): give the surviving generator the --path/--limit targeting the
legacy script provided, migrate the specs onto it, then delete. The specs
now exercise the real Rust path.

Files:

  • Delete: scripts/lib/markdown-converter.cjs, scripts/html-to-markdown.js

  • Modify: package.json (exports, scripts, deps), scripts/build-llm-markdown.js
    (add --path/--limit), cypress/e2e/content/markdown-content-validation.cy.js,
    cypress/e2e/content/llm-format-selector.cy.js

  • Step 1: Add --path/--limit targeting to build-llm-markdown.js

parseArgs gains --path <site-relative-path> and --limit <n>;
buildPageMarkdown filters the index.html glob to the path subtree and caps
the count (path filter first, then limit). This is the per-path generation the
Cypress specs used the legacy script for. (Phase 2 section bundling still scans
the whole public/, which is fine — the targeted page's children were just
generated.)

  • Step 2: Migrate the Cypress fixture generation off the legacy script

In both specs, replace every
node scripts/html-to-markdown.js --path … with
node scripts/build-llm-markdown.js --public-dir public --path … (path + any
--limit preserved verbatim). 7 invocations total.

  • Step 3: Delete the JS converter and the legacy CLI
git rm scripts/lib/markdown-converter.cjs scripts/html-to-markdown.js
  • Step 4: Repoint exports and remove legacy/dep entries in package.json

  • exports./markdown-converter./scripts/rust-markdown-converter/index.js

  • Remove the build:md:legacy and build:md:verbose script lines.

  • Remove turndown, @types/turndown, and jsdom from dependencies.

  • Step 5: Confirm nothing else imports the removed modules (grep MUST include cypress/)

grep -rnE "markdown-converter\.cjs|html-to-markdown|turndown|jsdom" \
  scripts deploy cypress | grep -v node_modules | grep -vE "\.md:|README"

Expected: only the intentional doc comment in scripts/lib/base-url.js. Note:
scripts/README.md still documents the deleted html-to-markdown.js end to end
— its rewrite is folded into Task 12 (docs).

  • Step 6: Reinstall and verify the active build still runs
CYPRESS_INSTALL_BINARY=0 yarn install
node scripts/build-llm-markdown.js --public-dir public --path influxdb3/core/get-started

Expected: install succeeds without turndown/jsdom and rebuilds Rust;
targeted build:md exits 0.

  • Step 7: Commit (stage explicit paths — git add -A grabs the worktree
    .cache symlink, which .gitignore's .cache/ rule does not match; add
    /.cache to .gitignore)
git add scripts package.json cypress yarn.lock .gitignore
git commit -m "chore: remove JS markdown converter and legacy CLI; add build:md --path/--limit"

Task 8b: Stamp date/lastmod in the JS provenance step

The Rust converter no longer emits timestamps. Add them as a JS post-step,
folded into the existing provenance injection so there is no extra pass over
each .md. Per page only; sections keep provenance (publisher/canonical) alone.

Files:

  • Modify: scripts/lib/provenance.js (add readSitemapLastmods, extend injectPageProvenance)

  • Modify: scripts/build-llm-markdown.js (build the lastmod Map once, pass per page)

  • Test: scripts/__tests__/provenance.test.mjs

  • Step 1: Write failing tests for sitemap lastmod parsing + stamping

Append to scripts/__tests__/provenance.test.mjs (it already imports injectPageProvenance and the fs/tmp helpers):

import { readSitemapLastmods } from '../lib/provenance.js';

test('readSitemapLastmods maps urlPath to lastmod from sitemap-md.xml', async () => {
  const dir = mkdtempSync(join(tmpdir(), 'prov-'));
  try {
    writeFileSync(
      join(dir, 'sitemap-md.xml'),
      '<urlset><url><loc>https://docs.influxdata.com/influxdb3/core/index.md</loc>' +
        '<lastmod>2025-01-15T00:00:00Z</lastmod></url></urlset>'
    );
    const map = await readSitemapLastmods(dir);
    assert.equal(map.get('/influxdb3/core/'), '2025-01-15T00:00:00Z');
  } finally {
    rmSync(dir, { recursive: true, force: true });
  }
});

test('injectPageProvenance stamps date/lastmod when lastmod is provided', () => {
  const md = '---\ntitle: X\nurl: https://docs.influxdata.com/x/\n---\n\nBody.\n';
  const out = injectPageProvenance(md, {
    publisher: 'InfluxData',
    canonical: 'https://docs.influxdata.com/x/',
    lastmod: '2025-01-15T00:00:00Z',
  });
  const fm = out.match(/^---\n([\s\S]+?)\n---/)[1];
  assert.match(fm, /date: 2025-01-15T00:00:00Z/);
  assert.match(fm, /lastmod: 2025-01-15T00:00:00Z/);
});

test('injectPageProvenance omits timestamps when lastmod is absent', () => {
  const md = '---\ntitle: X\nurl: https://docs.influxdata.com/x/\n---\n\nBody.\n';
  const out = injectPageProvenance(md, {
    publisher: 'InfluxData',
    canonical: 'https://docs.influxdata.com/x/',
  });
  assert.doesNotMatch(out, /date:/);
  assert.doesNotMatch(out, /lastmod:/);
});
  • Step 2: Run the tests to confirm they fail

Run: node --test scripts/__tests__/provenance.test.mjs
Expected: FAIL — readSitemapLastmods is not exported and injectPageProvenance ignores lastmod.

  • Step 3: Add readSitemapLastmods to scripts/lib/provenance.js
/**
 * Build a urlPath -> lastmod map from sitemap-md.xml. Keys are site-relative
 * paths (e.g. "/influxdb3/core/"), matching the build script's urlPath.
 * @param {string} publicDir
 * @returns {Promise<Map<string,string>>}
 */
export async function readSitemapLastmods(publicDir = 'public') {
  const map = new Map();
  try {
    const xml = await fs.readFile(
      path.join(publicDir, 'sitemap-md.xml'),
      'utf-8'
    );
    const re = /<url>\s*<loc>([^<]+)<\/loc>\s*<lastmod>([^<]+)<\/lastmod>/g;
    let m;
    while ((m = re.exec(xml)) !== null) {
      const urlPath = new URL(m[1]).pathname.replace(/index\.md$/, '');
      map.set(urlPath, m[2]);
    }
  } catch {
    /* no sitemap -> empty map -> timestamps omitted */
  }
  return map;
}
  • Step 4: Extend injectPageProvenance to stamp date/lastmod

Update its destructured arg and body (omit timestamps when lastmod is falsy):

export function injectPageProvenance(markdown, { publisher, canonical, lastmod }) {
  const match = markdown.match(/^---\r?\n([\s\S]+?)\r?\n---\r?\n+([\s\S]+)$/);
  if (!match) return markdown;
  let fm;
  try {
    fm = yaml.load(match[1]);
  } catch {
    return markdown;
  }
  if (!fm || typeof fm !== 'object') return markdown;
  fm.publisher = publisher;
  fm.canonical = canonical;
  if (lastmod) {
    fm.date = lastmod;
    fm.lastmod = lastmod;
  }
  const body = match[2];
  const serialized = yaml.dump(fm, { lineWidth: -1, noRefs: true }).trim();
  return `---\n${serialized}\n---\n\n${body}`;
}
  • Step 5: Build the map once and pass lastmod per page in build-llm-markdown.js

Add the import (~line 52), build the map in the provenance setup (~line 491), and pass the per-page value at the call site (~line 154):

// imports (~line 52)
import {
  loadOrgIdentity,
  readSitemapOrigin,
  readSitemapLastmods,
  injectPageProvenance,
} from './lib/provenance.js';

// provenance setup (~line 491)
const [org, origin, lastmods] = await Promise.all([
  loadOrgIdentity(),
  readSitemapOrigin(cliOptions.publicDir),
  readSitemapLastmods(cliOptions.publicDir),
]);
const provenance = { publisher: org.name, origin, lastmods };

// call site in buildPageMarkdown (~line 154)
const output = provenance
  ? injectPageProvenance(markdown, {
      publisher: provenance.publisher,
      canonical: `${provenance.origin}${urlPath}`,
      lastmod: provenance.lastmods.get(urlPath),
    })
  : markdown;

provenance is already threaded into buildPageMarkdown via options, so lastmods rides along. combineMarkdown (sections) is unchanged.

  • Step 6: Run the tests to confirm they pass

Run: node --test scripts/__tests__/provenance.test.mjs scripts/__tests__/build-llm-markdown.test.mjs
Expected: PASS (new timestamp tests + the existing #7294 provenance/section tests).

  • Step 7: Commit
git add scripts/lib/provenance.js scripts/build-llm-markdown.js scripts/__tests__/provenance.test.mjs
git commit -m "feat: stamp date/lastmod from sitemap in the provenance step"

Task 9: Migration parity gate (fixture diff + structural scan)

Body diffs found during cutover (Task 7), vs the .parity-baseline/ JS
output, and how each was resolved:

  1. for AI widget leak on ~2,690/4,684 pages — FIXED (commit after Task 8).
    Root cause was not a missing selector: clean_html re-parsed
    article.html(), and html5ever restructuring (e.g. <tbody> insertion)
    meant the re-serialized element needles no longer matched the haystack, so
    .replace() silently removed nothing for complex subtrees like the
    format-selector. Fix: compute haystack + needles from the same parse
    (clean_article_html(article: ElementRef)), strip longest-first. Locked by
    test_omits_body_h1_and_strips_format_selector.
  2. Body h1 — DECIDED: omit (keep html_to_markdown(&content, true)).
    Earlier note said "keep"; reversed by the user. The title is in frontmatter,
    and the API-reference markdown twins also omit the body h1, so omitting
    keeps one consistent convention across all LLM-facing twins. This is an
    accepted, documented parity diff vs the pre-migration JS baseline (which
    kept the h1), not a regression. Task 11 must update
    markdown-content-validation.cy.js:240
    (/^# /m) — the body no longer
    has an h1; assert it has headings (e.g. /^## /m) instead.
  3. Note-callout vs paragraph — NOT a regression (Rust is more faithful).
    Authored > [!Note] callouts render as <div class="block note">; Rust
    round-trips them to > [!Note] (which markdown-content-validation.cy.js:455
    expects), while JS flattened them to plain paragraphs, losing the callout.
    Treat callout-vs-paragraph diffs as an accepted improvement in the scan.
    (The "support section leak" worry was wrong — the baseline keeps the
    get-started "Find support" Tip callout too; the Cypress no-Discord assertion
    runs on the leaf page, which has none.)

Still open for the scan/fixtures: heading style (html2md emits setext
Title\n=== / --- for h1/h2; the JS baseline is ATX #/##) — likely a
broad diff to normalize in postprocess_markdown (setext→ATX) or accept. The
scan below will surface its true scope. See Task 11's note on the
product_version:version: assertion updates (also a contract change, not
a converter bug).

Do not byte-diff all ~4,700 pages — that is dominated by the intentional
date/lastmod additions plus cosmetic whitespace, and engine regressions are
classes of bugs that a curated sample exercises just as well. Instead:

  1. a feature-coverage fixture set — diff a dozen real pages chosen to
    exercise every structural feature (the actual semantic check);
  2. a cheap corpus-wide structural scan — an O(n) pass that flags only
    objective breakage (empty body, missing frontmatter, leftover raw HTML,
    unbalanced code fences) and gross content loss vs the baseline (body-length
    ratio). Surfaces a short outlier list, no human per-page diffing.

Corpus-wide content validation is already covered by Cypress (Task 11) and exact
output is locked by the golden snapshot (Task 10). The .parity-baseline/
snapshot from Task 1 is the reference for both checks.

Files:

  • Modify: scripts/rust-markdown-converter/src/lib.rs (heading normalization — Step 0)

  • Create: scripts/parity-scan.mjs (kept in-tree for re-runs)

  • Step 0: Normalize headings to open-ATX in postprocess_markdown

html2md emits setext for h2 (Data model\n----------) and closed-ATX
elsewhere (#### Related ####); the JS baseline is consistent open-ATX
(## Data model, #### Related). Rust output is also internally mixed, so
normalize regardless of parity. Add to postprocess_markdown:

  • setext → ATX: a non-blank text line immediately followed (no blank line
    between) by a line of only =# , or only -## . Guard against
    false positives: skip if the text line already starts with #, |, >, or a
    list marker, and require the underline to be the whole line. Crucially this
    must run before convert_tables so a table separator (--- under a
    | … | row) is never seen as setext — or exclude lines containing |.
  • closed-ATX → open: strip a trailing run of spaces + # from ATX heading
    lines (^(#{1,6} .*?)\s+#+\s*$$1).

Add a unit test (mixed setext h2 + closed-ATX h4 + a GFM table → all headings
become open-ATX, table separators untouched). Rebuild
(node scripts/build-rust-converter.js) before the scan.

  • Step 1: Write the structural + content-loss scan

Create scripts/parity-scan.mjs:

#!/usr/bin/env node
/**
 * Corpus-wide parity scan: flags OBJECTIVE breakage in Rust-generated per-page
 * Markdown, and gross content loss vs the JS baseline. Not a byte-diff — it
 * surfaces a short outlier list to inspect by hand. Exit 1 if any page is
 * flagged so CI/the runner notices.
 *
 * Usage: node scripts/parity-scan.mjs [currentDir] [baselineDir]
 */
import { glob } from 'glob';
import fs from 'node:fs/promises';
import path from 'node:path';

const curDir = process.argv[2] || 'public';
const baseDir = process.argv[3] || '.parity-baseline';

function splitFrontmatter(md) {
  const m = md.match(/^---\n([\s\S]+?)\n---\n([\s\S]*)$/);
  return m ? { fm: m[1], body: m[2] } : null;
}

const files = await glob(`${curDir}/**/index.md`, {
  ignore: ['**/node_modules/**'],
});
const flags = [];
for (const file of files) {
  const rel = path.relative(curDir, file);
  const md = await fs.readFile(file, 'utf-8');
  const parts = splitFrontmatter(md);
  if (!parts) {
    flags.push(`NO_FRONTMATTER  ${rel}`);
    continue;
  }
  const { fm, body } = parts;
  if (!/\btitle:/.test(fm) || !/\burl:/.test(fm)) {
    flags.push(`MISSING_FIELD   ${rel}`);
  }
  if (body.trim().length === 0) {
    flags.push(`EMPTY_BODY      ${rel}`);
  }
  // NOTE: unbalanced-fence and raw-HTML-at-line-start were tried and rejected
  // as noisy — docs legitimately contain literal ``` (grammar/spec pages) and
  // HTML examples in code blocks, producing false positives. CONTENT_LOSS below
  // is the high-signal, low-noise truncation/regression detector.
  // Gross content loss vs baseline body length.
  try {
    const base = await fs.readFile(path.join(baseDir, rel), 'utf-8');
    const baseParts = splitFrontmatter(base);
    if (baseParts) {
      const b = baseParts.body.trim().length;
      const c = body.trim().length;
      if (b > 200 && c < b * 0.5) {
        flags.push(`CONTENT_LOSS    ${rel} (rust ${c} vs js ${b} chars)`);
      }
    }
  } catch {
    /* page not in baseline (new) — skip ratio check */
  }
}

for (const f of flags) console.log(f);
console.log(`\n${flags.length} flag(s) across ${files.length} pages.`);
process.exit(flags.length === 0 ? 0 : 1);
  • Step 2: Regenerate the corpus with Rust and run the scan
npx hugo --quiet --destination public
yarn build:md --public-dir public
node scripts/parity-scan.mjs public .parity-baseline | tee parity-report.txt

Expected: ideally 0 flag(s). Any flag is a short, concrete list to inspect — fix the cause in lib.rs (html_to_markdown, callout/table/code handling, or the strip-list) and re-run. CONTENT_LOSS and RAW_HTML are the high-signal ones.

  • Step 3: Diff the feature-coverage fixture set

Pick ~12 real pages from public/ that, together, exercise: fenced code blocks with language identifiers, GFM tables, all five GitHub callout types (note/warning/important/tip/caution), tabbed content, nested + ordered lists, inline links, and a reference page dense with these. Spread across 2–3 products (e.g. influxdb3/core, telegraf, influxdb/v2). Verify each page actually contains the feature before choosing it (grep the baseline .md).

For each, compare baseline vs Rust and confirm the only differences are the expected date/lastmod additions plus cosmetic whitespace/escaping:

# example — repeat per fixture page
diff <(sed '/^date:/d;/^lastmod:/d' .parity-baseline/influxdb3/core/get-started/index.md) \
     <(sed '/^date:/d;/^lastmod:/d' public/influxdb3/core/get-started/index.md)

Any semantic difference (lost content, broken/changed code fence, malformed table, dropped/altered link, missing callout) → fix in lib.rs and re-run Steps 2–3. Stop when the fixture set shows only cosmetic diffs and the scan is clean.

  • Step 4: Record accepted cosmetic diffs

Scan result (4,684 pages): 0 semantic regressions. Only 2 EMPTY_BODY
flags — influxdb/v2/index.md and influxdb/cloud/index.md — both pre-existing
(the JS baseline also produced 0 body chars; these landing pages have an
article--content of nav cards with no prose). Not regressions.

Accepted differences vs the JS baseline (for the PR description), all
semantically neutral:

  • Body h1 omitted (intentional decision; title in frontmatter).
  • Note/callout faithfulness — Rust renders authored > [!Note] callouts
    that JS flattened to paragraphs (improvement).
  • Hard line-wrapping — html2md wraps paragraph text at ~80 cols; the baseline
    kept one line per paragraph. Renders identically (single \n = space); links
    are not split.
  • List marker * vs - and list indent (* x vs - x). Both valid.
  • Heading style normalized to open-ATX — now matches the baseline (was
    setext/closed-ATX before the Step 0 fix).

Do not commit parity-report.txt or .parity-baseline/ (gitignored).

  • Step 5: Commit the scan tool + Rust fixes (fixes already committed:
    format-selector strip, h1-omit, heading normalization). Commit the scan tool.
git add scripts/parity-scan.mjs
git commit -m "test: add corpus parity scan (0 semantic regressions over 4,684 pages)"

Task 10: Golden-snapshot regression test

Files:

  • Create: scripts/__tests__/fixtures/get-started.html, scripts/__tests__/fixtures/get-started.expected.md

  • Create: scripts/__tests__/markdown-parity.test.mjs

  • Step 1: Capture a real page as the fixture input

mkdir -p scripts/__tests__/fixtures
cp public/influxdb3/core/get-started/index.html scripts/__tests__/fixtures/get-started.html
  • Step 2: Generate the expected output and eyeball it
node -e "const {convertToMarkdown}=require('./scripts/rust-markdown-converter'); const fs=require('fs'); fs.writeFileSync('scripts/__tests__/fixtures/get-started.expected.md', convertToMarkdown(fs.readFileSync('scripts/__tests__/fixtures/get-started.html','utf-8'), '/influxdb3/core/get-started/', 'https://docs.influxdata.com'));"
head -15 scripts/__tests__/fixtures/get-started.expected.md

Expected: frontmatter with exactly the six base fields (title, description, url, estimated_tokens, product, version) and a sane body. No publisher/canonical/date/lastmod — those are added later by the JS post-step, not the converter. Confirm by eye before locking it as the golden file.

  • Step 3: Write the golden test

Create scripts/__tests__/markdown-parity.test.mjs:

import { test } from 'node:test';
import assert from 'node:assert/strict';
import { readFileSync } from 'node:fs';
import { createRequire } from 'node:module';

const require = createRequire(import.meta.url);
const { convertToMarkdown } = require('../rust-markdown-converter');

const FIXTURE = new URL('./fixtures/get-started.html', import.meta.url);
const EXPECTED = new URL('./fixtures/get-started.expected.md', import.meta.url);

test('Rust converter output matches the golden snapshot', () => {
  const html = readFileSync(FIXTURE, 'utf-8');
  const out = convertToMarkdown(
    html,
    '/influxdb3/core/get-started/',
    'https://docs.influxdata.com'
  );
  assert.equal(out, readFileSync(EXPECTED, 'utf-8'));
});

test('converter emits the drop-in base frontmatter only', () => {
  const html = readFileSync(FIXTURE, 'utf-8');
  const out = convertToMarkdown(
    html,
    '/influxdb3/core/get-started/',
    'https://docs.influxdata.com'
  );
  assert.match(out, /\nversion: core\n/);
  assert.doesNotMatch(out, /product_version:/);
  // provenance + timestamps are JS post-steps, not the converter:
  assert.doesNotMatch(out, /publisher:/);
  assert.doesNotMatch(out, /canonical:/);
  assert.doesNotMatch(out, /date:/);
  assert.doesNotMatch(out, /lastmod:/);
});
  • Step 4: Add a test script and run it

In package.json scripts, add:

"test:markdown-parity": "node --test scripts/__tests__/markdown-parity.test.mjs",

Run: yarn test:markdown-parity
Expected: both tests PASS.

  • Step 5: Commit
git add scripts/__tests__/fixtures scripts/__tests__/markdown-parity.test.mjs package.json
git commit -m "test: lock the Rust converter frontmatter/body with a golden snapshot"

Task 10b: Truncation regression test

Tracking issue: #6792 — "Section markdown truncated in
clipboard."
The reported example is the section page
/influxdb3/enterprise/admin/last-value-cache/ ("Copy section for AI").

Diagnosis from planning (verified against the current build):

  • The generated index.md files are complete (no gross content loss vs source
    HTML).
  • The generated index.section.md is also complete — the LVC section body
    (19,402 chars) equals the sum of its parent + four child bodies (19,373). So
    the truncation is not in generation.
  • The clipboard path is assets/js/components/format-selector.ts:
    handleCopySectionfetch(index.section.md)clipboard.writeText. The
    code does no slicing, so Section markdown truncated in clipboard  #6792 is in the fetch/clipboard runtime path
    (e.g. how .section.md is served, or a clipboard limit) — a UI subsystem
    separate from the Rust converter migration, which does not touch section
    generation or format-selector.ts.

So this task ships two complementary guards: a build-time completeness test
(pages and sections — guards generation, and the html2md large-page risk) and
a Cypress clipboard test on the real #6792 page (guards the runtime path).
The Cypress test is what actually reproduces/locks #6792; fixing the underlying
clipboard/serving bug, if the test fails, is a follow-up outside this migration.

Files:

  • Create: scripts/__tests__/markdown-completeness.test.mjs

  • Modify: package.json (add test:markdown-completeness)

  • Step 1: Write the completeness test

Create scripts/__tests__/markdown-completeness.test.mjs:

import { test } from 'node:test';
import assert from 'node:assert/strict';
import fs from 'node:fs/promises';
import path from 'node:path';

const PUBLIC = process.env.PUBLIC_DIR || 'public';

async function* walk(dir) {
  for (const e of await fs.readdir(dir, { withFileTypes: true })) {
    if (e.name === 'node_modules') continue;
    const p = path.join(dir, e.name);
    if (e.isDirectory()) yield* walk(p);
    else if (e.name === 'index.md') yield p;
  }
}

function htmlArticleText(html) {
  const m = html.match(
    /<article[^>]*article--content[^>]*>([\s\S]*?)<\/article>/i
  );
  return (m ? m[1] : '')
    .replace(/<[^>]+>/g, ' ')
    .replace(/\s+/g, ' ')
    .trim();
}

function mdBody(md) {
  const m = md.match(/^---\n[\s\S]+?\n---\n([\s\S]*)$/);
  return (m ? m[1] : md).trim();
}

test('generated .md pages are not truncated vs their source HTML', async () => {
  const truncated = [];
  let checked = 0;
  for await (const mdPath of walk(PUBLIC)) {
    const htmlPath = mdPath.replace(/index\.md$/, 'index.html');
    let html;
    try {
      html = await fs.readFile(htmlPath, 'utf-8');
    } catch {
      continue;
    }
    const text = htmlArticleText(html);
    if (text.length < 3000) continue; // only meaningful for substantial pages
    const body = mdBody(await fs.readFile(mdPath, 'utf-8'));
    checked++;
    // The .md body should retain most of the article's visible text. 0.5 is a
    // conservative floor (the baseline build had 0 pages below it); tighten
    // once a real truncated example sets the bar.
    if (body.length < text.length * 0.5) {
      truncated.push(
        `${path.relative(PUBLIC, mdPath)} (md ${body.length} vs html ${text.length})`
      );
    }
  }
  assert.ok(checked > 0, 'no pages checked — build public/ first');
  assert.equal(
    truncated.length,
    0,
    `Truncated pages:\n${truncated.join('\n')}`
  );
});

test('section bundles contain their child pages (not truncated)', async () => {
  // A section's body should be roughly the sum of its parent + child page
  // bodies. Far less means combineMarkdown dropped children. Guards #6792 at
  // the generation layer (the runtime clipboard path is covered by Cypress).
  const short = [];
  let checked = 0;
  async function* walkSections(dir) {
    for (const e of await fs.readdir(dir, { withFileTypes: true })) {
      if (e.name === 'node_modules') continue;
      const p = path.join(dir, e.name);
      if (e.isDirectory()) yield* walkSections(p);
      else if (e.name === 'index.section.md') yield p;
    }
  }
  for await (const secPath of walkSections(PUBLIC)) {
    const dir = path.dirname(secPath);
    const parts = [path.join(dir, 'index.md')];
    for (const e of await fs.readdir(dir, { withFileTypes: true })) {
      if (e.isDirectory()) parts.push(path.join(dir, e.name, 'index.md'));
    }
    let sum = 0;
    for (const p of parts) {
      try {
        sum += mdBody(await fs.readFile(p, 'utf-8')).length;
      } catch {
        /* missing child md — skip */
      }
    }
    if (sum < 3000) continue;
    const secBody = mdBody(await fs.readFile(secPath, 'utf-8'));
    checked++;
    if (secBody.length < sum * 0.6) {
      short.push(
        `${path.relative(PUBLIC, secPath)} (section ${secBody.length} vs parts ${sum})`
      );
    }
  }
  assert.ok(checked > 0, 'no section bundles checked — build public/ first');
  assert.equal(short.length, 0, `Truncated sections:\n${short.join('\n')}`);
});
  • Step 2: Run it against the current build

Run: PUBLIC_DIR=public node --test scripts/__tests__/markdown-completeness.test.mjs
Expected: PASS (the baseline build had 0 pages under the 0.5 ratio). A failure means real truncation — inspect the listed pages; the cause is either the converter (fix lib.rs) or the source content.

  • Step 3: Wire into package.json

Add to scripts: "test:markdown-completeness": "node --test scripts/__tests__/markdown-completeness.test.mjs",

This reproduces/locks the actual reported bug. On the #6792 example section page
/influxdb3/enterprise/admin/last-value-cache/, the "Copy section for AI"
option ([data-option="copy-section"]) fetches index.section.md and writes it
to the clipboard. Add a Cypress test that:

  1. visits the page (Hugo server running);
  2. grants clipboard permission and stubs/reads navigator.clipboard.writeText;
  3. opens the format selector, clicks [data-option="copy-section"];
  4. fetches /influxdb3/enterprise/admin/last-value-cache/index.section.md
    directly and asserts the clipboard text equals the fetched file (same
    length and content) — i.e. not truncated.

Put it under cypress/e2e/content/ next to the other content specs. If the test
fails (clipboard shorter than the file), that confirms #6792 is in the
runtime fetch/clipboard path; fixing it (e.g. how .section.md is served, or a
writeText size issue) is a follow-up tracked by #6792, separate from this
migration. If it passes, #6792 is already resolved by the current build and
the test prevents regression.

  • Step 5: Commit
git add scripts/__tests__/markdown-completeness.test.mjs package.json
git commit -m "test: guard against truncated markdown twins"

Task 11: Run the full acceptance suite

Files: none (verification)

  • Step 1: Full build with Rust
yarn build:ts
npx hugo --quiet --destination public
yarn build:md --public-dir public
yarn build:llms-full --public-dir public

Expected: all succeed; public/**/index.md, index.section.md, and llms-full.txt present.

  • Step 2: Coherence + corpus tests
yarn check:md-coherence --public-dir public
yarn test:build-llms-full
yarn test:corpus-paths
yarn test:build-md

Expected: all PASS. test:build-md runs the #7294 provenance/section tests plus the new date/lastmod stamping tests (Task 8b).

  • Step 3: Cypress markdown validation

Test-contract updates required (do these, they are not "weakening to
pass"):
markdown-content-validation.cy.js predates #7294 and this
migration. Two assertions encode the OLD contract and must be updated to the
shipped one:

  • :121 and :433 assert product_version: — the converter now emits
    version: (matches the post-feat(jsonld): carry InfluxData provenance into LLM-native artifacts (#7290) #7294 JS baseline). Update both to version:.
  • :130-131 assert date:/lastmod: — now added by the JS provenance
    post-step (Task 8b), so they pass once 8b is in. Keep them.
  • :240 asserts /^# /m (a body h1) — the body h1 is now intentionally
    omitted
    (title in frontmatter; matches the API-ref twins). Update to assert
    a section heading instead, e.g. /^## /m. This is a contract change, not
    test-weakening (see Task 9 note 2).

The rest is a real Rust converter requirement — satisfy by fixing lib.rs
(Task 9), not by editing the test: no Copy page :209/588/585-595 (FIXED by
the format-selector strip), callouts as > [!Note] :455 (Rust already
renders these). The :599-604 no-Discord check runs on the leaf page,
which has no support section in either converter — it already passes.

node cypress/support/run-e2e-specs.js --spec "cypress/e2e/content/markdown-content-validation.cy.js"
node cypress/support/run-e2e-specs.js --spec "cypress/e2e/content/markdown-autodiscovery.cy.js"

Expected: both specs PASS (no raw shortcodes, valid frontmatter, callouts/tables/links intact, autodiscovery coherent).

  • Step 4: Markdown-parity + Rust unit tests
yarn test:markdown-parity
PUBLIC_DIR=public yarn test:markdown-completeness
cd scripts/rust-markdown-converter && cargo test && cd -

Expected: all PASS. test:markdown-completeness guards against truncated twins
(Task 10b).

  • Step 5: If anything fails, fix at the source and re-run

Any Cypress failure about frontmatter or stripped UI is a Rust converter fix (lib.rs), then re-run Steps 1–4. Do not weaken the tests to pass.


Task 12: Update documentation

Files:

  • Modify: DOCS-TESTING.md (frontmatter schema + architecture sections)

  • Rewrite: scripts/README.md — currently documents only the deleted
    html-to-markdown.js (purpose, turndown config, --path/--limit/--verbose
    usage). Replace with the current architecture: build-llm-markdown.js
    (Phase 1 = Rust per-page conversion, Phase 2 = JS section bundling), the Rust
    napi converter, and the surviving --path/--limit flags. Remove all
    turndown/legacy references.

  • Step 1: Update the architecture diagram and "Related Files"

In DOCS-TESTING.md, replace references to scripts/lib/markdown-converter.js/.cjs as the conversion engine with scripts/rust-markdown-converter/ (Rust + napi). State that build-llm-markdown.js Phase 1 calls the Rust module and Phase 2 (combineMarkdown) builds section bundles in JS. Remove the legacy html-to-markdown.js / build:md:legacy mentions.

  • Step 2: Align the documented frontmatter schema to the Rust contract

Update the frontmatter example fields to the actual shipped set/order: the converter emits title, description, url, estimated_tokens, product, version; the JS provenance post-step then adds publisher, canonical, date, lastmod. For sections: type, pages (= parent + children), child_pages, plus provenance. Note which fields come from the converter vs. the post-step.

  • Step 3: Lint the docs change

Run: .ci/vale/vale.sh DOCS-TESTING.md
Expected: no new errors.

  • Step 4: Commit
git add DOCS-TESTING.md
git commit -m "docs: document the Rust converter and its frontmatter contract"

Spec coverage map

Spec section Task(s)
Wire Rust into CI (§1) 2, 3
No silent fallback (§2) 7 (load guard)
Delete JS + legacy + deps (§3) 4, 8
Pin frontmatter contract (§4) 5, 10, 12
Golden-snapshot test (§5) 10
Migration parity gate (§6) 1, 9
Truncation regression test 10b
Acceptance gate (§7) 11
Rust = exact drop-in (6 base fields) 5, 10
publisher/canonical (JS post-step) (untouched #7294; in 8b/11)
date/lastmod (JS post-step, sitemap) 8b
base_url resolution 6, 7
Section bundling stays in JS (untouched; verified in 11)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file javascript Pull requests that update javascript code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant