Skip to content

release 0.1.3: validate warns-not-raises; schema_hash prefix match#12

Merged
mprammer merged 1 commit into
developfrom
mp/release-0.1.3
May 10, 2026
Merged

release 0.1.3: validate warns-not-raises; schema_hash prefix match#12
mprammer merged 1 commit into
developfrom
mp/release-0.1.3

Conversation

@mprammer
Copy link
Copy Markdown
Contributor

Summary

Two paired changes to the validate stage so rebuilds aren't bricked by benign upstream drift, while keeping CI-grade strictness one flag away.

1. expect.schema_hash is now prefix-matched

All 37 slugs in sources.json with expect.schema_hash set use the 12-char short form (matching the [validate] schema_hash= print convention β€” same idea as git short SHAs). The previous full-string equality made every one of them fail validation on rebuild, because _schema_hash returns 64 hex chars. validate.py now uses strict equality when expected and actual are the same length, and prefix match when expected is shorter.

2. Validate-stage drift warns by default; --strict to fail

A row count or schema hash mismatch now emits a [WARN] line to stderr and the build continues. Users invoking python -m scripts.pipeline.build <slug> have already opted into "fetch whatever is upstream now"; an HF Arrow-conversion bump or a grow-only row count change shouldn't turn that into a failed build.

The new --strict flag on scripts.pipeline.build upgrades those warnings to hard errors β€” recommended for CI / pre-release gates. The previous --loose flag (which was the inverse) is removed; its behaviour is now the default.

How this surfaced

The wikipedia-en rebuild produced a parquet that matched expect.rows exactly (6,407,814) and matched the 12-char schema_hash prefix exactly, yet failed validate because full-string equality between 12 and 64 chars can never hold. Diagnosing that revealed the convention/code mismatch affected all 37 slugs with schema_hash set.

Files

  • scripts/pipeline/validate.py β€” prefix-match helper, warn-vs-strict semantics, strict=False default.
  • scripts/pipeline/build.py β€” --loose β†’ --strict flag flip, updated docstring + example.
  • sources.schema.md β€” documents both the prefix-match rule and the new warn-default semantics for the expect block.
  • pyproject.toml + CHANGELOG.md β€” version bump to 0.1.3.

πŸ€– Generated with Claude Code

…h prefix match

Two paired changes to make rebuilds resilient against benign upstream
drift while keeping CI-grade strictness one flag away:

1. expect.schema_hash now matches as a prefix when the manifest value is
   shorter than the full 64-char SHA-256. All 37 slugs with schema_hash
   set use the 12-char short form (matching the [validate] print line
   convention); the previous full-string equality silently broke every
   one of them on rebuild. Equal-length values still compare strictly,
   so full hashes remain enforceable for callers that prefer them.

2. The validate stage now treats row/schema_hash drift as a [WARN] line
   on stderr and continues by default. The previous behaviour (raise
   AssertionError) turned every HF Arrow-conversion bump or grow-only
   row change into a failed build for users who had already opted into
   "fetch whatever is upstream now". The new --strict flag on
   scripts.pipeline.build re-enables hard failures for CI / pre-release
   gates. The old --loose flag is removed (its behaviour is now the
   default).

Discovered via the wikipedia-en rebuild, which produced a parquet that
matched expect.rows exactly and matched the 12-char schema_hash prefix
exactly, yet failed validate because full-string equality could never
hold between 12 and 64 chars.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
@mprammer mprammer merged commit b6a3e34 into develop May 10, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant