Skip to content

GravityKit MCP 2.4.0: field-model correctness pass#6

Merged
zackkatz merged 37 commits into
mainfrom
develop
Jun 17, 2026
Merged

GravityKit MCP 2.4.0: field-model correctness pass#6
zackkatz merged 37 commits into
mainfrom
develop

Conversation

@zackkatz

@zackkatz zackkatz commented Jun 17, 2026

Copy link
Copy Markdown
Member

Brings main up to 2.4.0 with the 28 commits from this development cycle.

Summary

Gravity Forms entries created or read through the MCP now use the correct storage shape for every field type, so an AI assistant gets the entry format right the first time. Adds support for the password field type (now 46 field types), plus dev-only benchmark suites that grade the whole tool surface with a small model before each release.

Highlights

🚀 Added

  • password field type in the registry (now 46 field types, up from 45).
  • Expanded entry_input hints in gf_list_field_types summary mode for every field type whose entry format is not an obvious plain string.
  • Nested Form (form) field config in the registry and gf_list_field_types --detail (GP Nested Forms gpnfForm + Summary Fields).

✨ Improved

  • gf_create_form auto-assigns field ids when omitted (GF max+1), so callers can describe fields without hand-numbering them.
  • gf_delete_form / gf_delete_entry descriptions now state the safe Trash default explicitly.
  • Server instructions point the search-bar flow at the one-call gv_search_bar_add route.
  • One-shot clients can wait for the full tool catalog via GRAVITYKIT_MCP_LIST_TIMEOUT_MS.

🐛 Fixed

  • Corrected entry-storage shapes to match real Gravity Forms (number/quantity, signature, creditcard, post_content, post_image, fileupload, product, form).
  • chainedselect sub-inputs are now generated (one per dropdown level).
  • Corrected entry_input hints for address, chainedselect, survey_rank, date, and the survey/quiz/poll tokens.
  • Logger no longer corrupts the JSON-RPC transport (NODE_ENV=development was wrongly routed to stdout).
  • Ability /run calls always send an object input (WP Abilities API rejects null with a 400).
  • sorting.is_numeric is interpreted strictly (a JSON string "false" no longer forces numeric ordering).
  • Stdio server now exits on client disconnect (stdin EOF), eliminating orphaned node src/index.js processes.

💻 Developer Notes

  • Dev-only benchmark suites (not shipped in the npm package): npm run bench, bench:field-output, bench:field-storage, bench:nested-forms.

Full detail in CHANGELOG.md.

Summary by CodeRabbit

  • New Features

    • Expanded support to 46 Gravity Forms field types (including password) with richer gf_list_field_types metadata and entry_input hints
    • Added developer “AI release gate” benchmark suite for form, entry, and view correctness checks
    • Improved nested form handling and registry configuration for deeper gf_list_field_types --detail output
  • Bug Fixes

    • Improved form creation when field IDs are missing; fixed complex field storage and compound/chained select generation
    • Improved server stability (stdio shutdown, logging stream safety, and safer ability list timing)
  • Documentation

    • Updated tool docs and security notes (delete-to-Trash defaults, permanent delete requires force=true and ALLOW_DELETE=true)
    • Expanded benchmark setup/gating guidance

zackkatz added 28 commits June 17, 2026 09:41
The MCP server speaks JSON-RPC over stdout, so any log byte there
corrupts the transport and breaks the handshake (client sees a
malformed message; server hangs with 0 tools). The old detection
(`!NODE_ENV || NODE_ENV === 'production'`) was inverted: it routed the
common `NODE_ENV=development` to stdout and silently broke the handshake.

Resolve the mode per call (not memoized at import) so it always reflects
the current env and stays unit-testable: errors and all server-mode logs
go to stderr; only an explicit test context (NODE_ENV=test or the
GRAVITYKIT_MCP_TEST_MODE flags) may use stdout. Add a child-process test
that asserts stdout stays clean for development/unset/production and that
test mode logs to stdout.
Every Foundation ability declares an object `input_schema`, and the WP
abilities-api validates the `input` arg against `type:object` — so a
missing/null arg 400s ("input is not of type object"). Make the loader
always send an object input, never null/omitted:

- GET/DELETE empty input → `input=''` (WP's rest_is_object('') === true,
  the wire form of an empty object on a query string).
- POST empty input → body `{ input: {} }` (a real empty object).

Non-empty input is unchanged (bracketed query params / nested body).
Extract a `hasInputKeys` check and document the method-per-annotation
invariant. Add executeAbility request-shape tests across GET/POST and
both schema'd and schemaless abilities (the loader does not inspect the
schema; the server is the authority on what an ability accepts).
tools/list waits a default 2s for the abilities catalog before shipping;
gv_* tools that arrive later stream in via tools/list_changed. A one-shot
client (e.g. `claude -p`) reads tools/list once and never sees the
list_changed update, so it boots without gv_* if the catalog is slower
than the wait.

Add resolveAbilitiesListTimeoutMs(env) — reads GRAVITYKIT_MCP_LIST_TIMEOUT_MS,
falling back to 2000 on non-positive/non-numeric values — and wire it into
the ListTools handler so such clients can raise the first-list wait to
return the complete catalog. Covered by server-runtime unit tests.
gf_delete_form / gf_delete_entry default to Trash (recoverable), but the
terse "(vs trash)" description led a small model to pause and ask the user
trash-vs-permanent on every delete. Rewrite both tool descriptions and the
shared `force` param to state the default explicitly and instruct the model
to proceed with Trash unless the user explicitly asks to permanently delete.
…ions

The server `instructions` string described only the low-level
gv_search_field_* tools for building a search bar. Add the one-call
gv_search_bar_add path (View id + fields[] of {field_id, input?}) as the
preferred route, reserving the surgical gv_search_field_* tools for slot
edits after gv_view_config_get. Prose only — no behavior change.
The summary-mode input hints covered only a handful of types and were
imprecise on choices. Extend and correct them for the formats a small model
is most likely to populate wrong:

- choice fields (select/radio/checkbox/multiselect): store the choice VALUE
  when set or "Show Values"/enableChoiceValue is on, else the choice LABEL.
- time: a combined "HH:MM am/pm" string, NOT .1/.2/.3 sub-inputs.
- add compound (date/fileupload/signature), pricing (product/option/
  quantity/shipping/total, "Label|amount" encoding), survey/quiz/poll add-on
  choice codes, and repeater/form nested shapes.

Plain string fields (text/email/number/phone/website/textarea/hidden) stay
omitted on purpose. Hints mirror the field-registry storage model.
A stdio MCP server must exit when its stdin reaches EOF (client
disconnects or is killed). Without that, every crashed/SIGKILL'd client —
including each benchmark run — leaves an orphaned `node src/index.js`
running forever; the orphans accumulate and starve the next server's
startup, surfacing as agents booting with 0 MCP tools.

Add a child-process test that boots the server, closes its stdin, and
asserts it exits within 4s. Wire both new node:test suites
(logger-stdout, server-lifecycle) into the test:node script.
A behavioral release gate (not a unit test): it drives the MCP with the
smallest current model (claude-haiku-4-5) through realistic end-to-end
tasks across every surface and CRUD verb (discovery/forms/entries/
authoring/views/fields/widgets/search/grid), grading the actual
GravityView/Gravity Forms state — never the agent's self-report. Each
task runs BENCH_RUNS times; the gate passes only if every task's success
rate clears the threshold. The premise: a well-designed tool surface
shouldn't need a frontier model — a small model failing means the
descriptions/schemas/errors are the fix, not the model.

Includes the runner (run.mjs), config, task suites (tasks/), and the
harness lib (siteminter mint/teardown, the claude-CLI agent driver, the
target client, scoring, reporting). The agent driver pins the CLI via a
CLAUDE_BIN override so npm-run's node_modules/.bin PATH prepend can't
shadow the real `claude`. Adds the `bench` script and documents the gate
in AGENTS.md. Dev-only — not in the package files allowlist; reports land
in the gitignored bench/reports/.
A separate, NON-AI suite that renders every displayable Gravity Forms
field type's View-display cell through the gk-gravityview/view-field-render
ability (staged_slot — nothing persisted) against one seeded entry, then
asserts the output is error-free and, for content-pinnable types, contains
the expected display value. Output testing only: no search, CRUD, agent,
or scoring. Reuses the bench target plumbing (siteminter / resolveTarget /
makeClient) but talks REST directly. Adds the `bench:field-output` script.
Exit 0 all pass / 1 a field failed to render / 2 harness error.
The registry's storage blocks disagreed with how GF actually persists these
values, so gf_list_field_types --detail reported the wrong shape to callers.
Validated live against GF 2.10.3 + the real add-ons (entry round-trips) and
against GF source.

- number/quantity: storage.type number → string. GF stores numbers as TEXT
  in the entry; numeric is the field's purpose, not its storage shape
  (class-gf-field-number.php get_value_submission).
- signature: storage.format base64 → filename. The Signature add-on saves the
  drawn image to disk and stores its filename ("<hash>.png"); the URL is
  derived at display time (class-gf-field-signature.php get_value_save_entry
  → maybe_save_signature).
A chained-select field is compound, so addField asked generateSubInputs for its
inputs — but there was no chainedselect branch, so it got an empty inputs array
and none of its per-level dot-notation sub-inputs (fieldId.N) were created.

Add a branch that emits one sub-input per configured level, mirroring the GF
Chained Selects add-on (class-gf-field-chainedselect.php): ids count
1,2,…,9,11,12,… SKIPPING multiples of 10 (the .10/.20 slots are reserved), each
labelled by its level; a field with no configured levels defaults to two
(Parents/Children) per get_default_inputs(). TDD: failing tests first, including
the .10-skip edge. Confirmed live — the real add-on accepted the generated
inputs and the entry stored 1.1/1.2 per level.
These hints steer a small model toward the right entry shape; three were wrong
or incomplete.

- address: was hardcoded to field id 2 and omitted .2 (Line 2) and .6 (Country).
  Now lists all six sub-inputs N.1–N.6 with N = field id.
- chainedselect: now notes the sub-input count is dynamic (one per level).
- survey_rank: clarified — all choice values in ranked order, comma-separated,
  is CORRECT (the audit's "single value" claim was wrong; class-gf-field-rank.php
  get_ordered_choices explode(',')). Values are comma-free "gsurvey<id><hex>"
  tokens, so the delimiter is unambiguous (GenerateSurveyDefaultChoiceValue).
A deterministic suite (npm run bench:field-storage --mint) that builds a form
per fixed field, seeds an entry over GF REST, reads it back, and asserts the
STORED shape matches the registry — closing the loop the AI gate and the
field-output suite don't cover (output ≠ storage).

The add-on-gated types (chainedselect, signature, survey_rank) run against a
mint with the ACTUAL Chained Selects / Signature / Survey plugins symlinked +
activated; a preflight prints each active plugin's version and skips (never
fakes) a case whose add-on isn't really loaded, so a green run can't be a stub.

- config: add SITEMINTER.addons (the three real add-on paths, env-overridable).
- siteminter: provisionSite takes name/plugins overrides; add activePlugins().
An adversarial audit of all 45 field types against Gravity Forms + add-on
source (each finding refuted, then re-confirmed by a live entry round-trip on
GF 2.10.3) surfaced field-model entries that disagreed with how GF actually
stores values. A small model trusting them would write the wrong entry shape.

- creditcard: storage reflects the two persisted sub-inputs — .1 (masked
  number) and .4 (card type); generateSubInputs labels .4 Card Type / .5
  Cardholder Name (were swapped). .2/.3/.5 are not persisted.
- post_content: real GF type is 'post_content' (was the non-existent
  'post_body', which gf_add_field could never resolve).
- post_image: stored as one "url|:|title|:|caption|:|description|:|alt"
  composite, not a bare URL; added an entry_input hint.
- fileupload: JSON array when storageType is 'json' OR multipleFiles (not
  multipleFiles alone).
- product 'price' (User Defined): single money string, not dot-notation.
- form (Nested Forms): comma-separated string of child entry ids, not a JSON array.
- password: added as its own field type (not a text variant); GF does not
  persist it (stored value is empty).
- date / survey_likert / survey_rating / quiz / poll / survey: corrected the
  entry_input hints (date is always ISO; add-on choice values are
  "g<kind><fieldId><hex>" tokens; multi-row likert keys by sub-input id).
- Trimmed hint/comment phrasing to state the correct shape, not the old bug.
Grows the storage suite from the 6 fixed fields to every storable field type
plus variant-aware cases (fileupload single/JSON, select value/label-as-value,
product singleproduct/user-defined-price, name simple/advanced, list
single/multi-column) and the layout/no-value types. 47 run cases + 1 skipped
(nested forms, add-on not on the mint). Cases live in field-storage-cases.mjs.

Compound fields are built through the MCP's own FieldManager.generateSubInputs,
so the round-trip proves the structure our code emits is one GF accepts. Add-on
cases run against the REAL Chained Selects / Signature / Survey / Polls / Quiz
plugins (now minted); a case whose add-on isn't active is skipped, never faked.
… add-on

The form (Nested Form) case was the one skip — its add-on wasn't on the mint.
GP Nested Forms is in the plugins dir (gp-nested-forms, field type 'form',
standalone), so wire it into the addon mint set and run it for real.

Adds setup/teardown hooks to the runner so a case can build prerequisites: the
form case creates a child form + two real child entries, points the field at it
via gpnfForm, stores their ids, and asserts the comma-separated string round-
trips. All 48 cases now run (0 skipped) — stored child entry ids "42,43".
Field-model correctness pass (entry-storage shapes + entry_input hints validated against real GF 2.10.3 + add-ons), new password field type (registry now 46), and dev-only bench suites. See CHANGELOG for the full 2.4.0 entry.
…i skip]

- Note all 46 field types + entry-storage/entry_input validated against real GF + add-ons
- gf_delete_form/entry: document Trash-by-default, force=true to delete permanently
- gf_list_field_types: 46 types, summary/detail modes
- GravityView: mention one-call gv_search_bar_add
- Add dev-only bench / bench:field-output / bench:field-storage commands
- Security: replace the inaccurate 'Rate Limiting: automatic retry/backoff' claim (no such logic in source) with credential masking + Trash-default note
The `form` (Nested Form) field type now describes its GP Nested Forms config so
callers know how to set it up, and gf_list_field_types --detail surfaces it:

- gpnfForm — the child form id whose entries are nested.
- gpnfFields — "Summary Fields": which child-form field ids show in the nested
  summary table (a directory View lists child entry ids; the single-entry view
  renders these fields as columns).
- requiresAddon — the field comes from GP Nested Forms (needs the Spellbook
  framework, formerly Gravity Perks).

--detail output now includes `requiresAddon` + `settings` (undefined for other
types, stripped by compact).
npm run bench:nested-forms mints a site with the real GP Nested Forms + Spellbook
(and GF/GravityView), builds a parent form with a nested-form field, child form +
entries, a published View embedding it, then fetches the FRONT-END and asserts:

- directory (Multiple Entries) shows the child entry ids;
- single entry renders the nested child table with the chosen Summary Fields
  (Name + Amount) and EXCLUDES a non-summary field (Notes), proving gpnfFields
  controls which child fields display.

Supporting bench changes: config adds the Spellbook + nested-forms add-ons to
the mint; createView takes an optional status (publish, so the View renders for
anonymous front-end requests); siteminter exports wpCli (set GPNF parent meta /
clear _gpnf_expiration / approve via gform_update_meta, NOT a value-wiping
REST PUT). Fixtures linkage mirrors a real GPNF submission.
Companion to the scored gate: hands the small model ONE plain-English goal
(build a job-applications form + entries + a GravityView View + a role-aware
search bar) and only the MCP under test, then prints the transcript of which
tools it CHOSE to call and the report it wrote. Nothing is hard-coded to
specific tools — the model discovers and sequences the surface itself, which is
the real "is this usable via natural language?" test.

Reuses the existing agent harness (runAgent) + writeMcpConfig; mints with the
full add-on set.
The most common first-try failure when a natural-language agent creates a form:
it describes the fields but doesn't hand-number them, and validation rejects
"Field must have an id". gf_create_form now fills in missing field ids using
GF's max+1 convention BEFORE validation, preserving any explicit ids and
re-basing provided compound sub-input ids onto the assigned id (e.g. "2.3" →
"10.3"). Surfaced by bench:demo, where the small model hit this and had to
retry; now it's one-shot.

TDD: assignFieldIds unit tests (sequential, gap-fill-without-collision,
sub-input re-base, passthrough) added to field-registry.test.js.
`node bench/demo.mjs --runs N` loops the natural-language goal N times against
the same site and aggregates the signal that matters for a surface change:
how often gf_create_form is one-shot, the form gets created, and the run is
error-free. Single-run still prints the full transcript + agent report.

Used to confirm the gf_create_form auto-id fix: 4/4 one-shot, 0 tool errors.
…(soft)

Replaces the single global turn knob with per-task budgets:

- maxTurns: a HARD per-task ceiling threaded through runAgent → runOnce →
  --max-turns, falling back to the global CONFIG.maxTurns. A task hitting it is
  killed mid-run and grades as incomplete, so it's a right-sized runaway/fail-
  fast backstop — keep it generous (~2-3x the realistic worst case).
- expectedTurns: a SOFT efficiency budget. score.mjs now reports medianTurns and
  flags turnsOverBudget when median > expectedTurns x tolerance (default 2, env
  BENCH_TURNS_TOLERANCE). It is REPORTED, NOT gated — turn counts are too
  stochastic to hard-fail on (a correct run that verifies its own work spikes
  turns). The report shows median/expected with a ⚠ and a soft over-budget line;
  decideGate stays correctness-only.
Gives all 32 bench tasks a soft efficiency budget (expectedTurns) and a hard
ceiling (maxTurns), sized to each task's realistic complexity — simple
discovery/read tasks ~2 turns / cap 8, single-action CRUD ~3 / 8-10, multi-step
authoring/search ~4 / 12. Initial estimates; the report prints actual median
turns so they can be tuned from telemetry.
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

Bumps the package to v2.4.0, correcting field-registry storage metadata for 9+ field types, fixing the abilities loader empty-input request shape, routing all server logs to stderr, adding a configurable abilities-list timeout, and auto-assigning field IDs in createForm. Introduces a complete bench/ AI release gate benchmark suite with infrastructure, 30+ graded task definitions, and deterministic field-storage/output/nested-forms harnesses.

Changes

MCP Server v2.4.0 Improvements

Layer / File(s) Summary
Field registry corrections and assignFieldIds
src/field-definitions/field-registry.js, src/field-operations/field-manager.js, test/field-registry.test.js, test/field-manager.test.js
Adds password field definition; corrects storage types for number, quantity, fileupload, post_content, post_image, product, creditcard, signature, and nested-form fields; swaps creditcard sub-input .4/.5 labels; adds chainedselect sub-input generation; adds exported assignFieldIds with full test coverage.
gf_list_field_types expansion and createForm auto-assign IDs
src/field-operations/index.js, src/gravity-forms-client.js
Expands detail=true to return validation/requiresAddon/settings; replaces the entry_input hint map with a comprehensive 40+ field-type catalog; wires assignFieldIds into createForm before validation.
Abilities loader empty-input fix and tests
src/abilities/loader.js, test/abilities-loader.test.js
Adds hasInputKeys so GET/DELETE sends input: '' for empty input and input[...] params for non-empty; POST always sends { input: {} } or { input }; tests cover schema'd and schemaless abilities with edge cases.
Logger stderr routing, configurable timeout, server schema updates
src/utils/logger.js, src/server-runtime.js, src/index.js, test/logger-stdout.test.js, test/server-lifecycle.test.js, test/server-runtime.test.js
Replaces isMCPMode with isTestContext() so server-mode logs go to stderr; adds resolveAbilitiesListTimeoutMs reading GRAVITYKIT_MCP_LIST_TIMEOUT_MS; updates instructions string and gf_delete_form/gf_delete_entry force descriptions; lifecycle test asserts stdin-EOF exits the server.
Documentation
README.md, CHANGELOG.md, AGENTS.md
Updates field counts to 46, adds v2.4.0 changelog section, reorganizes earlier changelog entries, clarifies delete/field-type tool docs, references npm run bench in release checklist.

AI Release Gate Benchmark Suite

Layer / File(s) Summary
Bench infrastructure
bench/config.mjs, bench/lib/target.mjs, bench/lib/siteminter.mjs, bench/lib/agent.mjs
Defines all env-driven config constants and resolveTarget; writeMcpConfig/makeClient for ephemeral MCP configs and REST grader clients; siteminter lifecycle for disposable WordPress sites; headless claude CLI runner with stream-json telemetry, retry, and timeout.
Scoring and reporting
bench/lib/score.mjs, bench/lib/report.mjs
Pure-function scoreRun/aggregateTask/decideGate/delta for per-run and per-task metrics, flakiness, turn-budget overage, and before/after comparison; report() writes console output and JSON artifact.
Task helper utilities
bench/tasks/helpers.mjs
Provides uniqueLabel, telemetry predicates (noToolErrors, calledOk, sawErrorCode), GravityView tree traversal, and search-bar field matching helpers.
Task definitions
bench/tasks/discovery.mjs, bench/tasks/forms.mjs, bench/tasks/entries.mjs, bench/tasks/entries-crud.mjs, bench/tasks/authoring.mjs, bench/tasks/views.mjs, bench/tasks/view-fields.mjs, bench/tasks/view-widgets.mjs, bench/tasks/search.mjs, bench/tasks/grid.mjs, bench/tasks/index.mjs
30+ graded tasks across discovery, GF forms, entries CRUD, GravityView authoring, view settings/status/clone/delete, view-field add/label/reorder/remove, widget add/remove, search-bar configuration, and grid layout; all wired into TASKS registry.
Gate runner and npm scripts
bench/run.mjs, package.json
Iterates tasks with configurable run counts, writes partial.json checkpoints, computes gate decision, reports, destroys/retains minted site, exits 0/1/2; registers bench, bench:field-output, bench:field-storage, bench:nested-forms, bench:demo as npm scripts.
Deterministic harnesses
bench/field-storage-cases.mjs, bench/field-storage.mjs, bench/field-output-cases.mjs, bench/field-output.mjs, bench/nested-forms.mjs, bench/demo.mjs
Field-storage harness asserts stored entry shape for all 46 GF field types; field-output harness renders each via gk-gravityview/view-field-render and asserts HTML; nested-forms E2E validates GravityView + GP Nested Forms front-end rendering; demo runs a natural-language agent goal with aggregate metrics.
Bench documentation
bench/README.md, test/AGENTS.md
Documents the release gate contract, setup modes, task taxonomy, adding tasks, and cost/delta reporting; test/AGENTS.md documents test conventions and benchmark environment setup.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • GravityKit/MCP#5: Directly related — this PR modifies the same executeAbility() request wiring in src/abilities/loader.js and extends test/abilities-loader.test.js to cover the empty-input behavior introduced or adjusted there.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.97% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'GravityKit MCP 2.4.0: field-model correctness pass' accurately summarizes the main objective of the changeset—a version bump to 2.4.0 with corrections to field storage models and expanded field-type support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch develop

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/abilities/loader.js (1)

443-463: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use flattened params (not top-level keys) to decide “empty input” for GET/DELETE.

Line 443 can classify inputs as non-empty even when they serialize to no query params (e.g. { nested: {} }, { a: null }). In that case, input is omitted and WP may reject with type:object validation errors instead of treating it as {}.

Suggested fix
-  const hasInputKeys = !!input && Object.keys(input).length > 0;
+  const hasInputKeys = !!input && typeof input === 'object' && Object.keys(input).length > 0;

   if (method === 'GET' || method === 'DELETE') {
@@
-    if (hasInputKeys) {
-      const params = {};
-      walkInputToBracketedParams(input, 'input', params);
-      config.params = params;
-    } else {
-      // Empty input. Send `input=` (empty string): WP's
-      // rest_is_object('') === true, so it validates as an empty object —
-      // whereas omitting `input` entirely sends `null`, which fails the
-      // type:object check with a 400.
-      config.params = { input: '' };
-    }
+    const params = {};
+    if (hasInputKeys) {
+      walkInputToBracketedParams(input, 'input', params);
+    }
+    // If flattening yields nothing, preserve the empty-object wire contract.
+    config.params = Object.keys(params).length > 0 ? params : { input: '' };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/abilities/loader.js` around lines 443 - 463, The hasInputKeys check on
line 443 doesn't account for inputs with keys that serialize to empty params
(e.g., { nested: {} }, { a: null }). Instead of checking hasInputKeys before
calling walkInputToBracketedParams, always call walkInputToBracketedParams first
to flatten the input into params, then check if the resulting params object is
empty. If it's empty, set config.params = { input: '' } to ensure WordPress
properly validates it as an empty object instead of receiving null.
🧹 Nitpick comments (8)
README.md (1)

295-307: 💤 Low value

Document npm run bench:nested-forms in the benchmark section.

The README lists three benchmark commands (bench, bench:field-output, bench:field-storage) but CHANGELOG.md line 34 documents a fourth dev-only benchmark, npm run bench:nested-forms, for testing front-end nested forms rendering. For completeness, add it to the benchmark section:

- `npm run bench:nested-forms`: a front-end render test confirming a parent View shows a nested form's Summary Fields.

All four are marked dev-only (not shipped in npm), so they belong together in the documentation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 295 - 307, The benchmark section in README.md
documents three dev-only benchmark commands (bench, bench:field-output,
bench:field-storage) but is missing the fourth command npm run
bench:nested-forms, which is already documented in CHANGELOG.md. Add a fourth
bullet point to the benchmark section that describes npm run bench:nested-forms
as a front-end render test confirming that a parent View shows a nested form's
Summary Fields, placing it alongside the other dev-only benchmark commands to
keep all benchmark documentation together.
CHANGELOG.md (1)

26-26: 💤 Low value

Adverb repetition: rephrase line 26 for clarity.

"Sorting only is now valid only when..." repeats "only" awkwardly. Rephrase to one of:

  • "Sorting is now valid only when it genuinely means true"
  • "Sorting is now interpreted strictly — valid only when it genuinely means true"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CHANGELOG.md` at line 26, In the bullet point about `sorting.is_numeric`, the
phrase "Sorting only is now valid only when..." contains an awkward repetition
of the word "only" that makes the sentence unclear. Rephrase this line to remove
the repetition by using one of the suggested alternatives: either "Sorting is
now valid only when it genuinely means true" or "Sorting is now interpreted
strictly — valid only when it genuinely means true". Choose whichever option
best fits the context and maintain the rest of the explanation about how the
sorting behavior has changed.
bench/README.md (1)

94-106: ⚡ Quick win

Document optional task properties expectedTurns and maxTurns in the task-structure template.

The template shows the required shape (id, category, setup, prompt, grade, teardown), but the runner code (bench/run.mjs:66–67) also references task.expectedTurns and task.maxTurns for budget tracking and warnings. These are not required (absent tasks skip the check), but developers adding tasks should know about them. Update the template to include them as optional fields.

✏️ Proposed addition
 {
   id: 'category.short-name',
   category: 'category',
+  expectedTurns: 5,      // optional: expected turn count for this task; triggers ⚠ if exceeded
+  maxTurns: 20,          // optional: hard limit on turns (enforced by runner)
   async setup(client) { /* create fixtures */ return { /* state */ }; },
   prompt: (state) => `natural-language task referencing ${state.…}`,
   async grade({ client, state, telemetry }) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/README.md` around lines 94 - 106, The task-structure template in the
README documentation is incomplete and does not include the optional properties
`expectedTurns` and `maxTurns` that are referenced in the runner code
(bench/run.mjs) for budget tracking and warnings. Add these optional fields to
the template documentation shown in the diff, making it clear that they are
optional and documenting their purpose for turn budgeting and warning thresholds
when developers define new tasks.
test/abilities-loader.test.js (1)

550-625: ⚡ Quick win

Add DELETE and “flattened-empty input” regression tests.

The new coverage is strong for GET/POST, but it misses two high-value cases in the same changed path: DELETE with empty input, and objects that become empty after bracket-flattening (e.g. { nested: {} }).

Suggested test additions
+suite.test('executeAbility: empty input on DELETE ability → params {input:\'\'}', async () => {
+  const stub = buildCatalogStubGvClient([[
+    {
+      name: 'gk-gravityview/view-delete',
+      description: 'DELETE ability',
+      input_schema: { type: 'object', properties: {} },
+      annotations: { destructive: true, idempotent: true },
+      enabled: true,
+      mcp_tool_name: 'gv_schema_delete',
+    },
+  ]]);
+  const { handlers } = await loadAbilitiesAsTools(stub);
+  await handlers.gv_schema_delete({});
+  const run = findRun(stub, 'gk-gravityview/view-delete');
+  TestAssert.equal(run.method, 'DELETE');
+  TestAssert.deepEqual(run.params, { input: '' });
+});
+
+suite.test('executeAbility: effectively-empty nested input on GET → params {input:\'\'}', async () => {
+  const stub = buildCatalogStubGvClient([inputSchemaFenceCatalog()]);
+  const { handlers } = await loadAbilitiesAsTools(stub);
+  await handlers.gv_schema_get({ nested: {} });
+  const run = findRun(stub, 'gk-gravityview/views-list');
+  TestAssert.equal(run.method, 'GET');
+  TestAssert.deepEqual(run.params, { input: '' });
+});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/abilities-loader.test.js` around lines 550 - 625, Add two new regression
test cases to this test suite following the existing pattern. First, add a test
for a DELETE ability with empty input to ensure it handles the empty input case
consistently with the GET and POST tests already present. Second, add a test
case for an object that becomes empty after bracket-flattening, such as passing
{ nested: {} } to verify how the loader handles flattened inputs that resolve to
empty values. Both tests should use the same suite.test structure and verify the
correct method, params/data structure, and expected flattened output behavior.
src/index.js (1)

390-397: ⚡ Quick win

Keep delete tool descriptions terse in tools/list.

These new descriptions are clear, but they’re too long for a hot metadata path and increase token overhead on every list response.

✂️ Suggested concise wording
- description: 'Delete a form. Defaults to Trash (recoverable) — proceed with the default; do NOT pause to ask the user trash-vs-permanent. Set force=true only if the user explicitly asks to permanently delete. (Requires ALLOW_DELETE=true.)',
+ description: 'Delete a form. Default (force=false) moves it to Trash. Use force=true only for explicit permanent-delete requests. Requires ALLOW_DELETE=true.',
...
- force: { type: 'boolean', description: 'Permanently delete instead of moving to Trash. Default false (Trash, recoverable).' }
+ force: { type: 'boolean', description: 'Permanent delete. Default false (move to Trash).' }

- description: 'Delete an entry. Defaults to Trash (recoverable) — proceed with the default; do NOT pause to ask the user trash-vs-permanent. Set force=true only if the user explicitly asks to permanently delete. (Requires ALLOW_DELETE=true.)',
+ description: 'Delete an entry. Default (force=false) moves it to Trash. Use force=true only for explicit permanent-delete requests. Requires ALLOW_DELETE=true.',
...
- force: { type: 'boolean', description: 'Permanently delete instead of moving to Trash. Default false (Trash, recoverable).' }
+ force: { type: 'boolean', description: 'Permanent delete. Default false (move to Trash).' }

As per coding guidelines, src/index.js: Keep tool descriptions in src/index.js terse to reduce tools/list token overhead.

Also applies to: 544-551

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/index.js` around lines 390 - 397, The delete form tool description in
src/index.js is too verbose and increases token overhead for the tools/list
response. Shorten the description field in the delete tool definition (around
the id and force properties) to be more concise while retaining only the most
essential information about what the tool does. Apply the same brevity principle
to the other tool description mentioned around line 544-551 to maintain
consistency and reduce overall token usage in metadata responses.

Source: Coding guidelines

test/logger-stdout.test.js (1)

31-52: ⚡ Quick win

Add tests for GRAVITYKIT_MCP_TEST_MODE and GRAVITYMCP_TEST_MODE.

isTestContext() supports both flags, but this suite currently validates only NODE_ENV=test.

🧪 Suggested additions
 test('explicit test mode (NODE_ENV=test) may use stdout', () => {
   const { stdout } = runLogger({ NODE_ENV: 'test' });
   assert.ok(stdout.includes('LOG_MARKER'), 'test mode logs to stdout for visibility');
 });
+
+test('explicit test mode (GRAVITYKIT_MCP_TEST_MODE=true) may use stdout', () => {
+  const { stdout } = runLogger({ NODE_ENV: 'development', GRAVITYKIT_MCP_TEST_MODE: 'true' });
+  assert.ok(stdout.includes('LOG_MARKER'), 'GRAVITYKIT_MCP_TEST_MODE should route logs to stdout');
+});
+
+test('explicit test mode (GRAVITYMCP_TEST_MODE=true) may use stdout', () => {
+  const { stdout } = runLogger({ NODE_ENV: 'development', GRAVITYMCP_TEST_MODE: 'true' });
+  assert.ok(stdout.includes('LOG_MARKER'), 'GRAVITYMCP_TEST_MODE should route logs to stdout');
+});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/logger-stdout.test.js` around lines 31 - 52, The test suite validates
logger behavior for NODE_ENV settings but lacks coverage for the alternative
test context detection methods supported by isTestContext(). Add two new test
cases following the same pattern as the existing test cases in this file: one
that sets the GRAVITYKIT_MCP_TEST_MODE environment variable and verifies logs
appear on stdout, and another that sets GRAVITYMCP_TEST_MODE and performs the
same verification. Both tests should use the runLogger() function and assert
that LOG_MARKER is present in stdout when these environment variables are set,
ensuring parity with the NODE_ENV=test behavior.
test/server-lifecycle.test.js (1)

22-22: ⚡ Quick win

Spawn Node via process.execPath for test portability.

Hardcoding 'node' can fail in CI/runtime setups where PATH differs from the test process binary.

🔧 Suggested change
-  const child = spawn('node', [SERVER], { stdio: ['pipe', 'pipe', 'pipe'] });
+  const child = spawn(process.execPath, [SERVER], { stdio: ['pipe', 'pipe', 'pipe'] });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/server-lifecycle.test.js` at line 22, The spawn call in the test file
hardcodes the string 'node' as the executable, which fails in CI environments
with different PATH configurations. Replace the hardcoded 'node' string with
process.execPath, which provides the absolute path to the Node.js executable
that is currently running the test process. This ensures the spawned child
process uses the same Node binary as the test runner, making the test portable
across different runtime environments.
bench/lib/target.mjs (1)

16-16: 🏗️ Heavy lift

Align local ESM imports to the .js extension convention.

Line 16 imports ../config.mjs; this diverges from the project-wide import-extension rule and repeats across new bench modules.

As per coding guidelines, **/*.{js,mjs}: Use ESM throughout the project with .js extensions on all imports.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/lib/target.mjs` at line 16, The import statement for CONFIG and
MCP_ENTRY uses the .mjs extension which diverges from the project convention of
using .js extensions for ESM imports. Change the import path from ../config.mjs
to ../config.js to align with the project-wide import-extension rule established
for all ESM modules.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bench/demo.mjs`:
- Around line 46-53: The main function skips the destroySite cleanup when
runAgent() or any subsequent step throws an error, leaving the minted
environment running. Wrap the runAgent() call and any other steps that follow
the provisionSite() call in a try-finally block, moving the
destroySite(prov.name) call to the finally block to ensure cleanup always
executes regardless of whether an error occurs.

In `@bench/field-storage.mjs`:
- Around line 62-83: The main function does not guarantee cleanup of minted
sites when errors occur during execution. When args.mint is true, the site is
provisioned via provisionSite() and mintedName is set, but if an exception is
thrown in the subsequent code (after the provisioning block and before any
cleanup), the minted site is never destroyed. Wrap the code that executes after
the site provisioning (starting from where makeClient is called through the end
of the function logic) in a try-finally block that ensures destroySite() is
called with mintedName only when mintedName is not null, guaranteeing cleanup
regardless of whether an error occurs during execution.

In `@bench/lib/agent.mjs`:
- Around line 197-236: The spawn call for CLAUDE_BIN lacks an error event
handler, so if the process fails to spawn (e.g., missing or non-executable
binary), the error will go unhandled and either crash the process or leave the
Promise unresolved. Add a child.on('error') handler before or alongside the
existing child.on('close') handler that captures the spawn error and calls
resolvePromise with appropriate error details, ensuring the returned object
includes a hardError field set to a descriptive error message based on the error
type.

In `@bench/lib/target.mjs`:
- Around line 64-78: The axios instances gf and wp use validateStatus: () =>
true which treats all HTTP status codes as successful, allowing the read helper
functions (countEntries, getEntries, getForm, findFormByTitle) to misinterpret
HTTP errors as empty results and produce incorrect benchmark scores. Add a
shared assertion in each helper function to verify the response status is in the
2xx range before parsing values, returning an empty/fallback state only for
genuinely successful empty responses. Additionally, update the import statement
on line 16 from config.mjs to config.js to comply with ESM guidelines.

In `@bench/nested-forms.mjs`:
- Around line 239-240: When exceptions are caught in the catch block at lines
239-240 (and the similar catch block at 259-260), the code is converting them to
a normal fail verdict instead of signaling a harness error with the documented
exit code 2. Instead of assigning the verdict object with pass: false when an
exception occurs, call process.exit(2) to properly distinguish harness/runtime
faults from assertion failures in CI. This ensures thrown errors exit with code
2 as documented, while normal assertion failures continue to exit with code 1.
- Around line 74-77: The fetchHtml function reconstructs URLs by joining baseUrl
with path, but this fails when page URLs (pageLink/singleUrl) exist on different
hosts than target.baseUrl, causing it to fetch from the wrong site. Modify the
fetchHtml function signature to accept the complete published page URL directly
instead of baseUrl and path parameters, and update all calls to fetchHtml
(including the ones at lines 209-211 marked as "Also applies to") to pass the
full URL (pageLink or singleUrl) directly rather than reconstructing it from
separate base and path components.
- Around line 74-76: The fetchHtml function's fetch request lacks a timeout
mechanism, which can cause the process to hang indefinitely if a response
stalls. Add a timeout to the fetch call in fetchHtml by using an AbortController
to set a reasonable timeout duration (similar to the axios clients already in
the bench stack). When the timeout is exceeded, the abort controller should
terminate the request and the error should be properly handled to allow the
process to continue.

In `@bench/README.md`:
- Around line 113-115: The cost caveat example at lines 113-115 incorrectly
states "A 7-task suite at 3 runs is ~21 sessions," but the Coverage table at
lines 73-84 enumerates approximately 28 tasks across all categories (discovery,
forms, entries, authoring, views, view-fields, view-widgets, search, and grid).
Update the example to use the correct task count of 28 instead of 7, and
recalculate the sessions estimate accordingly to reflect the actual suite size
and help users accurately budget their time and tokens.

In `@bench/run.mjs`:
- Around line 21-27: The import statements in the run.mjs file use .mjs file
extensions (in the imports from config.mjs, lib/target.mjs, lib/siteminter.mjs,
lib/agent.mjs, lib/score.mjs, lib/report.mjs, and tasks/index.mjs) but the
project convention requires .js extensions on all ESM imports. Change all the
.mjs extension specifiers in these import statements to .js extensions to align
with the repository's ESM extension rule that requires .js extensions across all
imports.
- Around line 82-128: The minted environment cleanup code (checking mintedName
and calling destroySite) only runs on the normal completion path, causing
resource leaks if an exception occurs after provisionSite() succeeds. Use a
try-finally structure wrapping the main logic from where mintedName is
determined through the gate reporting, and move the cleanup code that handles
the mintedName variable (the if/else block that destroys or logs the site) into
a finally block to ensure it executes whether the code completes normally or
throws an exception.

In `@bench/tasks/discovery.mjs`:
- Around line 47-53: The grade method in the discovery task only validates that
views_list or views_scan calls completed successfully, but it does not verify
that the actual response was filtered to the publish status. To fix this, add
additional validation logic to the grade method that inspects the telemetry data
to confirm the response was actually filtered to the publish status (not just
that the call succeeded). This requires checking the response content or status
filtering metadata in the telemetry to ensure the filtering behavior was
correctly applied, preventing false positives where the call succeeds but the
filtering is incorrect.

In `@bench/tasks/entries-crud.mjs`:
- Around line 45-48: The grade function in the entries-crud.mjs file currently
passes the test if the output text contains "ada@example.com" and doesn't
contain "charles@example.com", but this can succeed through prompt echoing alone
without any actual tool invocation. Add a guard check to ensure that the
entries.search tool was actually called in the telemetry before accepting the
test as passing. Verify that telemetry contains evidence of actual tool calls
being made, not just that the final text matches the expected pattern.

In `@bench/tasks/grid.mjs`:
- Around line 39-45: The grade function for the grid.add-row-and-place-field
task only verifies that the Email field (id '2') exists somewhere in the
configuration after adding a row, but does not check that it is specifically
placed in the left column as required by the prompt. Modify the grading logic by
replacing or supplementing the hasEmail check to verify not only that the Email
field exists in the grid, but also that it is positioned in the left column of
the newly added row. This will require inspecting the grid structure to
understand how column positions are represented in cfg.fields and validating the
Email field's column position accordingly.

In `@bench/tasks/index.mjs`:
- Around line 17-26: The import statements in bench/tasks/index.mjs are using
`.mjs` file extensions which deviates from the project's ESM convention of using
`.js` extensions. Change all import statements for the task modules (discovery,
forms, entries, entriesCrud, authoring, views, viewFields, viewWidgets, search,
grid) from importing with `.mjs` extensions to `.js` extensions, and ensure all
corresponding files in the bench directory are renamed from `.mjs` to `.js` to
maintain consistency with the project's ESM import guideline.

In `@bench/tasks/view-fields.mjs`:
- Around line 33-37: The grade method in the view-fields task is checking the
entire field tree using fieldIdsInTree(cfg.fields) instead of restricting the
check to the column area only. For column-specific tasks like view-fields.add
and view-fields.remove, update the grading logic to check only the
column-specific area (likely cfg.fields.columns or a similar column-scoped
property) instead of the entire cfg.fields tree. This applies to both instances
of the grading logic mentioned in the comment.

In `@bench/tasks/views.mjs`:
- Around line 29-30: The teardown method in the task definition chains two
sequential delete operations where a failure in deleteView will prevent
deleteForm from executing, causing resource leaks. Modify the teardown method to
use error handling (such as try-finally blocks or Promise.allSettled) to ensure
both client.deleteView and client.deleteForm are always executed regardless of
whether either operation throws an error. This same pattern needs to be applied
to all other teardown methods in the file that have similar sequential delete
chains (at the other locations mentioned in the comment).

In `@src/field-definitions/field-registry.js`:
- Around line 991-1017: The assignFieldIds function lacks a guard clause to
validate that each field element is a valid object before attempting to spread
it or access its properties. When fields array contains null or non-object
elements, the code throws a runtime error when executing { ...field, id: newId
}. Add an early validation check at the beginning of the map callback in
assignFieldIds to ensure the field is a valid object, and handle invalid items
appropriately to prevent unhandled runtime errors and ensure validation errors
are returned instead.
- Around line 622-630: The generateCompoundInputs() function for creditcard
fields is generating sub-inputs that do not match the
creditcard.storage.subInputs contract. The subInputs object now defines only .1
for card_number_masked and .4 for card_type, but generateCompoundInputs() is
still emitting .4 as Cardholder Name and .5 as Card Type. Update the
generateCompoundInputs() function to only emit sub-inputs .1 and .4 with the
correct mappings to align with the creditcard.storage.subInputs definition,
removing the Cardholder Name and Card Type sub-inputs that are no longer part of
the contract.

In `@src/utils/logger.js`:
- Around line 28-40: The sendLogMessage function is outputting raw, unsanitized
log messages to the console, which can expose sensitive credentials. Modify the
function to sanitize the message parameter using the sanitize utility from
utils/sanitize.js before passing it to any console methods (console.error and
console.log). Apply this sanitization at the entry point of the function so that
all console output throughout sendLogMessage uses sanitized data, protecting
against credential leakage in all code paths including error logs, test context
logs, and server mode logs.

---

Outside diff comments:
In `@src/abilities/loader.js`:
- Around line 443-463: The hasInputKeys check on line 443 doesn't account for
inputs with keys that serialize to empty params (e.g., { nested: {} }, { a: null
}). Instead of checking hasInputKeys before calling walkInputToBracketedParams,
always call walkInputToBracketedParams first to flatten the input into params,
then check if the resulting params object is empty. If it's empty, set
config.params = { input: '' } to ensure WordPress properly validates it as an
empty object instead of receiving null.

---

Nitpick comments:
In `@bench/lib/target.mjs`:
- Line 16: The import statement for CONFIG and MCP_ENTRY uses the .mjs extension
which diverges from the project convention of using .js extensions for ESM
imports. Change the import path from ../config.mjs to ../config.js to align with
the project-wide import-extension rule established for all ESM modules.

In `@bench/README.md`:
- Around line 94-106: The task-structure template in the README documentation is
incomplete and does not include the optional properties `expectedTurns` and
`maxTurns` that are referenced in the runner code (bench/run.mjs) for budget
tracking and warnings. Add these optional fields to the template documentation
shown in the diff, making it clear that they are optional and documenting their
purpose for turn budgeting and warning thresholds when developers define new
tasks.

In `@CHANGELOG.md`:
- Line 26: In the bullet point about `sorting.is_numeric`, the phrase "Sorting
only is now valid only when..." contains an awkward repetition of the word
"only" that makes the sentence unclear. Rephrase this line to remove the
repetition by using one of the suggested alternatives: either "Sorting is now
valid only when it genuinely means true" or "Sorting is now interpreted strictly
— valid only when it genuinely means true". Choose whichever option best fits
the context and maintain the rest of the explanation about how the sorting
behavior has changed.

In `@README.md`:
- Around line 295-307: The benchmark section in README.md documents three
dev-only benchmark commands (bench, bench:field-output, bench:field-storage) but
is missing the fourth command npm run bench:nested-forms, which is already
documented in CHANGELOG.md. Add a fourth bullet point to the benchmark section
that describes npm run bench:nested-forms as a front-end render test confirming
that a parent View shows a nested form's Summary Fields, placing it alongside
the other dev-only benchmark commands to keep all benchmark documentation
together.

In `@src/index.js`:
- Around line 390-397: The delete form tool description in src/index.js is too
verbose and increases token overhead for the tools/list response. Shorten the
description field in the delete tool definition (around the id and force
properties) to be more concise while retaining only the most essential
information about what the tool does. Apply the same brevity principle to the
other tool description mentioned around line 544-551 to maintain consistency and
reduce overall token usage in metadata responses.

In `@test/abilities-loader.test.js`:
- Around line 550-625: Add two new regression test cases to this test suite
following the existing pattern. First, add a test for a DELETE ability with
empty input to ensure it handles the empty input case consistently with the GET
and POST tests already present. Second, add a test case for an object that
becomes empty after bracket-flattening, such as passing { nested: {} } to verify
how the loader handles flattened inputs that resolve to empty values. Both tests
should use the same suite.test structure and verify the correct method,
params/data structure, and expected flattened output behavior.

In `@test/logger-stdout.test.js`:
- Around line 31-52: The test suite validates logger behavior for NODE_ENV
settings but lacks coverage for the alternative test context detection methods
supported by isTestContext(). Add two new test cases following the same pattern
as the existing test cases in this file: one that sets the
GRAVITYKIT_MCP_TEST_MODE environment variable and verifies logs appear on
stdout, and another that sets GRAVITYMCP_TEST_MODE and performs the same
verification. Both tests should use the runLogger() function and assert that
LOG_MARKER is present in stdout when these environment variables are set,
ensuring parity with the NODE_ENV=test behavior.

In `@test/server-lifecycle.test.js`:
- Line 22: The spawn call in the test file hardcodes the string 'node' as the
executable, which fails in CI environments with different PATH configurations.
Replace the hardcoded 'node' string with process.execPath, which provides the
absolute path to the Node.js executable that is currently running the test
process. This ensures the spawned child process uses the same Node binary as the
test runner, making the test portable across different runtime environments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 173d7293-f9e3-43f3-bb53-1fb3951fecbb

📥 Commits

Reviewing files that changed from the base of the PR and between 306a541 and ef5b0e3.

📒 Files selected for processing (44)
  • AGENTS.md
  • CHANGELOG.md
  • README.md
  • bench/README.md
  • bench/config.mjs
  • bench/demo.mjs
  • bench/field-output-cases.mjs
  • bench/field-output.mjs
  • bench/field-storage-cases.mjs
  • bench/field-storage.mjs
  • bench/lib/agent.mjs
  • bench/lib/report.mjs
  • bench/lib/score.mjs
  • bench/lib/siteminter.mjs
  • bench/lib/target.mjs
  • bench/nested-forms.mjs
  • bench/run.mjs
  • bench/tasks/authoring.mjs
  • bench/tasks/discovery.mjs
  • bench/tasks/entries-crud.mjs
  • bench/tasks/entries.mjs
  • bench/tasks/forms.mjs
  • bench/tasks/grid.mjs
  • bench/tasks/helpers.mjs
  • bench/tasks/index.mjs
  • bench/tasks/search.mjs
  • bench/tasks/view-fields.mjs
  • bench/tasks/view-widgets.mjs
  • bench/tasks/views.mjs
  • package.json
  • src/abilities/loader.js
  • src/field-definitions/field-registry.js
  • src/field-operations/field-manager.js
  • src/field-operations/index.js
  • src/gravity-forms-client.js
  • src/index.js
  • src/server-runtime.js
  • src/utils/logger.js
  • test/abilities-loader.test.js
  • test/field-manager.test.js
  • test/field-registry.test.js
  • test/logger-stdout.test.js
  • test/server-lifecycle.test.js
  • test/server-runtime.test.js

Comment thread bench/demo.mjs
Comment on lines +46 to +53
async function main() {
console.log('\nMinting (or reusing) a GravityView 3.0 site for the natural-language demo…');
const prov = await provisionSite({
fresh: args.fresh,
name: SITE,
plugins: [...SITEMINTER.plugins, ...SITEMINTER.addons],
log: (m) => console.log(`[siteminter] ${m}`),
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Demo teardown is skipped when the run loop throws.

destroySite(prov.name) executes only on the happy path. If runAgent() (or any later step) throws, Line 114 exits and leaves the minted environment running.

Suggested fix
 async function main() {
   console.log('\nMinting (or reusing) a GravityView 3.0 site for the natural-language demo…');
   const prov = await provisionSite({
@@
   const { target } = prov;
-  const plugins = activePlugins(prov.path);
-  console.log(`\nSite: ${target.baseUrl}`);
-  for (const slug of ['gravityforms', 'GravityView', 'spellbook', 'gp-nested-forms']) {
-    if (plugins[slug]) console.log(`  • ${slug} ${plugins[slug]}`);
-  }
+  try {
+    const plugins = activePlugins(prov.path);
+    console.log(`\nSite: ${target.baseUrl}`);
+    for (const slug of ['gravityforms', 'GravityView', 'spellbook', 'gp-nested-forms']) {
+      if (plugins[slug]) console.log(`  • ${slug} ${plugins[slug]}`);
+    }

-  const mcpConfigPath = writeMcpConfig(target);
-  const traceDir = join(CONFIG.outDir, 'demo');
-  mkdirSync(traceDir, { recursive: true });
+    const mcpConfigPath = writeMcpConfig(target);
+    const traceDir = join(CONFIG.outDir, 'demo');
+    mkdirSync(traceDir, { recursive: true });

-  // ... run loop + reporting ...
+    // ... run loop + reporting ...
+  } finally {
+    if (!args.keep) { try { destroySite(prov.name); } catch { /* best effort */ } console.log(`\nTore down "${prov.name}".`); }
+    else console.log(`\n[siteminter] keeping "${prov.name}" at ${target.baseUrl} (admin/admin) so you can inspect what it built.`);
+  }

-  if (!args.keep) { try { destroySite(prov.name); } catch { /* best effort */ } console.log(`\nTore down "${prov.name}".`); }
-  else console.log(`\n[siteminter] keeping "${prov.name}" at ${target.baseUrl} (admin/admin) so you can inspect what it built.`);
 }

Also applies to: 73-112, 114-114

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/demo.mjs` around lines 46 - 53, The main function skips the destroySite
cleanup when runAgent() or any subsequent step throws an error, leaving the
minted environment running. Wrap the runAgent() call and any other steps that
follow the provisionSite() call in a try-finally block, moving the
destroySite(prov.name) call to the finally block to ensure cleanup always
executes regardless of whether an error occurs.

Comment thread bench/field-storage.mjs
Comment on lines +62 to +83
async function main() {
const args = parseArgs(process.argv);
const cases = args.types ? CASES.filter((cs) => args.types.includes(cs.type)) : CASES;
if (!cases.length) { console.error(`No storage cases match --type ${args.types}`); process.exit(2); }

let target;
let mintedName = null;
let plugins = null; // active-plugins map; null when not minting
if (args.mint) {
const prov = await provisionSite({
fresh: args.fresh,
name: STORAGE_SITE,
plugins: [...SITEMINTER.plugins, ...SITEMINTER.addons],
log: (m) => console.log(`[siteminter] ${m}`),
});
target = prov.target;
mintedName = prov.name;
plugins = activePlugins(prov.path);
} else {
target = resolveTarget();
}
const client = makeClient(target);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Minted site teardown is not guaranteed on early failures.

If an exception is thrown before Line 139 (for example, preflight/plugin validation), the .catch() at Line 155 exits without calling destroySite(), leaving minted resources behind.

Suggested fix
 async function main() {
   const args = parseArgs(process.argv);
+  let mintedName = null;

-  let target;
-  let mintedName = null;
-  let plugins = null; // active-plugins map; null when not minting
-  if (args.mint) {
-    const prov = await provisionSite({
-      fresh: args.fresh,
-      name: STORAGE_SITE,
-      plugins: [...SITEMINTER.plugins, ...SITEMINTER.addons],
-      log: (m) => console.log(`[siteminter] ${m}`),
-    });
-    target = prov.target;
-    mintedName = prov.name;
-    plugins = activePlugins(prov.path);
-  } else {
-    target = resolveTarget();
-  }
-  const client = makeClient(target);
+  try {
+    let target;
+    let plugins = null; // active-plugins map; null when not minting
+    if (args.mint) {
+      const prov = await provisionSite({
+        fresh: args.fresh,
+        name: STORAGE_SITE,
+        plugins: [...SITEMINTER.plugins, ...SITEMINTER.addons],
+        log: (m) => console.log(`[siteminter] ${m}`),
+      });
+      target = prov.target;
+      mintedName = prov.name;
+      plugins = activePlugins(prov.path);
+    } else {
+      target = resolveTarget();
+    }
+    const client = makeClient(target);

     // ... existing suite logic ...
-
-  if (mintedName && !args.keep) { try { destroySite(mintedName); } catch { /* best effort */ } }
-  else if (mintedName) console.log(`\n[siteminter] keeping "${mintedName}".`);
+  } finally {
+    if (mintedName && !args.keep) { try { destroySite(mintedName); } catch { /* best effort */ } }
+    else if (mintedName) console.log(`\n[siteminter] keeping "${mintedName}".`);
+  }
 }

Also applies to: 139-156

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/field-storage.mjs` around lines 62 - 83, The main function does not
guarantee cleanup of minted sites when errors occur during execution. When
args.mint is true, the site is provisioned via provisionSite() and mintedName is
set, but if an exception is thrown in the subsequent code (after the
provisioning block and before any cleanup), the minted site is never destroyed.
Wrap the code that executes after the site provisioning (starting from where
makeClient is called through the end of the function logic) in a try-finally
block that ensures destroySite() is called with mintedName only when mintedName
is not null, guaranteeing cleanup regardless of whether an error occurs during
execution.

Comment thread bench/lib/agent.mjs Outdated
Comment thread bench/lib/target.mjs
Comment on lines +64 to +78
const gf = axios.create({
baseURL: `${target.baseUrl}/wp-json/gf/v2`,
auth: { username: target.key, password: target.secret },
httpsAgent: agent,
timeout: 30000,
validateStatus: () => true,
});

const wp = axios.create({
baseURL: `${target.wpUrl}/wp-json`,
auth: { username: target.wpUser, password: target.wpPass },
httpsAgent: agent,
timeout: 30000,
validateStatus: () => true,
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify current non-2xx handling paths in bench/lib/target.mjs
rg -n "validateStatus:\s*\(\)\s*=>\s*true" bench/lib/target.mjs
rg -n "async (countEntries|getEntries|getForm|findFormByTitle)\(" bench/lib/target.mjs
rg -n "return Number\.isFinite|return Array\.isArray|return res\.data \|\| \{\}|\\|\\| null" bench/lib/target.mjs

Repository: GravityKit/MCP

Length of output: 704


🏁 Script executed:

head -25 bench/lib/target.mjs

Repository: GravityKit/MCP

Length of output: 1026


🏁 Script executed:

rg -n "^import|^export" bench/lib/target.mjs | head -20

Repository: GravityKit/MCP

Length of output: 392


🏁 Script executed:

fd -t f config '\.(mjs|js)$' bench/

Repository: GravityKit/MCP

Length of output: 151


🏁 Script executed:

fd -t f config '\.(mjs|js)$' . --max-depth 2

Repository: GravityKit/MCP

Length of output: 153


Fail fast on non-2xx grader reads to prevent false gate results.

With validateStatus: () => true (lines 69, 77), read helpers (countEntries, getEntries, getForm, findFormByTitle at lines 123–145) can treat HTTP errors as valid "empty" state, mis-scoring benchmark outcomes (e.g., auth/server failure interpreted as "no entries/forms found"). Add a shared 2xx assertion before parsing fallback values.

Also, line 16 imports config.mjs; update to config.js per ESM guidelines.

Suggested patch
-import { CONFIG, MCP_ENTRY } from '../config.mjs';
+import { CONFIG, MCP_ENTRY } from '../config.js';

@@
   const gf = axios.create({
@@
     validateStatus: () => true,
   });
@@
   const wp = axios.create({
@@
     validateStatus: () => true,
   });
+
+  const expect2xx = (res, op) => {
+    if (res.status < 200 || res.status >= 300) {
+      throw new Error(`${op} failed (${res.status}): ${JSON.stringify(res.data).slice(0, 200)}`);
+    }
+  };
@@
     async countEntries(formId) {
       const res = await gf.get(`/forms/${formId}/entries`, { params: { paging: { page_size: 1 } } });
+      expect2xx(res, `countEntries(${formId})`);
       const total = res.data?.total_count;
@@
     async getEntries(formId, params = {}) {
       const res = await gf.get(`/forms/${formId}/entries`, { params });
+      expect2xx(res, `getEntries(${formId})`);
       return Array.isArray(res.data?.entries) ? res.data.entries : [];
     },
@@
     async getForm(formId) {
       const res = await gf.get(`/forms/${formId}`);
+      expect2xx(res, `getForm(${formId})`);
       return res.data || {};
     },
@@
     async findFormByTitle(title) {
       const res = await gf.get('/forms');
+      expect2xx(res, 'findFormByTitle');
       const forms = res.data && typeof res.data === 'object' ? Object.values(res.data) : [];
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/lib/target.mjs` around lines 64 - 78, The axios instances gf and wp use
validateStatus: () => true which treats all HTTP status codes as successful,
allowing the read helper functions (countEntries, getEntries, getForm,
findFormByTitle) to misinterpret HTTP errors as empty results and produce
incorrect benchmark scores. Add a shared assertion in each helper function to
verify the response status is in the 2xx range before parsing values, returning
an empty/fallback state only for genuinely successful empty responses.
Additionally, update the import statement on line 16 from config.mjs to
config.js to comply with ESM guidelines.

Comment thread bench/nested-forms.mjs
Comment on lines +74 to +76
async function fetchHtml(baseUrl, path) {
const res = await fetch(`${baseUrl}${path}`, { redirect: 'follow' });
return { status: res.status, html: await res.text() };

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Front-end fetches need a timeout to keep the suite from hanging.

Unlike the axios clients in this bench stack, this raw fetch path has no timeout. A stalled response can block the process indefinitely.

Suggested fix
 async function fetchHtml(url) {
-  const res = await fetch(url, { redirect: 'follow' });
-  return { status: res.status, html: await res.text() };
+  const controller = new AbortController();
+  const timer = setTimeout(() => controller.abort(), 30_000);
+  try {
+    const res = await fetch(url, { redirect: 'follow', signal: controller.signal });
+    return { status: res.status, html: await res.text() };
+  } finally {
+    clearTimeout(timer);
+  }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/nested-forms.mjs` around lines 74 - 76, The fetchHtml function's fetch
request lacks a timeout mechanism, which can cause the process to hang
indefinitely if a response stalls. Add a timeout to the fetch call in fetchHtml
by using an AbortController to set a reasonable timeout duration (similar to the
axios clients already in the bench stack). When the timeout is exceeded, the
abort controller should terminate the request and the error should be properly
handled to allow the process to continue.

Comment thread bench/tasks/view-fields.mjs
Comment thread bench/tasks/views.mjs
Comment thread src/field-definitions/field-registry.js
Comment thread src/field-definitions/field-registry.js
Comment thread src/utils/logger.js
zackkatz added 9 commits June 17, 2026 19:09
… empty

Inputs that have keys but flatten to no query params (e.g. { nested: {} } or
{ a: null }) were sent with an empty params object, so the WP Abilities API
got no `input` arg and rejected GET/DELETE with a 400 against type:object.

- Flatten the input first, then fall back to input="" when the result is empty
- Covers truly-empty input and keys that serialize away
- Add loader regression tests (DELETE-empty parity, {nested:{}}, {a:null})
generateCompoundInputs emitted .4 as "Cardholder Name" and .5 as "Card Type",
the swapped order. Verified against class-gf-field-creditcard.php: .4 is Card
Type and .5 is Cardholder Name (only .1 and .4 are persisted), realigning the
createForm path with generateSubInputs and storage.subInputs.

- Swap the .4/.5 labels in generateCompoundInputs
- Update the field-registry test that had locked in the swapped order
The release gate's value is its grading; several graders passed on the wrong
signal and gave false confidence.

- entries.read/search require a real read/search tool call, not a prompt echo
- view-fields.add/remove scope the check to the table-columns area (columnIds)
- grid.add-row-and-place-field requires Email in a newly added area
- Add bench-grading unit tests covering each contract
If the claude binary is missing or not executable, the ChildProcess 'error'
event was unhandled: it crashed the gate or left the run's Promise pending.

- Add child.on('error') that resolves a spawn_failed hardError so the run
  scores as a failure and the suite continues
- Add a bin parameter to runOnce for testability and export it
- Add a bench-agent regression test
destroySite ran only on the normal completion path, so a throw between
provisioning and cleanup leaked a running Docker site.

- Add a tested cleanupMintedSite helper (best-effort, never throws)
- Wrap run.mjs main in try/finally so cleanup runs on every exit
- Add a bench-cleanup unit test
Hardcoding 'node' fails in CI whose PATH resolves a different (or no) node.
process.execPath uses the same binary running the test.
The new bench grader/runner tests run only when listed in test:node, which
also gates prepublishOnly.
mcp.json was left at 2.3.0 when package.json bumped to 2.4.0; the server-tools
test asserts the two match.
- Add test/AGENTS.md covering the two test harnesses, registration, and the
  bench AI gate; update the AGENTS.md testing section to match
- README: add the npm run bench:nested-forms command
- bench/README: correct the suite size (32 tasks) and document
  expectedTurns/maxTurns in the task template
- CHANGELOG: remove the doubled "only" in the 2.4.0 sorting note
Comment thread bench/lib/target.mjs Dismissed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/bench-grading.test.js`:
- Around line 42-52: The test for 'entries.read' is passing the wrong input
format to task.grade. The grader expects a client and state-based input
(specifically calling client.getEntry with state.entryId) as defined in
bench/tasks/entries-crud.mjs, but the tests on lines 45 and 49 are currently
passing telemetry with toolCalls and finalText. Update both task.grade
invocations to pass an object with the correct shape: a client mock/object and
state object containing entryId, rather than telemetry data. This will ensure
the tests validate the actual grading contract instead of failing for the wrong
reason (missing client parameter).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f875a7a9-b975-47d4-8373-aeaf8e65a302

📥 Commits

Reviewing files that changed from the base of the PR and between ef5b0e3 and 110dd9c.

📒 Files selected for processing (21)
  • AGENTS.md
  • CHANGELOG.md
  • README.md
  • bench/README.md
  • bench/lib/agent.mjs
  • bench/lib/siteminter.mjs
  • bench/run.mjs
  • bench/tasks/entries-crud.mjs
  • bench/tasks/grid.mjs
  • bench/tasks/view-fields.mjs
  • mcp.json
  • package.json
  • src/abilities/loader.js
  • src/field-definitions/field-registry.js
  • test/AGENTS.md
  • test/abilities-loader.test.js
  • test/bench-agent.test.js
  • test/bench-cleanup.test.js
  • test/bench-grading.test.js
  • test/field-registry.test.js
  • test/server-lifecycle.test.js
✅ Files skipped from review due to trivial changes (4)
  • mcp.json
  • AGENTS.md
  • README.md
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (10)
  • bench/README.md
  • bench/tasks/entries-crud.mjs
  • bench/tasks/view-fields.mjs
  • package.json
  • bench/lib/siteminter.mjs
  • src/abilities/loader.js
  • bench/run.mjs
  • bench/tasks/grid.mjs
  • bench/lib/agent.mjs
  • src/field-definitions/field-registry.js

Comment thread test/bench-grading.test.js
@zackkatz zackkatz merged commit 62e8e56 into main Jun 17, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants