Conversation
The MCP server speaks JSON-RPC over stdout, so any log byte there corrupts the transport and breaks the handshake (client sees a malformed message; server hangs with 0 tools). The old detection (`!NODE_ENV || NODE_ENV === 'production'`) was inverted: it routed the common `NODE_ENV=development` to stdout and silently broke the handshake. Resolve the mode per call (not memoized at import) so it always reflects the current env and stays unit-testable: errors and all server-mode logs go to stderr; only an explicit test context (NODE_ENV=test or the GRAVITYKIT_MCP_TEST_MODE flags) may use stdout. Add a child-process test that asserts stdout stays clean for development/unset/production and that test mode logs to stdout.
Every Foundation ability declares an object `input_schema`, and the WP
abilities-api validates the `input` arg against `type:object` — so a
missing/null arg 400s ("input is not of type object"). Make the loader
always send an object input, never null/omitted:
- GET/DELETE empty input → `input=''` (WP's rest_is_object('') === true,
the wire form of an empty object on a query string).
- POST empty input → body `{ input: {} }` (a real empty object).
Non-empty input is unchanged (bracketed query params / nested body).
Extract a `hasInputKeys` check and document the method-per-annotation
invariant. Add executeAbility request-shape tests across GET/POST and
both schema'd and schemaless abilities (the loader does not inspect the
schema; the server is the authority on what an ability accepts).
tools/list waits a default 2s for the abilities catalog before shipping; gv_* tools that arrive later stream in via tools/list_changed. A one-shot client (e.g. `claude -p`) reads tools/list once and never sees the list_changed update, so it boots without gv_* if the catalog is slower than the wait. Add resolveAbilitiesListTimeoutMs(env) — reads GRAVITYKIT_MCP_LIST_TIMEOUT_MS, falling back to 2000 on non-positive/non-numeric values — and wire it into the ListTools handler so such clients can raise the first-list wait to return the complete catalog. Covered by server-runtime unit tests.
gf_delete_form / gf_delete_entry default to Trash (recoverable), but the terse "(vs trash)" description led a small model to pause and ask the user trash-vs-permanent on every delete. Rewrite both tool descriptions and the shared `force` param to state the default explicitly and instruct the model to proceed with Trash unless the user explicitly asks to permanently delete.
…ions
The server `instructions` string described only the low-level
gv_search_field_* tools for building a search bar. Add the one-call
gv_search_bar_add path (View id + fields[] of {field_id, input?}) as the
preferred route, reserving the surgical gv_search_field_* tools for slot
edits after gv_view_config_get. Prose only — no behavior change.
The summary-mode input hints covered only a handful of types and were imprecise on choices. Extend and correct them for the formats a small model is most likely to populate wrong: - choice fields (select/radio/checkbox/multiselect): store the choice VALUE when set or "Show Values"/enableChoiceValue is on, else the choice LABEL. - time: a combined "HH:MM am/pm" string, NOT .1/.2/.3 sub-inputs. - add compound (date/fileupload/signature), pricing (product/option/ quantity/shipping/total, "Label|amount" encoding), survey/quiz/poll add-on choice codes, and repeater/form nested shapes. Plain string fields (text/email/number/phone/website/textarea/hidden) stay omitted on purpose. Hints mirror the field-registry storage model.
A stdio MCP server must exit when its stdin reaches EOF (client disconnects or is killed). Without that, every crashed/SIGKILL'd client — including each benchmark run — leaves an orphaned `node src/index.js` running forever; the orphans accumulate and starve the next server's startup, surfacing as agents booting with 0 MCP tools. Add a child-process test that boots the server, closes its stdin, and asserts it exits within 4s. Wire both new node:test suites (logger-stdout, server-lifecycle) into the test:node script.
A behavioral release gate (not a unit test): it drives the MCP with the smallest current model (claude-haiku-4-5) through realistic end-to-end tasks across every surface and CRUD verb (discovery/forms/entries/ authoring/views/fields/widgets/search/grid), grading the actual GravityView/Gravity Forms state — never the agent's self-report. Each task runs BENCH_RUNS times; the gate passes only if every task's success rate clears the threshold. The premise: a well-designed tool surface shouldn't need a frontier model — a small model failing means the descriptions/schemas/errors are the fix, not the model. Includes the runner (run.mjs), config, task suites (tasks/), and the harness lib (siteminter mint/teardown, the claude-CLI agent driver, the target client, scoring, reporting). The agent driver pins the CLI via a CLAUDE_BIN override so npm-run's node_modules/.bin PATH prepend can't shadow the real `claude`. Adds the `bench` script and documents the gate in AGENTS.md. Dev-only — not in the package files allowlist; reports land in the gitignored bench/reports/.
A separate, NON-AI suite that renders every displayable Gravity Forms field type's View-display cell through the gk-gravityview/view-field-render ability (staged_slot — nothing persisted) against one seeded entry, then asserts the output is error-free and, for content-pinnable types, contains the expected display value. Output testing only: no search, CRUD, agent, or scoring. Reuses the bench target plumbing (siteminter / resolveTarget / makeClient) but talks REST directly. Adds the `bench:field-output` script. Exit 0 all pass / 1 a field failed to render / 2 harness error.
The registry's storage blocks disagreed with how GF actually persists these
values, so gf_list_field_types --detail reported the wrong shape to callers.
Validated live against GF 2.10.3 + the real add-ons (entry round-trips) and
against GF source.
- number/quantity: storage.type number → string. GF stores numbers as TEXT
in the entry; numeric is the field's purpose, not its storage shape
(class-gf-field-number.php get_value_submission).
- signature: storage.format base64 → filename. The Signature add-on saves the
drawn image to disk and stores its filename ("<hash>.png"); the URL is
derived at display time (class-gf-field-signature.php get_value_save_entry
→ maybe_save_signature).
A chained-select field is compound, so addField asked generateSubInputs for its inputs — but there was no chainedselect branch, so it got an empty inputs array and none of its per-level dot-notation sub-inputs (fieldId.N) were created. Add a branch that emits one sub-input per configured level, mirroring the GF Chained Selects add-on (class-gf-field-chainedselect.php): ids count 1,2,…,9,11,12,… SKIPPING multiples of 10 (the .10/.20 slots are reserved), each labelled by its level; a field with no configured levels defaults to two (Parents/Children) per get_default_inputs(). TDD: failing tests first, including the .10-skip edge. Confirmed live — the real add-on accepted the generated inputs and the entry stored 1.1/1.2 per level.
These hints steer a small model toward the right entry shape; three were wrong
or incomplete.
- address: was hardcoded to field id 2 and omitted .2 (Line 2) and .6 (Country).
Now lists all six sub-inputs N.1–N.6 with N = field id.
- chainedselect: now notes the sub-input count is dynamic (one per level).
- survey_rank: clarified — all choice values in ranked order, comma-separated,
is CORRECT (the audit's "single value" claim was wrong; class-gf-field-rank.php
get_ordered_choices explode(',')). Values are comma-free "gsurvey<id><hex>"
tokens, so the delimiter is unambiguous (GenerateSurveyDefaultChoiceValue).
A deterministic suite (npm run bench:field-storage --mint) that builds a form per fixed field, seeds an entry over GF REST, reads it back, and asserts the STORED shape matches the registry — closing the loop the AI gate and the field-output suite don't cover (output ≠ storage). The add-on-gated types (chainedselect, signature, survey_rank) run against a mint with the ACTUAL Chained Selects / Signature / Survey plugins symlinked + activated; a preflight prints each active plugin's version and skips (never fakes) a case whose add-on isn't really loaded, so a green run can't be a stub. - config: add SITEMINTER.addons (the three real add-on paths, env-overridable). - siteminter: provisionSite takes name/plugins overrides; add activePlugins().
An adversarial audit of all 45 field types against Gravity Forms + add-on source (each finding refuted, then re-confirmed by a live entry round-trip on GF 2.10.3) surfaced field-model entries that disagreed with how GF actually stores values. A small model trusting them would write the wrong entry shape. - creditcard: storage reflects the two persisted sub-inputs — .1 (masked number) and .4 (card type); generateSubInputs labels .4 Card Type / .5 Cardholder Name (were swapped). .2/.3/.5 are not persisted. - post_content: real GF type is 'post_content' (was the non-existent 'post_body', which gf_add_field could never resolve). - post_image: stored as one "url|:|title|:|caption|:|description|:|alt" composite, not a bare URL; added an entry_input hint. - fileupload: JSON array when storageType is 'json' OR multipleFiles (not multipleFiles alone). - product 'price' (User Defined): single money string, not dot-notation. - form (Nested Forms): comma-separated string of child entry ids, not a JSON array. - password: added as its own field type (not a text variant); GF does not persist it (stored value is empty). - date / survey_likert / survey_rating / quiz / poll / survey: corrected the entry_input hints (date is always ISO; add-on choice values are "g<kind><fieldId><hex>" tokens; multi-row likert keys by sub-input id). - Trimmed hint/comment phrasing to state the correct shape, not the old bug.
Grows the storage suite from the 6 fixed fields to every storable field type plus variant-aware cases (fileupload single/JSON, select value/label-as-value, product singleproduct/user-defined-price, name simple/advanced, list single/multi-column) and the layout/no-value types. 47 run cases + 1 skipped (nested forms, add-on not on the mint). Cases live in field-storage-cases.mjs. Compound fields are built through the MCP's own FieldManager.generateSubInputs, so the round-trip proves the structure our code emits is one GF accepts. Add-on cases run against the REAL Chained Selects / Signature / Survey / Polls / Quiz plugins (now minted); a case whose add-on isn't active is skipped, never faked.
… add-on The form (Nested Form) case was the one skip — its add-on wasn't on the mint. GP Nested Forms is in the plugins dir (gp-nested-forms, field type 'form', standalone), so wire it into the addon mint set and run it for real. Adds setup/teardown hooks to the runner so a case can build prerequisites: the form case creates a child form + two real child entries, points the field at it via gpnfForm, stores their ids, and asserts the comma-separated string round- trips. All 48 cases now run (0 skipped) — stored child entry ids "42,43".
Field-model correctness pass (entry-storage shapes + entry_input hints validated against real GF 2.10.3 + add-ons), new password field type (registry now 46), and dev-only bench suites. See CHANGELOG for the full 2.4.0 entry.
…i skip] - Note all 46 field types + entry-storage/entry_input validated against real GF + add-ons - gf_delete_form/entry: document Trash-by-default, force=true to delete permanently - gf_list_field_types: 46 types, summary/detail modes - GravityView: mention one-call gv_search_bar_add - Add dev-only bench / bench:field-output / bench:field-storage commands - Security: replace the inaccurate 'Rate Limiting: automatic retry/backoff' claim (no such logic in source) with credential masking + Trash-default note
The `form` (Nested Form) field type now describes its GP Nested Forms config so callers know how to set it up, and gf_list_field_types --detail surfaces it: - gpnfForm — the child form id whose entries are nested. - gpnfFields — "Summary Fields": which child-form field ids show in the nested summary table (a directory View lists child entry ids; the single-entry view renders these fields as columns). - requiresAddon — the field comes from GP Nested Forms (needs the Spellbook framework, formerly Gravity Perks). --detail output now includes `requiresAddon` + `settings` (undefined for other types, stripped by compact).
npm run bench:nested-forms mints a site with the real GP Nested Forms + Spellbook (and GF/GravityView), builds a parent form with a nested-form field, child form + entries, a published View embedding it, then fetches the FRONT-END and asserts: - directory (Multiple Entries) shows the child entry ids; - single entry renders the nested child table with the chosen Summary Fields (Name + Amount) and EXCLUDES a non-summary field (Notes), proving gpnfFields controls which child fields display. Supporting bench changes: config adds the Spellbook + nested-forms add-ons to the mint; createView takes an optional status (publish, so the View renders for anonymous front-end requests); siteminter exports wpCli (set GPNF parent meta / clear _gpnf_expiration / approve via gform_update_meta, NOT a value-wiping REST PUT). Fixtures linkage mirrors a real GPNF submission.
Companion to the scored gate: hands the small model ONE plain-English goal (build a job-applications form + entries + a GravityView View + a role-aware search bar) and only the MCP under test, then prints the transcript of which tools it CHOSE to call and the report it wrote. Nothing is hard-coded to specific tools — the model discovers and sequences the surface itself, which is the real "is this usable via natural language?" test. Reuses the existing agent harness (runAgent) + writeMcpConfig; mints with the full add-on set.
The most common first-try failure when a natural-language agent creates a form: it describes the fields but doesn't hand-number them, and validation rejects "Field must have an id". gf_create_form now fills in missing field ids using GF's max+1 convention BEFORE validation, preserving any explicit ids and re-basing provided compound sub-input ids onto the assigned id (e.g. "2.3" → "10.3"). Surfaced by bench:demo, where the small model hit this and had to retry; now it's one-shot. TDD: assignFieldIds unit tests (sequential, gap-fill-without-collision, sub-input re-base, passthrough) added to field-registry.test.js.
`node bench/demo.mjs --runs N` loops the natural-language goal N times against the same site and aggregates the signal that matters for a surface change: how often gf_create_form is one-shot, the form gets created, and the run is error-free. Single-run still prints the full transcript + agent report. Used to confirm the gf_create_form auto-id fix: 4/4 one-shot, 0 tool errors.
…(soft) Replaces the single global turn knob with per-task budgets: - maxTurns: a HARD per-task ceiling threaded through runAgent → runOnce → --max-turns, falling back to the global CONFIG.maxTurns. A task hitting it is killed mid-run and grades as incomplete, so it's a right-sized runaway/fail- fast backstop — keep it generous (~2-3x the realistic worst case). - expectedTurns: a SOFT efficiency budget. score.mjs now reports medianTurns and flags turnsOverBudget when median > expectedTurns x tolerance (default 2, env BENCH_TURNS_TOLERANCE). It is REPORTED, NOT gated — turn counts are too stochastic to hard-fail on (a correct run that verifies its own work spikes turns). The report shows median/expected with a ⚠ and a soft over-budget line; decideGate stays correctness-only.
Gives all 32 bench tasks a soft efficiency budget (expectedTurns) and a hard ceiling (maxTurns), sized to each task's realistic complexity — simple discovery/read tasks ~2 turns / cap 8, single-action CRUD ~3 / 8-10, multi-step authoring/search ~4 / 12. Initial estimates; the report prints actual median turns so they can be tuned from telemetry.
WalkthroughBumps the package to v2.4.0, correcting field-registry storage metadata for 9+ field types, fixing the abilities loader empty-input request shape, routing all server logs to stderr, adding a configurable abilities-list timeout, and auto-assigning field IDs in ChangesMCP Server v2.4.0 Improvements
AI Release Gate Benchmark Suite
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 19
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/abilities/loader.js (1)
443-463:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse flattened params (not top-level keys) to decide “empty input” for GET/DELETE.
Line 443 can classify inputs as non-empty even when they serialize to no query params (e.g.
{ nested: {} },{ a: null }). In that case,inputis omitted and WP may reject withtype:objectvalidation errors instead of treating it as{}.Suggested fix
- const hasInputKeys = !!input && Object.keys(input).length > 0; + const hasInputKeys = !!input && typeof input === 'object' && Object.keys(input).length > 0; if (method === 'GET' || method === 'DELETE') { @@ - if (hasInputKeys) { - const params = {}; - walkInputToBracketedParams(input, 'input', params); - config.params = params; - } else { - // Empty input. Send `input=` (empty string): WP's - // rest_is_object('') === true, so it validates as an empty object — - // whereas omitting `input` entirely sends `null`, which fails the - // type:object check with a 400. - config.params = { input: '' }; - } + const params = {}; + if (hasInputKeys) { + walkInputToBracketedParams(input, 'input', params); + } + // If flattening yields nothing, preserve the empty-object wire contract. + config.params = Object.keys(params).length > 0 ? params : { input: '' };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/abilities/loader.js` around lines 443 - 463, The hasInputKeys check on line 443 doesn't account for inputs with keys that serialize to empty params (e.g., { nested: {} }, { a: null }). Instead of checking hasInputKeys before calling walkInputToBracketedParams, always call walkInputToBracketedParams first to flatten the input into params, then check if the resulting params object is empty. If it's empty, set config.params = { input: '' } to ensure WordPress properly validates it as an empty object instead of receiving null.
🧹 Nitpick comments (8)
README.md (1)
295-307: 💤 Low valueDocument
npm run bench:nested-formsin the benchmark section.The README lists three benchmark commands (
bench,bench:field-output,bench:field-storage) but CHANGELOG.md line 34 documents a fourth dev-only benchmark,npm run bench:nested-forms, for testing front-end nested forms rendering. For completeness, add it to the benchmark section:- `npm run bench:nested-forms`: a front-end render test confirming a parent View shows a nested form's Summary Fields.All four are marked dev-only (not shipped in npm), so they belong together in the documentation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 295 - 307, The benchmark section in README.md documents three dev-only benchmark commands (bench, bench:field-output, bench:field-storage) but is missing the fourth command npm run bench:nested-forms, which is already documented in CHANGELOG.md. Add a fourth bullet point to the benchmark section that describes npm run bench:nested-forms as a front-end render test confirming that a parent View shows a nested form's Summary Fields, placing it alongside the other dev-only benchmark commands to keep all benchmark documentation together.CHANGELOG.md (1)
26-26: 💤 Low valueAdverb repetition: rephrase line 26 for clarity.
"Sorting only is now valid only when..." repeats "only" awkwardly. Rephrase to one of:
- "Sorting is now valid only when it genuinely means true"
- "Sorting is now interpreted strictly — valid only when it genuinely means true"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CHANGELOG.md` at line 26, In the bullet point about `sorting.is_numeric`, the phrase "Sorting only is now valid only when..." contains an awkward repetition of the word "only" that makes the sentence unclear. Rephrase this line to remove the repetition by using one of the suggested alternatives: either "Sorting is now valid only when it genuinely means true" or "Sorting is now interpreted strictly — valid only when it genuinely means true". Choose whichever option best fits the context and maintain the rest of the explanation about how the sorting behavior has changed.bench/README.md (1)
94-106: ⚡ Quick winDocument optional task properties
expectedTurnsandmaxTurnsin the task-structure template.The template shows the required shape (
id,category,setup,prompt,grade,teardown), but the runner code (bench/run.mjs:66–67) also referencestask.expectedTurnsandtask.maxTurnsfor budget tracking and warnings. These are not required (absent tasks skip the check), but developers adding tasks should know about them. Update the template to include them as optional fields.✏️ Proposed addition
{ id: 'category.short-name', category: 'category', + expectedTurns: 5, // optional: expected turn count for this task; triggers ⚠ if exceeded + maxTurns: 20, // optional: hard limit on turns (enforced by runner) async setup(client) { /* create fixtures */ return { /* state */ }; }, prompt: (state) => `natural-language task referencing ${state.…}`, async grade({ client, state, telemetry }) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bench/README.md` around lines 94 - 106, The task-structure template in the README documentation is incomplete and does not include the optional properties `expectedTurns` and `maxTurns` that are referenced in the runner code (bench/run.mjs) for budget tracking and warnings. Add these optional fields to the template documentation shown in the diff, making it clear that they are optional and documenting their purpose for turn budgeting and warning thresholds when developers define new tasks.test/abilities-loader.test.js (1)
550-625: ⚡ Quick winAdd DELETE and “flattened-empty input” regression tests.
The new coverage is strong for GET/POST, but it misses two high-value cases in the same changed path: DELETE with empty input, and objects that become empty after bracket-flattening (e.g.
{ nested: {} }).Suggested test additions
+suite.test('executeAbility: empty input on DELETE ability → params {input:\'\'}', async () => { + const stub = buildCatalogStubGvClient([[ + { + name: 'gk-gravityview/view-delete', + description: 'DELETE ability', + input_schema: { type: 'object', properties: {} }, + annotations: { destructive: true, idempotent: true }, + enabled: true, + mcp_tool_name: 'gv_schema_delete', + }, + ]]); + const { handlers } = await loadAbilitiesAsTools(stub); + await handlers.gv_schema_delete({}); + const run = findRun(stub, 'gk-gravityview/view-delete'); + TestAssert.equal(run.method, 'DELETE'); + TestAssert.deepEqual(run.params, { input: '' }); +}); + +suite.test('executeAbility: effectively-empty nested input on GET → params {input:\'\'}', async () => { + const stub = buildCatalogStubGvClient([inputSchemaFenceCatalog()]); + const { handlers } = await loadAbilitiesAsTools(stub); + await handlers.gv_schema_get({ nested: {} }); + const run = findRun(stub, 'gk-gravityview/views-list'); + TestAssert.equal(run.method, 'GET'); + TestAssert.deepEqual(run.params, { input: '' }); +});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/abilities-loader.test.js` around lines 550 - 625, Add two new regression test cases to this test suite following the existing pattern. First, add a test for a DELETE ability with empty input to ensure it handles the empty input case consistently with the GET and POST tests already present. Second, add a test case for an object that becomes empty after bracket-flattening, such as passing { nested: {} } to verify how the loader handles flattened inputs that resolve to empty values. Both tests should use the same suite.test structure and verify the correct method, params/data structure, and expected flattened output behavior.src/index.js (1)
390-397: ⚡ Quick winKeep delete tool descriptions terse in
tools/list.These new descriptions are clear, but they’re too long for a hot metadata path and increase token overhead on every list response.
✂️ Suggested concise wording
- description: 'Delete a form. Defaults to Trash (recoverable) — proceed with the default; do NOT pause to ask the user trash-vs-permanent. Set force=true only if the user explicitly asks to permanently delete. (Requires ALLOW_DELETE=true.)', + description: 'Delete a form. Default (force=false) moves it to Trash. Use force=true only for explicit permanent-delete requests. Requires ALLOW_DELETE=true.', ... - force: { type: 'boolean', description: 'Permanently delete instead of moving to Trash. Default false (Trash, recoverable).' } + force: { type: 'boolean', description: 'Permanent delete. Default false (move to Trash).' } - description: 'Delete an entry. Defaults to Trash (recoverable) — proceed with the default; do NOT pause to ask the user trash-vs-permanent. Set force=true only if the user explicitly asks to permanently delete. (Requires ALLOW_DELETE=true.)', + description: 'Delete an entry. Default (force=false) moves it to Trash. Use force=true only for explicit permanent-delete requests. Requires ALLOW_DELETE=true.', ... - force: { type: 'boolean', description: 'Permanently delete instead of moving to Trash. Default false (Trash, recoverable).' } + force: { type: 'boolean', description: 'Permanent delete. Default false (move to Trash).' }As per coding guidelines,
src/index.js: Keep tool descriptions insrc/index.jsterse to reducetools/listtoken overhead.Also applies to: 544-551
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/index.js` around lines 390 - 397, The delete form tool description in src/index.js is too verbose and increases token overhead for the tools/list response. Shorten the description field in the delete tool definition (around the id and force properties) to be more concise while retaining only the most essential information about what the tool does. Apply the same brevity principle to the other tool description mentioned around line 544-551 to maintain consistency and reduce overall token usage in metadata responses.Source: Coding guidelines
test/logger-stdout.test.js (1)
31-52: ⚡ Quick winAdd tests for
GRAVITYKIT_MCP_TEST_MODEandGRAVITYMCP_TEST_MODE.
isTestContext()supports both flags, but this suite currently validates onlyNODE_ENV=test.🧪 Suggested additions
test('explicit test mode (NODE_ENV=test) may use stdout', () => { const { stdout } = runLogger({ NODE_ENV: 'test' }); assert.ok(stdout.includes('LOG_MARKER'), 'test mode logs to stdout for visibility'); }); + +test('explicit test mode (GRAVITYKIT_MCP_TEST_MODE=true) may use stdout', () => { + const { stdout } = runLogger({ NODE_ENV: 'development', GRAVITYKIT_MCP_TEST_MODE: 'true' }); + assert.ok(stdout.includes('LOG_MARKER'), 'GRAVITYKIT_MCP_TEST_MODE should route logs to stdout'); +}); + +test('explicit test mode (GRAVITYMCP_TEST_MODE=true) may use stdout', () => { + const { stdout } = runLogger({ NODE_ENV: 'development', GRAVITYMCP_TEST_MODE: 'true' }); + assert.ok(stdout.includes('LOG_MARKER'), 'GRAVITYMCP_TEST_MODE should route logs to stdout'); +});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/logger-stdout.test.js` around lines 31 - 52, The test suite validates logger behavior for NODE_ENV settings but lacks coverage for the alternative test context detection methods supported by isTestContext(). Add two new test cases following the same pattern as the existing test cases in this file: one that sets the GRAVITYKIT_MCP_TEST_MODE environment variable and verifies logs appear on stdout, and another that sets GRAVITYMCP_TEST_MODE and performs the same verification. Both tests should use the runLogger() function and assert that LOG_MARKER is present in stdout when these environment variables are set, ensuring parity with the NODE_ENV=test behavior.test/server-lifecycle.test.js (1)
22-22: ⚡ Quick winSpawn Node via
process.execPathfor test portability.Hardcoding
'node'can fail in CI/runtime setups where PATH differs from the test process binary.🔧 Suggested change
- const child = spawn('node', [SERVER], { stdio: ['pipe', 'pipe', 'pipe'] }); + const child = spawn(process.execPath, [SERVER], { stdio: ['pipe', 'pipe', 'pipe'] });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/server-lifecycle.test.js` at line 22, The spawn call in the test file hardcodes the string 'node' as the executable, which fails in CI environments with different PATH configurations. Replace the hardcoded 'node' string with process.execPath, which provides the absolute path to the Node.js executable that is currently running the test process. This ensures the spawned child process uses the same Node binary as the test runner, making the test portable across different runtime environments.bench/lib/target.mjs (1)
16-16: 🏗️ Heavy liftAlign local ESM imports to the
.jsextension convention.Line 16 imports
../config.mjs; this diverges from the project-wide import-extension rule and repeats across new bench modules.As per coding guidelines,
**/*.{js,mjs}: Use ESM throughout the project with.jsextensions on all imports.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bench/lib/target.mjs` at line 16, The import statement for CONFIG and MCP_ENTRY uses the .mjs extension which diverges from the project convention of using .js extensions for ESM imports. Change the import path from ../config.mjs to ../config.js to align with the project-wide import-extension rule established for all ESM modules.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@bench/demo.mjs`:
- Around line 46-53: The main function skips the destroySite cleanup when
runAgent() or any subsequent step throws an error, leaving the minted
environment running. Wrap the runAgent() call and any other steps that follow
the provisionSite() call in a try-finally block, moving the
destroySite(prov.name) call to the finally block to ensure cleanup always
executes regardless of whether an error occurs.
In `@bench/field-storage.mjs`:
- Around line 62-83: The main function does not guarantee cleanup of minted
sites when errors occur during execution. When args.mint is true, the site is
provisioned via provisionSite() and mintedName is set, but if an exception is
thrown in the subsequent code (after the provisioning block and before any
cleanup), the minted site is never destroyed. Wrap the code that executes after
the site provisioning (starting from where makeClient is called through the end
of the function logic) in a try-finally block that ensures destroySite() is
called with mintedName only when mintedName is not null, guaranteeing cleanup
regardless of whether an error occurs during execution.
In `@bench/lib/agent.mjs`:
- Around line 197-236: The spawn call for CLAUDE_BIN lacks an error event
handler, so if the process fails to spawn (e.g., missing or non-executable
binary), the error will go unhandled and either crash the process or leave the
Promise unresolved. Add a child.on('error') handler before or alongside the
existing child.on('close') handler that captures the spawn error and calls
resolvePromise with appropriate error details, ensuring the returned object
includes a hardError field set to a descriptive error message based on the error
type.
In `@bench/lib/target.mjs`:
- Around line 64-78: The axios instances gf and wp use validateStatus: () =>
true which treats all HTTP status codes as successful, allowing the read helper
functions (countEntries, getEntries, getForm, findFormByTitle) to misinterpret
HTTP errors as empty results and produce incorrect benchmark scores. Add a
shared assertion in each helper function to verify the response status is in the
2xx range before parsing values, returning an empty/fallback state only for
genuinely successful empty responses. Additionally, update the import statement
on line 16 from config.mjs to config.js to comply with ESM guidelines.
In `@bench/nested-forms.mjs`:
- Around line 239-240: When exceptions are caught in the catch block at lines
239-240 (and the similar catch block at 259-260), the code is converting them to
a normal fail verdict instead of signaling a harness error with the documented
exit code 2. Instead of assigning the verdict object with pass: false when an
exception occurs, call process.exit(2) to properly distinguish harness/runtime
faults from assertion failures in CI. This ensures thrown errors exit with code
2 as documented, while normal assertion failures continue to exit with code 1.
- Around line 74-77: The fetchHtml function reconstructs URLs by joining baseUrl
with path, but this fails when page URLs (pageLink/singleUrl) exist on different
hosts than target.baseUrl, causing it to fetch from the wrong site. Modify the
fetchHtml function signature to accept the complete published page URL directly
instead of baseUrl and path parameters, and update all calls to fetchHtml
(including the ones at lines 209-211 marked as "Also applies to") to pass the
full URL (pageLink or singleUrl) directly rather than reconstructing it from
separate base and path components.
- Around line 74-76: The fetchHtml function's fetch request lacks a timeout
mechanism, which can cause the process to hang indefinitely if a response
stalls. Add a timeout to the fetch call in fetchHtml by using an AbortController
to set a reasonable timeout duration (similar to the axios clients already in
the bench stack). When the timeout is exceeded, the abort controller should
terminate the request and the error should be properly handled to allow the
process to continue.
In `@bench/README.md`:
- Around line 113-115: The cost caveat example at lines 113-115 incorrectly
states "A 7-task suite at 3 runs is ~21 sessions," but the Coverage table at
lines 73-84 enumerates approximately 28 tasks across all categories (discovery,
forms, entries, authoring, views, view-fields, view-widgets, search, and grid).
Update the example to use the correct task count of 28 instead of 7, and
recalculate the sessions estimate accordingly to reflect the actual suite size
and help users accurately budget their time and tokens.
In `@bench/run.mjs`:
- Around line 21-27: The import statements in the run.mjs file use .mjs file
extensions (in the imports from config.mjs, lib/target.mjs, lib/siteminter.mjs,
lib/agent.mjs, lib/score.mjs, lib/report.mjs, and tasks/index.mjs) but the
project convention requires .js extensions on all ESM imports. Change all the
.mjs extension specifiers in these import statements to .js extensions to align
with the repository's ESM extension rule that requires .js extensions across all
imports.
- Around line 82-128: The minted environment cleanup code (checking mintedName
and calling destroySite) only runs on the normal completion path, causing
resource leaks if an exception occurs after provisionSite() succeeds. Use a
try-finally structure wrapping the main logic from where mintedName is
determined through the gate reporting, and move the cleanup code that handles
the mintedName variable (the if/else block that destroys or logs the site) into
a finally block to ensure it executes whether the code completes normally or
throws an exception.
In `@bench/tasks/discovery.mjs`:
- Around line 47-53: The grade method in the discovery task only validates that
views_list or views_scan calls completed successfully, but it does not verify
that the actual response was filtered to the publish status. To fix this, add
additional validation logic to the grade method that inspects the telemetry data
to confirm the response was actually filtered to the publish status (not just
that the call succeeded). This requires checking the response content or status
filtering metadata in the telemetry to ensure the filtering behavior was
correctly applied, preventing false positives where the call succeeds but the
filtering is incorrect.
In `@bench/tasks/entries-crud.mjs`:
- Around line 45-48: The grade function in the entries-crud.mjs file currently
passes the test if the output text contains "ada@example.com" and doesn't
contain "charles@example.com", but this can succeed through prompt echoing alone
without any actual tool invocation. Add a guard check to ensure that the
entries.search tool was actually called in the telemetry before accepting the
test as passing. Verify that telemetry contains evidence of actual tool calls
being made, not just that the final text matches the expected pattern.
In `@bench/tasks/grid.mjs`:
- Around line 39-45: The grade function for the grid.add-row-and-place-field
task only verifies that the Email field (id '2') exists somewhere in the
configuration after adding a row, but does not check that it is specifically
placed in the left column as required by the prompt. Modify the grading logic by
replacing or supplementing the hasEmail check to verify not only that the Email
field exists in the grid, but also that it is positioned in the left column of
the newly added row. This will require inspecting the grid structure to
understand how column positions are represented in cfg.fields and validating the
Email field's column position accordingly.
In `@bench/tasks/index.mjs`:
- Around line 17-26: The import statements in bench/tasks/index.mjs are using
`.mjs` file extensions which deviates from the project's ESM convention of using
`.js` extensions. Change all import statements for the task modules (discovery,
forms, entries, entriesCrud, authoring, views, viewFields, viewWidgets, search,
grid) from importing with `.mjs` extensions to `.js` extensions, and ensure all
corresponding files in the bench directory are renamed from `.mjs` to `.js` to
maintain consistency with the project's ESM import guideline.
In `@bench/tasks/view-fields.mjs`:
- Around line 33-37: The grade method in the view-fields task is checking the
entire field tree using fieldIdsInTree(cfg.fields) instead of restricting the
check to the column area only. For column-specific tasks like view-fields.add
and view-fields.remove, update the grading logic to check only the
column-specific area (likely cfg.fields.columns or a similar column-scoped
property) instead of the entire cfg.fields tree. This applies to both instances
of the grading logic mentioned in the comment.
In `@bench/tasks/views.mjs`:
- Around line 29-30: The teardown method in the task definition chains two
sequential delete operations where a failure in deleteView will prevent
deleteForm from executing, causing resource leaks. Modify the teardown method to
use error handling (such as try-finally blocks or Promise.allSettled) to ensure
both client.deleteView and client.deleteForm are always executed regardless of
whether either operation throws an error. This same pattern needs to be applied
to all other teardown methods in the file that have similar sequential delete
chains (at the other locations mentioned in the comment).
In `@src/field-definitions/field-registry.js`:
- Around line 991-1017: The assignFieldIds function lacks a guard clause to
validate that each field element is a valid object before attempting to spread
it or access its properties. When fields array contains null or non-object
elements, the code throws a runtime error when executing { ...field, id: newId
}. Add an early validation check at the beginning of the map callback in
assignFieldIds to ensure the field is a valid object, and handle invalid items
appropriately to prevent unhandled runtime errors and ensure validation errors
are returned instead.
- Around line 622-630: The generateCompoundInputs() function for creditcard
fields is generating sub-inputs that do not match the
creditcard.storage.subInputs contract. The subInputs object now defines only .1
for card_number_masked and .4 for card_type, but generateCompoundInputs() is
still emitting .4 as Cardholder Name and .5 as Card Type. Update the
generateCompoundInputs() function to only emit sub-inputs .1 and .4 with the
correct mappings to align with the creditcard.storage.subInputs definition,
removing the Cardholder Name and Card Type sub-inputs that are no longer part of
the contract.
In `@src/utils/logger.js`:
- Around line 28-40: The sendLogMessage function is outputting raw, unsanitized
log messages to the console, which can expose sensitive credentials. Modify the
function to sanitize the message parameter using the sanitize utility from
utils/sanitize.js before passing it to any console methods (console.error and
console.log). Apply this sanitization at the entry point of the function so that
all console output throughout sendLogMessage uses sanitized data, protecting
against credential leakage in all code paths including error logs, test context
logs, and server mode logs.
---
Outside diff comments:
In `@src/abilities/loader.js`:
- Around line 443-463: The hasInputKeys check on line 443 doesn't account for
inputs with keys that serialize to empty params (e.g., { nested: {} }, { a: null
}). Instead of checking hasInputKeys before calling walkInputToBracketedParams,
always call walkInputToBracketedParams first to flatten the input into params,
then check if the resulting params object is empty. If it's empty, set
config.params = { input: '' } to ensure WordPress properly validates it as an
empty object instead of receiving null.
---
Nitpick comments:
In `@bench/lib/target.mjs`:
- Line 16: The import statement for CONFIG and MCP_ENTRY uses the .mjs extension
which diverges from the project convention of using .js extensions for ESM
imports. Change the import path from ../config.mjs to ../config.js to align with
the project-wide import-extension rule established for all ESM modules.
In `@bench/README.md`:
- Around line 94-106: The task-structure template in the README documentation is
incomplete and does not include the optional properties `expectedTurns` and
`maxTurns` that are referenced in the runner code (bench/run.mjs) for budget
tracking and warnings. Add these optional fields to the template documentation
shown in the diff, making it clear that they are optional and documenting their
purpose for turn budgeting and warning thresholds when developers define new
tasks.
In `@CHANGELOG.md`:
- Line 26: In the bullet point about `sorting.is_numeric`, the phrase "Sorting
only is now valid only when..." contains an awkward repetition of the word
"only" that makes the sentence unclear. Rephrase this line to remove the
repetition by using one of the suggested alternatives: either "Sorting is now
valid only when it genuinely means true" or "Sorting is now interpreted strictly
— valid only when it genuinely means true". Choose whichever option best fits
the context and maintain the rest of the explanation about how the sorting
behavior has changed.
In `@README.md`:
- Around line 295-307: The benchmark section in README.md documents three
dev-only benchmark commands (bench, bench:field-output, bench:field-storage) but
is missing the fourth command npm run bench:nested-forms, which is already
documented in CHANGELOG.md. Add a fourth bullet point to the benchmark section
that describes npm run bench:nested-forms as a front-end render test confirming
that a parent View shows a nested form's Summary Fields, placing it alongside
the other dev-only benchmark commands to keep all benchmark documentation
together.
In `@src/index.js`:
- Around line 390-397: The delete form tool description in src/index.js is too
verbose and increases token overhead for the tools/list response. Shorten the
description field in the delete tool definition (around the id and force
properties) to be more concise while retaining only the most essential
information about what the tool does. Apply the same brevity principle to the
other tool description mentioned around line 544-551 to maintain consistency and
reduce overall token usage in metadata responses.
In `@test/abilities-loader.test.js`:
- Around line 550-625: Add two new regression test cases to this test suite
following the existing pattern. First, add a test for a DELETE ability with
empty input to ensure it handles the empty input case consistently with the GET
and POST tests already present. Second, add a test case for an object that
becomes empty after bracket-flattening, such as passing { nested: {} } to verify
how the loader handles flattened inputs that resolve to empty values. Both tests
should use the same suite.test structure and verify the correct method,
params/data structure, and expected flattened output behavior.
In `@test/logger-stdout.test.js`:
- Around line 31-52: The test suite validates logger behavior for NODE_ENV
settings but lacks coverage for the alternative test context detection methods
supported by isTestContext(). Add two new test cases following the same pattern
as the existing test cases in this file: one that sets the
GRAVITYKIT_MCP_TEST_MODE environment variable and verifies logs appear on
stdout, and another that sets GRAVITYMCP_TEST_MODE and performs the same
verification. Both tests should use the runLogger() function and assert that
LOG_MARKER is present in stdout when these environment variables are set,
ensuring parity with the NODE_ENV=test behavior.
In `@test/server-lifecycle.test.js`:
- Line 22: The spawn call in the test file hardcodes the string 'node' as the
executable, which fails in CI environments with different PATH configurations.
Replace the hardcoded 'node' string with process.execPath, which provides the
absolute path to the Node.js executable that is currently running the test
process. This ensures the spawned child process uses the same Node binary as the
test runner, making the test portable across different runtime environments.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 173d7293-f9e3-43f3-bb53-1fb3951fecbb
📒 Files selected for processing (44)
AGENTS.mdCHANGELOG.mdREADME.mdbench/README.mdbench/config.mjsbench/demo.mjsbench/field-output-cases.mjsbench/field-output.mjsbench/field-storage-cases.mjsbench/field-storage.mjsbench/lib/agent.mjsbench/lib/report.mjsbench/lib/score.mjsbench/lib/siteminter.mjsbench/lib/target.mjsbench/nested-forms.mjsbench/run.mjsbench/tasks/authoring.mjsbench/tasks/discovery.mjsbench/tasks/entries-crud.mjsbench/tasks/entries.mjsbench/tasks/forms.mjsbench/tasks/grid.mjsbench/tasks/helpers.mjsbench/tasks/index.mjsbench/tasks/search.mjsbench/tasks/view-fields.mjsbench/tasks/view-widgets.mjsbench/tasks/views.mjspackage.jsonsrc/abilities/loader.jssrc/field-definitions/field-registry.jssrc/field-operations/field-manager.jssrc/field-operations/index.jssrc/gravity-forms-client.jssrc/index.jssrc/server-runtime.jssrc/utils/logger.jstest/abilities-loader.test.jstest/field-manager.test.jstest/field-registry.test.jstest/logger-stdout.test.jstest/server-lifecycle.test.jstest/server-runtime.test.js
| async function main() { | ||
| console.log('\nMinting (or reusing) a GravityView 3.0 site for the natural-language demo…'); | ||
| const prov = await provisionSite({ | ||
| fresh: args.fresh, | ||
| name: SITE, | ||
| plugins: [...SITEMINTER.plugins, ...SITEMINTER.addons], | ||
| log: (m) => console.log(`[siteminter] ${m}`), | ||
| }); |
There was a problem hiding this comment.
Demo teardown is skipped when the run loop throws.
destroySite(prov.name) executes only on the happy path. If runAgent() (or any later step) throws, Line 114 exits and leaves the minted environment running.
Suggested fix
async function main() {
console.log('\nMinting (or reusing) a GravityView 3.0 site for the natural-language demo…');
const prov = await provisionSite({
@@
const { target } = prov;
- const plugins = activePlugins(prov.path);
- console.log(`\nSite: ${target.baseUrl}`);
- for (const slug of ['gravityforms', 'GravityView', 'spellbook', 'gp-nested-forms']) {
- if (plugins[slug]) console.log(` • ${slug} ${plugins[slug]}`);
- }
+ try {
+ const plugins = activePlugins(prov.path);
+ console.log(`\nSite: ${target.baseUrl}`);
+ for (const slug of ['gravityforms', 'GravityView', 'spellbook', 'gp-nested-forms']) {
+ if (plugins[slug]) console.log(` • ${slug} ${plugins[slug]}`);
+ }
- const mcpConfigPath = writeMcpConfig(target);
- const traceDir = join(CONFIG.outDir, 'demo');
- mkdirSync(traceDir, { recursive: true });
+ const mcpConfigPath = writeMcpConfig(target);
+ const traceDir = join(CONFIG.outDir, 'demo');
+ mkdirSync(traceDir, { recursive: true });
- // ... run loop + reporting ...
+ // ... run loop + reporting ...
+ } finally {
+ if (!args.keep) { try { destroySite(prov.name); } catch { /* best effort */ } console.log(`\nTore down "${prov.name}".`); }
+ else console.log(`\n[siteminter] keeping "${prov.name}" at ${target.baseUrl} (admin/admin) so you can inspect what it built.`);
+ }
- if (!args.keep) { try { destroySite(prov.name); } catch { /* best effort */ } console.log(`\nTore down "${prov.name}".`); }
- else console.log(`\n[siteminter] keeping "${prov.name}" at ${target.baseUrl} (admin/admin) so you can inspect what it built.`);
}Also applies to: 73-112, 114-114
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@bench/demo.mjs` around lines 46 - 53, The main function skips the destroySite
cleanup when runAgent() or any subsequent step throws an error, leaving the
minted environment running. Wrap the runAgent() call and any other steps that
follow the provisionSite() call in a try-finally block, moving the
destroySite(prov.name) call to the finally block to ensure cleanup always
executes regardless of whether an error occurs.
| async function main() { | ||
| const args = parseArgs(process.argv); | ||
| const cases = args.types ? CASES.filter((cs) => args.types.includes(cs.type)) : CASES; | ||
| if (!cases.length) { console.error(`No storage cases match --type ${args.types}`); process.exit(2); } | ||
|
|
||
| let target; | ||
| let mintedName = null; | ||
| let plugins = null; // active-plugins map; null when not minting | ||
| if (args.mint) { | ||
| const prov = await provisionSite({ | ||
| fresh: args.fresh, | ||
| name: STORAGE_SITE, | ||
| plugins: [...SITEMINTER.plugins, ...SITEMINTER.addons], | ||
| log: (m) => console.log(`[siteminter] ${m}`), | ||
| }); | ||
| target = prov.target; | ||
| mintedName = prov.name; | ||
| plugins = activePlugins(prov.path); | ||
| } else { | ||
| target = resolveTarget(); | ||
| } | ||
| const client = makeClient(target); |
There was a problem hiding this comment.
Minted site teardown is not guaranteed on early failures.
If an exception is thrown before Line 139 (for example, preflight/plugin validation), the .catch() at Line 155 exits without calling destroySite(), leaving minted resources behind.
Suggested fix
async function main() {
const args = parseArgs(process.argv);
+ let mintedName = null;
- let target;
- let mintedName = null;
- let plugins = null; // active-plugins map; null when not minting
- if (args.mint) {
- const prov = await provisionSite({
- fresh: args.fresh,
- name: STORAGE_SITE,
- plugins: [...SITEMINTER.plugins, ...SITEMINTER.addons],
- log: (m) => console.log(`[siteminter] ${m}`),
- });
- target = prov.target;
- mintedName = prov.name;
- plugins = activePlugins(prov.path);
- } else {
- target = resolveTarget();
- }
- const client = makeClient(target);
+ try {
+ let target;
+ let plugins = null; // active-plugins map; null when not minting
+ if (args.mint) {
+ const prov = await provisionSite({
+ fresh: args.fresh,
+ name: STORAGE_SITE,
+ plugins: [...SITEMINTER.plugins, ...SITEMINTER.addons],
+ log: (m) => console.log(`[siteminter] ${m}`),
+ });
+ target = prov.target;
+ mintedName = prov.name;
+ plugins = activePlugins(prov.path);
+ } else {
+ target = resolveTarget();
+ }
+ const client = makeClient(target);
// ... existing suite logic ...
-
- if (mintedName && !args.keep) { try { destroySite(mintedName); } catch { /* best effort */ } }
- else if (mintedName) console.log(`\n[siteminter] keeping "${mintedName}".`);
+ } finally {
+ if (mintedName && !args.keep) { try { destroySite(mintedName); } catch { /* best effort */ } }
+ else if (mintedName) console.log(`\n[siteminter] keeping "${mintedName}".`);
+ }
}Also applies to: 139-156
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@bench/field-storage.mjs` around lines 62 - 83, The main function does not
guarantee cleanup of minted sites when errors occur during execution. When
args.mint is true, the site is provisioned via provisionSite() and mintedName is
set, but if an exception is thrown in the subsequent code (after the
provisioning block and before any cleanup), the minted site is never destroyed.
Wrap the code that executes after the site provisioning (starting from where
makeClient is called through the end of the function logic) in a try-finally
block that ensures destroySite() is called with mintedName only when mintedName
is not null, guaranteeing cleanup regardless of whether an error occurs during
execution.
| const gf = axios.create({ | ||
| baseURL: `${target.baseUrl}/wp-json/gf/v2`, | ||
| auth: { username: target.key, password: target.secret }, | ||
| httpsAgent: agent, | ||
| timeout: 30000, | ||
| validateStatus: () => true, | ||
| }); | ||
|
|
||
| const wp = axios.create({ | ||
| baseURL: `${target.wpUrl}/wp-json`, | ||
| auth: { username: target.wpUser, password: target.wpPass }, | ||
| httpsAgent: agent, | ||
| timeout: 30000, | ||
| validateStatus: () => true, | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify current non-2xx handling paths in bench/lib/target.mjs
rg -n "validateStatus:\s*\(\)\s*=>\s*true" bench/lib/target.mjs
rg -n "async (countEntries|getEntries|getForm|findFormByTitle)\(" bench/lib/target.mjs
rg -n "return Number\.isFinite|return Array\.isArray|return res\.data \|\| \{\}|\\|\\| null" bench/lib/target.mjsRepository: GravityKit/MCP
Length of output: 704
🏁 Script executed:
head -25 bench/lib/target.mjsRepository: GravityKit/MCP
Length of output: 1026
🏁 Script executed:
rg -n "^import|^export" bench/lib/target.mjs | head -20Repository: GravityKit/MCP
Length of output: 392
🏁 Script executed:
fd -t f config '\.(mjs|js)$' bench/Repository: GravityKit/MCP
Length of output: 151
🏁 Script executed:
fd -t f config '\.(mjs|js)$' . --max-depth 2Repository: GravityKit/MCP
Length of output: 153
Fail fast on non-2xx grader reads to prevent false gate results.
With validateStatus: () => true (lines 69, 77), read helpers (countEntries, getEntries, getForm, findFormByTitle at lines 123–145) can treat HTTP errors as valid "empty" state, mis-scoring benchmark outcomes (e.g., auth/server failure interpreted as "no entries/forms found"). Add a shared 2xx assertion before parsing fallback values.
Also, line 16 imports config.mjs; update to config.js per ESM guidelines.
Suggested patch
-import { CONFIG, MCP_ENTRY } from '../config.mjs';
+import { CONFIG, MCP_ENTRY } from '../config.js';
@@
const gf = axios.create({
@@
validateStatus: () => true,
});
@@
const wp = axios.create({
@@
validateStatus: () => true,
});
+
+ const expect2xx = (res, op) => {
+ if (res.status < 200 || res.status >= 300) {
+ throw new Error(`${op} failed (${res.status}): ${JSON.stringify(res.data).slice(0, 200)}`);
+ }
+ };
@@
async countEntries(formId) {
const res = await gf.get(`/forms/${formId}/entries`, { params: { paging: { page_size: 1 } } });
+ expect2xx(res, `countEntries(${formId})`);
const total = res.data?.total_count;
@@
async getEntries(formId, params = {}) {
const res = await gf.get(`/forms/${formId}/entries`, { params });
+ expect2xx(res, `getEntries(${formId})`);
return Array.isArray(res.data?.entries) ? res.data.entries : [];
},
@@
async getForm(formId) {
const res = await gf.get(`/forms/${formId}`);
+ expect2xx(res, `getForm(${formId})`);
return res.data || {};
},
@@
async findFormByTitle(title) {
const res = await gf.get('/forms');
+ expect2xx(res, 'findFormByTitle');
const forms = res.data && typeof res.data === 'object' ? Object.values(res.data) : [];🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@bench/lib/target.mjs` around lines 64 - 78, The axios instances gf and wp use
validateStatus: () => true which treats all HTTP status codes as successful,
allowing the read helper functions (countEntries, getEntries, getForm,
findFormByTitle) to misinterpret HTTP errors as empty results and produce
incorrect benchmark scores. Add a shared assertion in each helper function to
verify the response status is in the 2xx range before parsing values, returning
an empty/fallback state only for genuinely successful empty responses.
Additionally, update the import statement on line 16 from config.mjs to
config.js to comply with ESM guidelines.
| async function fetchHtml(baseUrl, path) { | ||
| const res = await fetch(`${baseUrl}${path}`, { redirect: 'follow' }); | ||
| return { status: res.status, html: await res.text() }; |
There was a problem hiding this comment.
Front-end fetches need a timeout to keep the suite from hanging.
Unlike the axios clients in this bench stack, this raw fetch path has no timeout. A stalled response can block the process indefinitely.
Suggested fix
async function fetchHtml(url) {
- const res = await fetch(url, { redirect: 'follow' });
- return { status: res.status, html: await res.text() };
+ const controller = new AbortController();
+ const timer = setTimeout(() => controller.abort(), 30_000);
+ try {
+ const res = await fetch(url, { redirect: 'follow', signal: controller.signal });
+ return { status: res.status, html: await res.text() };
+ } finally {
+ clearTimeout(timer);
+ }
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@bench/nested-forms.mjs` around lines 74 - 76, The fetchHtml function's fetch
request lacks a timeout mechanism, which can cause the process to hang
indefinitely if a response stalls. Add a timeout to the fetch call in fetchHtml
by using an AbortController to set a reasonable timeout duration (similar to the
axios clients already in the bench stack). When the timeout is exceeded, the
abort controller should terminate the request and the error should be properly
handled to allow the process to continue.
… empty
Inputs that have keys but flatten to no query params (e.g. { nested: {} } or
{ a: null }) were sent with an empty params object, so the WP Abilities API
got no `input` arg and rejected GET/DELETE with a 400 against type:object.
- Flatten the input first, then fall back to input="" when the result is empty
- Covers truly-empty input and keys that serialize away
- Add loader regression tests (DELETE-empty parity, {nested:{}}, {a:null})
generateCompoundInputs emitted .4 as "Cardholder Name" and .5 as "Card Type", the swapped order. Verified against class-gf-field-creditcard.php: .4 is Card Type and .5 is Cardholder Name (only .1 and .4 are persisted), realigning the createForm path with generateSubInputs and storage.subInputs. - Swap the .4/.5 labels in generateCompoundInputs - Update the field-registry test that had locked in the swapped order
The release gate's value is its grading; several graders passed on the wrong signal and gave false confidence. - entries.read/search require a real read/search tool call, not a prompt echo - view-fields.add/remove scope the check to the table-columns area (columnIds) - grid.add-row-and-place-field requires Email in a newly added area - Add bench-grading unit tests covering each contract
If the claude binary is missing or not executable, the ChildProcess 'error'
event was unhandled: it crashed the gate or left the run's Promise pending.
- Add child.on('error') that resolves a spawn_failed hardError so the run
scores as a failure and the suite continues
- Add a bin parameter to runOnce for testability and export it
- Add a bench-agent regression test
destroySite ran only on the normal completion path, so a throw between provisioning and cleanup leaked a running Docker site. - Add a tested cleanupMintedSite helper (best-effort, never throws) - Wrap run.mjs main in try/finally so cleanup runs on every exit - Add a bench-cleanup unit test
Hardcoding 'node' fails in CI whose PATH resolves a different (or no) node. process.execPath uses the same binary running the test.
The new bench grader/runner tests run only when listed in test:node, which also gates prepublishOnly.
mcp.json was left at 2.3.0 when package.json bumped to 2.4.0; the server-tools test asserts the two match.
- Add test/AGENTS.md covering the two test harnesses, registration, and the bench AI gate; update the AGENTS.md testing section to match - README: add the npm run bench:nested-forms command - bench/README: correct the suite size (32 tasks) and document expectedTurns/maxTurns in the task template - CHANGELOG: remove the doubled "only" in the 2.4.0 sorting note
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/bench-grading.test.js`:
- Around line 42-52: The test for 'entries.read' is passing the wrong input
format to task.grade. The grader expects a client and state-based input
(specifically calling client.getEntry with state.entryId) as defined in
bench/tasks/entries-crud.mjs, but the tests on lines 45 and 49 are currently
passing telemetry with toolCalls and finalText. Update both task.grade
invocations to pass an object with the correct shape: a client mock/object and
state object containing entryId, rather than telemetry data. This will ensure
the tests validate the actual grading contract instead of failing for the wrong
reason (missing client parameter).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f875a7a9-b975-47d4-8373-aeaf8e65a302
📒 Files selected for processing (21)
AGENTS.mdCHANGELOG.mdREADME.mdbench/README.mdbench/lib/agent.mjsbench/lib/siteminter.mjsbench/run.mjsbench/tasks/entries-crud.mjsbench/tasks/grid.mjsbench/tasks/view-fields.mjsmcp.jsonpackage.jsonsrc/abilities/loader.jssrc/field-definitions/field-registry.jstest/AGENTS.mdtest/abilities-loader.test.jstest/bench-agent.test.jstest/bench-cleanup.test.jstest/bench-grading.test.jstest/field-registry.test.jstest/server-lifecycle.test.js
✅ Files skipped from review due to trivial changes (4)
- mcp.json
- AGENTS.md
- README.md
- CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (10)
- bench/README.md
- bench/tasks/entries-crud.mjs
- bench/tasks/view-fields.mjs
- package.json
- bench/lib/siteminter.mjs
- src/abilities/loader.js
- bench/run.mjs
- bench/tasks/grid.mjs
- bench/lib/agent.mjs
- src/field-definitions/field-registry.js
Brings
mainup to 2.4.0 with the 28 commits from this development cycle.Summary
Gravity Forms entries created or read through the MCP now use the correct storage shape for every field type, so an AI assistant gets the entry format right the first time. Adds support for the
passwordfield type (now 46 field types), plus dev-only benchmark suites that grade the whole tool surface with a small model before each release.Highlights
🚀 Added
passwordfield type in the registry (now 46 field types, up from 45).entry_inputhints ingf_list_field_typessummary mode for every field type whose entry format is not an obvious plain string.form) field config in the registry andgf_list_field_types --detail(GP Nested FormsgpnfForm+ Summary Fields).✨ Improved
gf_create_formauto-assigns field ids when omitted (GF max+1), so callers can describe fields without hand-numbering them.gf_delete_form/gf_delete_entrydescriptions now state the safe Trash default explicitly.instructionspoint the search-bar flow at the one-callgv_search_bar_addroute.GRAVITYKIT_MCP_LIST_TIMEOUT_MS.🐛 Fixed
number/quantity,signature,creditcard,post_content,post_image,fileupload,product,form).chainedselectsub-inputs are now generated (one per dropdown level).entry_inputhints foraddress,chainedselect,survey_rank,date, and the survey/quiz/poll tokens.NODE_ENV=developmentwas wrongly routed to stdout)./runcalls always send an objectinput(WP Abilities API rejects null with a 400).sorting.is_numericis interpreted strictly (a JSON string"false"no longer forces numeric ordering).node src/index.jsprocesses.💻 Developer Notes
npm run bench,bench:field-output,bench:field-storage,bench:nested-forms.Full detail in
CHANGELOG.md.Summary by CodeRabbit
New Features
gf_list_field_typesmetadata andentry_inputhintsgf_list_field_types --detailoutputBug Fixes
Documentation
force=trueandALLOW_DELETE=true)