eforge eval

End-to-end evaluation harness for eforge. Runs eforge against fixture projects and validates the output compiles and tests pass.

Prerequisites

Node.js >= 22.6.0 (for native SQLite support)
eforge on PATH (or set EFORGE_BIN)
pnpm (for dependency installation)

Setup

pnpm install

Usage

./run.sh                              # Run all scenarios
./run.sh todo-health-check            # Run one scenario by ID
./run.sh pi-codex-todo-api-errand-health-check  # Run one Pi/Codex OAuth scenario
./open-monitor.sh                     # Open monitor for eval results DB
./run.sh --dry-run                    # Set up workspaces without running eforge
./run.sh --env-file .env              # Source env vars (e.g. Langfuse credentials)
./run.sh --cleanup                    # Remove all results

Environment variables

Variable	Default	Description
`EFORGE_BIN`	`eforge`	Path to eforge binary. Use this to test a local build (e.g. `EFORGE_BIN=~/projects/eforge/dist/cli.js`)
`EFORGE_MONITOR_DB`	(auto-set)	Shared SQLite DB for metrics. Set automatically by the harness.
`EFORGE_TRACE_TAGS`	(auto-set)	Langfuse trace tags. Set automatically per scenario.

Pi provider auth

Pi-backed scenarios can authenticate in two ways:

API key env vars via a scenario envFile such as env/pi.env
OAuth or cached credentials from ~/.pi/agent/auth.json

This repo now includes Pi scenarios for both:

OpenRouter API-key-based Pi runs
OpenAI Codex OAuth runs using pi.provider: openai-codex and agents.models.max: gpt-5.4

If you are testing Codex through Pi, make sure you have already logged in with Pi in your user environment before running the evals.

How it works

Each scenario copies a fixture to a temp directory in /tmp/ and initializes a fresh git repo
Runs eforge run <prd> --auto --verbose --foreground --no-monitor from the temp workspace
Events are recorded to a shared SQLite DB (results/monitor.db) via EFORGE_MONITOR_DB
Validation commands run against the workspace (type-check, tests, etc.)
Results are aggregated into results/<timestamp>/summary.json

A monitor server starts from the eval repo root, providing a stable web UI for observing runs. Individual eforge runs use --no-monitor (foreground mode, writing directly to the shared DB).

Adding scenarios

Edit scenarios.yaml:

scenarios:
  - id: my-scenario
    fixture: my-fixture        # Directory under fixtures/
    prd: docs/my-prd.md        # PRD path within the fixture
    description: "What this tests"
    validate:
      - pnpm install
      - pnpm type-check
      - pnpm test
    expect:                    # Optional
      mode: errand
      buildStagesContain: [implement]

Create the fixture under fixtures/my-fixture/ with source code and the PRD file.

For Pi scenarios, configure the provider under pi.provider and the model under agents.model or agents.models.*. Do not use pi.model; that is no longer part of eforge's Pi config schema.

Results

Results are stored in results/<timestamp>/ (gitignored) with:

summary.json - aggregate metrics across all scenarios
<scenario>/result.json - per-scenario metrics, validation, expectations
<scenario>/eforge.log - full eforge output
<scenario>/orchestration.yaml - preserved plan metadata

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
eforge		eforge
fixtures		fixtures
lib		lib
mcp-server		mcp-server
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
open-monitor.sh		open-monitor.sh
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
run.sh		run.sh
scenarios.yaml		scenarios.yaml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eforge eval

Prerequisites

Setup

Usage

Environment variables

Pi provider auth

How it works

Adding scenarios

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

eforge eval

Prerequisites

Setup

Usage

Environment variables

Pi provider auth

How it works

Adding scenarios

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages