Most voice agent benchmarks evaluate either what the agent does or how it sounds — EVA evaluates both.
EVA is an open-source evaluation framework for conversational voice agents that scores complete, multi-turn spoken conversations across two fundamental dimensions:
- 🎯 EVA-A (Accuracy) — Did the agent complete the task correctly and faithfully?
- ✨ EVA-X (Experience) — Was the interaction natural, concise, and appropriate for spoken dialogue?
Using a realistic bot-to-bot architecture, EVA runs fully automated evaluations without human listeners — end to end, from speech in to judgment out.
- Metrics for both EVA-A and EVA-X, fully documented and validated with judge prompts, code, etc.
- 50 airline scenarios spanning flight rebooking, cancellations, vouchers, and more
- Results for 20 cascade and audio-native systems (speech-to-speech models, large audio language models) — see Experiment Setup for model configurations.
Agents that score well on task completion tend to score worse on conversational experience — and vice versa. The accuracy–experience tradeoff is real, consistent, and previously unmeasured.
If you're only interested in running the latest stable version of EVA, you can clone with --branch latest, and optionally speed things up with --depth 1 --no-tags --single-branch.
git clone https://github.com/ServiceNow/eva.git --branch latest --depth 1 --no-tags --single-branchOtherwise, for development, you can clone the default branch, main.
git clone https://github.com/ServiceNow/eva.gitWe recommend using uv for fast, reliable dependency management. If you don't have uv installed, see the uv installation guide.
This project requires Python 3.11–3.13 (set via requires-python in pyproject.toml). uv will automatically select a compatible version. If you're using pip, make sure you're running a supported Python version.
cd eva
# Install all dependencies (uv automatically creates a virtual environment)
uv sync --all-extras
# Copy environment template
cp .env.example .env
# Edit .env with your API keys (ELEVENLABS_API_KEY, OPENAI_API_KEY required)After installation, you can run EVA using either:
eva— CLI entry point (e.g.,eva --help)python main.py— script at the repo root (e.g.,python main.py --help)
If using an IDE, point your Python interpreter to .venv/bin/python so commands run in the virtual environment automatically. Otherwise, prefix commands with uv run or activate the environment with source .venv/bin/activate.
Alternative: using pip
This project requires Python 3.11. If you need to manage multiple Python versions, consider using pyenv.
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install --upgrade pip
pip install -e ".[dev]"Required:
OPENAI_API_KEY(or another LLM provider): Powers the assistant LLM and text judge metricsEVA_MODEL_LIST: Model deployments that reference your API key (see.env.example). Also configurable via--model-listCLI flag. Only used for regular LLMs.ELEVENLABS_API_KEY+ agent IDs: For user simulation- STT/TTS API key and model: Passed via
EVA_MODEL__STT_PARAMS/EVA_MODEL__TTS_PARAMS(default provider is Cartesia)
For all metrics:
OPENAI_API_KEY: GPT-5.2 for text judge metrics (task completion, conciseness, turn taking, etc.)GOOGLE_APPLICATION_CREDENTIALS: Gemini via Vertex AI (audio judge metrics)AWS_ACCESS_KEY_ID+AWS_SECRET_ACCESS_KEY: Claude via Bedrock (faithfulness metric)
Key Environment Variables:
# Framework Configuration
EVA_DOMAIN=airline # Domain-based path conventions
EVA_MAX_CONCURRENT_CONVERSATIONS=5 # Max parallel conversations
EVA_DEBUG=false # Run only 1 record for testing when enabled
EVA_RECORD_IDS=1.2.1,1.2.2 # Run specific records only (remove to run all records)
# Pipeline Model Configuration (nested under EVA_MODEL__)
EVA_MODEL__LLM=gpt-5-mini # LLM model name (must match EVA_MODEL_LIST)
EVA_MODEL__STT=deepgram # deepgram | openai_whisper
EVA_MODEL__TTS=cartesia # cartesia | elevenlabs
EVA_MODEL__STT_PARAMS={"api_key":"", "alias": "deepgram-nova-3", "model": "nova-3"}
EVA_MODEL__TTS_PARAMS={"api_key":"", "alias": "cartesia-sonic-3", "model": "sonic-3"}
# Or speech-to-speech model (mutually exclusive with LLM)
# EVA_MODEL__S2S=gpt-realtime-mini # Audio-native model name (S2S, S2T+TTS)
# Logging
EVA_LOG_LEVEL=INFO # DEBUG | INFO | WARNING | ERRORSee .env.example for the complete list of configuration options.
# Run with domain-based conventions (easiest):
EVA_DOMAIN=airline python main.py
# Automatically uses:
# data/airline_dataset.jsonl
# configs/agents/airline_agent.yaml
# data/airline_scenarios/
# Run with CLI overrides
python main.py --llm-model gpt-5-mini --max-concurrent 10# Re-run specific metrics on an existing run
python main.py \
--run-id <existing_run_id> \
--metrics task_completion,faithfulness,conciseness# Build the image
docker compose build
# Run a benchmark
docker compose run --rm benchmarkInstall pre-commit hooks to lint and format code:
pre-commit installInstall the [dev] extra dependencies as shown in the Installation section.
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_postprocessor_transcript.py -v
# Run with coverage
pytest tests/ --cov=eva
# Run metrics tests
pytest tests/integration/test_metrics.py -vExisting benchmarks evaluate voice agent components in isolation — speech understanding, TTS quality, or conversational dynamics — but none assess the full pipeline end to end. In real deployed systems, errors compound across modules and failure modes interact in ways that component-level evaluation cannot capture. EVA addresses this by treating voice agent quality as an integrated whole, evaluating accuracy and experience jointly across complete multi-turn spoken conversations.
| Framework | Interaction Mode | Multi-turn | Tool Calling | Goal Completion | Experience Metrics | Pass@k Pass^k |
Supported Systems |
|---|---|---|---|---|---|---|---|
| EVA | Live bot-to-bot | ✅ | ✅ | ✅ Task Completion, Speech Fidelity, Faithfulness |
✅ Conciseness, Turn-taking, Latency, Progression |
✅ | Audio-native, Cascade |
| VoiceAgentBench | Static, TTS-synthesized | ✅ | ✅ | ❌ | ❌ | Audio-native, Cascade | |
| CAVA | Partial simulation | ✅ | ✅ | Latency, Tone-awareness |
❌ | Audio-native, Cascade | |
| FDB-v2 | Live, automated examiner | ✅ | ❌ | ❌ | ✅ Turn-taking fluency, Correction handling, Safety |
❌ | Audio-native |
| FDB-v1 | Static, pre-recorded | ❌ | ❌ | ❌ | ✅ Turn-taking, Backchanneling, Interruption |
❌ | Audio-native |
| FD-Bench | Live, simulated | ❌ | ❌ | ❌ | ✅ Interruption, Delay, Robustness |
❌ | Audio-native |
| Talking Turns | Static, curated | ❌ | ❌ | ❌ | ✅ Turn change, Backchannel, Interruption |
❌ | Audio-native, Cascade |
EVA evaluates agents using a bot-to-bot audio architecture — no human listeners, no text replays. Two conversational AIs speak to each other over a live WebSocket connection, producing realistic speech-to-speech interactions that capture real STT behavior and turn-taking dynamics.
| Component | Role |
|---|---|
| 🎭 User Simulator (ElevenAgent) | Plays the role of a caller with a defined goal and persona |
| 🤖 Voice Agent (Pipecat) | The system under evaluation — supports cascade (STT→LLM→TTS) and speech-to-speech models |
| 🔧 Tool Executor | The engine that provides deterministic, reproducible tool responses via custom Python functions. It dynamically queries and modifies a predefined per-scenario database. |
| ✅ Validators | Automated checks that verify conversations are complete and that the user simulator faithfully reproduced its intended goal — no human annotation required. Conversations that fail validation are automatically regenerated, ensuring only clean, correctly executed runs enter evaluation. |
| 📊 Metrics Engine | Scores each conversation using the audio recording, transcripts, and tool call logs. |
output/<run_id>/
├── config.json # Run configuration snapshot
├── results.csv # Quick results table
├── metrics_summary.json # Aggregate metrics (after metrics run)
├── metrics_summary.csv # Per-category metrics breakdown
└── records/<record_id>/
├── result.json # Conversation result
├── audio_assistant.wav # Assistant audio channel
├── audio_user.wav # User audio channel
├── audio_mixed.wav # Mixed stereo audio
├── transcript.jsonl # Turn-by-turn transcript
├── audit_log.json # Complete interaction log
├── pipecat_logs.jsonl # Pipecat framework events
├── elevenlabs_events.jsonl # ElevenLabs events
└── metrics.json # Per-record metric scores and details
| 🎯 EVA-A · Accuracy | ✨ EVA-X · Experience |
|---|---|
| Did the agent complete the task correctly? | Was the conversational experience high quality? |
| Task Completion · Deterministic | Turn Taking · LLM Judge BETA |
Agent Speech Fidelity · Audio LLM Judge BETA |
Conciseness · LLM Judge |
| Faithfulness · LLM Judge | Conversation Progression · LLM Judge |
See the Metrics documentation for detailed scoring rubrics and judge prompts. For the data structures that metrics operate on, see MetricContext documentation.
EVA includes 50 airline scenarios, each specifying a user goal, persona, scenario database, and ground truth end state — making evaluations fully reproducible and directly comparable across agents and model versions. See the Data documentation for a detailed breakdown of the data structure and scenario design, and the Database & Tool Schema for the airline scenario database format.
Flight rebooking is a strong initial domain: it is high-stakes, time-pressured, and demands temporal reasoning, policy following, constraint satisfaction, and accurate transcription of named entities (confirmation codes, flight numbers, passenger names, dates).
| Category | Description |
|---|---|
| Airline-initiated disruptions — user is entitled to rebooking at no cost | |
| 🔄 Voluntary Changes | User-initiated changes subject to fare differences and change fees |
| 🔗 Missed Connections | Cascading disruptions across multiple legs |
| ⏱️ Same-Day Changes | Time-sensitive standby and same-day change requests |
| Users seeking compensation they are not entitled to under policy |
eva/
├── main.py # Main entry point
├── pyproject.toml # Python project configuration
├── Dockerfile # Docker configuration
├── compose.yaml # Docker Compose configuration
├── src/eva/
│ ├── cli.py # CLI interface
│ ├── run_benchmark.py # Benchmark runner
│ ├── models/ # Pydantic data models
│ ├── orchestrator/ # Framework execution
│ │ ├── runner.py # Main orchestrator
│ │ ├── worker.py # Per-conversation worker
│ │ ├── validation_runner.py # Validation runner
│ │ └── port_pool.py # Port management
│ ├── assistant/ # Pipecat-based assistant
│ │ ├── agentic/ # Agent orchestration
│ │ ├── tools/ # Python-based tool implementations
│ │ ├── pipeline/ # Audio/LLM processing pipeline
│ │ └── services/ # STT/TTS/LLM factories
│ ├── user_simulator/ # ElevenLabs user simulator
│ ├── metrics/ # Evaluation metrics
│ │ ├── base.py # Base metric classes
│ │ ├── processor.py # Metrics context processor
│ │ ├── runner.py # Metrics execution
│ │ ├── registry.py # Metric registry
│ │ ├── aggregation.py # Metric aggregation
│ │ ├── accuracy/ # Task completion metrics
│ │ ├── experience/ # Responsiveness, progression, turn-taking
│ │ ├── diagnostic/ # Diagnostic metrics (not in final scores)
│ │ └── validation/ # Quality control metrics
│ └── utils/ # Utilities (LLM client, log processing)
├── scripts/ # Utility scripts
│ ├── run_text_only.py # Text-only evaluation runner
│ ├── docker_entrypoint.py # Docker entry point
│ ├── check_version_bump.py # Version checking
│ └── push_to_hf.py # Hugging Face push script
├── configs/ # Configuration files
│ ├── prompts/ # Judge and simulation prompts
│ │ ├── judge.yaml # Judge metric prompts
│ │ └── simulation.yaml # User simulator prompts
│ └── agents/ # Agent configurations
│ └── airline_agent.yaml
├── docs/ # Documentation
│ ├── metrics/ # Per-metric documentation
│ ├── data.md # Data documentation
│ ├── experiment_setup.md # Experiment setup guide
│ ├── llm_configuration.md # LLM provider setup guide
│ ├── metric_context.md # Metric context documentation
│ ├── limitations.md # Known limitations
│ └── demo/ # Demo audio files
├── data/ # Data files
│ ├── airline_dataset.jsonl # Evaluation dataset
│ └── airline_scenarios/ # Per-record scenario databases
├── tests/ # Test suite
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ ├── artifacts/ # Test artifacts and fixtures
│ └── fixtures/ # Shared test fixtures
└── website/ # Project website (React/TypeScript)
We welcome contributions! Please read our Contributing Guidelines before submitting a pull request. For larger features, we recommend reaching out first to ensure alignment with our roadmap.
See Limitations for known limitations of the framework and metrics.