Add evaluation harness by fcogidi · Pull Request #37 · VectorInstitute/eval-agents

fcogidi · 2026-02-04T22:01:02Z

Summary

Add evaluation harness. This includes a wrapper around Langfuse's dataset.run_experiment to run evaluators/graders on langfuse datasets, as well as a method to run trace-based evaluations.

Clickup Ticket(s): N/A

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Added a run_experiment, a thin wrapper around langfuse's dataset.run_experiment that includes fetching the dataset by name before running an experiment.
Added run_trace_evaluations as a second-pass evaluation that takes the output of run_experiment, fetches traces and runs trace evaluators.
Added run_experiment_with_trace_evals, which combines run_experiment and run_trace_evaluations in a workflow.
Added run_coroutine_sync utility function to be able to run an async coroutine synchronously from a python script or a notebook.

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:
N/A

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

N/A

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

Copilot

Pull request overview

This PR adds an evaluation harness for running experiments and trace-based evaluations on Langfuse datasets. It provides a wrapper around Langfuse's dataset.run_experiment with additional trace evaluation capabilities.

Changes:

Added run_experiment wrapper to fetch datasets by name and run experiments
Added run_trace_evaluations for second-pass trace-based evaluation
Added run_coroutine_sync utility to execute async code synchronously
Introduced type definitions for trace evaluations, metrics, and results

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
aieng-eval-agents/aieng/agent_evals/evaluation/types.py	Defines types for evaluation harness including protocols, dataclasses, and enums
aieng-eval-agents/aieng/agent_evals/evaluation/trace.py	Implements trace-based evaluation with retry logic and metric extraction
aieng-eval-agents/aieng/agent_evals/evaluation/experiment.py	Provides experiment wrapper functions for running evaluations
aieng-eval-agents/aieng/agent_evals/evaluation/init.py	Package initialization exporting public API
aieng-eval-agents/aieng/agent_evals/async_utils.py	Added `run_coroutine_sync` utility and refactored progress display

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aieng-eval-agents/aieng/agent_evals/evaluation/trace.py

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aieng-eval-agents/aieng/agent_evals/evaluation/trace.py

aieng-eval-agents/aieng/agent_evals/async_utils.py

…r token usage

amrit110

Code looks great, perhaps we can adding an integration test with a real langfuse trace in a follow-up?

aieng-eval-agents/aieng/agent_evals/evaluation/trace.py

amrit110 · 2026-02-05T20:57:48Z

I'd wait for @lotif to review this PR before merging, since its quite an important one.

lotif

Good addition, thanks. Just one question.

aieng-eval-agents/aieng/agent_evals/evaluation/trace.py

Add evaluation harness with experiment and trace evaluation support

b484152

fcogidi requested a review from Copilot February 4, 2026 22:01

fcogidi self-assigned this Feb 4, 2026

fcogidi added the enhancement New feature or request label Feb 4, 2026

Copilot AI reviewed Feb 4, 2026

View reviewed changes

aieng-eval-agents/aieng/agent_evals/evaluation/trace.py Outdated Show resolved Hide resolved

fcogidi added 2 commits February 5, 2026 11:41

Refactor trace evaluation coroutine setup for improved readability

dce0b67

Add unit tests for trace evaluation helpers and metrics extraction

abe05eb

fcogidi requested a review from Copilot February 5, 2026 16:58

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Fix type casting for trace_id in _evaluate_item function

7011611

fcogidi marked this pull request as ready for review February 5, 2026 17:06

fcogidi requested review from amrit110 and lotif February 5, 2026 17:06

Clarify comments in _sum_token_usage to explain key counting logic fo…

4cfcdab

…r token usage

amrit110 approved these changes Feb 5, 2026

View reviewed changes

aieng-eval-agents/aieng/agent_evals/evaluation/trace.py Show resolved Hide resolved

amrit110 added 2 commits February 5, 2026 15:19

Merge branch 'main' into fc/add_eval_harness

4947226

Merge branch 'main' into fc/add_eval_harness

da1bcad

lotif approved these changes Feb 5, 2026

View reviewed changes

aieng-eval-agents/aieng/agent_evals/evaluation/trace.py Show resolved Hide resolved

fcogidi merged commit 08ca627 into main Feb 5, 2026
3 checks passed

fcogidi deleted the fc/add_eval_harness branch February 5, 2026 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add evaluation harness#37

Add evaluation harness#37
fcogidi merged 7 commits intomainfrom
fc/add_eval_harness

fcogidi commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amrit110 left a comment

Uh oh!

Uh oh!

amrit110 commented Feb 5, 2026

Uh oh!

lotif left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

fcogidi commented Feb 4, 2026

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amrit110 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amrit110 commented Feb 5, 2026

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants