Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an evaluation harness for running experiments and trace-based evaluations on Langfuse datasets. It provides a wrapper around Langfuse's dataset.run_experiment with additional trace evaluation capabilities.
Changes:
- Added
run_experimentwrapper to fetch datasets by name and run experiments - Added
run_trace_evaluationsfor second-pass trace-based evaluation - Added
run_coroutine_syncutility to execute async code synchronously - Introduced type definitions for trace evaluations, metrics, and results
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| aieng-eval-agents/aieng/agent_evals/evaluation/types.py | Defines types for evaluation harness including protocols, dataclasses, and enums |
| aieng-eval-agents/aieng/agent_evals/evaluation/trace.py | Implements trace-based evaluation with retry logic and metric extraction |
| aieng-eval-agents/aieng/agent_evals/evaluation/experiment.py | Provides experiment wrapper functions for running evaluations |
| aieng-eval-agents/aieng/agent_evals/evaluation/init.py | Package initialization exporting public API |
| aieng-eval-agents/aieng/agent_evals/async_utils.py | Added run_coroutine_sync utility and refactored progress display |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
amrit110
left a comment
There was a problem hiding this comment.
Code looks great, perhaps we can adding an integration test with a real langfuse trace in a follow-up?
|
I'd wait for @lotif to review this PR before merging, since its quite an important one. |
lotif
left a comment
There was a problem hiding this comment.
Good addition, thanks. Just one question.
Summary
Add evaluation harness. This includes a wrapper around Langfuse's
dataset.run_experimentto run evaluators/graders on langfuse datasets, as well as a method to run trace-based evaluations.Clickup Ticket(s): N/A
Type of Change
Changes Made
run_experiment, a thin wrapper around langfuse'sdataset.run_experimentthat includes fetching the dataset by name before running an experiment.run_trace_evaluationsas a second-pass evaluation that takes the output ofrun_experiment, fetches traces and runs trace evaluators.run_experiment_with_trace_evals, which combinesrun_experimentandrun_trace_evaluationsin a workflow.run_coroutine_syncutility function to be able to run an async coroutine synchronously from a python script or a notebook.Testing
uv run pytest tests/)uv run mypy <src_dir>)uv run ruff check src_dir/)Manual testing details:
N/A
Screenshots/Recordings
N/A
Related Issues
N/A
Deployment Notes
N/A
Checklist