Skip to content

Comments

Add evaluation harness#37

Merged
fcogidi merged 7 commits intomainfrom
fc/add_eval_harness
Feb 5, 2026
Merged

Add evaluation harness#37
fcogidi merged 7 commits intomainfrom
fc/add_eval_harness

Conversation

@fcogidi
Copy link
Collaborator

@fcogidi fcogidi commented Feb 4, 2026

Summary

Add evaluation harness. This includes a wrapper around Langfuse's dataset.run_experiment to run evaluators/graders on langfuse datasets, as well as a method to run trace-based evaluations.

Clickup Ticket(s): N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🔧 Refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test improvements
  • 🔒 Security fix

Changes Made

  • Added a run_experiment, a thin wrapper around langfuse's dataset.run_experiment that includes fetching the dataset by name before running an experiment.
  • Added run_trace_evaluations as a second-pass evaluation that takes the output of run_experiment, fetches traces and runs trace evaluators.
  • Added run_experiment_with_trace_evals, which combines run_experiment and run_trace_evaluations in a workflow.
  • Added run_coroutine_sync utility function to be able to run an async coroutine synchronously from a python script or a notebook.

Testing

  • Tests pass locally (uv run pytest tests/)
  • Type checking passes (uv run mypy <src_dir>)
  • Linting passes (uv run ruff check src_dir/)
  • Manual testing performed (describe below)

Manual testing details:
N/A

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

N/A

Checklist

  • Code follows the project's style guidelines
  • Self-review of code completed
  • Documentation updated (if applicable)
  • No sensitive information (API keys, credentials) exposed

@fcogidi fcogidi requested a review from Copilot February 4, 2026 22:01
@fcogidi fcogidi self-assigned this Feb 4, 2026
@fcogidi fcogidi added the enhancement New feature or request label Feb 4, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an evaluation harness for running experiments and trace-based evaluations on Langfuse datasets. It provides a wrapper around Langfuse's dataset.run_experiment with additional trace evaluation capabilities.

Changes:

  • Added run_experiment wrapper to fetch datasets by name and run experiments
  • Added run_trace_evaluations for second-pass trace-based evaluation
  • Added run_coroutine_sync utility to execute async code synchronously
  • Introduced type definitions for trace evaluations, metrics, and results

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
aieng-eval-agents/aieng/agent_evals/evaluation/types.py Defines types for evaluation harness including protocols, dataclasses, and enums
aieng-eval-agents/aieng/agent_evals/evaluation/trace.py Implements trace-based evaluation with retry logic and metric extraction
aieng-eval-agents/aieng/agent_evals/evaluation/experiment.py Provides experiment wrapper functions for running evaluations
aieng-eval-agents/aieng/agent_evals/evaluation/init.py Package initialization exporting public API
aieng-eval-agents/aieng/agent_evals/async_utils.py Added run_coroutine_sync utility and refactored progress display

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@fcogidi fcogidi requested a review from Copilot February 5, 2026 16:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@fcogidi fcogidi marked this pull request as ready for review February 5, 2026 17:06
@fcogidi fcogidi requested review from amrit110 and lotif February 5, 2026 17:06
Copy link
Member

@amrit110 amrit110 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great, perhaps we can adding an integration test with a real langfuse trace in a follow-up?

@amrit110
Copy link
Member

amrit110 commented Feb 5, 2026

I'd wait for @lotif to review this PR before merging, since its quite an important one.

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition, thanks. Just one question.

@fcogidi fcogidi merged commit 08ca627 into main Feb 5, 2026
3 checks passed
@fcogidi fcogidi deleted the fc/add_eval_harness branch February 5, 2026 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants