autodebug

Telemetry-Driven Inference Troubleshooting and Optimization Loop

Deploy a model as an inference service via dstack, continuously fetch profiling telemetry from Graphsignal, analyze performance, and redeploy with improved configuration — all autonomously.

Inspired by karpathy/autoresearch. Where autoresearch lets an AI agent iterate on training code overnight, autodebug lets an AI agent iterate on inference deployment configuration: tuning batch sizes, caching strategies, parallelism, and engine parameters to minimize latency and maximize throughput.

How it works

The agent follows an optimization loop defined in program.md:

Deploy an inference service (e.g. SGLang, vLLM) on dstack with Graphsignal telemetry enabled.
Benchmark the endpoint with targeted request patterns (parallel, sequential, long prompts, etc.).
Fetch telemetry from Graphsignal — profiling data, traces, metrics, and errors.
Analyze performance: compute prefill throughput, decode throughput, token throughput, and identify bottlenecks.
Redeploy with an optimized dstack configuration reflecting the improvements.
Repeat indefinitely.

Each iteration is logged to a separate sessions/debug-<ISO>.md file (findings and rationale), building a complete record of the optimization journey.

Prerequisites

1. Graphsignal

Graphsignal provides inference observability — profiling, tracing, and metrics. Sign up and obtain an API key.

Install the debug CLI and log in:

uv tool install graphsignal-debug
graphsignal-debug login

2. dstack

dstack manages cloud infrastructure for inference services. Follow the dstack installation docs to set up your dstack project.

uv tool install dstack

3. Agent skills

Download the skill files so the agent has full context:

mkdir -p ~/.claude/skills/graphsignal-python ~/.claude/skills/graphsignal-debug ~/.claude/skills/dstack
curl -sL https://raw.githubusercontent.com/graphsignal/graphsignal-python/main/SKILL.md -o ~/.claude/skills/graphsignal-python/SKILL.md
curl -sL https://raw.githubusercontent.com/graphsignal/graphsignal-debug/main/SKILL.md -o ~/.claude/skills/graphsignal-debug/SKILL.md
curl -sL https://raw.githubusercontent.com/dstackai/dstack/master/.skills/dstack/SKILL.md -o ~/.claude/skills/dstack/SKILL.md

Project structure

program.md              — agent instructions (the optimization loop)
dstack-baseline.yml     — baseline dstack service configuration
sessions/
  .gitkeep
  debug-<ISO>.md        — per-session findings and planned changes (created by agent)
  dstack-<ISO>.yml      — per-session optimized configurations (created by agent)

Running the agent

Point your AI agent at this repo and prompt:

Read program.md and let's set up an optimization session.

The agent will verify prerequisites, deploy the initial service, and begin the autonomous optimization loop. You can walk away — it will keep iterating, logging results, and redeploying improvements until you stop it.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
sessions		sessions
.DS_Store		.DS_Store
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
dstack-baseline.yml		dstack-baseline.yml
program.md		program.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autodebug

How it works

Prerequisites

1. Graphsignal

2. dstack

3. Agent skills

Project structure

Running the agent

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

autodebug

How it works

Prerequisites

1. Graphsignal

2. dstack

3. Agent skills

Project structure

Running the agent

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages