Telemetry-Driven Inference Troubleshooting and Optimization Loop
Deploy a model as an inference service via dstack, continuously fetch profiling telemetry from Graphsignal, analyze performance, and redeploy with improved configuration — all autonomously.
Inspired by karpathy/autoresearch. Where autoresearch lets an AI agent iterate on training code overnight, autodebug lets an AI agent iterate on inference deployment configuration: tuning batch sizes, caching strategies, parallelism, and engine parameters to minimize latency and maximize throughput.
Read more in the blog post.
The agent follows an optimization loop defined in program.md:
- Deploy an inference service (e.g. SGLang, vLLM) on dstack with Graphsignal telemetry enabled.
- Benchmark the endpoint with targeted request patterns (parallel, sequential, long prompts, etc.).
- Fetch telemetry from Graphsignal — profiling data, traces, metrics, and errors.
- Analyze performance: compute prefill throughput, decode throughput, token throughput, and identify bottlenecks.
- Redeploy with an optimized dstack configuration reflecting the improvements.
- Repeat indefinitely.
Each iteration is logged to a separate sessions/debug-<ISO>.md file (findings and rationale), building a complete record of the optimization journey.
Graphsignal provides inference observability — profiling, tracing, and metrics. Sign up and obtain an API key.
Install the debug CLI and log in:
uv tool install graphsignal-debug
graphsignal-debug logindstack manages cloud infrastructure for inference services. Follow the dstack installation docs to set up your dstack project.
uv tool install dstackDownload the skill files so the agent has full context:
mkdir -p ~/.claude/skills/graphsignal-python ~/.claude/skills/graphsignal-debug ~/.claude/skills/dstack
curl -sL https://raw.githubusercontent.com/graphsignal/graphsignal-python/main/SKILL.md -o ~/.claude/skills/graphsignal-python/SKILL.md
curl -sL https://raw.githubusercontent.com/graphsignal/graphsignal-debug/main/SKILL.md -o ~/.claude/skills/graphsignal-debug/SKILL.md
curl -sL https://raw.githubusercontent.com/dstackai/dstack/master/.skills/dstack/SKILL.md -o ~/.claude/skills/dstack/SKILL.mdprogram.md — agent instructions (the optimization loop)
dstack-baseline.yml — baseline dstack service configuration
sessions/
.gitkeep
debug-<ISO>.md — per-session findings and planned changes (created by agent)
dstack-<ISO>.yml — per-session optimized configurations (created by agent)
Point your AI agent at this repo and prompt:
Read program.md and let's set up an optimization session.
The agent will verify prerequisites, deploy the initial service, and begin the autonomous optimization loop. You can walk away — it will keep iterating, logging results, and redeploying improvements until you stop it.
MIT