Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Description: Low-Overhead Instrumentation Support
Summary
This PR introduces dynamic instrumentation policies (sampling and warm-up) to significantly reduce overhead for large-scale training runs. It refactors the instrumentation control logic to be injected directly into training and evaluation loops, ensuring robust state management and correct application of policies across different execution stages.
Key Changes
1. Loop-Based Instrumentation Control
traincheck.instrumentor.controlwithstart_step()andstart_eval_step()functions.start_step(): Increments the global training step counter and applies the configured policy (interval/warmup) to toggle instrumentation.start_eval_step(): Manages a separateeval_stepcounter for evaluation loops, reusing the global policy.optimizer.step()wrapper. This prevents issues where instrumentation state could become desynchronized or incorrectly applied outside of loop contexts.2. Smart AST Injection (Source Instrumentation)
InsertTracerVisitorintraincheck/instrumentor/source_file.pyto intelligently detect loop contexts:optimizer.step()orloss.backward(). The visitor injectsstart_step().test,eval,valid). The visitor injectsstart_eval_step().3. CLI & Configuration Updates
traincheck-collectArguments:--sampling-interval: Controls how frequently steps are instrumented (e.g., every Nth step).--warm-up-steps: Specifies the number of initial steps to always instrument, regardless of the sampling interval.4. Robustness Improvements
annotate_stageto resetDISABLE_WRAPPERtoFalseupon entering a new stage. This ensures instrumentation is re-enabled by default when switching contexts (e.g., from Training to Validation), preventing state leakage.Verification
tests/test_loop_injection.pyto verify that AST transformations correctly identify loop types and inject the appropriate control calls.tests/test_dynamic_policy.pyto verify the runtime logic ofstart_stepand policy application.tests/test_policy_injection.pyfor CLI argument integration.mnist.pyexample. Confirmed that trace logs show expected "Interval step" (instrumented) and "Skipping step" (skipped) behavior for both training and testing loops.