fix: resolve multi-iteration tensor file overwrite and simplify precision checker#104
Merged
kilinchange merged 10 commits intomasterfrom Feb 2, 2026
Merged
fix: resolve multi-iteration tensor file overwrite and simplify precision checker#104kilinchange merged 10 commits intomasterfrom
kilinchange merged 10 commits intomasterfrom
Conversation
…sion checker Counter mechanism: - Add ResetCounters() to clear tensor counter at iteration boundaries - Move counter management to PrecisionCheckEnv with thread_local storage - Call ResetCounters() at start of each training step in gpt2/llama3 Precision checker refactoring: - Remove baseline comparison functionality (use separate script instead) - Remove table format output, keep only simple and md5 formats - Add TensorStats struct with min/max/mean/nan_count/inf_count - Add SaveNpy() function for NPY file saving with rank subdirectories - Simplify log output format with dtype, shape, stats, and first 6 values - Change stage names from "Module Forward/Backward Output" to "Forward/Backward Output" - Use std::filesystem instead of sys/stat.h for directory creation Documentation and scripts: - Update docs/precision_checker_guide.md with current implementation - Add precision_compare.py for offline NPY comparison - Add run_precision_check_gpt2.sh and run_precision_check_llama3.sh Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GlobalModuleHookRegistry singleton to decouple PrecisionChecker from Module::operator(), allowing any hook to be registered globally - Add md5_tolerance config option for PrecisionChecker to handle BF16 precision differences (e.g., md5_tolerance=1e-3 makes 4.0003 and 4.0004 produce the same MD5 hash) - Update gpt2 and llama3 examples to use the new hook registration API Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace all Chinese comments with English translations in global_module_hook_registry.h for better international accessibility.
kilinchange
requested changes
Jan 28, 2026
Collaborator
|
另外,麻烦在 scripts/run_models_and_profile.bash 这个测试脚本里,末尾加一下执行 compare_loss.py 对比精度的步骤吧(可以加一个配置传入用于对比的 log 路径,没传就默认不执行 loss 对比脚本) |
- Rename compare_loss.py/compare_tps.py from tools/ to scripts/ - Add --verbose flag to comparison scripts for detailed output - Show full paths in "Files only in..." messages - Only print comparison details for mismatches (quiet by default) - Add precision_check_config.json and run_precision_check.sh unified runner - Delete old run_precision_check_gpt2.sh/llama3.sh scripts - Add COMPARE_LOG_DIR support to run_models_and_profile.bash - Add tls_ prefix to thread_local variables for consistency - Add error handling with log tail output in run_models_and_profile.bash - Fix timestamped_path_ default initialization Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Contributor
Author
done |
- Remove ambiguous run_precision_check.sh and precision_check_config.json - Add PrecisionChecker::Init() for automatic module hook registration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
kilinchange
requested changes
Jan 29, 2026
…erHook - RegisterHook now returns std::unique_ptr<HookHandle> for hook removal
kilinchange
requested changes
Jan 30, 2026
- Update precision checker to use new hook registration APIs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use Tensor::output_idx() instead of TryMarkModuleBackwardHookRegistered to ensure hooks are only registered for the first output tensor. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
kilinchange
requested changes
Feb 2, 2026
kilinchange
approved these changes
Feb 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix precision checker file accumulation issue during multi-iteration runs and simplify the overall implementation.
Changes
Counter Mechanism Fix
ResetCounters()method to reset tensor counter at iteration boundariesPrecisionCheckEnvwiththread_localstorage for thread safetyResetCounters()at the start of each training step in gpt2/llama3Precision Checker Refactoring
SaveNpy()function with rank subdirectory support[GAS-X] [L-Y] name_idx_stage tensor[i]: dtype=... shape=... min=... max=... mean=... [values] [NaN:X Inf:Y]New Scripts
scripts/precision_check/precision_compare.py- Offline NPY comparison toolscripts/precision_check/run_precision_check_gpt2.sh- GPT2 verification scriptscripts/precision_check/run_precision_check_llama3.sh- LLaMA3 verification scriptDocumentation
docs/precision_checker_guide.mdto reflect current implementationUsage Example
Testing Example
Run verification script: