fix: resolve multi-iteration tensor file overwrite and simplify precision checker by chen2021673 · Pull Request #104 · InfiniTensor/InfiniTrain

chen2021673 · 2026-01-22T07:55:38Z

Summary

Fix precision checker file accumulation issue during multi-iteration runs and simplify the overall implementation.

Changes

Counter Mechanism Fix

Add ResetCounters() method to reset tensor counter at iteration boundaries
Move counter management to PrecisionCheckEnv with thread_local storage for thread safety
Call ResetCounters() at the start of each training step in gpt2/llama3

Precision Checker Refactoring

Remove baseline comparison functionality (use separate script instead)
Remove table format output, keep only simple and md5 formats
Add SaveNpy() function with rank subdirectory support
Simplify log format: [GAS-X] [L-Y] name_idx_stage tensor[i]: dtype=... shape=... min=... max=... mean=... [values] [NaN:X Inf:Y]

New Scripts

scripts/precision_check/precision_compare.py - Offline NPY comparison tool
scripts/precision_check/run_precision_check_gpt2.sh - GPT2 verification script
scripts/precision_check/run_precision_check_llama3.sh - LLaMA3 verification script

Documentation

Update docs/precision_checker_guide.md to reflect current implementation

Usage Example

# Basic check
./build/gpt2 --precision_check "level=1" --num_iteration 1

# Save NPY files
./build/gpt2 --precision_check "level=1,save_tensors=true" --num_iteration 1

# MD5 format
./build/gpt2 --precision_check "level=1,format=md5" --num_iteration 1

# Compare two runs
python scripts/precision_check/precision_compare.py \
    --dir1 ./precision_check/run1 \
    --dir2 ./precision_check/run2

Testing Example

Run verification script:

bash scripts/precision_check/run_precision_check_gpt2.sh

…sion checker Counter mechanism: - Add ResetCounters() to clear tensor counter at iteration boundaries - Move counter management to PrecisionCheckEnv with thread_local storage - Call ResetCounters() at start of each training step in gpt2/llama3 Precision checker refactoring: - Remove baseline comparison functionality (use separate script instead) - Remove table format output, keep only simple and md5 formats - Add TensorStats struct with min/max/mean/nan_count/inf_count - Add SaveNpy() function for NPY file saving with rank subdirectories - Simplify log output format with dtype, shape, stats, and first 6 values - Change stage names from "Module Forward/Backward Output" to "Forward/Backward Output" - Use std::filesystem instead of sys/stat.h for directory creation Documentation and scripts: - Update docs/precision_checker_guide.md with current implementation - Add precision_compare.py for offline NPY comparison - Add run_precision_check_gpt2.sh and run_precision_check_llama3.sh Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add GlobalModuleHookRegistry singleton to decouple PrecisionChecker from Module::operator(), allowing any hook to be registered globally - Add md5_tolerance config option for PrecisionChecker to handle BF16 precision differences (e.g., md5_tolerance=1e-3 makes 4.0003 and 4.0004 produce the same MD5 hash) - Update gpt2 and llama3 examples to use the new hook registration API Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace all Chinese comments with English translations in global_module_hook_registry.h for better international accessibility.

infini_train/include/utils/precision_check_config.h

infini_train/src/utils/precision_check_config.cc

scripts/precision_check/run_precision_check_gpt2.sh

scripts/precision_check/run_precision_check_llama3.sh

kilinchange · 2026-01-28T03:38:17Z

另外，麻烦在 scripts/run_models_and_profile.bash 这个测试脚本里，末尾加一下执行 compare_loss.py 对比精度的步骤吧（可以加一个配置传入用于对比的 log 路径，没传就默认不执行 loss 对比脚本）

- Rename compare_loss.py/compare_tps.py from tools/ to scripts/ - Add --verbose flag to comparison scripts for detailed output - Show full paths in "Files only in..." messages - Only print comparison details for mismatches (quiet by default) - Add precision_check_config.json and run_precision_check.sh unified runner - Delete old run_precision_check_gpt2.sh/llama3.sh scripts - Add COMPARE_LOG_DIR support to run_models_and_profile.bash - Add tls_ prefix to thread_local variables for consistency - Add error handling with log tail output in run_models_and_profile.bash - Fix timestamped_path_ default initialization Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

chen2021673 · 2026-01-29T06:34:49Z

另外，麻烦在 scripts/run_models_and_profile.bash 这个测试脚本里，末尾加一下执行 compare_loss.py 对比精度的步骤吧（可以加一个配置传入用于对比的 log 路径，没传就默认不执行 loss 对比脚本）

done

- Remove ambiguous run_precision_check.sh and precision_check_config.json - Add PrecisionChecker::Init() for automatic module hook registration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

infini_train/src/utils/global_module_hook_registry.cc

…erHook - RegisterHook now returns std::unique_ptr<HookHandle> for hook removal

infini_train/src/utils/global_module_hook_registry.cc

- Update precision checker to use new hook registration APIs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use Tensor::output_idx() instead of TryMarkModuleBackwardHookRegistered to ensure hooks are only registered for the first output tensor. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

infini_train/include/autograd/function.h

infini_train/include/utils/global_module_hook_registry.h

chen2021673 and others added 3 commits January 22, 2026 07:49

docs: translate Chinese comments to English in GlobalModuleHookRegistry

2d8abfe

Replace all Chinese comments with English translations in global_module_hook_registry.h for better international accessibility.

kilinchange requested review from Chamberlain0w0 and kilinchange January 26, 2026 07:25

kilinchange requested changes Jan 28, 2026

View reviewed changes

chen2021673 requested review from JYMiracle305 and kilinchange January 29, 2026 06:35

chen2021673 and others added 2 commits January 29, 2026 09:10

refactor: simplify precision checker and improve test coverage

8b55238

- Remove ambiguous run_precision_check.sh and precision_check_config.json - Add PrecisionChecker::Init() for automatic module hook registration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

style: apply clang-format-16

8a7796d

kilinchange requested changes Jan 29, 2026

View reviewed changes

infini_train/src/utils/global_module_hook_registry.cc Outdated Show resolved Hide resolved

feat: add HookHandle return value to GlobalModuleHookRegistry::Regist…

88f5e30

…erHook - RegisterHook now returns std::unique_ptr<HookHandle> for hook removal

chen2021673 requested a review from kilinchange January 29, 2026 15:20

kilinchange requested changes Jan 30, 2026

View reviewed changes

infini_train/src/utils/global_module_hook_registry.cc Outdated Show resolved Hide resolved

refactor: refactor GlobalModuleHookRegistry

c014d0f

- Update precision checker to use new hook registration APIs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chen2021673 requested a review from kilinchange January 30, 2026 09:18

refactor: simplify backward hook registration using output_idx

1ff0fdb

Use Tensor::output_idx() instead of TryMarkModuleBackwardHookRegistered to ensure hooks are only registered for the first output tensor. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

kilinchange requested changes Feb 2, 2026

View reviewed changes

infini_train/include/autograd/function.h Show resolved Hide resolved

infini_train/include/utils/global_module_hook_registry.h Outdated Show resolved Hide resolved

refactor: clean up hook registry and remove unused atomic flag

7e13c93

kilinchange approved these changes Feb 2, 2026

View reviewed changes

kilinchange merged commit 49d14ab into master Feb 2, 2026
2 checks passed

kilinchange deleted the fix_precision_checker branch February 2, 2026 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve multi-iteration tensor file overwrite and simplify precision checker#104

fix: resolve multi-iteration tensor file overwrite and simplify precision checker#104
kilinchange merged 10 commits intomasterfrom
fix_precision_checker

chen2021673 commented Jan 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kilinchange commented Jan 28, 2026

Uh oh!

chen2021673 commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chen2021673 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Counter Mechanism Fix

Precision Checker Refactoring

New Scripts

Documentation

Usage Example

Testing Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kilinchange commented Jan 28, 2026

Uh oh!

chen2021673 commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chen2021673 commented Jan 22, 2026 •

edited

Loading