Skip to content

fix: resolve multi-iteration tensor file overwrite and simplify precision checker#104

Merged
kilinchange merged 10 commits intomasterfrom
fix_precision_checker
Feb 2, 2026
Merged

fix: resolve multi-iteration tensor file overwrite and simplify precision checker#104
kilinchange merged 10 commits intomasterfrom
fix_precision_checker

Conversation

@chen2021673
Copy link
Contributor

@chen2021673 chen2021673 commented Jan 22, 2026

Summary

Fix precision checker file accumulation issue during multi-iteration runs and simplify the overall implementation.

Changes

Counter Mechanism Fix

  • Add ResetCounters() method to reset tensor counter at iteration boundaries
  • Move counter management to PrecisionCheckEnv with thread_local storage for thread safety
  • Call ResetCounters() at the start of each training step in gpt2/llama3

Precision Checker Refactoring

  • Remove baseline comparison functionality (use separate script instead)
  • Remove table format output, keep only simple and md5 formats
  • Add SaveNpy() function with rank subdirectory support
  • Simplify log format: [GAS-X] [L-Y] name_idx_stage tensor[i]: dtype=... shape=... min=... max=... mean=... [values] [NaN:X Inf:Y]

New Scripts

  • scripts/precision_check/precision_compare.py - Offline NPY comparison tool
  • scripts/precision_check/run_precision_check_gpt2.sh - GPT2 verification script
  • scripts/precision_check/run_precision_check_llama3.sh - LLaMA3 verification script

Documentation

  • Update docs/precision_checker_guide.md to reflect current implementation

Usage Example

# Basic check
./build/gpt2 --precision_check "level=1" --num_iteration 1

# Save NPY files
./build/gpt2 --precision_check "level=1,save_tensors=true" --num_iteration 1

# MD5 format
./build/gpt2 --precision_check "level=1,format=md5" --num_iteration 1

# Compare two runs
python scripts/precision_check/precision_compare.py \
    --dir1 ./precision_check/run1 \
    --dir2 ./precision_check/run2

Testing Example

Run verification script:

bash scripts/precision_check/run_precision_check_gpt2.sh

chen2021673 and others added 3 commits January 22, 2026 07:49
…sion checker

Counter mechanism:
- Add ResetCounters() to clear tensor counter at iteration boundaries
- Move counter management to PrecisionCheckEnv with thread_local storage
- Call ResetCounters() at start of each training step in gpt2/llama3

Precision checker refactoring:
- Remove baseline comparison functionality (use separate script instead)
- Remove table format output, keep only simple and md5 formats
- Add TensorStats struct with min/max/mean/nan_count/inf_count
- Add SaveNpy() function for NPY file saving with rank subdirectories
- Simplify log output format with dtype, shape, stats, and first 6 values
- Change stage names from "Module Forward/Backward Output" to "Forward/Backward Output"
- Use std::filesystem instead of sys/stat.h for directory creation

Documentation and scripts:
- Update docs/precision_checker_guide.md with current implementation
- Add precision_compare.py for offline NPY comparison
- Add run_precision_check_gpt2.sh and run_precision_check_llama3.sh

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GlobalModuleHookRegistry singleton to decouple PrecisionChecker
  from Module::operator(), allowing any hook to be registered globally
- Add md5_tolerance config option for PrecisionChecker to handle BF16
  precision differences (e.g., md5_tolerance=1e-3 makes 4.0003 and
  4.0004 produce the same MD5 hash)
- Update gpt2 and llama3 examples to use the new hook registration API

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace all Chinese comments with English translations in
global_module_hook_registry.h for better international accessibility.
@kilinchange
Copy link
Collaborator

另外,麻烦在 scripts/run_models_and_profile.bash 这个测试脚本里,末尾加一下执行 compare_loss.py 对比精度的步骤吧(可以加一个配置传入用于对比的 log 路径,没传就默认不执行 loss 对比脚本)

- Rename compare_loss.py/compare_tps.py from tools/ to scripts/
- Add --verbose flag to comparison scripts for detailed output
- Show full paths in "Files only in..." messages
- Only print comparison details for mismatches (quiet by default)
- Add precision_check_config.json and run_precision_check.sh unified runner
- Delete old run_precision_check_gpt2.sh/llama3.sh scripts
- Add COMPARE_LOG_DIR support to run_models_and_profile.bash
- Add tls_ prefix to thread_local variables for consistency
- Add error handling with log tail output in run_models_and_profile.bash
- Fix timestamped_path_ default initialization

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@chen2021673
Copy link
Contributor Author

另外,麻烦在 scripts/run_models_and_profile.bash 这个测试脚本里,末尾加一下执行 compare_loss.py 对比精度的步骤吧(可以加一个配置传入用于对比的 log 路径,没传就默认不执行 loss 对比脚本)

done

chen2021673 and others added 2 commits January 29, 2026 09:10
- Remove ambiguous run_precision_check.sh and precision_check_config.json
- Add PrecisionChecker::Init() for automatic module hook registration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…erHook

- RegisterHook now returns std::unique_ptr<HookHandle> for hook removal
- Update precision checker to use new hook registration APIs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use Tensor::output_idx() instead of TryMarkModuleBackwardHookRegistered
to ensure hooks are only registered for the first output tensor.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@kilinchange kilinchange merged commit 49d14ab into master Feb 2, 2026
2 checks passed
@kilinchange kilinchange deleted the fix_precision_checker branch February 2, 2026 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants