Problem
Debugging "why does dataset X miss calibration target Y" currently means manual H5 spelunking: open the dataset, map the target name to a PolicyEngine variable, and check whether that variable is absent or all-zero. This came up diagnosing why a candidate dataset scored −100% on business net losses (self_employment_income had no negative values) and why a comparison baseline scored −100% on social_security_retirement (the column was absent). In both cases the dataset structurally cannot reproduce the target regardless of reweighting, but that only became clear after hand analysis.
Proposed
A target_coverage report: given a dataset H5, list every national calibration target whose underlying variable is absent or all-zero in that dataset — the coverage holes a refit can never close.
Reuse the existing target→variable mapping in policyengine_us_data/utils/national_target_parity.py (classify_national_target, _direct_census_variable, load_national_target_records), then check the dataset's columns.
CLI: python -m policyengine_us_data.utils.target_coverage --dataset PATH printing the absent / all-zero targets. This turns the manual eCPS-vs-candidate coverage analysis into a one-command step and would have surfaced both the SS-retirement and business-loss holes immediately.
Follow-up to #1163.
Problem
Debugging "why does dataset X miss calibration target Y" currently means manual H5 spelunking: open the dataset, map the target name to a PolicyEngine variable, and check whether that variable is absent or all-zero. This came up diagnosing why a candidate dataset scored −100% on business net losses (
self_employment_incomehad no negative values) and why a comparison baseline scored −100% onsocial_security_retirement(the column was absent). In both cases the dataset structurally cannot reproduce the target regardless of reweighting, but that only became clear after hand analysis.Proposed
A
target_coveragereport: given a dataset H5, list every national calibration target whose underlying variable is absent or all-zero in that dataset — the coverage holes a refit can never close.Reuse the existing target→variable mapping in
policyengine_us_data/utils/national_target_parity.py(classify_national_target,_direct_census_variable,load_national_target_records), then check the dataset's columns.CLI:
python -m policyengine_us_data.utils.target_coverage --dataset PATHprinting the absent / all-zero targets. This turns the manual eCPS-vs-candidate coverage analysis into a one-command step and would have surfaced both the SS-retirement and business-loss holes immediately.Follow-up to #1163.