Skip to content

Add a target-coverage report (which calibration targets a dataset cannot reproduce) #1165

@MaxGhenis

Description

@MaxGhenis

Problem

Debugging "why does dataset X miss calibration target Y" currently means manual H5 spelunking: open the dataset, map the target name to a PolicyEngine variable, and check whether that variable is absent or all-zero. This came up diagnosing why a candidate dataset scored −100% on business net losses (self_employment_income had no negative values) and why a comparison baseline scored −100% on social_security_retirement (the column was absent). In both cases the dataset structurally cannot reproduce the target regardless of reweighting, but that only became clear after hand analysis.

Proposed

A target_coverage report: given a dataset H5, list every national calibration target whose underlying variable is absent or all-zero in that dataset — the coverage holes a refit can never close.

Reuse the existing target→variable mapping in policyengine_us_data/utils/national_target_parity.py (classify_national_target, _direct_census_variable, load_national_target_records), then check the dataset's columns.

CLI: python -m policyengine_us_data.utils.target_coverage --dataset PATH printing the absent / all-zero targets. This turns the manual eCPS-vs-candidate coverage analysis into a one-command step and would have surfaced both the SS-retirement and business-loss holes immediately.

Follow-up to #1163.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions