Add structured logging and feature output diagnostics to FeatureBuilder by sundy1994 · Pull Request #370 · Watts-Lab/team_comm_tools

sundy1994 · 2026-05-07T18:32:59Z

Summary

Introduces a reusable setup_logger utility in src/team_comm_tools/utils/preprocess.py that writes timestamped logs to ./<output_file_base>/logs/, auto-creates the directory, and guards against duplicate handlers / propagation to root.
Wires two loggers into FeatureBuilder — feature_builder.log for top-level run info and summary_details.log for verbose per-column output — and threads the logger through ChatLevelFeaturesCalculator, UserLevelFeaturesCalculator, ConversationLevelFeaturesCalculator, and check_embeddings. print and warnings.warn calls for errors / invalid configs are now mirrored to the log.
Adds perf_counter-based timings around each feature method (chat / user / conversation level) and around sentence-vector and BERT generation in check_embeddings.py, so the log captures per-step durations.
Adds a post-featurization diagnostics step (generate_summary_stats) that, for each output level, reports columns with high NA ratios, high zero ratios, and groups of highly correlated columns (Spearman, configurable threshold). New constructor params: corr_thresh, min_na_ratio, min_zero_ratio, min_group_size, treat_zero_as_na, drop_redundant_columns. With drop_redundant_columns=True, columns with NAs/zeros that exceeding the thresholds are dropped. Moreover, only one representative per correlated group is kept (chosen by valid-data count and variance) and others in the group are dropped.
Logs the run header (timestamp, dataset shape — lines / unique speakers / unique conversations) at the start of featurize.
Cleans up imports in check_embeddings.py (drops top-level torch / unused util in favor of narrower from torch import cuda, no_grad).
Updates and reorganizes docstrings in feature_builder.py.
Adds *.csv and *.log to .gitignore.

Behavior change to flag for review

As discussed, the first x percent feature is deprecated. analyze_first_pct and get_first_pct_of_chat are commented out in feature_builder.py, removing the multi-truncation loop in featurize.

…scripts

sundy1994 added 3 commits January 7, 2026 11:15

Enhance logging functionality

2f79bf8

Enhance logging and feature processing in FeatureBuilder and utility …

e4ded18

…scripts

update docstrings

c944c74

sundy1994 requested a review from xehu May 7, 2026 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add structured logging and feature output diagnostics to FeatureBuilder#370

Add structured logging and feature output diagnostics to FeatureBuilder#370
sundy1994 wants to merge 3 commits intodevfrom
log_file

sundy1994 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sundy1994 commented May 7, 2026

Summary

Behavior change to flag for review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant