Skip to content

Add structured logging and feature output diagnostics to FeatureBuilder#370

Open
sundy1994 wants to merge 3 commits intodevfrom
log_file
Open

Add structured logging and feature output diagnostics to FeatureBuilder#370
sundy1994 wants to merge 3 commits intodevfrom
log_file

Conversation

@sundy1994
Copy link
Copy Markdown
Collaborator

Summary

  • Introduces a reusable setup_logger utility in src/team_comm_tools/utils/preprocess.py that writes timestamped logs to ./<output_file_base>/logs/, auto-creates the directory, and guards against duplicate handlers / propagation to root.
  • Wires two loggers into FeatureBuilderfeature_builder.log for top-level run info and summary_details.log for verbose per-column output — and threads the logger through ChatLevelFeaturesCalculator, UserLevelFeaturesCalculator, ConversationLevelFeaturesCalculator, and check_embeddings. print and warnings.warn calls for errors / invalid configs are now mirrored to the log.
  • Adds perf_counter-based timings around each feature method (chat / user / conversation level) and around sentence-vector and BERT generation in check_embeddings.py, so the log captures per-step durations.
  • Adds a post-featurization diagnostics step (generate_summary_stats) that, for each output level, reports columns with high NA ratios, high zero ratios, and groups of highly correlated columns (Spearman, configurable threshold). New constructor params: corr_thresh, min_na_ratio, min_zero_ratio, min_group_size, treat_zero_as_na, drop_redundant_columns. With drop_redundant_columns=True, columns with NAs/zeros that exceeding the thresholds are dropped. Moreover, only one representative per correlated group is kept (chosen by valid-data count and variance) and others in the group are dropped.
  • Logs the run header (timestamp, dataset shape — lines / unique speakers / unique conversations) at the start of featurize.
  • Cleans up imports in check_embeddings.py (drops top-level torch / unused util in favor of narrower from torch import cuda, no_grad).
  • Updates and reorganizes docstrings in feature_builder.py.
  • Adds *.csv and *.log to .gitignore.

Behavior change to flag for review

As discussed, the first x percent feature is deprecated. analyze_first_pct and get_first_pct_of_chat are commented out in feature_builder.py, removing the multi-truncation loop in featurize.

@sundy1994 sundy1994 requested a review from xehu May 7, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant