Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
setup_loggerutility insrc/team_comm_tools/utils/preprocess.pythat writes timestamped logs to./<output_file_base>/logs/, auto-creates the directory, and guards against duplicate handlers / propagation to root.FeatureBuilder—feature_builder.logfor top-level run info andsummary_details.logfor verbose per-column output — and threads the logger throughChatLevelFeaturesCalculator,UserLevelFeaturesCalculator,ConversationLevelFeaturesCalculator, andcheck_embeddings.printandwarnings.warncalls for errors / invalid configs are now mirrored to the log.perf_counter-based timings around each feature method (chat / user / conversation level) and around sentence-vector and BERT generation incheck_embeddings.py, so the log captures per-step durations.generate_summary_stats) that, for each output level, reports columns with high NA ratios, high zero ratios, and groups of highly correlated columns (Spearman, configurable threshold). New constructor params:corr_thresh,min_na_ratio,min_zero_ratio,min_group_size,treat_zero_as_na,drop_redundant_columns. Withdrop_redundant_columns=True, columns with NAs/zeros that exceeding the thresholds are dropped. Moreover, only one representative per correlated group is kept (chosen by valid-data count and variance) and others in the group are dropped.featurize.check_embeddings.py(drops top-leveltorch/ unusedutilin favor of narrowerfrom torch import cuda, no_grad).feature_builder.py.*.csvand*.logto.gitignore.Behavior change to flag for review
As discussed, the first x percent feature is deprecated.
analyze_first_pctandget_first_pct_of_chatare commented out infeature_builder.py, removing the multi-truncation loop infeaturize.