Ensemble fix: decouple ID extraction from drop operation#124
Conversation
|
|
||
| # Dataset specific information used for processing in this example | ||
| data_processing_config: | ||
| midst_data_path: /projects/midst-experiments/all_tabddpms/ # Used to collect the data (input) |
There was a problem hiding this comment.
One step towards config modularization.
📝 WalkthroughWalkthroughThis pull request consolidates configuration restructuring with API and function signature updates. The midst_data_path configuration is moved from the top-level data_paths section into data_processing_config, along with new dataset metadata including split specifications and file naming configurations. The data collection call in run_attack.py is updated to reference the new config path and passes population_splits and challenge_splits parameters. The test_attack_model.py file refactors the ID extraction function—renamed from extract_and_drop_id_column to extract_the_main_id_column with a changed return type from tuple to Series—and updates all usage sites accordingly. Additionally, the dataset partitioning for shadow model training is adjusted by changing split_folders from ["test"] to ["final"]. The TprAtFpr API parameter names are updated across two files, replacing max_fpr with fpr_threshold and predictions with predicted_membership. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/ensemble_attack/test_attack_model.py (1)
55-84:⚠️ Potential issue | 🟡 MinorDocstring describes wrong return type.
The docstring (lines 69-72) still describes the old return type as a tuple, but the function now returns only a
pd.Series. This inconsistency could confuse future maintainers.📝 Proposed fix for the docstring
""" - Extracts and returns the main IDs from the dataframe. The main ID column is identified based on - the data types JSON file with "id_column_name" key. - Main IDs are not repeated in the dataset. - For example, in the Berka dataset, "trans_id" is the main ID column, and "account_id" is not the main ID column. + Extracts and returns the main ID column values from the dataframe. The main ID column is + identified based on the data types JSON file with "id_column_name" key. + Main IDs are not repeated in the dataset. + For example, in the Berka dataset, "trans_id" is the main ID column, and "account_id" is not + the main ID column. Args: data_frame: Input dataframe. data_types_file_path: Path to the data types JSON file. Returns: - A tuple containing: - - The modified dataframe with ID columns dropped. - - A Series containing the extracted data of ID columns. + A Series containing the main ID column values. """
|
|
||
|
|
||
| def extract_and_drop_id_column( | ||
| def extract_the_main_id_column( |
There was a problem hiding this comment.
Super minor, but maybe "primary_id" instead of "main_id"? Primary keys are the unique keys in databases. So it might be more suggestive to a reader 🙂
There was a problem hiding this comment.
Yeah, I knew there was a better name, but I couldn't remember at the time lol. Thanks!
|
|
||
|
|
||
| def extract_and_drop_id_column( | ||
| def extract_the_main_id_column( |
There was a problem hiding this comment.
Suuuper nit-picky, but can we rename the function name to extract_main_id_column (drop the "the")? Sounds better
PR Type
Fix
Short Description
The Berka dataset has two id columns,
"account_id"and"trans_id".transaction IDs should be extracted and passed to
blending_attacker.predict. Therefore, we should first extract the main ID column, then remove all ID columns. Previously, the code was performing the extraction and drop operation just for the main ID column.