fix: redacting user retirement data in lms#37886
fix: redacting user retirement data in lms#37886ktyagiapphelix2u wants to merge 11 commits intoopenedx:masterfrom
Conversation
70f2265 to
7a7ef1b
Compare
|
Maybe this is out of scope for this PR, but I'm concerned about the explicit reference to Jenkins. I understand the desirability of recording the job id, but not everyone uses Jenkins as their runner. |
99b47d4 to
fa6aa32
Compare
|
@pdpinch: @ktyagiapphelix2u is out until Friday, but we will be removing references to 2U and Jenkins. Thanks. |
|
@pdpinch I agreed I am removing jenkins name reference. Thanks. |
63da9fe to
a7cdaa6
Compare
19bc46e to
eb5b4df
Compare
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Outdated
Show resolved
Hide resolved
|
@ktyagiapphelix2u: Also, I think we should change this to a breaking change PR. That means updating the title to We should note that switching from deleting records to redacting records is not a breaking change from the point of view of user retirement, because in either case the sensitive data has been safely taken care of. The breaking change is that these records will no longer be deleted, so any operator scripts that run against the table after retirement, that also relied on the fact that data was being deleted, would need to be updated. |
0c174cf to
536a3bd
Compare
|
@robrap I was thinking of updating archive cleanup script for Open edX. Is it expected to update it there as well, or should it be left as is? as they have moved to CI workflows rather than jenkins |
536a3bd to
5ac51b6
Compare
|
Hi all, can we get a description of the problem this is trying to address since we can't see the internal ticket? I have some (very small) amount of concern about this table growing given how often it's called, but more want to understand why this change is necessary. Since this is a breaking change in the API I think it should be versioned to a V2 endpoint as part of the depr. @fghaas I know you've been involved in |
|
@bmtcril: In thinking through explaining the issue to you, I may have come up with a backward-compatible version of this change, so that is what I am going to propose we do. The reason we wanted this change is because in Snowflake, we've used different technologies for sync, but all of them treat the status record deletes as soft-deletes, which then requires an additional custom job to clear out the soft deleted data to fully remove the sensitive data. We decided that redacting here, just like we do everywhere else, would avoid the custom jobs and ongoing maintenance. I just realized that we could have the best of both worlds, and have this api first redact, and then delete. That way it will still be backward-compatible with the deletes, but we wouldn't have to clean up the soft-deletes, because those records would already be redacted. @ktyagiapphelix2u: Can you please implement this. You'll need to ensure that the redacted data gets saved to the DB before the delete. I'm not certain exactly how to ensure this, but maybe working with Dave W. to ensure that the soft-deleted records contain the redacted data? Hopefully neither mysql nor Django nor the sync code tries to be too smart. |
|
@ktyagiapphelix2u: Once we ensure we can get everything working with redact + delete, we'd be able to close the DEPR: |
robrap
left a comment
There was a problem hiding this comment.
I'd start with either assertNumQueries (see other comment) or possibly the Django debug toolbar to ensure you are getting the update query before the delete query.
Once confirmed, you could also consider using https://www.mysqltutorial.org/mysql-administration/mysql-binary-logs/ locally to confirm the update is seen before the delete.
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Outdated
Show resolved
Hide resolved
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Show resolved
Hide resolved
robrap
left a comment
There was a problem hiding this comment.
Looks great. Minor comments. Thank you.
|
|
||
| for update_query in update_queries: | ||
| sql = update_query['sql'] | ||
| assert custom_username in sql, f"UPDATE query missing custom username '{custom_username}': {sql}" |
There was a problem hiding this comment.
I'm not sure of the exact syntax, but can you update custom_username to something like f"original_username='{custom_username}'" so we verify that the correct field is getting the correct value?
There was a problem hiding this comment.
@ktyagiapphelix2u: You either missed this comment, or didn't fully understand it. If you updated the code to:
for retirement in retirements:
retirement.original_username = redacted_email
retirement.original_email = redacted_name
retirement.original_name = redacted_username
Where we are setting the wrong field with each input, I think your test would still pass, right? I was trying to get you to assert that the correct field was being set.
There was a problem hiding this comment.
@robrap I've updated the test to check that each specific field original_username, original_email, original_name is being set with its correct corresponding value. If someone swaps the assignments as you suggested, the test will now fail immediately. Thanks.
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
Outdated
Show resolved
Hide resolved
| assert 'original_username' in sql_lower or 'original_email' in sql_lower, ( | ||
| f"UPDATE query doesn't appear to update retirement fields: {sql_lower}" | ||
| ) | ||
|
|
||
| def test_simple_success(self): |
There was a problem hiding this comment.
I don't think there is any difference between what is being tested here, and what is tested in test_redaction_before_deletion, right? It's the same test, but you are just using two different names and asserting on different things. I think I would just delete test_simple_success and rename test_redaction_before_deletion to test_default_redacted_values. I think test_default_redacted_values already has all the assertions you need. Thoughts?
There was a problem hiding this comment.
Updated thanks.
robrap
left a comment
There was a problem hiding this comment.
Also, I remove the breaking change "!" from the title, because this is no longer a breaking change. Thanks.
| # Redact PII fields first, then delete. In case an ETL tool is syncing data | ||
| # to a downstream data warehouse, and treats the deletes as soft-deletes, | ||
| # the data will have first been redacted, protecting the sensitive PII. | ||
| for retirement in retirements: | ||
| retirement.original_username = redacted_username | ||
| retirement.original_email = redacted_email | ||
| retirement.original_name = redacted_name | ||
| retirement.save() | ||
| retirement.delete() |
There was a problem hiding this comment.
@ktyagiapphelix2u: I thought you had said this work was complete. I guess I misunderstood. Either way, please take care of this and all copilot comments. Thank you.
There was a problem hiding this comment.
Oh. Is this a very outdated duplicate PR for openedx? We should discuss our process, but I would have imagined that this would be the PR we start with.
Description
Instead of deleting user retirement records completely, this PR updates the system to replace personal information (name, email, username) with safe placeholder values while keeping the records in the database.
Private Ticket
https://2u-internal.atlassian.net/browse/BOMS-293