-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[python] support table scan with row range #6944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9ab6fee to
12df60e
Compare
| (self.idx_of_this_subtask - remainder) * base_rows_per_shard) | ||
|
|
||
| end_row = start_row + num_row | ||
| def with_row_range(self, start_row, end_row) -> 'FullStartingScanner': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we raise a exception for primary key table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use with_row_range and with_shard at the same time or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not, It is difficult for all users to have a completely consistent behavior when using at the same time. Added an exception to avoid this situation.
| self.starting_scanner.with_shard(idx_of_this_subtask, number_of_para_subtasks) | ||
| return self | ||
|
|
||
| def with_row_range(self, start_row, end_row) -> 'TableScan': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we returning this exact number of lines, or can it be approximate? This needs to be specified clearly in the comments.
|
|
||
| if self.idx_of_this_subtask is not None: | ||
| print("self.start_row_of_this_subtask:{}".format(self.start_row_of_this_subtask)) | ||
| if self.start_row_of_this_subtask is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger may be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for debugging purposes, I will delete it soon.
12df60e to
f1973d6
Compare
f1973d6 to
59a42d9
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| """ | ||
| Filter file entries by row range. The row_id corresponds to the row position of the | ||
| file in all file entries in table scan's partitioned_files. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to clearly describe the inclusion and exclusion of idx within the range in the comments.
|
+1 |
* upstream/master: (35 commits) [spark] Spark support vector search (apache#6950) [doc] update Apache Doris document with DLF 3.0 (apache#6954) [variant] Fix reading empty shredded variant via variantAccess (apache#6953) [python] support alterTable (apache#6952) [python] support ray data sink to paimon (apache#6883) [python] Rename to TableScan.withSlice to specific start_pos and end_pos [python] sync to_ray method args with ray data api (apache#6948) [python] light refactor for stats collect (apache#6941) [doc] Update cdc ingestion related docs [rest] Add tagNamePrefix definition for listTagsPaged (apache#6947) [python] support table scan with row range (apache#6944) [spark] Fix EqualNullSafe is not correct when column has null value. (apache#6943) [python] fix value_stats containing system fields for primary key tables (apache#6945) [test][rest] add test case for two sessions with cache for rest commitTable (apache#6438) [python] do not retry for connect exception in rest (apache#6942) [spark] Fix read shredded and unshredded variant both (apache#6936) [python] Let Python write file without value stats by default (apache#6940) [python] ray version compatible (apache#6937) [core] Unify conflict detect in FileStoreCommitImpl (apache#6932) [test] Fix unstable case in CompactActionITCase ...
Purpose
In model training and inference, it is common to record unread row intervals to checkpoints for recovery after worker failure. Currently, the with_stard method cannot support this feature and needs to support row_range for table scanning.
Linked issue: close #xxx
Tests
test_data_blob_writer_with_row_range in blob_table_test.py
API and Format
Documentation