Skip to content

Conversation

@universe-hcy
Copy link
Contributor

Purpose

In model training and inference, it is common to record unread row intervals to checkpoints for recovery after worker failure. Currently, the with_stard method cannot support this feature and needs to support row_range for table scanning.

Linked issue: close #xxx

Tests

test_data_blob_writer_with_row_range in blob_table_test.py

API and Format

Documentation

@universe-hcy universe-hcy force-pushed the paimon_ali branch 2 times, most recently from 9ab6fee to 12df60e Compare January 3, 2026 02:00
@universe-hcy universe-hcy reopened this Jan 4, 2026
(self.idx_of_this_subtask - remainder) * base_rows_per_shard)

end_row = start_row + num_row
def with_row_range(self, start_row, end_row) -> 'FullStartingScanner':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise a exception for primary key table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use with_row_range and with_shard at the same time or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not, It is difficult for all users to have a completely consistent behavior when using at the same time. Added an exception to avoid this situation.

self.starting_scanner.with_shard(idx_of_this_subtask, number_of_para_subtasks)
return self

def with_row_range(self, start_row, end_row) -> 'TableScan':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we returning this exact number of lines, or can it be approximate? This needs to be specified clearly in the comments.


if self.idx_of_this_subtask is not None:
print("self.start_row_of_this_subtask:{}".format(self.start_row_of_this_subtask))
if self.start_row_of_this_subtask is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger may be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for debugging purposes, I will delete it soon.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

"""
Filter file entries by row range. The row_id corresponds to the row position of the
file in all file entries in table scan's partitioned_files.
"""
Copy link
Contributor

@discivigour discivigour Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to clearly describe the inclusion and exclusion of idx within the range in the comments.

@discivigour
Copy link
Contributor

+1

@JingsongLi JingsongLi merged commit 9ed304b into apache:master Jan 4, 2026
4 checks passed
jerry-024 added a commit to jerry-024/paimon that referenced this pull request Jan 6, 2026
* upstream/master: (35 commits)
  [spark] Spark support vector search (apache#6950)
  [doc] update Apache Doris document with DLF 3.0 (apache#6954)
  [variant] Fix reading empty shredded variant via variantAccess (apache#6953)
  [python] support alterTable (apache#6952)
  [python] support ray data sink to paimon (apache#6883)
  [python] Rename to TableScan.withSlice to specific start_pos and end_pos
  [python] sync to_ray method args with ray data api (apache#6948)
  [python] light refactor for stats collect (apache#6941)
  [doc] Update cdc ingestion related docs
  [rest] Add tagNamePrefix definition for listTagsPaged (apache#6947)
  [python] support table scan with row range (apache#6944)
  [spark] Fix EqualNullSafe is not correct when column has null value. (apache#6943)
  [python] fix value_stats containing system fields for primary key tables (apache#6945)
  [test][rest] add test case for two sessions with cache for rest commitTable (apache#6438)
  [python] do not retry for connect exception in rest (apache#6942)
  [spark] Fix read shredded and unshredded variant both (apache#6936)
  [python] Let Python write file without value stats by default (apache#6940)
  [python] ray version compatible (apache#6937)
  [core] Unify conflict detect in FileStoreCommitImpl (apache#6932)
  [test] Fix unstable case in CompactActionITCase
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants