[python] support table scan with row range #6944

universe-hcy · 2026-01-02T13:51:23Z

Purpose

In model training and inference, it is common to record unread row intervals to checkpoints for recovery after worker failure. Currently, the with_stard method cannot support this feature and needs to support row_range for table scanning.

Linked issue: close #xxx

Tests

test_data_blob_writer_with_row_range in blob_table_test.py

API and Format

Documentation

JingsongLi · 2026-01-04T02:55:27Z

paimon-python/pypaimon/read/scanner/full_starting_scanner.py

-                         (self.idx_of_this_subtask - remainder) * base_rows_per_shard)
-
-        end_row = start_row + num_row
+    def with_row_range(self, start_row, end_row) -> 'FullStartingScanner':


Can we raise a exception for primary key table?

Can we use with_row_range and with_shard at the same time or not?

not, It is difficult for all users to have a completely consistent behavior when using at the same time. Added an exception to avoid this situation.

JingsongLi · 2026-01-04T02:56:01Z

paimon-python/pypaimon/read/table_scan.py

        self.starting_scanner.with_shard(idx_of_this_subtask, number_of_para_subtasks)
        return self
+
+    def with_row_range(self, start_row, end_row) -> 'TableScan':


Are we returning this exact number of lines, or can it be approximate? This needs to be specified clearly in the comments.

XiaoHongbo-Hope · 2026-01-04T03:08:57Z

paimon-python/pypaimon/read/scanner/full_starting_scanner.py


-        if self.idx_of_this_subtask is not None:
+        print("self.start_row_of_this_subtask:{}".format(self.start_row_of_this_subtask))
+        if self.start_row_of_this_subtask is not None:


logger may be better

This is for debugging purposes, I will delete it soon.

JingsongLi

+1

discivigour · 2026-01-04T03:45:23Z

paimon-python/pypaimon/read/table_scan.py

+        """
+        Filter file entries by row range. The row_id corresponds to the row position of the
+        file in all file entries in table scan's partitioned_files.
+        """


It might be better to clearly describe the inclusion and exclusion of idx within the range in the comments.

discivigour · 2026-01-04T03:49:01Z

+1

* upstream/master: (35 commits) [spark] Spark support vector search (apache#6950) [doc] update Apache Doris document with DLF 3.0 (apache#6954) [variant] Fix reading empty shredded variant via variantAccess (apache#6953) [python] support alterTable (apache#6952) [python] support ray data sink to paimon (apache#6883) [python] Rename to TableScan.withSlice to specific start_pos and end_pos [python] sync to_ray method args with ray data api (apache#6948) [python] light refactor for stats collect (apache#6941) [doc] Update cdc ingestion related docs [rest] Add tagNamePrefix definition for listTagsPaged (apache#6947) [python] support table scan with row range (apache#6944) [spark] Fix EqualNullSafe is not correct when column has null value. (apache#6943) [python] fix value_stats containing system fields for primary key tables (apache#6945) [test][rest] add test case for two sessions with cache for rest commitTable (apache#6438) [python] do not retry for connect exception in rest (apache#6942) [spark] Fix read shredded and unshredded variant both (apache#6936) [python] Let Python write file without value stats by default (apache#6940) [python] ray version compatible (apache#6937) [core] Unify conflict detect in FileStoreCommitImpl (apache#6932) [test] Fix unstable case in CompactActionITCase ...

universe-hcy force-pushed the paimon_ali branch 2 times, most recently from 9ab6fee to 12df60e Compare January 3, 2026 02:00

universe-hcy closed this Jan 4, 2026

universe-hcy reopened this Jan 4, 2026

JingsongLi reviewed Jan 4, 2026

View reviewed changes

XiaoHongbo-Hope reviewed Jan 4, 2026

View reviewed changes

universe-hcy force-pushed the paimon_ali branch from 12df60e to f1973d6 Compare January 4, 2026 03:23

[python] support table scan with row range

59a42d9

universe-hcy force-pushed the paimon_ali branch from f1973d6 to 59a42d9 Compare January 4, 2026 03:37

JingsongLi approved these changes Jan 4, 2026

View reviewed changes

discivigour reviewed Jan 4, 2026

View reviewed changes

JingsongLi merged commit 9ed304b into apache:master Jan 4, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[python] support table scan with row range #6944

[python] support table scan with row range #6944

Uh oh!

universe-hcy commented Jan 2, 2026

Uh oh!

JingsongLi Jan 4, 2026

Uh oh!

XiaoHongbo-Hope Jan 4, 2026

Uh oh!

universe-hcy Jan 4, 2026

Uh oh!

JingsongLi Jan 4, 2026

Uh oh!

XiaoHongbo-Hope Jan 4, 2026

Uh oh!

universe-hcy Jan 4, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

discivigour Jan 4, 2026 •

edited

Loading

Uh oh!

discivigour commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[python] support table scan with row range #6944

[python] support table scan with row range #6944

Uh oh!

Conversation

universe-hcy commented Jan 2, 2026

Purpose

Tests

API and Format

Documentation

Uh oh!

JingsongLi Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

universe-hcy Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

universe-hcy Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

discivigour Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

discivigour commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

discivigour Jan 4, 2026 •

edited

Loading