MDT Test framework without writing data files by vamsikarnika · Pull Request #17796 · apache/hudi

vamsikarnika · 2026-01-07T17:07:02Z

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…s partition

…rds command

…data

nsivabalan · 2026-01-07T23:31:37Z

            initializeFilegroupsAndCommit(partitionType, relativePartitionPath, fileGroupCountAndRecordsPair, instantTimeForPartition);
            break;
          case PARTITION_STATS:
-            // For PARTITION_STATS, COLUMN_STATS should also be enabled


lets get this taken care. lets not comment out any code.
but introduce a config and disable it in our benchmarking script.

we can try reverting this commit if that helps

#14165

nsivabalan · 2026-01-07T23:31:59Z

    }
+
    if (enabledPartitionTypes.contains(MetadataPartitionType.PARTITION_STATS.getPartitionPath())) {
-      checkState(MetadataPartitionType.COLUMN_STATS.isMetadataPartitionAvailable(dataMetaClient),


same here. lets revert these changes

nsivabalan · 2026-01-07T23:43:10Z

-    } catch (IOException ioe) {
-      throw new HoodieIOException(ioe.getMessage(), ioe);
-    }
+    log.warn("Skipping reconcile markers for instant: {}", instantTs);


were we not able to understand why this was deleting the data or log files?

this is P1. and on me (siva)

nsivabalan · 2026-01-07T23:44:01Z

+
+import scala.collection.JavaConverters;
+
+public class HoodieMDTStats implements Closeable {


MetadataBenchmarkingTool

nsivabalan · 2026-01-07T23:50:00Z

+    @Parameter(names = {"--num-partitions", "-np"}, description = "Target Base path for the table", required = true)
+    public Integer numPartitions = 1;
+
+    @Parameter(names = {"--files-per-commit", "-fpc"}, description = "Number of files to create per commit. If not specified or >= num-files, all files will be in one commit", required = false)


lets do incremental commits differently.
for instance,
to bootstrap, we can offer a top level config for num files.
and for incremental we can offer a diff config.

something like
add 1M files in first batch.
and then 5000 files in each incremental batch.
we don't need this right now. but in a few days, once get the benchmarking tool in ready to use state.

for our first deliverable, 1 commit would suffice.
but we need to ensure all files in MDT are hfiles (i..e base files) and not log files.

we can enhance for more number of commits next week.

lets try to get this working:

initial setup:
just create data table w/o any metadata table.

from tests:
generate data
ingest into mdt which will also initialize. that way, we get hfiles in mdt directly.

lets keep this as P1.

nsivabalan · 2026-01-08T00:26:18Z

+    spark.sqlContext().conf().setConfString("hoodie.fileIndex.dataSkippingFailureMode", "strict");
+
+    // Create schema with the columns used for data skipping
+    StructType dataSchema = new StructType()


I see we have declared the schema in 3 places.
can we declare once and reuse it wherever required

nsivabalan · 2026-01-08T00:27:54Z

+    LOG.info("DEBUG: Resolved filter tree:\n{}", filter1.treeString());
+
+    dataFilters.add(filter1);
+    // Expression filter2 = org.apache.spark.sql.HoodieCatalystExpressionUtils.resolveExpr(


lets clean this up

nsivabalan · 2026-01-08T00:28:26Z

+    scala.collection.Seq<Expression> partitionFiltersSeq = partitionFiltersList;
+
+    // Call filterFileSlices
+    scala.collection.Seq<scala.Tuple2<scala.Option<org.apache.hudi.BaseHoodieTableFileIndex.PartitionPath>,


where is the timer here?
are we not interested in measuring the read latency (just the planning) in this case?

nsivabalan · 2026-01-08T00:29:03Z

+    LOG.info(String.join("", Collections.nCopies(100, "-")));
+
+    int totalFileSlices = 0;
+    for (int j = 0; j < filteredSlices.size(); j++) {


lets just print the total file slices we get after filtering.
for verbose output, we should add additional top level config.

nsivabalan · 2026-01-08T00:32:28Z

+  }
+
+  public static class Config implements Serializable {
+    @Parameter(names = {"--table-base-path", "-tbp"}, description = "Number of columns to index", required = true)


lets fix the description to be in line w/ the config.
for all configs

nsivabalan

On the query side,
lets see if we can support below queries.

numColumnsToIndex = 1
numPartitions = 100

query: select count(*) from tbl where dt >= '2025-01-01' and dt <= '2025-01-31' and tenantId = '100000000'

we can keep this P1.

Ideally, tenantId w/n each partition will be clustered.
but the spread of each tenant could be different.
2025-01-01 : 10k fgs.

t1: 2 fgs
t2,.... t10: fg3
t11: fg3...fg10
..
.

for Friday, lets just focus on general benchmarking script deliverable.

nsivabalan · 2026-01-08T01:13:50Z

+    @Parameter(names = {"--num-partitions", "-np"}, description = "Target Base path for the table", required = true)
+    public Integer numPartitions = 1;
+
+    @Parameter(names = {"--files-per-commit", "-fpc"}, description = "Number of files to create per commit. If not specified or >= num-files, all files will be in one commit", required = false)


lets try to get this working:

initial setup:
just create data table w/o any metadata table.

from tests:
generate data
ingest into mdt which will also initialize. that way, we get hfiles in mdt directly.

nsivabalan · 2026-01-08T01:14:11Z

+    @Parameter(names = {"--num-partitions", "-np"}, description = "Target Base path for the table", required = true)
+    public Integer numPartitions = 1;
+
+    @Parameter(names = {"--files-per-commit", "-fpc"}, description = "Number of files to create per commit. If not specified or >= num-files, all files will be in one commit", required = false)


lets keep this as P1.

nsivabalan · 2026-01-08T01:14:42Z

-    } catch (IOException ioe) {
-      throw new HoodieIOException(ioe.getMessage(), ioe);
-    }
+    log.warn("Skipping reconcile markers for instant: {}", instantTs);


this is P1. and on me (siva)

nsivabalan · 2026-01-08T01:28:47Z

+          filesPartitionExists,
+          metadataMetaClient.getTableConfig().getMetadataPartitions());
+
+      if (!filesPartitionExists) {


ok. if this helps w/initializing FILES directly w/ first commit from benchmarking tool, we can leave it as is.

nsivabalan · 2026-01-08T01:29:36Z

+      // Generate column stats records
+      @SuppressWarnings("rawtypes")
+      Map<String, Map<String, HoodieColumnRangeMetadata<Comparable>>> expectedStats = new HashMap<>();
+      List<HoodieRecord<HoodieMetadataPayload>> columnStatsRecords = generateColumnStatsRecordsForCommitMetadata(


make this P1. lets focus on other feedback and come to this later

nsivabalan · 2026-01-08T01:33:35Z

+          Comparable minValue;
+          Comparable maxValue;
+
+          if (colIdx == 0) {


cardinality of tenantId is 25 to 30k.
so, lets rename salary -> tenantId.
generate random long within 30k values.

and from top level config,
lets accept 1 or 2 as numColumnsToIndex.
if 1 -> tenantId
if 2 -> tenantId & (either of salary or age)

vamsi suggestion: use age (so that we have one with high cardinality / one w/ low cardinality)

nsivabalan · 2026-01-08T06:16:44Z

Deliverable by Friday:

Focus on just 1 commit to mdt. we need hfiles in latest file slices of MDT(files and col stats). so that we can measure best possible read latencies for query pruning.
Ensure we can support date predicate and tenantId predicates in queries.
Generate col stats records using spark engine context
Benchmarking script should be able to run either of writer or read benchmarks.
Lets validate 1M files and 360 partitions. If we run into scale issues, atleast try to find the inclination point. for eg, can we do 100k files.
resources:
- driver: 6 or gb. executors: 4 core 8gb. if not, 3 core 9gb.
Disabling partition stats and other feedback comments. pavithran and vamsi to sync up.

hudi-bot · 2026-02-10T01:21:48Z

CI report:

8abc18d Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Pavithran Ravichandiran and others added 11 commits December 23, 2025 13:45

MDT Test framework without writing data files

5dbb271

MDT Test framework - using filterFileSlics for colstats

677aa96

MDT Test framework - using createCommitMetadata, tagLocation and file…

7294abc

…s partition

MDT Test framework - using createCommitMetadata, tagLocation and file…

c688bf1

…s partition

MDT Test framework - initializing files partition before commit

5ed79d6

MDT Test framework - bug fixes with files partition

68a9fab

MDT Test framework - Added writing colStats to same upsertPreppedReco…

a8f01d3

…rds command

MDT Test framework - Added read path changes using filterFileSlices

328065f

MDT Test framework - create empty parquet data files from commit meta…

3fc0d7c

…data

MDT Test framework - disable partition stats and reconcile markers

718056e

Add spark context to the HoodieMDTStats class

a49beee

github-actions Bot added the size:XL PR with lines of changes > 1000 label Jan 7, 2026

vamsikarnika added 8 commits January 7, 2026 22:52

Modify colsToIndex config to take column names

8ebdb0a

Fix Partition field config

4c2a7e6

Add md file on how to use the tool

ef17392

Fix usage file

24f41a6

Add config for enabling partition stats

6807f9c

Creating files using engine context

d68824f

parallelize the empty parquet file creation based on no of partitions

fe41155

Writes files through multiple commits

5e05e80

nsivabalan reviewed Jan 8, 2026

View reviewed changes

Bulk Insert Files & Column stats

8abc18d

apache deleted a comment from hudi-bot Feb 10, 2026


		import scala.collection.JavaConverters;

		public class HoodieMDTStats implements Closeable {

Conversation

vamsikarnika commented Jan 7, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Jan 8, 2026

Uh oh!

hudi-bot commented Feb 10, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nsivabalan left a comment •

edited

Loading