Skip to content

HIVE-29464: Rethink MapWork.aliasToPartnInfo#6344

Merged
abstractdog merged 3 commits intoapache:masterfrom
hemanthumashankar0511:mapwork-get-tabledescs
Mar 9, 2026
Merged

HIVE-29464: Rethink MapWork.aliasToPartnInfo#6344
abstractdog merged 3 commits intoapache:masterfrom
hemanthumashankar0511:mapwork-get-tabledescs

Conversation

@hemanthumashankar0511
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Added a new method getDistinctTableDescs() in MapWork that returns the unique TableDesc objects used by the map task, and updated configureJobConf to use it.

Before this change, the deduplication logic was sitting inside configureJobConf:

Set<String> processedTables = new HashSet<>();
for (PartitionDesc partition : aliasToPartnInfo.values()) {
    TableDesc tableDesc = partition.getTableDesc();
    if (tableDesc != null && !processedTables.contains(tableDesc.getTableName())) {
        processedTables.add(tableDesc.getTableName());
        PlanUtils.configureJobConf(tableDesc, job);
    }
}

After this change, that logic lives in getDistinctTableDescs() and configureJobConf just calls it cleanly:

for (TableDesc tableDesc : getDistinctTableDescs()) {
    PlanUtils.configureJobConf(tableDesc, job);
}

Why are the changes needed?

Callers like KafkaDagCredentialSupplier that only care about tables are currently forced to loop through all partitions in aliasToPartnInfo just to get the TableDesc objects. A table can have thousands of partitions but only one TableDesc, so everyone ends up writing the same boilerplate deduplication loop.

This method gives callers a clean way to get unique tables directly from MapWork without reinventing the wheel every time.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I tested this locally by attaching a debugger to the test run and checking two scenarios:

Self-join — I wanted to make sure deduplication wouldn't accidentally skip anything:

SELECT * FROM test t1 JOIN test t2 USING(a);

Confirmed that both aliases point to the exact same TableDesc instance in memory, so the table only gets configured once as expected.

Cross-database join — I wanted to make sure tables with the same name from different databases don't collide:

SELECT * FROM db1.test_cross t1 JOIN db2.test_cross t2 USING(a);

Confirmed that getTableName() returns fully qualified names like db1.test_cross and db2.test_cross as distinct strings, so both tables get configured correctly.

…unique TableDesc objects without iterating partitions
@hemanthumashankar0511 hemanthumashankar0511 marked this pull request as ready for review March 3, 2026 05:19
@abstractdog
Copy link
Copy Markdown
Contributor

abstractdog commented Mar 3, 2026

nice work so far, thanks @hemanthumashankar0511 for taking care of this!
I believe one of the main concerns of this ticket was the getAliasToPartnInfo() method, which simply returns a mutable collection, leaving us totally unsure where this collection is actually touched, so an ideal solutio removes this method altogether or at least attempts to limit the usage of it, can you please take care of that also?

aliasToPartnInfo.put(alias, partitionDesc);
}

public void putAllPartitionDescs(Map<String, PartitionDesc> partitionDescs) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fortunately, we don't need this method, not used at all

*/
public Map<String, PartitionDesc> getAliasToPartnInfo() {
return aliasToPartnInfo;
public Collection<PartitionDesc> getPartitionDescs() {
Copy link
Copy Markdown
Contributor

@abstractdog abstractdog Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this collection returned by this method is mainly used for iterating: is it possible to return an Iterator instead of copying the whole collection? unfortunately, copying it can be costly, and we could never now how heavily use that now or in the future?

return;
}
if (aliasToPartnInfo == null) {
aliasToPartnInfo = new LinkedHashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rely on the current instance, like:

  private Map<String, PartitionDesc> aliasToPartnInfo = new LinkedHashMap<String, PartitionDesc>();

this ensures that we have an instance and don't need the extra null checks

}

public void removeAlias(String alias) {
if (aliasToPartnInfo != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove null-check

}

public void putPartitionDesc(String alias, PartitionDesc partitionDesc) {
if (aliasToPartnInfo == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove null-check

}

public boolean hasPartitionDesc(String alias) {
return aliasToPartnInfo != null && aliasToPartnInfo.containsKey(alias);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove null-check

}

public int getPartitionCount() {
return aliasToPartnInfo == null ? 0 : aliasToPartnInfo.size();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove null-check

LinkedHashMap<String, PartitionDesc> aliasToPartnInfo) {
this.aliasToPartnInfo = aliasToPartnInfo;
public PartitionDesc getPartitionDesc(String alias) {
return aliasToPartnInfo == null ? null : aliasToPartnInfo.get(alias);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove null-check

PartitionDesc partition = partitions.values().stream().findFirst().orElse(null);
if (partition != null) {
TableDesc tableDesc = partition.getTableDesc();
TableDesc mapWorkTableDesc = ((MapWork) work).getDistinctTableDescs().stream().findFirst().orElse(null);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can eliminate one of these two variables (tableDesc, mapWorkTableDesc)

}

plan.getAliasToPartnInfo().put(alias_id, aliasPartnDesc);
plan.putPartitionDesc(alias_id, aliasPartnDesc);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you be so kind to fix alias_id to aliasId as you're already touching this area?

…ecks by initializing aliasToPartnInfo at field declaration
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Mar 5, 2026

@hemanthumashankar0511 hemanthumashankar0511 marked this pull request as ready for review March 5, 2026 09:33
@hemanthumashankar0511 hemanthumashankar0511 changed the title HIVE-29464: Rethink MapWork.aliasToPartnInfo - add getDistinctTableDescs() for callers that only need TableDesc objects HIVE-29464: Rethink MapWork.aliasToPartnInfo Mar 6, 2026
@abstractdog
Copy link
Copy Markdown
Contributor

LGTM, merging this soon

@abstractdog abstractdog merged commit 64c0055 into apache:master Mar 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants