Skip to content

Fix information content values for conflation#366

Open
gaurav wants to merge 9 commits intomasterfrom
fix-ic-for-conflation
Open

Fix information content values for conflation#366
gaurav wants to merge 9 commits intomasterfrom
fix-ic-for-conflation

Conversation

@gaurav
Copy link
Collaborator

@gaurav gaurav commented Feb 26, 2026

I'm not sure if information content (IC) values are being calculated correct for conflations -- I think only the IC value of the conflated clique leader is being used. I'm going to use this PR to interrogate that and fix it if needed. It also adds information content values to every clique leader, although these might be nulls if Ubergraph doesn't have an IC for that value.

@gaurav gaurav marked this pull request as ready for review February 26, 2026 22:00
@gaurav gaurav requested a review from Copilot February 26, 2026 22:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the information content (IC) calculation for conflations in the node normalizer. Previously, only the IC value of the canonical clique leader was used. The PR now:

  • Collects IC values from all conflated identifiers and uses the minimum value for the canonical ID
  • Adds IC values to each equivalent identifier in the output
  • Moves the IC retrieval logic to be path-specific (conflation vs non-conflation)

Changes:

  • Modified IC value calculation to use minimum IC from all conflated identifiers instead of just the canonical ID
  • Added info_contents_all parameter to create_node() to provide IC values for all equivalent identifiers
  • Added IC values to individual equivalent identifiers in the node output

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# output the final result
normal_nodes = {
input_curie: await create_node(app, canonical_id, dereference_ids, dereference_types, info_contents,
input_curie: await create_node(app, canonical_id, dereference_ids, dereference_types, info_contents, info_contents_all,
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable info_contents_all is not defined in the non-conflation path (lines 640-645) or when canonical_nonan is empty (lines 646-648). This will cause a NameError when create_node is called in these cases. The variable needs to be initialized in all code paths before this call.

Copilot uses AI. Check for mistakes.
Comment on lines +622 to +635
ic_vals = []

for other in dereference_others[canonical_id]:
# logging.debug(f"e = {e}, other = {other}, deref_others_eqs = {deref_others_eqs}")
e += deref_others_eqs[other]
t += deref_others_typ[other]
if other in info_contents_all and info_contents_all[other]:
ic_vals.append(info_contents_all[other])

final_eqids.append(e)
final_types.append(uniquify_list(t))

# What's the smallest IC value for this canonical ID?
info_contents[canonical_id] = min(ic_vals) if ic_vals else None
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable ic_vals is only initialized inside the conditional block at line 619-622, but it is referenced at line 635 which is outside that block. If len(dereference_others[canonical_id]) is 0 for any canonical_id in the loop, ic_vals will be undefined when accessed at line 635, causing a NameError. The initialization of ic_vals should be moved outside the conditional block to line 617 or 618.

Copilot uses AI. Check for mistakes.
@gaurav
Copy link
Collaborator Author

gaurav commented Feb 27, 2026

I think there is a bug here, which means that we don't calculate the minimum IC for across all the clique leaders in a conflation. But I can't prove it with tests, I think because we sort the conflation IDs by IC from smallest to largest, and since the only ICs we have for chemicals are from CHEBI and the only ICs for genes are from NCBIGene, that means it should be impossible to have a situation where this is obvious. So either:

  1. I can try to fix the issue without testing (awkward, possibly incorrect).
  2. I write a test for this (depends on haveing a NodeNorm test suite).
  3. l can load deliberately messed up NodeNorm files (or mess with a Redis database) to test this.
  4. I can wait until something changes with the IC situation.

I don't think this is urgent, so I'll open a ticket (#368) and leave this for now.

@gaurav gaurav moved this from Backlog to In progress in NodeNorm sprints Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants