Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes the information content (IC) calculation for conflations in the node normalizer. Previously, only the IC value of the canonical clique leader was used. The PR now:
- Collects IC values from all conflated identifiers and uses the minimum value for the canonical ID
- Adds IC values to each equivalent identifier in the output
- Moves the IC retrieval logic to be path-specific (conflation vs non-conflation)
Changes:
- Modified IC value calculation to use minimum IC from all conflated identifiers instead of just the canonical ID
- Added
info_contents_allparameter tocreate_node()to provide IC values for all equivalent identifiers - Added IC values to individual equivalent identifiers in the node output
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # output the final result | ||
| normal_nodes = { | ||
| input_curie: await create_node(app, canonical_id, dereference_ids, dereference_types, info_contents, | ||
| input_curie: await create_node(app, canonical_id, dereference_ids, dereference_types, info_contents, info_contents_all, |
There was a problem hiding this comment.
The variable info_contents_all is not defined in the non-conflation path (lines 640-645) or when canonical_nonan is empty (lines 646-648). This will cause a NameError when create_node is called in these cases. The variable needs to be initialized in all code paths before this call.
| ic_vals = [] | ||
|
|
||
| for other in dereference_others[canonical_id]: | ||
| # logging.debug(f"e = {e}, other = {other}, deref_others_eqs = {deref_others_eqs}") | ||
| e += deref_others_eqs[other] | ||
| t += deref_others_typ[other] | ||
| if other in info_contents_all and info_contents_all[other]: | ||
| ic_vals.append(info_contents_all[other]) | ||
|
|
||
| final_eqids.append(e) | ||
| final_types.append(uniquify_list(t)) | ||
|
|
||
| # What's the smallest IC value for this canonical ID? | ||
| info_contents[canonical_id] = min(ic_vals) if ic_vals else None |
There was a problem hiding this comment.
The variable ic_vals is only initialized inside the conditional block at line 619-622, but it is referenced at line 635 which is outside that block. If len(dereference_others[canonical_id]) is 0 for any canonical_id in the loop, ic_vals will be undefined when accessed at line 635, causing a NameError. The initialization of ic_vals should be moved outside the conditional block to line 617 or 618.
|
I think there is a bug here, which means that we don't calculate the minimum IC for across all the clique leaders in a conflation. But I can't prove it with tests, I think because we sort the conflation IDs by IC from smallest to largest, and since the only ICs we have for chemicals are from CHEBI and the only ICs for genes are from NCBIGene, that means it should be impossible to have a situation where this is obvious. So either:
I don't think this is urgent, so I'll open a ticket (#368) and leave this for now. |
I'm not sure if information content (IC) values are being calculated correct for conflations -- I think only the IC value of the conflated clique leader is being used. I'm going to use this PR to interrogate that and fix it if needed. It also adds information content values to every clique leader, although these might be nulls if Ubergraph doesn't have an IC for that value.