Add python-chebi-utils library: downloader, OBO/SDF extractors, stratified splits, tests, CI#1
Conversation
… and CI workflow Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
|
@copilot Use the following download locations for chebi.obo and chebi.sdf. Accept a version parameter as input: chebi.obo chebi.sdf |
…URLs Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Done in eb6a7fe. Both |
|
@copilot for processing the chebi ontology, use the Here are some code blocks working with fastobo as orientation: |
Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Done in 1df6eee.
|
|
@copilot When reading the sdf file, also parse the molecule into an rdkit Mol object. Take this as an example: |
|
@copilot Make sure that chebi IDs are always stored as strings (even though the id is a number). E.g., "123" should still have the string datatype |
Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Done in b2e747a.
Each row in the returned DataFrame now has a |
Co-authored-by: sfluegel05 <43573433+sfluegel05@users.noreply.github.com>
Done in 361ebd7. Renamed |
Implements a Python library for generating data products from ChEBI — downloading source files, extracting ontology classes/relations and molecules, and producing stratified dataset splits.
Package:
chebi_utils/downloader.py—download_chebi_obo(version, ...)/download_chebi_sdf(version, ...)fetch versioned releases from the EBI FTP archive viaurllib. Versions below 245 are resolved to the legacy archive path (chebi_legacy/archive/rel{version}/...); versions 245 and above use the modern path (archive/rel{version}/...).obo_extractor.py— Uses thefastobolibrary to parse ChEBI OBO files and exposesbuild_chebi_graph(), which returns anetworkx.DiGraph.xref:lines are stripped before parsing. Only non-obsolete CHEBI-prefixed terms are included. Nodes are string CHEBI IDs (e.g."123"forCHEBI:123) with attributesname,smiles(extracted fromPropertyValueClauseusing both the legacychebi/smilesand modernchemrof:smiles_stringproperties, with fallback to SMILES synonym entries), andsubset. Edges carry arelationattribute and representis_a(child → parent) andhas_part(whole → part) relationships.sdf_extractor.py—extract_molecules()reads plain.sdfor gzip.sdf.gz, returns a DataFrame withchebi_id, name, smiles, inchi, inchikey, formula, charge, mass, mol. Each molecule's connection table is parsed into an RDKitMolobject stored in themolcolumn. Parsing usesMolFromMolBlock(sanitize=False, removeHs=False), followed by_update_mol_valences()(setsNoImplicit=Trueon all atoms) andChem.SanitizeMolwith flags FINDRADICALS | KEKULIZE | SETAROMATICITY | SETCONJUGATION | SETHYBRIDIZATION | SYMMRINGS. Molecules that fail to parse result inNonewith a warning.splitter.py—create_splits()produces reproducible random or stratifiedtrain/val/testsplits from any DataFrameTests (
tests/, 58 tests)tests/fixtures/sample.oboandtests/fixtures/sample.sdfused throughouturlretrievemock), OBO graph construction (node/edge count, string IDs, node attributes, obsolete term exclusion, xref robustness), SDF plain+gzip parsing, RDKit Mol object presence and atom counts, None-on-failure behaviour, split ratio validation, reproducibility, no-overlap guarantees, and stratification proportionsCI/CD (
.github/workflows/ci.yml)Two jobs on every push/PR:
ruff format --check+ruff checkpytestmatrix across Python 3.10, 3.11, 3.12Configuration (
pyproject.toml)Hatchling build backend; runtime deps:
fastobo>=0.14,networkx>=3.0,numpy>=1.24,pandas>=2.0,rdkit>=2022.09; dev extras addpytestandruff.Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.