Skip to content

Conversation

@trxvorr
Copy link
Contributor

@trxvorr trxvorr commented Dec 13, 2025

Description

This PR introduces the NoSQL Parser, a core feature that allows users to transform nested, semi-structured NoSQL data (like JSON or MongoDB collections) into normalized, relational tables suitable for analytical storage (Parquet).

It addresses the need for a reusable, standard way to flatten complex document structures while preserving parent-child relationships through automatically generated foreign keys.

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)

  • ✨ New feature (non-breaking change which adds functionality)

  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)

  • 📝 Documentation update

  • 🎨 Code style update (formatting, renaming)

  • ♻️ Refactoring (no functional changes)

  • ⚡ Performance improvement

  • ✅ Test update

  • 🔧 Configuration change

  • 🏗️ Infrastructure/build change

Related Issue(s)

Fixes #107

Changes Made

  • Core Parser (src/intugle/nosql/parser.py): Implemented recursive logic to split nested lists into separate child tables and generate foreign keys (parent_table_id) to maintain relationships.

  • Schema Inference (src/intugle/nosql/inference.py): Added logic to scan sample data, resolve type conflicts (e.g., unifying int and str), and auto-detect primary keys (_id, uuid).

  • Parquet Writer (src/intugle/nosql/writer.py): Implemented ParquetTarget to persist in-memory DataFrames to disk using pyarrow.

  • Configuration: Added support for custom table renaming and Primary Key overrides via a config dictionary.

  • Dependencies: Added pandas and pyarrow as optional dependencies under the nosql extra in pyproject.toml.

Testing

Test Configuration

  • Python Version: 3.13.2

  • OS: Windows 11

  • LLM Provider: N/A

Test Cases

  • Unit tests pass locally

  • Manual testing completed

  • Tested with sample datasets

Test Commands

# Run the NoSQL specific test suite
pytest tests/nosql/

Screenshots/Examples

from intugle.nosql.parser import NoSQLParser
from intugle.nosql.writer import ParquetTarget

data = [
    {"id": 1, "name": "Trevor", "orders": [{"order_id": 101, "total": 50}]}
]

# 1. Parse (Splits into 'root' and 'root_orders')
parser = NoSQLParser()
tables = parser.parse(data)

# 2. Result is a dict of DataFrames
# tables['root'] -> id, name
# tables['root_orders'] -> order_id, total, root_id (FK)

# 3. Write to Parquet
target = ParquetTarget("output_dir")
target.write(tables)

Checklist

  • My code follows the code style of this project

  • I have performed a self-review of my own code

  • I have commented my code, particularly in hard-to-understand areas

  • I have made corresponding changes to the documentation

  • My changes generate no new warnings or linter errors

  • I have added tests that prove my fix is effective or that my feature works

  • New and existing unit tests pass locally with my changes

  • Any dependent changes have been merged and published

  • I have updated the relevant notebooks (if applicable)

  • I have checked my code and corrected any misspellings

Documentation Updates

  • README.md updated

  • Docstrings added/updated

  • Documentation site updated (if needed)

  • Notebook examples updated (if applicable)

  • CHANGELOG updated (if applicable)

Breaking Changes

  • This PR introduces breaking changes

  • Migration guide provided (if applicable)

Performance Impact

  • Performance benchmarks run

  • No significant performance impact

  • Performance improvement:

  • Performance regression:

Additional Context

The parser uses recursive processing. Memory usage scales with the chunk size of the input data. Dependencies (pandas, pyarrow) are optional and must be installed via pip install intugle[nosql].

Deployment Notes

Users must install the optional extras to use this feature:

pip install .[nosql]

@raphael-intugle
Copy link
Collaborator

Hey @trxvorr , Im facing issue while running the example snippet. Can you double check ?

@trxvorr
Copy link
Contributor Author

trxvorr commented Dec 18, 2025

@raphael-intugle I've resolved the merge conflict in pyproject.toml, so the dependencies should install correctly now.

I also realized the example snippet in the description was slightly outdated compared to the final API implementation. I've updated the PR description with the correct usage:

  1. NoSQLParser is initialized without arguments.
  2. data is passed directly to .parse(data).
  3. Writing is handled via arget.write(tables).

It should run perfectly now!

@raphael-intugle
Copy link
Collaborator

Great start on the NoSQL parser! The core logic for flattening nested documents and generating foreign keys looks solid, and the Parquet export is working well as verified by the tests.

However, comparing this implementation against the original feature requirements, there are several key components missing to fully complete the scope:

  1. Connectors: The MongoSource (and the pluggable adapter interface) is not implemented yet; currently, the parser only accepts a raw list of dictionaries.
  2. CLI & High-Level API: The CLI command (intugle nosql-to-relational) and the top-level NoSQLToRelationalParser class (for the from intugle import ... pattern) are missing.
  3. Metadata Export: The infer_model() capability to emit a structured relational schema/relationship graph is not fully realized.

@trxvorr
Copy link
Contributor Author

trxvorr commented Dec 19, 2025

@raphael-intugle I've implemented the missing components you requested:

  • Connectors: Added MongoSource using pymongo with a pluggable NoSQLSource interface
  • High-Level API: Implemented NoSQLToRelationalParser in api.py as the main orchestrator
  • CLI: Added the intugle nosql-to-relational command
  • Metadata Export: infer_model() capability is available via the orchestrator

I also added unit tests for the new components (32 tests passing) and verified the CLI works locally against a MongoDB instance. Ready for another look.

@trxvorr
Copy link
Contributor Author

trxvorr commented Dec 19, 2025

To test the CLI yourself:

# Install with nosql extras
uv sync --all-extras

# View available commands
uv run intugle --help

# Test against a MongoDB instance (replace with your connection details)
uv run intugle nosql-to-relational --uri \"mongodb://localhost:27017\" --db your_database --collection your_collection --output ./output_parquet

# Or with sampling (fetch only 100 documents)
uv run intugle nosql-to-relational --uri \"mongodb://localhost:27017\" --db your_database --collection your_collection --output ./output_parquet --sample 100

Run the unit tests:

uv run pytest tests/nosql/ -v

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] NoSQL Parser

2 participants