-
Notifications
You must be signed in to change notification settings - Fork 39
Description
name: Good First Issue
about: A beginner-friendly task perfect for first-time contributors
title: '[GOOD FIRST ISSUE] Bug: _is_yaml_stale does not correctly identify missing source_last_modified as stale'
labels: 'good first issue'
assignees: ''
Welcome! 👋
This is a beginner-friendly issue perfect for first-time contributors to the Intugle project. We've designed this task to help you get familiar with our codebase while making a meaningful contribution.
Task Description
The _is_yaml_stale method in src/intugle/analysis/models.py is intended to determine if a cached YAML file is outdated compared to its source data. However, the current implementation fails to correctly identify the YAML as stale if the source_last_modified timestamp is missing from the YAML metadata. Instead of returning True (stale), it currently returns False (not stale).
Why This Matters
If the source_last_modified timestamp is absent from the YAML, there's no way to verify its freshness against the original data source. Treating such a cache as "not stale" can lead to the system using potentially outdated or incorrect data, undermining data integrity and leading to unexpected behavior in downstream processes that rely on the semantic model.
What You'll Learn
- Python object-oriented programming
- Caching logic and cache invalidation principles
- Importance of data integrity in data processing pipelines
- Working with file system metadata (modification times)
Step-by-Step Guide
Prerequisites
- Python 3.10+ installed
- Git basics (clone, commit, push, pull request)
- Read our CONTRIBUTING.md guide
Setup Instructions
-
Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/data-tools.git cd data-tools -
Create a virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -e ".[dev]" -
Create a new branch
git checkout -b fix/issue-NUMBER-missing-yaml-timestamp
Implementation Steps
- Locate the
_is_yaml_stalemethod insrc/intugle/analysis/models.py. - Modify the logic within the
tryblock to explicitly returnTrue(stale) ifsource_last_modifiedisNoneor missing. The currentif source_last_modified:check implicitly treats a missing timestamp as "not stale," which is incorrect.
Files to Modify
- File:
src/intugle/analysis/models.py- Change: Adjust the conditional logic in
_is_yaml_staleto correctly mark YAML as stale whensource_last_modifiedis missing. - Line(s): Approximately lines 122-132.
- Change: Adjust the conditional logic in
Testing Your Changes
You should add a new test case to tests/analysis/test_dataset_analysis.py that specifically simulates a scenario where a YAML file is loaded but lacks the source_last_modified field. Verify that _is_yaml_stale correctly returns True for this case.
# Run tests
pytest tests/
# Or run specific test (example, adjust as needed)
pytest tests/analysis/test_dataset_analysis.pySubmitting Your Work
-
Commit your changes
git add . git commit -m "Fix(analysis): _is_yaml_stale treats missing source_last_modified as stale"
-
Push to your fork
git push origin fix/issue-NUMBER-missing-yaml-timestamp
-
Create a Pull Request
- Go to the original repository
- Click "Pull Requests" → "New Pull Request"
- Select your branch
- Fill out the PR template
- Reference this issue with "Fixes #ISSUE_NUMBER"
Example Code
# Before (simplified)
source_last_modified = table.get("source_last_modified")
if source_last_modified:
# ... staleness check based on mtime ...
return True # if stale
return False # if not stale, or if source_last_modified is missing# After (simplified)
source_last_modified = table.get("source_last_modified")
if source_last_modified:
current_mtime = os.path.getmtime(self.data["path"])
if current_mtime > source_last_modified:
console.print(
f"Warning: Source file for '{self.name}' has been modified since the last analysis.",
style=warning_style,
)
return True
return False # Explicitly not stale only if check passes
# If we are here, source_last_modified is missing/None, so it's stale
return TrueExpected Outcome
The _is_yaml_stale method will accurately detect when a YAML cache is stale due to a missing source_last_modified timestamp, ensuring that the system reliably re-profiles data when cache validity cannot be determined.
Definition of Done
- Code changes implemented in
src/intugle/analysis/models.py - A new unit test added to
tests/analysis/test_dataset_analysis.pycovering the missingsource_last_modifiedscenario. - All existing tests pass locally.
- Code follows project style guidelines.
- No new linter warnings.
- Pull request submitted, referencing this issue.
Resources
- Project Documentation
- CONTRIBUTING.md
- Python Style Guide (PEP 8)
- Related documentation: Caching strategies and data integrity principles.
Need Help?
Don't hesitate to ask questions! We're here to help you succeed.
- Comment below with your questions
- Join our Discord for real-time support
- Tag maintainers: @raphael-intugle
Skills You'll Use
- Python basics
- Git and GitHub
- Testing with pytest (optional)
- Other:
Estimated Time
This task should take approximately: 1-3 hours
Thank you for contributing to Intugle!
Tips for Success:
- Take your time and read through everything carefully
- Don't be afraid to ask questions
- Test your changes before submitting
- Have fun! 🎉