Skip to content

[GOOD FIRST ISSUE] Bug: _is_yaml_stale does not correctly identify missing source_last_modified as stale #152

@raphael-intugle

Description

@raphael-intugle

name: Good First Issue
about: A beginner-friendly task perfect for first-time contributors
title: '[GOOD FIRST ISSUE] Bug: _is_yaml_stale does not correctly identify missing source_last_modified as stale'
labels: 'good first issue'
assignees: ''

Welcome! 👋

This is a beginner-friendly issue perfect for first-time contributors to the Intugle project. We've designed this task to help you get familiar with our codebase while making a meaningful contribution.

Task Description

The _is_yaml_stale method in src/intugle/analysis/models.py is intended to determine if a cached YAML file is outdated compared to its source data. However, the current implementation fails to correctly identify the YAML as stale if the source_last_modified timestamp is missing from the YAML metadata. Instead of returning True (stale), it currently returns False (not stale).

Why This Matters

If the source_last_modified timestamp is absent from the YAML, there's no way to verify its freshness against the original data source. Treating such a cache as "not stale" can lead to the system using potentially outdated or incorrect data, undermining data integrity and leading to unexpected behavior in downstream processes that rely on the semantic model.

What You'll Learn

  • Python object-oriented programming
  • Caching logic and cache invalidation principles
  • Importance of data integrity in data processing pipelines
  • Working with file system metadata (modification times)

Step-by-Step Guide

Prerequisites

  • Python 3.10+ installed
  • Git basics (clone, commit, push, pull request)
  • Read our CONTRIBUTING.md guide

Setup Instructions

  1. Fork and clone the repository

    git clone https://github.com/YOUR_USERNAME/data-tools.git
    cd data-tools
  2. Create a virtual environment

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  3. Install dependencies

    pip install -e ".[dev]"
  4. Create a new branch

    git checkout -b fix/issue-NUMBER-missing-yaml-timestamp

Implementation Steps

  1. Locate the _is_yaml_stale method in src/intugle/analysis/models.py.
  2. Modify the logic within the try block to explicitly return True (stale) if source_last_modified is None or missing. The current if source_last_modified: check implicitly treats a missing timestamp as "not stale," which is incorrect.

Files to Modify

  • File: src/intugle/analysis/models.py
    • Change: Adjust the conditional logic in _is_yaml_stale to correctly mark YAML as stale when source_last_modified is missing.
    • Line(s): Approximately lines 122-132.

Testing Your Changes

You should add a new test case to tests/analysis/test_dataset_analysis.py that specifically simulates a scenario where a YAML file is loaded but lacks the source_last_modified field. Verify that _is_yaml_stale correctly returns True for this case.

# Run tests
pytest tests/

# Or run specific test (example, adjust as needed)
pytest tests/analysis/test_dataset_analysis.py

Submitting Your Work

  1. Commit your changes

    git add .
    git commit -m "Fix(analysis): _is_yaml_stale treats missing source_last_modified as stale"
  2. Push to your fork

    git push origin fix/issue-NUMBER-missing-yaml-timestamp
  3. Create a Pull Request

    • Go to the original repository
    • Click "Pull Requests" → "New Pull Request"
    • Select your branch
    • Fill out the PR template
    • Reference this issue with "Fixes #ISSUE_NUMBER"

Example Code

# Before (simplified)
            source_last_modified = table.get("source_last_modified")

            if source_last_modified:
                # ... staleness check based on mtime ...
                return True # if stale
            return False # if not stale, or if source_last_modified is missing
# After (simplified)
            source_last_modified = table.get("source_last_modified")

            if source_last_modified:
                current_mtime = os.path.getmtime(self.data["path"])
                if current_mtime > source_last_modified:
                    console.print(
                        f"Warning: Source file for '{self.name}' has been modified since the last analysis.",
                        style=warning_style,
                    )
                    return True
                return False # Explicitly not stale only if check passes
            
            # If we are here, source_last_modified is missing/None, so it's stale
            return True

Expected Outcome

The _is_yaml_stale method will accurately detect when a YAML cache is stale due to a missing source_last_modified timestamp, ensuring that the system reliably re-profiles data when cache validity cannot be determined.

Definition of Done

  • Code changes implemented in src/intugle/analysis/models.py
  • A new unit test added to tests/analysis/test_dataset_analysis.py covering the missing source_last_modified scenario.
  • All existing tests pass locally.
  • Code follows project style guidelines.
  • No new linter warnings.
  • Pull request submitted, referencing this issue.

Resources

Need Help?

Don't hesitate to ask questions! We're here to help you succeed.

Skills You'll Use

  • Python basics
  • Git and GitHub
  • Testing with pytest (optional)
  • Other:

Estimated Time

This task should take approximately: 1-3 hours


Thank you for contributing to Intugle!

Tips for Success:

  • Take your time and read through everything carefully
  • Don't be afraid to ask questions
  • Test your changes before submitting
  • Have fun! 🎉

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions