Skip to content

Conversation

@razvanMiu
Copy link

@razvanMiu razvanMiu commented Oct 16, 2025

Description

This update enhances the web connector with two key improvements for sitemap processing:

  1. Lastmod Value Integration

    • Incorporates lastmod values from sitemap.xml for precise content update tracking
    • Enables more accurate change detection for web documents
  2. Intelligent Sitemap Filtering

    • Added skip_unchanged_documents flag to WebConnector
    • Implemented _filter_urls_by_timestamp() for efficient URL filtering
    • Smart filtering only during incremental updates (from_beginning=False)
    • Robust handling of edge cases (empty sitemaps, missing timestamps)
    • Maintains full reindex capability with from_beginning=True

Benefits

  • Performance: Reduces redundant HTTP requests by skipping unchanged documents
  • Efficiency: Improves indexing speed during incremental updates
  • Reliability: Ensures data consistency through timestamp-based change detection
  • Flexibility: Preserves full reindex behavior when needed
  • Seamless Integration: Works alongside existing sitemap processing logic

Testing

Test Cases

  1. Basic Functionality

    • Index documents using a sitemap with lastmod values
    • Verify document updates when lastmod changes
  2. Filtering Behavior

    • Incremental update with unchanged documents
    • Full reindex with from_beginning=True
    • Handling of missing/empty lastmod values
  3. Edge Cases

    • Empty sitemap handling
    • Mixed content (with/without timestamps)
    • Large sitemap processing

Backporting

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

razvanMiu and others added 14 commits July 30, 2025 16:05
…iltering

- Add skip_unchanged_documents flag to WebConnector
- Implement _filter_urls_by_timestamp() for sitemap URL filtering
- Only apply filtering during incremental updates (from_beginning=False)
- Track URL counts to distinguish empty sitemaps from fully filtered ones
- Update connector validation to handle filtered-out URLs gracefully
- Add batch timestamp lookup for efficient database queries
- Preserve full reindex behavior with from_beginning=True

This optimization reduces redundant HTTP requests and processing by skipping unchanged documents during incremental updates, while maintaining safety by only activating during non-full reindexes.
feat(pdf): use pdf metadata title as semantic identifier
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants