Skip to content

Use parallel curl downloads for website link checks#311

Open
glenn-jocher wants to merge 1 commit into
mainfrom
curl-brotli-parallel-website-downloads
Open

Use parallel curl downloads for website link checks#311
glenn-jocher wants to merge 1 commit into
mainfrom
curl-brotli-parallel-website-downloads

Conversation

@glenn-jocher

@glenn-jocher glenn-jocher commented May 25, 2026

Copy link
Copy Markdown
Member

Summary

  • replace serial Wget page mirroring in the website link checker with one generated curl config and a single parallel curl invocation
  • use curl --compressed so GitHub runner curl can negotiate Brotli when supported and still save decompressed HTML locally
  • keep concurrency bounded with --parallel-max 16 for the 600-1100 page sites

Validation

  • parsed updated workflow YAML with PyYAML
  • actionlint -ignore SC2004 -ignore SC2086 -ignore SC2129 .github/workflows/links.yml .github/workflows/download_websites.yml
  • git diff --check
  • local curl config smoke test confirmed url/output pairs create expected files

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

This PR speeds up website link checks by replacing sequential wget downloads with parallel curl downloads in the GitHub Actions link-check workflow. ⚡

📊 Key Changes

  • Replaces sitemap and subsitemap fetching from wget to curl --compressed -fsSL for simpler, reliable compressed downloads.
  • Removes the sequential wget website download step and introduces a small Python helper to generate a curl config file from urls.txt.
  • Maps each URL to a local decompressed HTML output path, including directory-style URLs to index.html and extensionless paths to .html.
  • Skips rejected asset types such as image and text files, and avoids re-downloading files that already exist.
  • Enables parallel page downloads with curl --parallel --parallel-max 16, while preserving retries, redirect handling, compression, and directory creation.

🎯 Purpose & Impact

  • Improves CI efficiency by downloading many pages concurrently instead of one at a time. 🚀
  • Reduces total runtime for link validation workflows, especially on large sitemaps.
  • Keeps local downloaded content structured and compatible with downstream checks by saving pages as decompressed HTML.
  • Maintains resilience through retry behavior and graceful failure handling for individual sitemap or page fetches.
  • Helps make link-check automation faster and more scalable for ultralytics/docs. ✅

@UltralyticsAssistant UltralyticsAssistant added devops GitHub Devops or MLops enhancement New feature or request priority: medium Normal priority; valuable but not urgent. labels May 25, 2026
@UltralyticsAssistant

Copy link
Copy Markdown
Member

👋 Hello @glenn-jocher, thank you for submitting a ultralytics/docs 🚀 PR! This is an automated message to help with a smooth review process, and an engineer will assist you shortly. Please review the checklist below:

  • Define a Purpose: Clearly explain the purpose of your fix or feature in your PR description, and link to any relevant issues. Ensure your commit messages are clear, concise, and adhere to the project's conventions.
  • Synchronize with Source: Confirm your PR is synchronized with the ultralytics/docs main branch. If it's behind, update it by clicking the 'Update branch' button or by running git pull and git merge main locally.
  • Ensure CI Checks Pass: Verify all Ultralytics Continuous Integration (CI) checks are passing. If any checks fail, please address the issues.
  • Update Documentation: Update the relevant documentation for any new or modified features.
  • Add Tests: If applicable, include or update tests to cover your changes, and confirm that all tests are passing.
  • Sign the CLA: Please ensure you have signed our Contributor License Agreement if this is your first Ultralytics PR by writing "I have read the CLA Document and I sign the CLA" in a new message.
  • Minimize Changes: Limit your changes to the minimum necessary for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

For more guidance, please refer to our Contributing Guide. Don't hesitate to leave a comment if you have any questions. Thank you for contributing to Ultralytics! 🚀

@UltralyticsAssistant UltralyticsAssistant left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 PR Review

Made with ❤️ by Ultralytics Actions

Good simplification overall, and the parallel curl approach should speed this workflow up. I found two regressions to address before merging: bare-root URLs can be written to the wrong local path, and the sitemap fetches lost retry protection.

💬 Posted 2 inline comments

# Download initial sitemap and process
echo "Downloading sitemap..."
SITEMAP=$(wget --compression=auto -qO- "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }
SITEMAP=$(curl --compressed -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 MEDIUM: Switching the sitemap fetches from wget to plain curl removes retry behavior here, so a transient network/TLS/5xx failure now aborts the whole job before the parallel download step even starts. Please add retries to this fetch, and mirror the same change for the sub-sitemap download below.

Suggested change:

Suggested change
SITEMAP=$(curl --compressed -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }
SITEMAP=$(curl --compressed --retry 3 --retry-all-errors -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }

if not url:
continue
parsed = urlsplit(url)
path = f"{parsed.netloc}{parsed.path}"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 MEDIUM: This path builder misclassifies bare-origin URLs like https://docs.ultralytics.com because path becomes just the hostname, and the '.' in Path(path).name check then treats the hostname dots as a file extension. That saves the root page as docs.ultralytics.com instead of docs.ultralytics.com/index.html, which changes the local base path and can throw off relative-link resolution for the homepage.

Suggested change:

Suggested change
path = f"{parsed.netloc}{parsed.path}"
path = f"{parsed.netloc}{parsed.path or '/'}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops GitHub Devops or MLops enhancement New feature or request priority: medium Normal priority; valuable but not urgent.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants