Use parallel curl downloads for website link checks by glenn-jocher · Pull Request #311 · ultralytics/docs

glenn-jocher · 2026-05-25T12:14:46Z

Summary

replace serial Wget page mirroring in the website link checker with one generated curl config and a single parallel curl invocation
use curl --compressed so GitHub runner curl can negotiate Brotli when supported and still save decompressed HTML locally
keep concurrency bounded with --parallel-max 16 for the 600-1100 page sites

Validation

parsed updated workflow YAML with PyYAML
actionlint -ignore SC2004 -ignore SC2086 -ignore SC2129 .github/workflows/links.yml .github/workflows/download_websites.yml
git diff --check
local curl config smoke test confirmed url/output pairs create expected files

🛠️ PR Summary

🌟 Summary

This PR speeds up website link checks by replacing sequential wget downloads with parallel curl downloads in the GitHub Actions link-check workflow. ⚡

📊 Key Changes

Replaces sitemap and subsitemap fetching from wget to curl --compressed -fsSL for simpler, reliable compressed downloads.
Removes the sequential wget website download step and introduces a small Python helper to generate a curl config file from urls.txt.
Maps each URL to a local decompressed HTML output path, including directory-style URLs to index.html and extensionless paths to .html.
Skips rejected asset types such as image and text files, and avoids re-downloading files that already exist.
Enables parallel page downloads with curl --parallel --parallel-max 16, while preserving retries, redirect handling, compression, and directory creation.

🎯 Purpose & Impact

Improves CI efficiency by downloading many pages concurrently instead of one at a time. 🚀
Reduces total runtime for link validation workflows, especially on large sitemaps.
Keeps local downloaded content structured and compatible with downstream checks by saving pages as decompressed HTML.
Maintains resilience through retry behavior and graceful failure handling for individual sitemap or page fetches.
Helps make link-check automation faster and more scalable for ultralytics/docs. ✅

UltralyticsAssistant · 2026-05-25T12:15:24Z

👋 Hello @glenn-jocher, thank you for submitting a ultralytics/docs 🚀 PR! This is an automated message to help with a smooth review process, and an engineer will assist you shortly. Please review the checklist below:

✅ Define a Purpose: Clearly explain the purpose of your fix or feature in your PR description, and link to any relevant issues. Ensure your commit messages are clear, concise, and adhere to the project's conventions.
✅ Synchronize with Source: Confirm your PR is synchronized with the ultralytics/docs main branch. If it's behind, update it by clicking the 'Update branch' button or by running git pull and git merge main locally.
✅ Ensure CI Checks Pass: Verify all Ultralytics Continuous Integration (CI) checks are passing. If any checks fail, please address the issues.
✅ Update Documentation: Update the relevant documentation for any new or modified features.
✅ Add Tests: If applicable, include or update tests to cover your changes, and confirm that all tests are passing.
✅ Sign the CLA: Please ensure you have signed our Contributor License Agreement if this is your first Ultralytics PR by writing "I have read the CLA Document and I sign the CLA" in a new message.
✅ Minimize Changes: Limit your changes to the minimum necessary for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

For more guidance, please refer to our Contributing Guide. Don't hesitate to leave a comment if you have any questions. Thank you for contributing to Ultralytics! 🚀

UltralyticsAssistant

🔍 PR Review

_{Made with ❤️ by Ultralytics Actions}

Good simplification overall, and the parallel curl approach should speed this workflow up. I found two regressions to address before merging: bare-root URLs can be written to the wrong local path, and the sitemap fetches lost retry protection.

💬 Posted 2 inline comments

UltralyticsAssistant · 2026-05-25T12:17:07Z

          # Download initial sitemap and process
          echo "Downloading sitemap..."
-          SITEMAP=$(wget --compression=auto -qO- "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }
+          SITEMAP=$(curl --compressed -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }


💡 MEDIUM: Switching the sitemap fetches from wget to plain curl removes retry behavior here, so a transient network/TLS/5xx failure now aborts the whole job before the parallel download step even starts. Please add retries to this fetch, and mirror the same change for the sub-sitemap download below.

Suggested change:

Suggested change

SITEMAP=$(curl --compressed -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }

SITEMAP=$(curl --compressed --retry 3 --retry-all-errors -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }

UltralyticsAssistant · 2026-05-25T12:17:07Z

+                  if not url:
+                      continue
+                  parsed = urlsplit(url)
+                  path = f"{parsed.netloc}{parsed.path}"


💡 MEDIUM: This path builder misclassifies bare-origin URLs like https://docs.ultralytics.com because path becomes just the hostname, and the '.' in Path(path).name check then treats the hostname dots as a file extension. That saves the root page as docs.ultralytics.com instead of docs.ultralytics.com/index.html, which changes the local base path and can throw off relative-link resolution for the homepage.

Suggested change:

Suggested change

path = f"{parsed.netloc}{parsed.path}"

path = f"{parsed.netloc}{parsed.path or '/'}"

Use parallel curl downloads for website link checks

fea8b64

UltralyticsAssistant added devops GitHub Devops or MLops enhancement New feature or request priority: medium Normal priority; valuable but not urgent. labels May 25, 2026

UltralyticsAssistant reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use parallel curl downloads for website link checks#311

Use parallel curl downloads for website link checks#311
glenn-jocher wants to merge 1 commit into
mainfrom
curl-brotli-parallel-website-downloads

glenn-jocher commented May 25, 2026 •

edited by UltralyticsAssistant

Loading

Uh oh!

UltralyticsAssistant commented May 25, 2026

Uh oh!

UltralyticsAssistant left a comment

Uh oh!

UltralyticsAssistant May 25, 2026

Uh oh!

UltralyticsAssistant May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	SITEMAP=$(curl --compressed -fsSL "https://${{ matrix.website }}/sitemap.xml") \|\| { echo "Failed to download sitemap"; exit 1; }
	SITEMAP=$(curl --compressed --retry 3 --retry-all-errors -fsSL "https://${{ matrix.website }}/sitemap.xml") \|\| { echo "Failed to download sitemap"; exit 1; }

	path = f"{parsed.netloc}{parsed.path}"
	path = f"{parsed.netloc}{parsed.path or '/'}"

Uh oh!

Conversation

glenn-jocher commented May 25, 2026 • edited by UltralyticsAssistant Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

Uh oh!

UltralyticsAssistant commented May 25, 2026

Uh oh!

UltralyticsAssistant left a comment

Choose a reason for hiding this comment

🔍 PR Review

Uh oh!

UltralyticsAssistant May 25, 2026

Choose a reason for hiding this comment

Uh oh!

UltralyticsAssistant May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

glenn-jocher commented May 25, 2026 •

edited by UltralyticsAssistant

Loading