Skip to content

⚡️ Speed up method DownloadManager._download_single by 54%#114

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-DownloadManager._download_single-mlcedeqy
Open

⚡️ Speed up method DownloadManager._download_single by 54%#114
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-DownloadManager._download_single-mlcedeqy

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Feb 7, 2026

📄 54% (0.54x) speedup for DownloadManager._download_single in src/datasets/download/download_manager.py

⏱️ Runtime : 35.8 milliseconds 23.2 milliseconds (best of 41 runs)

📝 Explanation and details

This optimization achieves a 54% runtime improvement (35.8ms → 23.2ms) by replacing expensive URL parsing operations with fast string-based checks in hot path functions that are called frequently during dataset downloads.

Key Optimizations

1. Optimized is_relative_path() - 76% faster (30.5ms → 7.2ms)

The original implementation called urlparse() on every invocation, which constructs a full ParseResult object. The optimized version uses a fast string search for : and validates scheme characters inline, avoiding object allocation entirely:

# Original: Always calls urlparse
urlparse(url_or_filename).scheme == ""

# Optimized: Fast string operations
idx = s.find(":")  # O(n) string search, no object creation
# Validates scheme chars only if ':' is found

This function is called 1,375 times per test run, making the per-call savings significant.

2. Optimized url_or_path_join() - 34% faster (15.4ms → 10.2ms)

Replaced is_remote_url() (which calls urlparse) with a simple substring check:

# Original: Calls urlparse internally
if is_remote_url(base_name):

# Optimized: Direct string check
if isinstance(base_name, str) and "://" in base_name:

The "://" pattern is a reliable indicator for remote URLs in this context, avoiding the overhead of full URL parsing.

3. Optimized cached_path() - 24% faster (84.9ms → 64.6ms)

Replaced can_be_local() and strip_protocol() calls with targeted file:// prefix checks:

# Original: Generic fsspec check
if can_be_local(url_or_filename):
    url_or_filename = strip_protocol(url_or_filename)

# Optimized: Fast startswith check
if url_or_filename.startswith("file://") or url_or_filename.startswith("file:"):
    url_or_filename = strip_protocol(url_or_filename)

This reduces unnecessary fsspec introspection for the common case where URLs are not file:// URLs.

4. Optimized tracked_str.set_origin() - 3% faster (2.99ms → 2.89ms)

Cached the super().__repr__() result to avoid calling it twice:

# Original: Calls __repr__ twice
if super().__repr__() not in self.origins:
    self.origins[super().__repr__()] = origin

# Optimized: Single call, cached
rep = super().__repr__()
if rep not in self.origins:
    self.origins[rep] = origin

Why This Matters

The _download_single() method is called during dataset loading workflows, potentially thousands of times when processing dataset manifests or multi-file datasets. The line profiler shows this function spending 22.7% of its time in is_relative_path() and 60.1% in cached_path() in the original version. By optimizing these path-checking operations with fast string operations instead of full URL parsing, we reduce the cumulative overhead significantly.

The test results confirm the optimization works well for the error-handling path (FileNotFoundError case: 47.7μs → 32.7μs, 45.8% faster), demonstrating consistent speedups across different code paths.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 32 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 71.4%
🌀 Click to see Generated Regression Tests
import os
from pathlib import Path

import pytest  # used for our unit tests
from src.datasets.download.download_config import DownloadConfig
from src.datasets.download.download_manager import DownloadManager
from src.datasets.utils.track import tracked_str

def test_download_single_nonexistent_local_path_raises_file_not_found(tmp_path):
    # Construct a path that does not exist
    nonexist = tmp_path / "does_not_exist.txt"

    dm = DownloadManager(download_config=DownloadConfig())

    # Passing a non-existent local path should raise FileNotFoundError from cached_path
    with pytest.raises(FileNotFoundError):
        dm._download_single(str(nonexist), DownloadConfig()) # 47.7μs -> 32.7μs (45.8% faster)
import json
import os
import tempfile
from pathlib import Path
from unittest.mock import MagicMock, Mock, patch

import pytest
from src.datasets.download.download_config import DownloadConfig
from src.datasets.download.download_manager import DownloadManager
from src.datasets.utils.track import tracked_str

class TestDownloadManagerDownloadSingle:
    """Test suite for DownloadManager._download_single method."""

    @pytest.fixture
    def temp_cache_dir(self):
        """Create a temporary cache directory for tests."""
        with tempfile.TemporaryDirectory() as tmpdir:
            yield tmpdir

    @pytest.fixture
    def download_manager(self, temp_cache_dir):
        """Create a DownloadManager instance with a temporary cache directory."""
        config = DownloadConfig(cache_dir=temp_cache_dir)
        manager = DownloadManager(download_config=config)
        return manager

    @pytest.fixture
    def download_config(self, temp_cache_dir):
        """Create a DownloadConfig instance with a temporary cache directory."""
        return DownloadConfig(cache_dir=temp_cache_dir)

    # ==================== BASIC TEST CASES ====================

    

To edit these changes git checkout codeflash/optimize-DownloadManager._download_single-mlcedeqy and push.

Codeflash Static Badge

This optimization achieves a **54% runtime improvement** (35.8ms → 23.2ms) by replacing expensive URL parsing operations with fast string-based checks in hot path functions that are called frequently during dataset downloads.

## Key Optimizations

**1. Optimized `is_relative_path()` - 76% faster (30.5ms → 7.2ms)**

The original implementation called `urlparse()` on every invocation, which constructs a full `ParseResult` object. The optimized version uses a fast string search for `:` and validates scheme characters inline, avoiding object allocation entirely:

```python
# Original: Always calls urlparse
urlparse(url_or_filename).scheme == ""

# Optimized: Fast string operations
idx = s.find(":")  # O(n) string search, no object creation
# Validates scheme chars only if ':' is found
```

This function is called 1,375 times per test run, making the per-call savings significant.

**2. Optimized `url_or_path_join()` - 34% faster (15.4ms → 10.2ms)**

Replaced `is_remote_url()` (which calls `urlparse`) with a simple substring check:

```python
# Original: Calls urlparse internally
if is_remote_url(base_name):

# Optimized: Direct string check
if isinstance(base_name, str) and "://" in base_name:
```

The `"://"` pattern is a reliable indicator for remote URLs in this context, avoiding the overhead of full URL parsing.

**3. Optimized `cached_path()` - 24% faster (84.9ms → 64.6ms)**

Replaced `can_be_local()` and `strip_protocol()` calls with targeted `file://` prefix checks:

```python
# Original: Generic fsspec check
if can_be_local(url_or_filename):
    url_or_filename = strip_protocol(url_or_filename)

# Optimized: Fast startswith check
if url_or_filename.startswith("file://") or url_or_filename.startswith("file:"):
    url_or_filename = strip_protocol(url_or_filename)
```

This reduces unnecessary fsspec introspection for the common case where URLs are not file:// URLs.

**4. Optimized `tracked_str.set_origin()` - 3% faster (2.99ms → 2.89ms)**

Cached the `super().__repr__()` result to avoid calling it twice:

```python
# Original: Calls __repr__ twice
if super().__repr__() not in self.origins:
    self.origins[super().__repr__()] = origin

# Optimized: Single call, cached
rep = super().__repr__()
if rep not in self.origins:
    self.origins[rep] = origin
```

## Why This Matters

The `_download_single()` method is called during dataset loading workflows, potentially thousands of times when processing dataset manifests or multi-file datasets. The line profiler shows this function spending 22.7% of its time in `is_relative_path()` and 60.1% in `cached_path()` in the original version. By optimizing these path-checking operations with fast string operations instead of full URL parsing, we reduce the cumulative overhead significantly.

The test results confirm the optimization works well for the error-handling path (FileNotFoundError case: 47.7μs → 32.7μs, 45.8% faster), demonstrating consistent speedups across different code paths.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 February 7, 2026 14:16
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Feb 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants