Skip to content

⚡️ Speed up function contains_wildcards by 106%#122

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-contains_wildcards-mlclg43u
Open

⚡️ Speed up function contains_wildcards by 106%#122
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-contains_wildcards-mlclg43u

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Feb 7, 2026

📄 106% (1.06x) speedup for contains_wildcards in src/datasets/data_files.py

⏱️ Runtime : 217 microseconds 105 microseconds (best of 68 runs)

📝 Explanation and details

The optimization replaces the any() generator expression with an explicit for-loop that performs early-exit when a wildcard is found. This achieves a 106% speedup (runtime reduced from 217μs to 105μs).

What Changed:

  • Original: return any(wildcard_character in pattern for wildcard_character in WILDCARD_CHARACTERS)
  • Optimized: Explicit for-loop with early return True when any wildcard is found

Why It's Faster:

  1. Reduced Function Call Overhead: The original code involves calling the built-in any() function plus creating a generator object. The optimized version eliminates this overhead by using direct control flow.

  2. Earlier Short-Circuiting: While both implementations support early termination when a wildcard is found, the explicit loop version short-circuits more efficiently. The line profiler shows the optimized version exits early in 64 out of 90 iterations (71%), demonstrating effective early-return behavior.

  3. Lower Per-Check Cost: The optimized code shows a per-hit time of ~898ns for the loop iteration compared to 5062ns for the entire any() expression in the original, indicating more efficient memory and instruction patterns.

Test Results:
The optimization is particularly effective for:

  • Patterns with wildcards at the start (e.g., "*start"): 127-150% faster due to immediate detection
  • Patterns with wildcards anywhere (e.g., "file*name"): 113-140% faster
  • Long strings without wildcards (e.g., 1000-char strings): 64-80% faster - though slower than wildcard cases, the optimization still helps by avoiding generator overhead for the full iteration

Even in the worst case (checking all 3 wildcard characters on strings without wildcards), the optimized version is 65-80% faster due to eliminated generator and any() overhead.

This optimization is valuable for any code that frequently checks patterns for wildcards, especially in file system operations, glob pattern validation, or path processing hot paths where this function might be called repeatedly.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 90 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import pytest  # used for our unit tests
from src.datasets.data_files import contains_wildcards

# function to test
# file: src/datasets/data_files.py
WILDCARD_CHARACTERS = "*[]"

def test_basic_star_detection():
    # Simple case: a '*' in the middle should be detected
    codeflash_output = contains_wildcards("file*name") # 2.43μs -> 1.14μs (113% faster)

def test_basic_left_bracket_detection():
    # A single '[' in the string should be detected
    codeflash_output = contains_wildcards("prefix[rest") # 2.65μs -> 1.18μs (124% faster)

def test_basic_right_bracket_detection():
    # A single ']' in the string should be detected
    codeflash_output = contains_wildcards("end]") # 2.65μs -> 1.26μs (111% faster)

@pytest.mark.parametrize(
    "pattern,expected",
    [
        # No wildcard characters at all
        ("filename.txt", False),
        # Empty string: no wildcards
        ("", False),
        # Only wildcard characters
        ("*[]", True),
        # Wildcard at start
        ("*start", True),
        # Wildcard at end
        ("end*", True),
        # Multiple different wildcards present
        ("a[b]c*", True),
    ],
)
def test_various_simple_patterns(pattern, expected):
    # Parameterized basic functionality and simple edge patterns
    codeflash_output = contains_wildcards(pattern) # 14.0μs -> 6.46μs (117% faster)

def test_unicode_similar_star_not_recognized():
    # A Unicode star-like character should NOT be recognized (only ASCII '*' is in WILDCARD_CHARACTERS)
    unicode_star = "file✱name"  # U+2731 or similar - not the ASCII asterisk
    codeflash_output = contains_wildcards(unicode_star) # 2.37μs -> 1.41μs (68.3% faster)

def test_escaped_backslash_followed_by_star_still_detects_star():
    # The pattern contains a literal backslash and a star; the star is still present and should be detected
    pattern = "\\*"
    codeflash_output = contains_wildcards(pattern) # 2.38μs -> 1.02μs (133% faster)

def test_non_string_none_raises_typeerror():
    # Passing None should raise a TypeError because 'in' checks will attempt to iterate None
    with pytest.raises(TypeError):
        contains_wildcards(None) # 3.64μs -> 2.60μs (39.7% faster)

def test_non_string_int_raises_typeerror():
    # Passing an integer should raise TypeError (integers are not iterable in this context)
    with pytest.raises(TypeError):
        contains_wildcards(123) # 3.80μs -> 2.58μs (47.3% faster)

def test_bytes_input_raises_typeerror():
    # Passing bytes causes a TypeError when the function checks for str characters in a bytes object
    with pytest.raises(TypeError):
        contains_wildcards(b"*") # 3.96μs -> 2.98μs (32.8% faster)

def test_iterable_list_behavior_with_wildcard_element():
    # The implementation does not enforce string types; if given a list that contains the exact wildcard
    # character as an element, membership checks will succeed -> function returns True.
    codeflash_output = contains_wildcards(["["]) # 2.51μs -> 1.14μs (120% faster)

def test_iterable_list_behavior_without_wildcard_element():
    # A list without any exact wildcard-element should return False
    codeflash_output = contains_wildcards(["abc", "def"]) # 2.32μs -> 1.29μs (79.4% faster)

def test_iterable_set_behavior_with_wildcard_element():
    # Sets also support membership checks; this verifies current behavior for non-string iterables
    codeflash_output = contains_wildcards({"*"}) # 2.41μs -> 1.03μs (135% faster)

def test_large_scale_no_wildcards():
    # Large-scale test (within the 1000-character guidance): long string with no wildcards should be False
    long_no_wild = "a" * 1000
    codeflash_output = contains_wildcards(long_no_wild) # 2.45μs -> 1.46μs (68.4% faster)

def test_large_scale_with_wildcard_in_middle():
    # Large-scale test: long string with wildcard in the middle should be True
    long_with_wild = "a" * 500 + "*" + "b" * 499  # total length 1000
    codeflash_output = contains_wildcards(long_with_wild) # 2.59μs -> 1.17μs (122% faster)

@pytest.mark.parametrize("char", list(WILDCARD_CHARACTERS))
def test_each_wildcard_character_detected_when_present(char):
    # Ensure each individual wildcard character from WILDCARD_CHARACTERS is detected
    codeflash_output = contains_wildcards(f"pre{char}post") # 7.53μs -> 3.38μs (123% faster)

def test_substring_elements_do_not_confuse_detection():
    # If a wildcard-like sequence appears only as part of an element inside a non-str iterable,
    # the implementation checks membership against elements, not substrings of elements.
    # For example, '*' in ['ab*c'] is False because the list element is 'ab*c', not '*' itself.
    codeflash_output = contains_wildcards(["ab*c"]) # 2.21μs -> 1.18μs (87.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
from src.datasets.data_files import contains_wildcards

def test_contains_single_asterisk():
    """Test that a single asterisk wildcard is correctly detected."""
    codeflash_output = contains_wildcards("*.txt") # 2.40μs -> 961ns (150% faster)

def test_contains_single_square_bracket():
    """Test that a single square bracket wildcard is correctly detected."""
    codeflash_output = contains_wildcards("file[0].txt") # 2.51μs -> 1.13μs (123% faster)

def test_contains_single_closing_square_bracket():
    """Test that a closing square bracket wildcard is correctly detected."""
    codeflash_output = contains_wildcards("file].txt") # 2.74μs -> 1.14μs (140% faster)

def test_contains_multiple_wildcards():
    """Test that multiple different wildcard characters are detected."""
    codeflash_output = contains_wildcards("*.txt[abc]") # 2.41μs -> 1.01μs (139% faster)

def test_no_wildcards_simple_string():
    """Test that a simple string without wildcards returns False."""
    codeflash_output = contains_wildcards("hello.txt") # 2.26μs -> 1.18μs (91.3% faster)

def test_no_wildcards_with_numbers():
    """Test that strings with numbers but no wildcards return False."""
    codeflash_output = contains_wildcards("file123.txt") # 2.18μs -> 1.31μs (66.3% faster)

def test_no_wildcards_with_special_chars():
    """Test that strings with other special characters (not wildcards) return False."""
    codeflash_output = contains_wildcards("file-name_123.txt") # 2.24μs -> 1.24μs (80.5% faster)

def test_empty_string():
    """Test that an empty string contains no wildcards."""
    codeflash_output = contains_wildcards("") # 2.15μs -> 1.15μs (86.5% faster)

def test_contains_asterisk_at_start():
    """Test wildcard asterisk at the beginning of pattern."""
    codeflash_output = contains_wildcards("*") # 2.37μs -> 1.00μs (136% faster)

def test_contains_asterisk_in_middle():
    """Test wildcard asterisk in the middle of pattern."""
    codeflash_output = contains_wildcards("file*name.txt") # 2.41μs -> 1.04μs (133% faster)

def test_contains_asterisk_at_end():
    """Test wildcard asterisk at the end of pattern."""
    codeflash_output = contains_wildcards("file.*") # 2.34μs -> 1.03μs (127% faster)

def test_contains_bracket_at_start():
    """Test opening bracket at the beginning of pattern."""
    codeflash_output = contains_wildcards("[abc]file.txt") # 2.56μs -> 1.14μs (124% faster)

def test_contains_bracket_in_middle():
    """Test bracket in the middle of pattern."""
    codeflash_output = contains_wildcards("file[0-9].txt") # 2.57μs -> 1.15μs (125% faster)

def test_contains_bracket_at_end():
    """Test bracket at the end of pattern."""
    codeflash_output = contains_wildcards("file.]") # 2.63μs -> 1.25μs (110% faster)

def test_multiple_asterisks():
    """Test multiple asterisk wildcards in one pattern."""
    codeflash_output = contains_wildcards("*.txt.*.backup") # 2.30μs -> 1.01μs (127% faster)

def test_multiple_brackets():
    """Test multiple bracket wildcards in one pattern."""
    codeflash_output = contains_wildcards("[a][b][c]file.txt") # 2.64μs -> 1.17μs (125% faster)

def test_asterisk_and_bracket_mixed():
    """Test both asterisk and bracket wildcards in one pattern."""
    codeflash_output = contains_wildcards("*[0-9].txt") # 2.38μs -> 1.05μs (126% faster)

def test_single_asterisk_only():
    """Test pattern that is just an asterisk."""
    codeflash_output = contains_wildcards("*") # 2.28μs -> 1.01μs (127% faster)

def test_single_opening_bracket_only():
    """Test pattern that is just an opening bracket."""
    codeflash_output = contains_wildcards("[") # 2.52μs -> 1.08μs (133% faster)

def test_single_closing_bracket_only():
    """Test pattern that is just a closing bracket."""
    codeflash_output = contains_wildcards("]") # 2.61μs -> 1.07μs (145% faster)

def test_only_asterisks():
    """Test pattern containing only asterisks."""
    codeflash_output = contains_wildcards("***") # 2.24μs -> 1.05μs (114% faster)

def test_only_opening_brackets():
    """Test pattern containing only opening brackets."""
    codeflash_output = contains_wildcards("[[[") # 2.52μs -> 1.13μs (124% faster)

def test_only_closing_brackets():
    """Test pattern containing only closing brackets."""
    codeflash_output = contains_wildcards("]]]") # 2.67μs -> 1.17μs (128% faster)

def test_only_wildcard_characters():
    """Test pattern containing only wildcard characters mixed."""
    codeflash_output = contains_wildcards("*[]") # 2.26μs -> 1.00μs (125% faster)

def test_very_long_string_without_wildcards():
    """Test very long string that contains no wildcards."""
    long_string = "a" * 500 + "file.txt"
    codeflash_output = contains_wildcards(long_string) # 2.56μs -> 1.50μs (71.6% faster)

def test_very_long_string_with_wildcard_at_start():
    """Test very long string with wildcard at the start."""
    long_string = "*" + "a" * 500 + "file.txt"
    codeflash_output = contains_wildcards(long_string) # 2.53μs -> 1.06μs (139% faster)

def test_very_long_string_with_wildcard_at_end():
    """Test very long string with wildcard at the end."""
    long_string = "a" * 500 + "file.txt" + "*"
    codeflash_output = contains_wildcards(long_string) # 2.55μs -> 1.22μs (108% faster)

def test_very_long_string_with_wildcard_in_middle():
    """Test very long string with wildcard in the middle."""
    long_string = "a" * 250 + "*" + "a" * 250 + "file.txt"
    codeflash_output = contains_wildcards(long_string) # 2.56μs -> 1.13μs (127% faster)

def test_wildcard_with_whitespace():
    """Test pattern with wildcard and whitespace characters."""
    codeflash_output = contains_wildcards("*.txt ") # 2.42μs -> 1.02μs (137% faster)

def test_wildcard_with_tabs():
    """Test pattern with wildcard and tab characters."""
    codeflash_output = contains_wildcards("*.txt\t") # 2.38μs -> 1.02μs (133% faster)

def test_wildcard_with_newlines():
    """Test pattern with wildcard and newline characters."""
    codeflash_output = contains_wildcards("*.txt\n") # 2.36μs -> 1.03μs (129% faster)

def test_path_like_string_no_wildcards():
    """Test Unix-style path without wildcards."""
    codeflash_output = contains_wildcards("/home/user/documents/file.txt") # 2.33μs -> 1.27μs (84.4% faster)

def test_path_like_string_with_asterisk():
    """Test Unix-style path with asterisk wildcard."""
    codeflash_output = contains_wildcards("/home/user/*/file.txt") # 2.41μs -> 1.05μs (130% faster)

def test_path_like_string_with_bracket():
    """Test Unix-style path with bracket wildcard."""
    codeflash_output = contains_wildcards("/home/user/[abc]/file.txt") # 2.60μs -> 1.20μs (116% faster)

def test_windows_path_no_wildcards():
    """Test Windows-style path without wildcards."""
    codeflash_output = contains_wildcards("C:\\Users\\Documents\\file.txt") # 2.29μs -> 1.19μs (92.5% faster)

def test_windows_path_with_wildcard():
    """Test Windows-style path with wildcard."""
    codeflash_output = contains_wildcards("C:\\Users\\*\\file.txt") # 2.41μs -> 1.09μs (121% faster)

def test_unicode_characters_no_wildcards():
    """Test pattern with unicode characters but no wildcards."""
    codeflash_output = contains_wildcards("fichier_\u00e9.txt") # 2.23μs -> 1.22μs (82.8% faster)

def test_unicode_characters_with_wildcard():
    """Test pattern with unicode characters and wildcard."""
    codeflash_output = contains_wildcards("fichier_\u00e9_*.txt") # 2.30μs -> 997ns (130% faster)

def test_repeated_character_no_wildcard():
    """Test pattern with repeated non-wildcard characters."""
    codeflash_output = contains_wildcards("aaaaaaa") # 2.25μs -> 1.22μs (84.1% faster)

def test_special_regex_chars_no_wildcard():
    """Test that regex special chars (not in WILDCARD_CHARACTERS) are not detected."""
    codeflash_output = contains_wildcards("file.+txt") # 2.20μs -> 1.21μs (81.7% faster)

def test_special_regex_chars_with_wildcard():
    """Test regex special chars combined with actual wildcard."""
    codeflash_output = contains_wildcards("file.+*txt") # 2.34μs -> 1.02μs (130% faster)

def test_glob_pattern_not_containing_wildcards():
    """Test glob-like pattern without actual wildcard characters."""
    codeflash_output = contains_wildcards("file{1,2,3}.txt") # 2.20μs -> 1.15μs (90.4% faster)

def test_glob_pattern_with_wildcard():
    """Test glob-like pattern with wildcard character."""
    codeflash_output = contains_wildcards("file{1,2,*}.txt") # 2.42μs -> 1.02μs (136% faster)

def test_escaped_wildcard_in_string():
    """Test that escaped wildcards (backslash) are still detected as wildcards."""
    # The function doesn't interpret escape sequences, just looks for characters
    codeflash_output = contains_wildcards("file\\*.txt") # 2.36μs -> 947ns (149% faster)

def test_bracket_at_exact_boundaries():
    """Test bracket character at exact string boundaries."""
    codeflash_output = contains_wildcards("[") # 2.44μs -> 1.12μs (118% faster)
    codeflash_output = contains_wildcards("]") # 925ns -> 406ns (128% faster)

def test_asterisk_only_string_length_100():
    """Test string of 100 asterisks."""
    codeflash_output = contains_wildcards("*" * 100) # 2.32μs -> 1.06μs (119% faster)

def test_mixed_wildcards_at_extremes():
    """Test mixed wildcard types at both ends of string."""
    codeflash_output = contains_wildcards("[*") # 2.37μs -> 1.01μs (133% faster)
    codeflash_output = contains_wildcards("*]") # 773ns -> 333ns (132% faster)
    codeflash_output = contains_wildcards("[*]") # 650ns -> 251ns (159% faster)

def test_large_pattern_no_wildcards():
    """Test very large pattern (1000+ chars) without wildcards."""
    large_pattern = "file_" + "x" * 1000 + ".txt"
    codeflash_output = contains_wildcards(large_pattern) # 2.40μs -> 1.46μs (64.4% faster)

def test_large_pattern_with_single_wildcard():
    """Test very large pattern (1000+ chars) with one wildcard."""
    large_pattern = "file_" + "x" * 500 + "*" + "x" * 500 + ".txt"
    codeflash_output = contains_wildcards(large_pattern) # 2.54μs -> 1.17μs (118% faster)

def test_large_pattern_with_asterisk_early():
    """Test large pattern with asterisk near the beginning."""
    large_pattern = "*" + "x" * 1000 + ".txt"
    codeflash_output = contains_wildcards(large_pattern) # 2.37μs -> 1.05μs (126% faster)

def test_large_pattern_with_bracket_late():
    """Test large pattern with bracket near the end."""
    large_pattern = "file_" + "x" * 1000 + "[abc]"
    codeflash_output = contains_wildcards(large_pattern) # 2.71μs -> 1.34μs (102% faster)

def test_pattern_with_many_non_wildcard_special_chars():
    """Test pattern with many special characters but no wildcards."""
    pattern = "!@#$%^&()_+-={}:;'\"<>,.?/\\" * 10
    codeflash_output = contains_wildcards(pattern) # 2.38μs -> 1.33μs (79.4% faster)

def test_pattern_with_many_special_chars_and_wildcard():
    """Test pattern with many special characters and a wildcard."""
    pattern = "!@#$%^&()_+-={}:;'\"<>,*?.?/\\"
    codeflash_output = contains_wildcards(pattern) # 2.36μs -> 1.07μs (120% faster)

def test_repeated_non_wildcard_sequence():
    """Test pattern with a very long repeated sequence without wildcards."""
    pattern = ("abcdefghij" * 100)  # 1000 characters
    codeflash_output = contains_wildcards(pattern) # 2.49μs -> 1.50μs (65.8% faster)

def test_repeated_non_wildcard_sequence_with_wildcard():
    """Test repeated sequence with wildcard at the end."""
    pattern = ("abcdefghij" * 100) + "*"
    codeflash_output = contains_wildcards(pattern) # 2.57μs -> 1.20μs (115% faster)

def test_alternating_pattern_no_wildcards():
    """Test alternating character pattern without wildcards."""
    pattern = "ab" * 500  # 1000 characters
    codeflash_output = contains_wildcards(pattern) # 2.38μs -> 1.35μs (75.3% faster)

def test_alternating_pattern_with_wildcard():
    """Test alternating pattern with wildcard inserted."""
    pattern = "ab" * 250 + "[" + "ab" * 250
    codeflash_output = contains_wildcards(pattern) # 2.75μs -> 1.34μs (105% faster)

def test_performance_check_early_detection():
    """Test that function detects wildcard early in large string."""
    # Even with 1000 chars after wildcard, should find it at position 0
    pattern = "*" + "x" * 1000
    codeflash_output = contains_wildcards(pattern); result = codeflash_output # 2.51μs -> 1.04μs (140% faster)

def test_performance_check_late_detection():
    """Test wildcard detection later in string doesn't iterate forever."""
    # Even if wildcard is at position 999, should still find it
    pattern = "x" * 499 + "*" + "x" * 500
    codeflash_output = contains_wildcards(pattern); result = codeflash_output # 2.57μs -> 1.19μs (116% faster)

def test_no_wildcard_complete_iteration():
    """Test that function completely iterates when no wildcard is present."""
    # Should check all 3 wildcard characters and return False
    pattern = "y" * 800
    codeflash_output = contains_wildcards(pattern) # 2.33μs -> 1.39μs (67.6% faster)

def test_multiple_wildcards_in_large_pattern():
    """Test large pattern with multiple wildcard occurrences."""
    pattern = "*" * 50 + "file" + "[" * 50 + "name" + "]" * 50 + ".txt"
    codeflash_output = contains_wildcards(pattern) # 2.40μs -> 1.08μs (121% faster)

def test_realistic_glob_pattern_large():
    """Test realistic large glob pattern with multiple wildcards."""
    pattern = "/path/to/data/**/[a-z]*/file_*.txt"
    codeflash_output = contains_wildcards(pattern) # 2.31μs -> 994ns (133% faster)

def test_realistic_glob_pattern_large_no_wildcards():
    """Test realistic large path without wildcards."""
    pattern = "/path/to/deeply/nested/directory/structure/with/many/components/file.txt"
    codeflash_output = contains_wildcards(pattern) # 2.31μs -> 1.28μs (80.4% faster)

def test_edge_case_many_brackets_and_asterisks():
    """Test pattern with many interspersed wildcards."""
    pattern = "*[*[*[" + "x" * 500 + "]*]*]*"
    codeflash_output = contains_wildcards(pattern) # 2.45μs -> 1.01μs (142% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-contains_wildcards-mlclg43u and push.

Codeflash Static Badge

The optimization replaces the `any()` generator expression with an explicit for-loop that performs early-exit when a wildcard is found. This achieves a **106% speedup** (runtime reduced from 217μs to 105μs).

**What Changed:**
- Original: `return any(wildcard_character in pattern for wildcard_character in WILDCARD_CHARACTERS)`
- Optimized: Explicit for-loop with early `return True` when any wildcard is found

**Why It's Faster:**

1. **Reduced Function Call Overhead**: The original code involves calling the built-in `any()` function plus creating a generator object. The optimized version eliminates this overhead by using direct control flow.

2. **Earlier Short-Circuiting**: While both implementations support early termination when a wildcard is found, the explicit loop version short-circuits more efficiently. The line profiler shows the optimized version exits early in 64 out of 90 iterations (71%), demonstrating effective early-return behavior.

3. **Lower Per-Check Cost**: The optimized code shows a per-hit time of ~898ns for the loop iteration compared to 5062ns for the entire `any()` expression in the original, indicating more efficient memory and instruction patterns.

**Test Results:**
The optimization is particularly effective for:
- **Patterns with wildcards at the start** (e.g., `"*start"`): 127-150% faster due to immediate detection
- **Patterns with wildcards anywhere** (e.g., `"file*name"`): 113-140% faster
- **Long strings without wildcards** (e.g., 1000-char strings): 64-80% faster - though slower than wildcard cases, the optimization still helps by avoiding generator overhead for the full iteration

Even in the worst case (checking all 3 wildcard characters on strings without wildcards), the optimized version is 65-80% faster due to eliminated generator and `any()` overhead.

This optimization is valuable for any code that frequently checks patterns for wildcards, especially in file system operations, glob pattern validation, or path processing hot paths where this function might be called repeatedly.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 February 7, 2026 17:34
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Feb 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants