Skip to content

[#38059] Fix GCS glob matching to support ** and / in object names#38099

Open
hilaryRope wants to merge 4 commits intoapache:masterfrom
hilaryRope:fix/gcs-glob-matching-slash-separator
Open

[#38059] Fix GCS glob matching to support ** and / in object names#38099
hilaryRope wants to merge 4 commits intoapache:masterfrom
hilaryRope:fix/gcs-glob-matching-slash-separator

Conversation

@hilaryRope
Copy link
Copy Markdown

Fix GCS filesystem glob matching to handle / in object names and support **

Fixes #38059

Summary

The GCS filesystem's List() method was using filepath.Match() to filter objects, which incorrectly treats / as a path separator. Since GCS object names are flat (with / being just another character), patterns like gs://bucket/** failed to match objects containing /.

Problem

  • filepath.Match treats / as a path separator, so * cannot match across /
  • filepath.Match does not support ** for recursive matching
  • fileio.MatchFiles(scope, "gs://my-bucket/**") silently excluded objects like dir/subdir/file.txt

Solution

Replaced filepath.Match with a custom globToRegex() function:

  • * → matches any characters except / (single path segment)
  • ** → matches any characters including / (recursive)
  • **/ → matches zero or more path segments

This aligns the Go SDK with the Python and Java SDKs.

Testing

Added unit tests for globToRegex() and List() with nested object names.


  • Mention the appropriate issue in your description (for example: addresses #123), if applicable.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where GCS filesystem glob patterns were incorrectly filtered due to the standard library's path-aware matching logic. By implementing a custom regex-based matching system, the GCS connector now properly handles object names containing slashes and supports recursive globbing, ensuring consistent behavior across different Apache Beam SDKs.

Highlights

  • Custom Glob Matching: Replaced the standard library's filepath.Match with a custom globToRegex function to correctly handle GCS object names, which are flat and treat '/' as a standard character.
  • Recursive Matching Support: Added support for the '**' glob pattern to enable recursive matching across GCS object paths, aligning the Go SDK with Python and Java implementations.
  • Comprehensive Testing: Introduced new unit tests for the glob-to-regex conversion logic and verified GCS list operations with nested object structures.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@utafrali utafrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core approach of replacing filepath.Match with a custom globToRegex is correct and well-motivated. The main issue is that ? is translated to . (matches /) while * is explicitly restricted to [^/]*, creating an inconsistency that will surprise callers. There are also minor concerns around the hand-maintained isRegexSpecial list (use regexp.QuoteMeta instead) and the silent swallowing of malformed [ patterns.

case '[':
// Character class - find the closing bracket
j := i + 1
if j < len(pattern) && pattern[j] == '!' {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ? wildcard is translated to . (matches any character including /), but * is explicitly translated to [^/]* (excludes /). This is inconsistent: a user writing dir?file.txt would unexpectedly match dir/file.txt.

Since the PR's design treats * as a single-segment matcher and ** as the cross-segment matcher, ? should follow the same rule and not cross /:

case '?':
    result.WriteString("[^/]")

The comment above the function even documents this asymmetry as intentional (? matches a single character (any character including /)), but no other major glob implementation (Python pathlib, Java PathMatcher) lets ? cross path separators when * does not.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, resolved the inconsistency and updated the comment

content := pattern[i+1 : j]
if len(content) > 0 && content[0] == '!' {
result.WriteByte('^')
content = content[1:]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isRegexSpecial maintains a hand-written list of regex metacharacters. The list is currently missing ] (harmless in Go RE2 outside [...], but still surprising) and is a maintenance risk if the code is adapted later. The default case in globToRegex can just call regexp.QuoteMeta directly, which is already imported and handles all metacharacters correctly:

default:
    result.WriteString(regexp.QuoteMeta(string(c)))

This removes the need for isRegexSpecial entirely.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, makes a lot of sense. Done. Replaced with regexp.QuoteMeta() and removed isRegexSpecial altogether

if j < len(pattern) && pattern[j] == ']' {
j++
}
for j < len(pattern) && pattern[j] != ']' {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is no closing ], the code silently treats [ as a literal character. Previously, filepath.Match would return filepath.ErrBadPattern for the same input. This is a silent behavior change: a caller who was relying on the error to detect bad patterns will no longer get one. Consider returning an error instead:

if j >= len(pattern) {
    return nil, fmt.Errorf("syntax error: unclosed '[' in pattern %q", pattern)
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Now returns fmt.Errorf("syntax error: unclosed '[' in pattern %q", pattern)

// Character classes
{"file[0-9].txt", "file1.txt", true},
{"file[0-9].txt", "filea.txt", false},
{"file[!0-9].txt", "filea.txt", true},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ? test cases only verify matching a digit (file1.txt) and rejecting two characters (file12.txt). There is no test confirming whether ? matches or rejects /. Given the inconsistency with *, add an explicit case:

{"file?.txt", "file/.txt", false}, // ? should not cross /

(Or true if crossing is intentional, but document it clearly.)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added {"file?.txt", "file/.txt", false}, // ? should not cross /

want []string
}{
// Single * should only match top-level files
{dirPath + "/*.txt", []string{dirPath + "/file.txt", dirPath + "/other.txt"}},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration test for /** lists all four objects including file.txt and other.txt. The /** pattern is compiled to ^.*$ which matches any string, including an empty one. This is correct for GCS, but it is also worth adding a case like dirPath + "/dir/subdir/**" to verify deeply nested ** matching, since the nested case (dir/subdir/file.txt) is the core scenario from issue #38059.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added dirPath + "/dir/subdir/**" test case to verify deeply nested matching, which is the core scenario from issue #38059

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Assigning reviewers:

R: @jrmccluskey for label go.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Go SDK] fileio.MatchAll() cannot match all files on GCS

2 participants