-
Notifications
You must be signed in to change notification settings - Fork 3.7k
fix(grep): stream ripgrep output to prevent memory exhaustion #5432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves the memory efficiency of the grep tool by streaming ripgrep output instead of loading all results into memory before processing. Previously, the tool would wait for ripgrep to complete, load all output into memory, split it into lines (creating a second copy), and only then limit to 100 results. The new implementation streams output line-by-line and stops processing once 100 matches are collected, preventing memory exhaustion on large search spaces.
Key changes:
- Implements streaming processing of ripgrep stdout using ReadableStream API
- Adds early termination by killing the ripgrep process once MATCH_LIMIT (100) matches are collected
- Updates exit code handling to account for killed processes and partial results
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (const line of lines) { | ||
| if (!line) continue | ||
| if (matches.length >= MATCH_LIMIT) { | ||
| hitCollectLimit = true | ||
| break | ||
| } | ||
|
|
||
| const [filePath, lineNumStr, ...lineTextParts] = line.split("|") | ||
| if (!filePath || !lineNumStr || lineTextParts.length === 0) continue | ||
|
|
||
| const lineNum = parseInt(lineNumStr, 10) | ||
| const lineText = lineTextParts.join("|") | ||
|
|
||
| const file = Bun.file(filePath) | ||
| const stats = await file.stat().catch(() => null) | ||
| if (!stats) continue | ||
|
|
||
| matches.push({ | ||
| path: filePath, | ||
| modTime: stats.mtime.getTime(), | ||
| lineNum, | ||
| lineText, | ||
| }) | ||
| } |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streaming implementation processes matches asynchronously with file.stat() calls inside the read loop. If file.stat() is slow, this could cause backpressure issues where the stdout buffer fills up because the reader isn't consuming chunks fast enough. Consider processing file stats after collecting all line data, or implement a queue-based approach to handle I/O more efficiently.
packages/opencode/src/tool/grep.ts
Outdated
| const truncated = matches.length > MATCH_LIMIT | ||
| const finalMatches = truncated ? matches.slice(0, MATCH_LIMIT) : matches |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable truncated is computed based on whether matches.length exceeds MATCH_LIMIT, but since the collection loop stops at MATCH_LIMIT items, matches.length will never actually be greater than MATCH_LIMIT. This means truncated will always be false, making the variable redundant. The truncation logic should rely solely on hitCollectLimit to determine if results were truncated.
| reader.releaseLock() | ||
| } | ||
|
|
||
| const errorOutput = await new Response(proc.stderr).text() |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the collection limit is hit and the process is killed, the stderr might not be fully captured yet since proc.stderr is being read after the kill. This could lead to incomplete error messages. Consider reading stderr before killing the process, or at minimum be aware that error output may be incomplete when the limit is hit.
packages/opencode/src/tool/grep.ts
Outdated
| if (!hitCollectLimit && buffer) { | ||
| const [filePath, lineNumStr, ...lineTextParts] = buffer.split("|") | ||
| if (filePath && lineNumStr && lineTextParts.length > 0) { | ||
| const lineNum = parseInt(lineNumStr, 10) | ||
| const lineText = lineTextParts.join("|") | ||
| const file = Bun.file(filePath) | ||
| const stats = await file.stat().catch(() => null) | ||
| if (stats) { | ||
| matches.push({ | ||
| path: filePath, | ||
| modTime: stats.mtime.getTime(), | ||
| lineNum, | ||
| lineText, | ||
| }) | ||
| } | ||
| } | ||
| } |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handling of the last buffered line (lines 85-101) has duplication with the main loop logic (lines 64-79). This duplicated parsing and file stat logic reduces maintainability. Consider extracting the line processing into a separate function to avoid code duplication.
packages/opencode/src/tool/grep.ts
Outdated
| const file = Bun.file(filePath) | ||
| const stats = await file.stat().catch(() => null) | ||
| if (!stats) continue |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling file.stat() for every single match result can be a performance bottleneck, especially when processing many results. Each stat call is an I/O operation that waits for file system access. Since ripgrep already provides the file path, consider whether the modification time is truly necessary for sorting, or if there's a way to batch or optimize these stat calls.
| const exitCode = await proc.exited | ||
|
|
||
| if (exitCode === 1) { | ||
| if (exitCode === 1 && matches.length === 0) { |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the process is killed due to hitting the collection limit (line 103), the exit code will likely be non-zero (e.g., killed by signal). The error handling on line 118 accounts for this with !hitCollectLimit, but there's a potential edge case: if ripgrep legitimately fails with exit code 1 but we already collected some matches, we'll now return those matches instead of reporting the error. The condition on line 110 should verify that exit code 1 specifically means "no matches found" rather than another error condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| matches.push({ | ||
| path: filePath, | ||
| lineNum: parseInt(lineNumStr, 10), | ||
| lineText: lineTextParts.join("|"), | ||
| }) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final buffer processing logic doesn't check if the match count has already reached the MATCH_LIMIT. If exactly MATCH_LIMIT matches were found in the streaming loop, this code could push a 101st match, exceeding the intended limit.
| matches.push({ | |
| path: filePath, | |
| lineNum: parseInt(lineNumStr, 10), | |
| lineText: lineTextParts.join("|"), | |
| }) | |
| if (matches.length < MATCH_LIMIT) { | |
| matches.push({ | |
| path: filePath, | |
| lineNum: parseInt(lineNumStr, 10), | |
| lineText: lineTextParts.join("|"), | |
| }) | |
| } else { | |
| truncated = true | |
| } |
| } | ||
|
|
||
| if (exitCode !== 0) { | ||
| if (exitCode !== 0 && exitCode !== 1 && !truncated) { |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the process is killed due to truncation, the exit code will likely be non-zero (typically 143 for SIGTERM or 137 for SIGKILL). The current condition checks if exitCode is not 0, not 1, and not truncated, but this doesn't account for the case where the process was killed and exits with a signal-related code. This could cause the error handler to throw incorrectly after a successful truncation. Consider checking if the process was killed before evaluating the exit code, or adjust the condition to handle signal-related exit codes.
| if (exitCode !== 0 && exitCode !== 1 && !truncated) { | |
| // Allow exit codes 137 (SIGKILL) and 143 (SIGTERM) if process was intentionally killed due to truncation | |
| if ( | |
| exitCode !== 0 && | |
| exitCode !== 1 && | |
| !(truncated && (exitCode === 137 || exitCode === 143)) | |
| ) { |
| } | ||
| } | ||
| } finally { | ||
| if (truncated) proc.kill() |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the process is killed due to truncation, stderr and the exit code are accessed immediately after without waiting for the process to fully terminate. The proc.kill() call is asynchronous, and accessing stderr before the process has fully exited could result in incomplete error output. Consider awaiting proc.exited in the finally block after killing the process to ensure proper cleanup.
| if (truncated) proc.kill() | |
| if (truncated) { | |
| proc.kill(); | |
| await proc.exited; | |
| } |
rg runs on some really common string on some large search space e.g.
cd / && rg /it waits for ALL the results to finish...
It used to load all the results (imagine 1 GB of text) all into memory.
Then string split, (allocating another 1+ GB of memory).
It would also do stat calls for all results
Then finally only fetch the first 100.
Now when you do sinful rg it returns instantly and saves your memory/disk for better things.
Note: There would be a behavior regression here - it used to load all results and when truncating to 100 it is sorted by recently modified
We could do a hybrid approach here - allow streaming in 2000(?) results and return the most recent 100 out of that (killing rg) to limit worst case, but still provide a good experience for small search results.