feat: add fuzzy text matching fallbacks for edit_file tool#2145
feat: add fuzzy text matching fallbacks for edit_file tool#2145trungutt wants to merge 1 commit intodocker:mainfrom
Conversation
Inspired by the resilient edit matching in OpenCode, add fuzzy text matching to edit_file. When an exact oldText match fails, try progressively looser normalization strategies before giving up: 1. Line-trimmed (strip trailing whitespace per line) 2. Indentation-flexible (strip leading whitespace per line) 3. Line-continuation normalization (collapse \ + newline + spaces) 4. Escape-normalized (\" ↔ ") 5. Whitespace-collapsed (collapse runs of whitespace to single space) These handle common LLM mistakes such as adding spurious indentation, collapsing shell line continuations into single lines, or unescaping quotes. The replacement is always applied to the original content so surrounding text is preserved. Index mapping between normalized and original strings uses binary search with rune-aware boundary snapping to preserve UTF-8 integrity. Both the builtin and ACP filesystem handlers now use the shared FindAndReplace function.
There was a problem hiding this comment.
Review Summary
Assessment: 🔴 CRITICAL
This PR adds fuzzy text matching fallbacks to the edit_file tool with 6 normalization strategies. The implementation has critical bugs in the boundary mapping logic that can cause data corruption when fuzzy matching is triggered.
Critical Issues Found
2 HIGH severity bugs in the mapNormToOrigEnd function that handles boundary detection for character-consuming normalizations (whitespace collapsing, line continuations, etc.). These bugs can lead to incorrect replacement ranges and file corruption.
1 MEDIUM severity issue in the binary search logic that may cause off-by-one errors with collapsing normalizations.
Recommendation
The boundary mapping algorithm needs to be redesigned or thoroughly tested with edge cases involving:
- Multi-byte UTF-8 characters combined with character-consuming normalizations
- Whitespace collapsing at UTF-8 boundaries
- Line continuation normalization with non-ASCII content
The existing test suite does not cover these critical edge cases.
Detailed Findings
See inline comments below.
| pos := mapNormToOrig(original, normIdx, normalize) | ||
| for pos < len(original) { | ||
| _, size := utf8.DecodeRuneInString(original[pos:]) | ||
| if len(normalize(original[:pos+size])) != len(normalize(original[:pos])) { |
There was a problem hiding this comment.
🔴 HIGH: Incorrect termination condition for consumed character detection
The condition len(normalize(original[:pos+size])) != len(normalize(original[:pos])) is fundamentally flawed for character-consuming normalizations.
The Problem:
When normalization collapses multiple characters to fewer (e.g., collapseWhitespace, collapseLineContinuations), advancing by one rune will almost always change the normalized length UNLESS that rune is completely consumed. This logic will fail to correctly identify the end boundary.
Impact:
- Wrong replacement ranges
- Potential file corruption when fuzzy matching triggers
- Especially dangerous with multi-byte UTF-8 characters
Example:
Consider normalizing "a b" (two spaces) to "a b" (one space):
- At position after 'a', normalized length is 2
- Advancing one rune (first space), normalized length is still 2 (space collapsed)
- Advancing another rune (second space), normalized length is still 2
- The loop will incorrectly continue past the intended boundary
The test suite doesn't cover these edge cases with UTF-8 and character-consuming normalizations.
| // quote in \" → "). This ensures the entire original text corresponding | ||
| // to the normalized match is included. Advances by whole runes to | ||
| // preserve UTF-8 integrity. | ||
| func mapNormToOrigEnd(original string, normIdx int, normalize func(string) string) int { |
There was a problem hiding this comment.
🔴 HIGH: mapNormToOrigEnd boundary detection logic is incorrect
This function attempts to advance past trailing characters consumed by normalization, but the implementation is fundamentally incorrect for normalizations that map multiple characters to fewer.
Root Cause:
The function uses mapNormToOrig (binary search) to find the starting position, then advances by runes while checking if the normalized length changes. However, this approach doesn't correctly handle character-consuming normalizations like:
collapseWhitespace(multiple spaces → one space)collapseLineContinuations(\\+ newline → space)normalizeEscapedQuotes(\\"→")
Impact:
Incorrect end boundary detection leads to wrong replacement ranges and potentially corrupted file content.
Recommendation:
Consider a different approach:
- Track character-to-character mapping during normalization
- Use a forward scan with explicit state tracking
- Add comprehensive tests with UTF-8 edge cases
|
|
||
| // mapNormToOrig finds the smallest rune-aligned position in the original | ||
| // string whose normalized prefix has at least normIdx characters. | ||
| func mapNormToOrig(original string, normIdx int, normalize func(string) string) int { |
There was a problem hiding this comment.
🟡 MEDIUM: Binary search may overshoot with collapsing normalizations
The binary search finds "the smallest position where the normalized prefix has at least normIdx characters", but this can cause off-by-one errors when normalization collapses characters.
Example:
Normalizing "a b" (two spaces) to "a b" (one space):
- Looking for position 2 in normalized string (after "a ")
- Binary search might land at position 3 in original (after "a ")
- Because
normalize("a ") = "a "has length 2
Why LIKELY not CONFIRMED:
The binary search invariant appears correct in principle, but edge cases with collapsing normalizations could expose boundary issues. The real problem is that this function feeds into mapNormToOrigEnd, which has confirmed bugs.
Recommendation:
Add test cases specifically for boundary detection with:
- Multiple consecutive collapsible characters
- UTF-8 multi-byte characters adjacent to collapsible sequences
|
Closed, reopen when needed |
Summary
Inspired by the resilient edit matching in OpenCode (which uses 8+ fallback strategies in its
edittool), this adds fuzzy text matching toedit_file.When an exact
oldTextmatch fails, the tool now tries progressively looser normalization strategies before returning an error:\+ newline + spaces into a single space\"and"as equivalentThese handle real-world LLM mistakes observed in session logs:
oldText(e.g., 4 leading spaces on a DockerfileRUNline)\+ newline) into single lines\"to"when reproducing file contentWhen a fuzzy strategy matches, the replacement is applied to the original (un-normalized) content so surrounding text is preserved exactly. Index mapping between normalized and original strings uses binary search with rune-aware boundary snapping to preserve UTF-8 integrity.
Both the builtin and ACP filesystem handlers use the shared
FindAndReplacefunction.This PR is independent of #2144 (double-serialized edits fix).