Skip to content

fix(db): ASCII case fold in upper/lower/ilike#1574

Open
sravan27 wants to merge 1 commit into
TanStack:mainfrom
sravan27:ascii-case-fold-for-upper-lower-ilike
Open

fix(db): ASCII case fold in upper/lower/ilike#1574
sravan27 wants to merge 1 commit into
TanStack:mainfrom
sravan27:ascii-case-fold-for-upper-lower-ilike

Conversation

@sravan27
Copy link
Copy Markdown

@sravan27 sravan27 commented Jun 7, 2026

Summary

upper(), lower(), and ilike currently call String.prototype.toUpperCase() / toLowerCase() directly. Those methods are locale-aware and length-changing:

input .toUpperCase() .toLowerCase()
straße STRASSE (6 -> 7) straße
STRA+ßE STRASSE straße
İ İ (1 -> 2, adds U+0307)
file FILE (3 -> 4) file
I (tr-TR locale) I ı

When a TanStack DB collection is fed by a server-side filter (e.g. a query collection backed by PostgreSQL or a sync engine over SQLite), the client side and server side disagree on the case-fold of the bucket key any time the data contains ß, ligatures, Turkish dotted/dotless i, or runs under a Turkish locale. The row silently drops out of the matching set - no error, no log line.

This PR uses ASCII-only case folding for upper(), lower(), and ilike, matching SQLites default behavior and making the result deterministic across JS locales and length-preserving.

What changed

  • packages/db/src/query/compiler/evaluators.ts: added asciiToUpper / asciiToLower helpers (32-codepoint shift for a-z / A-Z, no-op for everything else). Used by the upper, lower cases and by evaluateLike when caseInsensitive=true.
  • packages/db/tests/query/compiler/evaluators.test.ts: added regression tests covering ß, the ligature, Turkish dotted-i for upper/lower/ilike.
  • changeset added (patch).

Out of scope

This PR does not change length() or concat(). length() still returns String.prototype.length (UTF-16 code units) which diverges from SQLites code-point count for non-BMP characters - if maintainers want, Im happy to ship a follow-up. The case-fold issue has the highest visible impact because it affects the common case (Western-European data with ß / , Turkish İ/I) where users do not expect length changes.

Context

This is the same fix class behind powersync-ja/powersync-service#663 (open) and the four merged sync-rules fixes #644-#647. Reference open-source checker for this bug shape across JS DB query layers: https://github.com/sravan27/silentdrop

All 99 tests in evaluators.test.ts pass locally.

Summary by CodeRabbit

  • Bug Fixes

    • upper(), lower(), and ilike functions now use ASCII-only case folding instead of locale-aware case folding, aligning with SQLite behavior and resolving inconsistencies that previously caused silent row drops when non-ASCII characters were involved.
  • Tests

    • Added comprehensive test coverage for ASCII-only case-folding behavior with various non-ASCII character scenarios.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates TanStack DB's upper(), lower(), and ilike() string functions to use ASCII-only case folding instead of JavaScript's locale-aware variants, ensuring results match SQLite's behavior and preventing silent data loss when non-ASCII characters differ in how they fold across JS and SQL.

Changes

ASCII-only case folding for string functions

Layer / File(s) Summary
ASCII case folding helper functions
packages/db/src/query/compiler/evaluators.ts
New asciiToUpper and asciiToLower internal helpers provide ASCII-only, length-preserving case folding as a foundation for all updated string functions.
Upper and lower function evaluators
packages/db/src/query/compiler/evaluators.ts
The compiled upper() and lower() evaluators now use asciiToUpper/asciiToLower instead of toUpperCase()/toLowerCase() for string inputs.
Case-insensitive LIKE/ILIKE pattern matching
packages/db/src/query/compiler/evaluators.ts
The LIKE/ILIKE case-insensitive path converts both search values and patterns using asciiToLower for deterministic, locale-independent matching.
ASCII case folding test coverage
packages/db/tests/query/compiler/evaluators.test.ts
New test cases validate ASCII-only, non-length-changing folding for upper(), lower(), and ilike() with characters like German sharp s, Turkish dotted i, and fi ligature.
Changelog entry
.changeset/ascii-case-fold-upper-lower-ilike.md
Documents the behavior change, examples of the new ASCII-only folding, and notes that users can opt into Unicode-aware folding via JavaScript string methods before queries enter the evaluator.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 In queries where letters must fold,
ASCII reigns, steady and bold,
No ß grows long, no dots disappear,
Matching's now clear, consistent and dear!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description comprehensively covers the problem, solution, and changes, but does not complete the required template checklist items. Complete the checklist by confirming local testing with 'pnpm test' and indicating whether a changeset has been generated.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: implementing ASCII case folding for upper/lower/ilike functions.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/db/src/query/compiler/evaluators.ts (1)

17-45: ⚡ Quick win

Deduplicate the ASCII fold loop to prevent divergence.

asciiToUpper and asciiToLower are near-identical; extracting one shared helper will reduce maintenance drift risk.

♻️ Proposed refactor
+function asciiFold(
+  s: string,
+  minCode: number,
+  maxCode: number,
+  delta: number,
+): string {
+  let out = ``
+  for (let i = 0; i < s.length; i++) {
+    const code = s.charCodeAt(i)
+    out +=
+      code >= minCode && code <= maxCode
+        ? String.fromCharCode(code + delta)
+        : s.charAt(i)
+  }
+  return out
+}
+
 function asciiToUpper(s: string): string {
-  let out = ``
-  for (let i = 0; i < s.length; i++) {
-    const c = s.charCodeAt(i)
-    out += c >= 0x61 && c <= 0x7a ? String.fromCharCode(c - 32) : s.charAt(i)
-  }
-  return out
+  return asciiFold(s, 0x61, 0x7a, -32)
 }
@@
 function asciiToLower(s: string): string {
-  let out = ``
-  for (let i = 0; i < s.length; i++) {
-    const c = s.charCodeAt(i)
-    out += c >= 0x41 && c <= 0x5a ? String.fromCharCode(c + 32) : s.charAt(i)
-  }
-  return out
+  return asciiFold(s, 0x41, 0x5a, 32)
 }

As per coding guidelines, “Extract common logic into utility functions when identical or near-identical code blocks appear in multiple places.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/db/src/query/compiler/evaluators.ts` around lines 17 - 45,
asciiToUpper and asciiToLower duplicate the same per-character loop; extract a
single helper (e.g., asciiFold or asciiCaseFold) that iterates the string and
applies a transform based on per-character checks and an offset or mapping, then
reimplement asciiToUpper and asciiToLower to call that helper with the
appropriate parameters (upper offset -32, lower offset +32 or a
predicate/mapper). Update calls to use the new helper and keep original function
names and signatures so external callers are unaffected, and ensure behavior for
non-ASCII characters remains unchanged by returning the original character when
outside the A-Z/a-z ranges.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/db/src/query/compiler/evaluators.ts`:
- Around line 17-45: asciiToUpper and asciiToLower duplicate the same
per-character loop; extract a single helper (e.g., asciiFold or asciiCaseFold)
that iterates the string and applies a transform based on per-character checks
and an offset or mapping, then reimplement asciiToUpper and asciiToLower to call
that helper with the appropriate parameters (upper offset -32, lower offset +32
or a predicate/mapper). Update calls to use the new helper and keep original
function names and signatures so external callers are unaffected, and ensure
behavior for non-ASCII characters remains unchanged by returning the original
character when outside the A-Z/a-z ranges.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cf6731c0-df7b-498e-b7f7-5a1df560cc1a

📥 Commits

Reviewing files that changed from the base of the PR and between 66210a5 and d9a2753.

📒 Files selected for processing (3)
  • .changeset/ascii-case-fold-upper-lower-ilike.md
  • packages/db/src/query/compiler/evaluators.ts
  • packages/db/tests/query/compiler/evaluators.test.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant