You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`ConnectorFileTooLargeError`: thrown mid-download when the listing under-reported size
52
58
- Large workflow value payloads
53
59
- prefer durable references/manifests over inlining large arrays or files
54
60
- materialize refs only behind an explicit byte budget
55
61
62
+
## KB Connector File Size Handling
63
+
64
+
The connector size pattern in `apps/sim/connectors/utils.ts` (`CONNECTOR_MAX_FILE_BYTES` + `readBodyWithLimit` + `stubOrSkipBySize`/`markSkipped`) exists for one risk: a knowledge-base connector downloading **arbitrary, user-controlled file bytes** that the source does not hard-cap. Apply it by that risk, not by the connector's name.
65
+
66
+
Use the pattern when the connector downloads file content via a stream/`download_url` where the user controls the size:
- any connector that fetches a file via a download URL even if it is not a "storage" service (e.g. the Zoom transcript `.vtt`)
69
+
70
+
For those, require all three:
71
+
- stream the body with `readBodyWithLimit(resp, CONNECTOR_MAX_FILE_BYTES)` — never raw `response.text()`/`response.arrayBuffer()`
72
+
- skip oversize at listing (`stubOrSkipBySize` with the reported size) and again at fetch time (overflow -> `markSkipped`), since the listing size can be missing or under-reported
73
+
- never drop/truncate silently — oversized files become content-less failed rows carrying `skippedReason`, so they stay visible in the KB UI instead of vanishing from the index
74
+
75
+
Skip the pattern when the source already bounds the payload:
76
+
- pure API/structured-data connectors (Jira, Linear, Notion, Confluence, Sentry, Slack, Zendesk, Gmail, ...) — paginated JSON/text; apply normal pagination + concurrency bounds instead of a per-file byte cap
77
+
- native-document connectors capped by the platform (Google Docs ~50 MB, Google Sheets via `MAX_ROWS`, Evernote ~25 MB/note) — a 100 MB cap can never fire, and wrapping a `response.json()`/Thrift parse in `readBodyWithLimit` is cargo-culting
78
+
79
+
Litmus test: "Can a user make this one fetch arbitrarily large, with nothing upstream stopping it?" Yes -> use the pattern. No (platform hard-cap, or already paginated) -> a per-file byte cap adds noise, not safety. Borderline: a user-configured/self-hosted endpoint with no platform cap (e.g. Obsidian) — bound it only if the content is genuinely unbounded.
80
+
56
81
## Review Workflow
57
82
58
83
1. Identify every changed data source:
@@ -96,6 +121,7 @@ Read these when doing a deeper pass:
96
121
- fetches all pages from an external API before processing
97
122
- reads an entire file, HTTP response, or stream without a max byte budget
98
123
- checks size only after `Buffer.concat`, `arrayBuffer`, `text`, `JSON.parse`, or parse expansion
124
+
- a KB connector silently drops or truncates an oversized file instead of recording it as a failed (skipped) row
99
125
- chunks only after loading the complete dataset
100
126
- paginates with unbounded/deep `OFFSET` on a mutable or large table
101
127
- creates one queue job per row without batching or a queue-level concurrency key
0 commit comments