Parameter limit#7803
Conversation
|
Follow-up to #6784 This is probably a candidate for using the "run plugin CI against pulpcore main branch" automation. |
| values = list(values) | ||
| if len(values) < POSTGRES_MAX_QUERY_PARAMS: | ||
| return Q(**{f"{field_name}__in": values}) | ||
| return Q(**{f"{field_name}__any_array": values}) |
There was a problem hiding this comment.
Would there be a downside when we always used this array method?
There was a problem hiding this comment.
This was something I was wanting to investigate a bit more before undrafting. This is draft because it's pretty much just "what Claude said" and I wanted to at least look at some query plans and compare before going with it.
I was trying to make a reliable unit test as well but, unfortunately, the unit test in #7801 is not reliable for reasons that are not entirely clear to me.
The two SQL strategies are:
IN ($1, $2, ..., $N)— N separate bind parameters, one per value
= ANY($1)— one bind parameter containing a PostgreSQL array
Parsing/planning: With IN, PostgreSQL must parse N parameter placeholders and the planner builds an OR-tree of comparison nodes. At 100K values, that's significant parse time and planner overhead. With = ANY(array), the query structure is always the same single ScalarArrayOpExpr node regardless of list size.Prepared statement caching: IN produces a different query shape for each different N, so the plan can't be reused across different list sizes. = ANY(array) always has the same shape — one parameter — so the plan is reusable.
Index usage: Both use btree indexes equally well.
Small lists: For tiny lists (1-10 items), IN is marginally cheaper because there's no array construction. The difference is negligible in practice.
So why not always use = ANY?
There's no strong PostgreSQL-level reason not to. The threshold in safe_in() is mostly conservatism:
__in is Django's standard, battle-tested lookup — it handles querysets (subqueries), empty lists, None values, and all the edge cases Django has polished over years. A custom lookup is more code to maintain.
__in works across all database backends. = ANY(array) is PostgreSQL-specific.
For Pulp specifically (always PostgreSQL), you could always use = ANY for Python lists and it'd be fine.
If you want to simplify, you could drop the threshold and always use any_array for concrete lists. The threshold just avoids the custom path when there's no benefit.
PostgreSQL's wire protocol limits bind parameters to 65,535 per statement. When Django ORM's filter(field__in=python_list) generates WHERE field IN ($1, $2, ..., $65536+), it exceeds this limit when using server-side cursors (.iterator()). This introduces a safe_in() utility that uses a custom Django lookup (= ANY(%s)) for large lists, passing the entire list as a single PostgreSQL array parameter regardless of size. For small lists, the standard __in lookup is used unchanged. Applied safe_in() to all vulnerable code paths in pulpcore: - RepositoryVersion.get_content(), added(), removed() - import_repository_version() content mapping Also updated the test to use .iterator() so it reliably exercises the server-side cursor path that triggers the parameter limit. Assisted-By: claude-opus-4.6
📜 Checklist
See: Pull Request Walkthrough