Skip to content

Parameter limit#7803

Draft
dralley wants to merge 1 commit into
pulp:mainfrom
dralley:parameter-limit
Draft

Parameter limit#7803
dralley wants to merge 1 commit into
pulp:mainfrom
dralley:parameter-limit

Conversation

@dralley

@dralley dralley commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

📜 Checklist

  • Commits are cleanly separated with meaningful messages (simple features and bug fixes should be squashed to one commit)
  • A changelog entry or entries has been added for any significant changes
  • Follows the Pulp policy on AI Usage
  • (For new features) - User documentation and test coverage has been added

See: Pull Request Walkthrough

@dralley

dralley commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up to #6784

This is probably a candidate for using the "run plugin CI against pulpcore main branch" automation.

Comment thread pulpcore/app/util.py
values = list(values)
if len(values) < POSTGRES_MAX_QUERY_PARAMS:
return Q(**{f"{field_name}__in": values})
return Q(**{f"{field_name}__any_array": values})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a downside when we always used this array method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was something I was wanting to investigate a bit more before undrafting. This is draft because it's pretty much just "what Claude said" and I wanted to at least look at some query plans and compare before going with it.

I was trying to make a reliable unit test as well but, unfortunately, the unit test in #7801 is not reliable for reasons that are not entirely clear to me.

The two SQL strategies are:

IN ($1, $2, ..., $N) — N separate bind parameters, one per value
= ANY($1) — one bind parameter containing a PostgreSQL array
Parsing/planning: With IN, PostgreSQL must parse N parameter placeholders and the planner builds an OR-tree of comparison nodes. At 100K values, that's significant parse time and planner overhead. With = ANY(array), the query structure is always the same single ScalarArrayOpExpr node regardless of list size.

Prepared statement caching: IN produces a different query shape for each different N, so the plan can't be reused across different list sizes. = ANY(array) always has the same shape — one parameter — so the plan is reusable.

Index usage: Both use btree indexes equally well.

Small lists: For tiny lists (1-10 items), IN is marginally cheaper because there's no array construction. The difference is negligible in practice.

So why not always use = ANY?

There's no strong PostgreSQL-level reason not to. The threshold in safe_in() is mostly conservatism:

__in is Django's standard, battle-tested lookup — it handles querysets (subqueries), empty lists, None values, and all the edge cases Django has polished over years. A custom lookup is more code to maintain.
__in works across all database backends. = ANY(array) is PostgreSQL-specific.
For Pulp specifically (always PostgreSQL), you could always use = ANY for Python lists and it'd be fine.
If you want to simplify, you could drop the threshold and always use any_array for concrete lists. The threshold just avoids the custom path when there's no benefit.

PostgreSQL's wire protocol limits bind parameters to 65,535 per
statement.  When Django ORM's filter(field__in=python_list) generates
WHERE field IN ($1, $2, ..., $65536+), it exceeds this limit when
using server-side cursors (.iterator()).

This introduces a safe_in() utility that uses a custom Django lookup
(= ANY(%s)) for large lists, passing the entire list as a single
PostgreSQL array parameter regardless of size.  For small lists, the
standard __in lookup is used unchanged.

Applied safe_in() to all vulnerable code paths in pulpcore:
- RepositoryVersion.get_content(), added(), removed()
- import_repository_version() content mapping

Also updated the test to use .iterator() so it reliably exercises the
server-side cursor path that triggers the parameter limit.

Assisted-By: claude-opus-4.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants