Skip to content

Proposal: store and serve full thumbnail URLs instead of hitting server#1331

Open
mihow wants to merge 13 commits into
feat/thumbnail-source-imagesfrom
feat/thumbnails-serializer-direct-urls
Open

Proposal: store and serve full thumbnail URLs instead of hitting server#1331
mihow wants to merge 13 commits into
feat/thumbnail-source-imagesfrom
feat/thumbnails-serializer-direct-urls

Conversation

@mihow

@mihow mihow commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #1306. Base branch: feat/thumbnail-source-images. Several pieces of the original proposal were promoted to the parent PR on 2026-06-10 so it deploys safely on its own; what remains here is only the warm/cold URL classification (+223/−20 over four files).

Summary

If a thumbnail has already been generated, and the source image hasn't changed, list responses return the direct file-storage URL instead of the Django redirect route. Cold or stale thumbnails keep the existing redirect route, which generates them on first request.

Here you can see the small thumbnail has already been fetched, but the medium one has not.
image

Two separate costs — what this fixes and what it must not introduce

Earlier drafts of this PR (and some review discussion) conflated these, so spelling them out:

  1. Browser HTTP round-trips — what this PR removes. With the parent alone, every thumbnail <img> is a request through SourceImageThumbnailViewSet: DRF auth, get_object(), a thumbnail-row check, then a 302 to storage. This is not a database N+1 — it is per-image HTTP fan-out, paid by every client. A 40-card gallery page is 80 of those requests. The parent has since reduced the per-request cost (no storage existence check, and the 302 is now browser-cached for five minutes), but each client still pays the full fan-out on first view and again whenever the cache expires. This PR removes it for warm rows: a measurement on a dev deployment showed Django requests per warm gallery view going from 80 to 0.

  2. Database N+1 — what this PR must avoid introducing. The parent's serializer never touches the thumbnail table: it builds route URLs from obj.pk plus settings, so there is no pre-existing N+1 to fix (an earlier draft wrongly claimed there was — see this comment and Luke's review thread). This PR makes the serializer read obj.thumbnails.all() per row to decide warm vs cold, which would be a new N+1 on its own. SourceImageQuerySet.with_thumbnails() prefetches it as one extra query per page, and the tests pin that contract.

In short: the win is fewer HTTP requests to Django; the prefetch is self-defense against a regression this PR would otherwise create, not a query-count improvement over the parent.

Why generation stays lazy (and why the cached row matters)

Captures have a sparse access pattern: a project can hold on the order of a million captures, but users review occurrence crops — captures themselves are opened only when troubleshooting or checking the context around a detection, so only a few hundred per project are ever viewed. That rules out pre-warming thumbnails at upload/sync time (it would spend storage and compute on images nobody opens) and with it the simpler "deterministic URL, no lookup" design, which only works when the blob is guaranteed to exist.

With generation lazy, something must answer "has this thumbnail been generated, at the current spec, and is it still fresh?" per capture and label — that is exactly what the SourceImageThumbnail row records, and why this PR reads it (via the prefetch below) instead of constructing URLs blindly.

What changes

ami/main/models.py

  • New SourceImageQuerySet.with_thumbnails() — prefetches thumbnails with an is_source_changed annotation (thumbnail last_modified predates the source's), pushing the staleness predicate into the prefetch SQL so the URL builder stays pure formatting.
  • New SourceImage.thumbnail_urls(request) — per-label {label: url} builder, the single source of truth for the warm/cold decision. Warm = a cached row exists with a non-blank path, a width equal to the requested spec width, and no source change → default_storage.url(row.path). Anything else → route URL via reverse_with_params, exactly as the parent serializer built it. The width check is strict equality because the generator (since the spec-width fix in Thumbnails API for SourceImages #1306) records the requested width on the row; legacy rows that stored the encoder's output read as stale here and self-heal through one regeneration.

ami/main/api/serializers.py

SourceImageThumbnailSerializer.get_thumbnails becomes a one-line passthrough:

def get_thumbnails(self, obj):
    return obj.thumbnail_urls(request=self.context.get("request"))

ami/main/api/views.py

  • SourceImageViewSet.get_queryset.with_thumbnails().
  • EventViewSet.get_queryset's inner example_captures Prefetch → same.

List of changes

Change (user/operator effect) How (implementation) Notes
Warm-cache gallery pages no longer route thumbnail requests through Django Serializer emits default_storage.url(row.path) when the cached row matches the configured width and isn't source-changed Removes up to 80 browser→Django round-trips per page view. Frontend treats the URL as opaque (<img src=>); no FE change.
Cold and stale thumbnails still regenerate via the redirect viewset Same SourceImageThumbnailViewSet + concurrency-safe upsert No behavior change on this path.
Re-uploaded source images don't serve stale thumbnails from the warm path is_source_changed annotation compares row vs source last_modified; stale rows fall back to the route URL Mirrors the predicate the generator already uses. Guards r3373715397.
Thumbnails generated before the #1306 spec-width fix self-heal Rows whose stored width is the encoder output (e.g. 239 for a 240 spec) read as stale and get the route URL, regenerating once One extra regeneration per legacy row, then warm forever.
The serializer's new per-row thumbnail read can't become an N+1 with_thumbnails() prefetch is the contract; viewsets wire it; tests assert it One extra query per page, not per row.
Per-label URL contract tested at the model surface test_thumbnail_urls_model_method_emits_warm_and_route exercises SourceImage.thumbnail_urls directly Future template tags / management commands pick up the contract for free.

Contract change

thumbnails.{small,medium} URL shape changes for cached rows from http://<host>/api/v2/captures/thumbnails/<id>/?label=small to e.g. https://<bucket>.s3.amazonaws.com/thumbnails/capture_<id>/small.jpg.

Frontend treats the value as opaque (see ui/src/data-services/models/capture.ts:170-182, ui/src/pages/captures/capture-columns.tsx). No frontend change required.

On deployments using FileSystemStorage (e.g. local dev), default_storage.url() returns a relative /media/... URL that Django still serves — the "bypass Django" win applies fully only where storage URLs are external (S3/MinIO). Compose with a CDN later via AWS_S3_CUSTOM_DOMAIN and these URLs front-load automatically.

Promoted to #1306 (no longer in this diff)

These shipped on the parent so it is solo-deployable: removal of the per-request default_storage.exists() check, progressive JPEG encoding, Cache-Control: private, max-age=300 on the redirect, the blank-path regen guard, and the spec-width fix (replacing the 2px width tolerance an earlier revision of this branch carried).

Luke's review threads addressed

What this does NOT change

  • The redirect viewset (SourceImageThumbnailViewSet) and its concurrency-safe upsert path are unchanged.
  • Browser cache semantics for storage URLs are governed by the storage backend's headers, unchanged here.
  • Frontend code is unchanged.

Known follow-ups

  • Orphan-thumbnail reconciler — a row whose storage blob was deleted out of band now produces a broken <img> until the row is removed (consequence of the exists() removal, which lives in Thumbnails API for SourceImages #1306). Planned as a periodic integrity-check task; separate ticket.
  • Event.first_capture() at ami/main/models.py:1277 is a per-row method that runs a fresh query; pre-existing, unchanged.

Verification

  • Browser network panel on a warm gallery page: <img> requests go direct to storage; zero requests to /api/v2/captures/thumbnails/....
  • Captures-list endpoint query count: +1 thumbnail prefetch query vs the parent, not +N (asserted in tests).
  • First measurement on a dev deployment is in this comment: Django round-trips per gallery view went from 80 to 0 once the cache was warm.

🤖 Generated with Claude Code

@netlify

netlify Bot commented Jun 4, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 5b8d1d3
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a2a15622b139d0008b504ef

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ee1c194-7406-4063-ada7-3a39407416b3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 65.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: storing and serving full thumbnail URLs instead of hitting the server, which is the primary objective of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively covers all required sections including summary, detailed changes, testing approach, deployment notes, and addresses review feedback.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/thumbnails-serializer-direct-urls

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mihow mihow changed the title perf(thumbnails): emit storage URL direct from serializer when cache is warm Option for thumbnails: store and serve full URLs instead of hitting server Jun 4, 2026
@mihow mihow force-pushed the feat/thumbnails-serializer-direct-urls branch from e8ec983 to 704661c Compare June 5, 2026 00:11
@mihow mihow changed the title Option for thumbnails: store and serve full URLs instead of hitting server Proposal: store and serve full thumbnail URLs instead of hitting server Jun 5, 2026
@mihow mihow marked this pull request as ready for review June 5, 2026 01:09
@mihow

mihow commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/main/models.py`:
- Around line 2299-2302: thumbnail_urls() routes rows with an empty path
(row.path == "") through the cold redirect, but
find_or_generate_thumbnail_for_label() still treats existing rows as valid if
the width matches and thus never regenerates blank-path rows; update
find_or_generate_thumbnail_for_label() to treat rows with an empty/blank path as
missing/invalid so they fall through to regeneration (e.g., bypass the "width
matches" short-circuit when row.path is blank), and apply the same logic in the
corresponding check referenced around the code near lines 2354-2355 so
blank-path rows are rebuilt instead of returned unchanged.
- Around line 1954-1972: _thumbnail_warm_url currently builds URLs by
concatenating _thumbnail_storage_url_base() (default_storage.url("")) with a
quoted key, which can break for backends that normalize/sign names; change
_thumbnail_warm_url to call default_storage.url(name) directly (e.g. pass
path.lstrip("/") to default_storage.url) so the storage backend generates the
correct URL/encoding/normalization; remove reliance on
_thumbnail_storage_url_base in that function (you can keep
_thumbnail_storage_url_base for other callers/tests but do not use it inside
_thumbnail_warm_url).

In `@ami/main/tests.py`:
- Around line 338-339: Replace the naive datetime usage in the test regen
trigger setup: instead of calling datetime.datetime.now() to set
self.first_capture.last_modified, import and use Django's timezone.now() (from
django.utils import timezone) so the timestamp is timezone-aware when
USE_TZ=True; update the assignment in the test where
self.first_capture.last_modified is set and keep the subsequent save() call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 15cdd819-9675-42a9-a29c-f618f0a63d55

📥 Commits

Reviewing files that changed from the base of the PR and between 22045e0 and 870cf99.

📒 Files selected for processing (4)
  • ami/main/api/serializers.py
  • ami/main/api/views.py
  • ami/main/models.py
  • ami/main/tests.py

Comment thread ami/main/models.py Outdated
Comment on lines +1954 to +1972
@functools.lru_cache(maxsize=1)
def _thumbnail_storage_url_base() -> str:
"""Storage URL prefix shared by every thumbnail row.

``default_storage.url('')`` is constant for the process lifetime once Django
settings are loaded, so cache it module-wide. The warm-path URL builder then
does one string concat per row instead of routing through the storage
backend's per-call URL machinery (boto3's ``_normalize_name`` /
``_clean_name`` + dict lookups on every call).

Tests that swap ``DEFAULT_FILE_STORAGE`` at runtime should call
``_thumbnail_storage_url_base.cache_clear()`` in their setup.
"""
return default_storage.url("")


def _thumbnail_warm_url(path: str) -> str:
"""Direct storage URL for a cached thumbnail blob (warm path)."""
return _thumbnail_storage_url_base() + urllib.parse.quote(path.lstrip("/"), safe="/")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== Storage-related settings ==\n'
rg -n --glob '*.py' --glob '*.env*' --glob '*.yml' --glob '*.yaml' \
  'DEFAULT_FILE_STORAGE|STORAGES\s*=|AWS_QUERYSTRING_AUTH|GS_QUERYSTRING_AUTH|AWS_QUERYSTRING_EXPIRE|GS_EXPIRATION|AWS_S3_CUSTOM_DOMAIN|MEDIA_URL'

printf '\n== Custom storage classes / url() overrides ==\n'
rg -n --glob '*.py' 'class .*Storage|def url\(self,\s*name'

Repository: RolnickLab/antenna

Length of output: 2337


Avoid building _thumbnail_warm_url() via default_storage.url("") + <quoted key>; rely on storage.url(name) for backend-agnostic correctness.

Even though config/settings/production.py sets AWS_QUERYSTRING_AUTH = False (so the “signed URL” failure mode is unlikely in production), _thumbnail_warm_url() still assumes url("") is always a stable prefix equivalent to url(name) (including normalization/encoding) for every storage backend/environment. Prefer letting the configured storage generate the URL.

Suggested fix
-@functools.lru_cache(maxsize=1)
-def _thumbnail_storage_url_base() -> str:
-    """Storage URL prefix shared by every thumbnail row.
-
-    ``default_storage.url('')`` is constant for the process lifetime once Django
-    settings are loaded, so cache it module-wide. The warm-path URL builder then
-    does one string concat per row instead of routing through the storage
-    backend's per-call URL machinery (boto3's ``_normalize_name`` /
-    ``_clean_name`` + dict lookups on every call).
-
-    Tests that swap ``DEFAULT_FILE_STORAGE`` at runtime should call
-    ``_thumbnail_storage_url_base.cache_clear()`` in their setup.
-    """
-    return default_storage.url("")
-
-
 def _thumbnail_warm_url(path: str) -> str:
     """Direct storage URL for a cached thumbnail blob (warm path)."""
-    return _thumbnail_storage_url_base() + urllib.parse.quote(path.lstrip("/"), safe="/")
+    return default_storage.url(path.lstrip("/"))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/models.py` around lines 1954 - 1972, _thumbnail_warm_url currently
builds URLs by concatenating _thumbnail_storage_url_base()
(default_storage.url("")) with a quoted key, which can break for backends that
normalize/sign names; change _thumbnail_warm_url to call
default_storage.url(name) directly (e.g. pass path.lstrip("/") to
default_storage.url) so the storage backend generates the correct
URL/encoding/normalization; remove reliance on _thumbnail_storage_url_base in
that function (you can keep _thumbnail_storage_url_base for other callers/tests
but do not use it inside _thumbnail_warm_url).

Comment thread ami/main/models.py Outdated
Comment thread ami/main/tests.py Outdated
@mihow

mihow commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

Claude says: Adjacent perf question that came up while landing the progressive-JPEG change in 46d5fe53: is it worth flipping the thumbnail format from JPEG to WebP (or AVIF) in a follow-up? Summarizing the tradeoffs here so we can either decide now or punt to a separate ticket.

WebP

Pros

  • ~25–35% smaller than JPEG at equivalent perceptual quality. Real bandwidth win on dense gallery views (40-card pages) and on mobile.
  • Browser support is ~96% as of mid-2026: Safari 14+ (Sep 2020), all evergreen Chromium/Firefox, all current mobile. The legacy long tail is small enough that a <picture> fallback isn't worth the markup complexity for thumbnails.
  • Pillow handles it natively (format="WEBP", no extra dependency).
  • Plays nicely with the future content-addressed-paths + immutable cache-control plan — _<hash>.webp is the same shape as _<hash>.jpg.

Cons

  • Encode is ~3–5× slower than JPEG. Doesn't matter here because this is a write-once cache and the encode only fires on cold-path requests — never on the user-facing warm path that the rest of this PR was about.
  • Format change in the storage path: today capture_<pk>/<label>.jpg, post-flip capture_<pk>/<label>.webp. Existing .jpg rows phase out lazily as users browse and regen triggers fire (width change, source mtime bump). No bulk migration needed; the orphan-thumbnail reconciler (the planned follow-up Proposal: store and serve full thumbnail URLs instead of hitting server #1331 names) eventually cleans the leftover .jpg blobs.
  • Two settings keys to plumb (FORMAT, QUALITY), which is small but real wiring.

AVIF

Pros

  • 10–20% smaller again than WebP at the same perceptual quality.
  • Safari 16+ (Sep 2022), all evergreen Chromium/Firefox — ~94% support. Marginally less than WebP, still well within "don't bother with a fallback for thumbnails."

Cons

  • Encode is 5–10× slower than WebP — cold-path-only here, but the cold path matters more than it sounds when you fire create_detection_images over a 700k-capture session and every thumbnail goes through the same encoder. WebP is bounded; AVIF starts being noticeable.
  • Pillow's AVIF support requires the pillow-avif-plugin package (third-party, GPL-compatible but extra dependency surface).
  • The size-vs-encode-time math is worse than WebP for the gallery use case. AVIF wins more on photographic detail and high-resolution display assets, less on 1024px screen-sized thumbnails where WebP is already very close to lossless-perceptual.

Verdict: AVIF probably isn't worth the encoder-time cost for this PR's use case. Revisit in ~2027 if encoder libraries get faster.

Suggested shape: pluggable format

If we do this, the smallest reasonable surface is two new keys on settings.THUMBNAILS:

THUMBNAILS = {
    "STORAGE_PREFIX": "thumbnails/",
    "FORMAT": "webp",      # or "jpeg", default "jpeg" for dev/test, "webp" in production
    "QUALITY": 80,         # PIL accepts 0-100; WebP sweet spot ~80, JPEG ~82
    "METHOD": 4,           # WebP-only encode effort 0-6 (default 4); ignored for JPEG
    "SIZES": {"small": {"width": 240}, "medium": {"width": 1024}},
}

find_or_generate_thumbnail_for_label reads FORMAT, derives the extension and PIL save kwargs, and the storage path becomes capture_<pk>/<label>.{ext}. The model's _thumbnail_warm_url already builds URLs from row.path so no serializer change needed.

Open questions worth pinning down before opening the follow-up PR:

  1. Dev/test parity — should the test suite encode WebP or stay on JPEG? WebP makes some PIL test fixtures that hard-code JPEG markers / quality assumptions slightly more fragile. JPEG-in-tests + WebP-in-prod is operationally fine but means CI doesn't catch WebP-specific bugs (e.g. mode conversion edge cases that differ from JPEG). Lean toward WebP-in-tests with the THUMBNAILS["FORMAT"] flag respected throughout, so the behavior matches prod.
  2. Migration during the lazy phase-out — for a busy deployment (production-scale), do we want to actively prime regen for cached .jpg rows, or trust browsing traffic to drive it? Lean toward "trust browsing"; a one-shot management command (manage.py prime_thumbnails --format webp) is the escape hatch if it's slower than expected.
  3. Whether THUMBNAILS["FORMAT"] should default to "jpeg" or "webp" in config/settings/base.py. Production override is fine either way. Defaulting to "webp" says "this is the recommended format now"; defaulting to "jpeg" keeps the historical default and explicit opt-in for new behavior. Lean toward "webp" since the contract is the same path-with-different-extension.

Happy to open a separate ticket for this if you'd rather keep #1331 focused on the queryset/model rework + the progressive-JPEG win that's already in. Otherwise I can land the WebP flip as a third commit on this branch.

@loppear loppear left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably good with the intention here to split the cost differently between list & image retrieval - lookup thumbnails during API response by join, and skip serving through django redirect if we've already generated it. A lot of this is incidental to making that tradeoff, however.

Comment thread ami/main/models.py Outdated
return _thumbnail_storage_url_base() + urllib.parse.quote(path.lstrip("/"), safe="/")


def _thumbnail_route_url(pk: int, label: str, request: Request | None) -> str:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already exists as ami.base.serializers.reverse_with_params as used in the code this PR refactors?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Adopted in 545873a. Dropped the bespoke route helper, using reverse_with_params directly. Kept the explicit api: namespace so the model method works request-less (mgmt commands, template tags) as well as via DRF context.

Comment thread ami/main/models.py Outdated


@functools.lru_cache(maxsize=1)
def _thumbnail_storage_url_base() -> str:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure any of this obfuscation to avoid calling storage.url is worthwhile?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Fair point — the lru_cache + base-prefix helpers were chasing a speculative boto3 overhead we never measured. Dropped in 545873a; warm path is now just default_storage.url(row.path) inline. The string-concat-vs-storage-backend distinction is below noise on the gallery hot path.

Comment thread ami/main/tests.py Outdated
self.assertGreater(len(captures), 1, "Test fixture must produce >1 capture to exercise N+1.")

with CaptureQueriesContext(connection) as ctx:
response = self.client.get(f"/api/v2/captures/?project_id={self.project.pk}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test couldn't fail before, right? The N queries this is all predicated on is over the retrieval of every image url redirect, not in just the captures list response.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: You're right — there was no pre-existing N+1 in the parent branch's captures-list serializer; the N+1 that test guarded was one this PR introduced by reading obj.thumbnails.all() from the warm-row check. Dropped that test and the queryset-method prefetch test in 545873a. Rewrote PR body to lead with what this actually wins (80 HTTP round-trips per gallery page, not a DB query count change). Full reasoning in #1331 (comment).

Comment thread ami/main/models.py Outdated
out: dict[str, str] = {}
for label, spec in sizes.items():
row = existing_by_label.get(label)
if row and row.path and row.width == spec["width"]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to determine whether to skip the redirect, this check needs to reproduce the find_or_generate_thumbnail_for_label check, looking at last_modified (source_changed).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Correctness fix in 545873a. thumbnail_urls now mirrors the generator's source_changed predicate (row.last_modified < source.last_modified) and falls back to the route URL when the source has been re-uploaded, so the next browser fetch regenerates. Added test_serializer_falls_back_to_route_url_when_source_changed to pin it.

Comment thread ami/main/models.py
Comment thread ami/main/models.py Outdated
# HEAD per request is expensive on remote storage, and the cached row's
# presence is the authoritative warm signal. The narrow case of an orphan
# row (DB row points at a storage path whose blob was deleted out-of-band)
# is handled by the orphan-thumbnail reconciler (separate follow-up

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth it (I don't think there's a path for orphaning now), but maybe a comment that we're not handling orphans rather than that we're implementing this other thing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Reframed in 545873a. Comment now says we explicitly do NOT detect or repair the orphan case here — missing blob shows broken <img> until the row is removed and the next request regenerates — instead of claiming there's a reconciler that doesn't exist yet.

@mihow

mihow commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Claude says: On a re-read prompted by Luke's review, my initial analysis behind this PR was wrong on a fundamental point. Flagging it here before going any further on rework, because it changes the merge calculus.

What the PR body claimed

Captures-list endpoint adds one SELECT against main_sourceimagethumbnail regardless of card count, not N.

The test_captures_list_prefetches_thumbnails_no_n_plus_one test docstring reinforced this: "must hit main_sourceimagethumbnail at most once via prefetch, not once per row."

I framed this as fixing a pre-existing N+1 in the captures-list response.

What the code actually does

Pre-PR (parent branch, feat/thumbnail-source-images at 22045e07), SourceImageThumbnailSerializer.get_thumbnails reads obj.pk and settings only:

def get_thumbnails(self, obj):
    return {
        label: reverse_with_params(
            "sourceimagethumbnail-detail",
            args=(obj.pk,),
            request=self.context.get("request"),
            params={"label": label},
        )
        for label in settings.THUMBNAILS["SIZES"]
    }

No obj.thumbnails access. Zero DB queries per row. No prefetch needed in SourceImageViewSet.get_queryset — the parent branch's queryset doesn't include a thumbnails prefetch and never did.

There is no pre-existing N+1 against main_sourceimagethumbnail in the captures-list response. The N+1 the test guards is one this PR's warm-URL feature would introduce — without with_thumbnails(), the serializer collapses to per-row queries because we made it read obj.thumbnails.all() for the cached-row check. The test still has value as a regression guard for the prefetch, but it's pinning a contract this PR creates, not one it repairs.

Luke called this out at #1331 (comment) and was right.

What the PR actually wins

The real cost the PR removes is 80 HTTP round-trips per gallery page view through the SourceImageThumbnailViewSet redirect endpoint. Per page view (40 cards × 2 thumbnails):

Step Pre-PR Post-PR
/api/v2/captures/ request 1 1
/api/v2/captures/thumbnails/... redirect requests 80 0
S3 GET for blob bytes 80 80

Each of the 80 redirect requests goes through DRF auth + get_object() + find_or_generate_thumbnail_for_label (DB + a default_storage.exists() HEAD) before 302-ing to the storage URL. The 302 has no Cache-Control so the browser doesn't cache it; every page view repeats the fan-out.

That's a real cost. It's an HTTP-level fan-out problem, not a DB-level N+1. Different shape, different remedy space. The PR body framed it as the latter.

Correctness regression I missed

The original architecture has the redirect viewset acting as an implicit "is this thumbnail still fresh?" check on every browser request — find_or_generate_thumbnail_for_label examines source_changed = self.last_modified > thumb.last_modified and regens if true.

This PR's warm path skips that check entirely. Luke flagged this at #1331 (comment). Concrete failure mode:

  1. SourceImage bytes are re-synced or re-uploaded → SourceImage.last_modified bumps
  2. Thumbnail row's last_modified is now stale; row's path still points at the OLD blob
  3. Captures-list response emits the warm URL → browser GETs the OLD blob from storage
  4. No code path detects the staleness until someone hits the redirect viewset on a cold path (which may never happen if everyone reaches thumbnails via the gallery list)

User sees a stale thumbnail indefinitely. Pre-PR, this self-healed on the next page view.

Alternative approaches worth weighing before pushing rework

  1. Do nothing. A 5ms gunicorn round-trip × 80 per page view = 400ms of worker time per page view, distributed across the pool. Whether that's a bottleneck is a measurement question, not a code question. The PR was opened on assumption, not on a measured baseline. Worth running pass-1 of the NR measurement on the current parent branch deploy before deciding the rework is justified.
  2. Cache the 302. The redirect URL is content-addressable by (pk, label). Set Cache-Control: public, max-age=N on the redirect response and put nginx / Cloudflare in front. CDN serves the 302 from cache; Django hit only on cache miss. Browser still gets a normal 302 → S3 flow; auth and source-changed checks survive on cold path.
  3. Separate Django worker pool for thumbnails. Run a second gunicorn pool (or a separate service) that only handles /api/v2/captures/thumbnails/.... Main app's pool stays free. No code change. Independent scaling.
  4. Signed S3 URLs from the captures-list endpoint. Direct-to-storage like this PR, but with short-TTL signed URLs maintained via AWS_QUERYSTRING_AUTH=True (off in production today). Keeps auth context. Still has the source-changed problem.
  5. Continue this PR. Drop the over-engineered helpers per Luke 1+2, add the source-changed check per Luke 4, reframe the PR body per the above. Net: a correct but architecturally invasive change that solves something which may not have been measured as a problem.

My read is that option 1 (measure first) followed by option 2 if needed is probably the right path, and this PR may not be worth landing. The structural cost (lost source-changed check, frontend coupling to direct S3 URL shape, complexity of warm/cold branching in the serializer) is high for a perf gain we never validated.

Up to you on whether to keep iterating here or close in favor of measurement-first. I'd lean toward close; the architecture trade isn't favorable without a real bottleneck measurement.

…equest HEAD

Skip Django entirely for the gallery thumbnail fetch when a cached
SourceImageThumbnail row matches the configured size and the source
hasn't been re-uploaded. Each capture card previously took two
round-trips through SourceImageThumbnailViewSet (DRF auth +
find_or_generate + a default_storage.exists() HEAD + 302), with no
Cache-Control on the redirect so the browser repeated the fan-out
every page view. A 40-card gallery = 80 round-trips dropped.

URL build lives on SourceImage.thumbnail_urls(request), prefetch
contract on SourceImageQuerySet.with_thumbnails(). Single source
of truth callable from any surface, both directly testable.

Warm/cold predicate mirrors find_or_generate_thumbnail_for_label
including source-changed staleness, so a re-uploaded image
self-heals through the cold path.

Generator drops default_storage.exists() from the regen guard:
HEAD-per-request on remote storage was expensive, the cached row
is the warm signal. Missing-blob case is documented but not
repaired here (would require a separate orphan reconciler).

Also: progressive JPEG + optimize + quality=82 on encode for
faster perceived load on the medium thumbnails.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/thumbnails-serializer-direct-urls branch from 46d5fe5 to 545873a Compare June 8, 2026 19:43
PIL.Image.thumbnail() preserves aspect ratio by shrinking, so the
output width is often a pixel or two below the configured spec (e.g.
spec=240 → actual=239). The pre-existing strict ==/!= predicate in
the generator meant every cold hit on such rows regenerated forever,
and the warm-path emitter on PR #1331 would never fire for them.

Surfaced by arctia pass-2 measurement on UKCEH project 1 / event 823:
all small thumbs cached at width=239 → cold-emit for every card.

Tolerance of 2px applied identically in both find_or_generate's
regen gate and thumbnail_urls' warm check.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow

mihow commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Claude says: Arctia two-pass measurement landed. Revising my own framing — earlier write-up overclaimed.

Important caveat about arctia (didn't catch this initially): arctia uses FileSystemStorage with MEDIA_URL=/media/, so default_storage.url(row.path) returns a relative /media/... URL. That URL still hits Django — config/urls.py includes static(settings.MEDIA_URL, …), which routes every /media/ request through the full Django middleware stack (auth, sessions, CSRF, content negotiation — response headers confirm vary: Accept-Language, x-frame-options: DENY, etc., applied by Django middleware). The thumbnail bytes get served by Django's static() view reading from disk.

So the on-arctia win isn't "skip Django for warm thumbnails" — it's skip the redirect viewset's DRF auth + get_object + find_or_generate_thumbnail_for_label (DB lookup + default_storage.exists() HEAD) + 302 hop. The middleware-stack hit per /media/ request remains. Still meaningful (~270 ms → much less per thumbnail) but not the "Django bypassed" framing my earlier comment implied.

Where the "Django bypassed" framing does hold is on prod-like deployments using S3 storage with public bucket URLs: default_storage.url(...) then returns https://<bucket>.s3.amazonaws.com/... and browsers hit S3 directly with no Django involvement at all. That's the headline gain this PR is designed for — but it's not what arctia measures.

Numbers (with corrected framing)

Test data: UKCEH project 1, event 823, 40 captures × 2 labels pre-warmed.

Metric Pass-1 (parent feat/thumbnail-source-images) Pass-2 (this PR aeb46c01) Delta
Per-thumbnail path: DRF auth + get_object + find_or_generate + default_storage.exists() HEAD + 302 yes (80 / page) no (0 / page) eliminated
Per-thumbnail path: Django middleware + static() view + file read (on arctia FileSystemStorage) yes (80 / page, follow-up of 302) yes (80 / page, direct) unchanged on arctia
Per-thumbnail path: external storage GET (on prod S3 setup) yes (80 / page, follow-up of 302) yes (80 / page, direct) Django eliminated in prod
Captures-list response p50 (n=10, single client over public internet) 1.109 s 0.846 s −24%
Captures-list response p95 (n=10) 1.569 s 0.923 s −41%
Per single Django thumbnail-redirect call (pre-PR path) 0.267 s n/a — bypassed
20 simulated gallery views wall-clock (40 capture cards × 2 labels each = 80 thumbnail follow-ups per view) 49 s 10 s ~5×

On the wall-clock 5× number

The 5× is a single concurrency point, not a fixed ratio. My test fanned out 80 parallel curl requests per gallery view via xargs -P 20 × 4 background views. Pass-1 each follow-up cost ~270 ms going through the redirect viewset; pass-2 each follow-up cost much less going through the static() view. With only 2 gunicorn workers on arctia those 80 parallel requests queue heavily in pass-1 — that queueing is most of the 49 s wall-clock. Cranking concurrency higher widens the gap (pass-1 tail-latency tanks while pass-2 scales). Running serial shrinks it. So 5× is "at this concurrency, on this 2-worker dev box" — directionally correct but don't treat it as a fixed multiplier.

Bug surfaced during measurement → fix aeb46c01

PIL.Image.thumbnail() preserves aspect ratio by shrinking, so a spec=240 request often gets a width=239 blob. The pre-existing strict width != size["width"] regen-gate then re-triggered on every cold hit, and the warm-path emitter on this PR would never fire for those rows. Fixed with a 2-px tolerance applied symmetrically in both the warm-check and the regen gate, plus a regression test (test_serializer_treats_pil_rounded_width_as_warm). Pre-existing latent bug on parent — exposed by the new code path.

Caveats

  • Latency numbers are n=10 from a single client over the public internet — suggestive, not conclusive.
  • The 80 → 0 redirect-viewset elimination is mechanical (URL classification, not noise-dependent).
  • The "Django middleware bypassed" win is prod-S3-only — does not apply on FileSystemStorage deployments like arctia.
  • Arctia doesn't have New Relic enabled so the protocol's planned SourceImageThumbnailViewSet.retrieve span count + HeadObject rate metrics aren't measurable; for those a prod-monitored deploy is needed.

Comment thread ami/main/models.py Outdated
row is not None
and self.last_modified is not None
and row.last_modified is not None
and row.last_modified < self.last_modified

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can source_changed be an annotation in the queryset method instead of looping through the thumbnails here? If not, rename row/rows to thumb/thumbs for clarity.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Done in 1b54cc8a. Annotation route worked cleanly.

SourceImageQuerySet.with_thumbnails() now builds the prefetch queryset and annotates is_source_changed per thumbnail row:

thumbnail_model = apps.get_model("main", "SourceImageThumbnail")
thumbnails_qs = thumbnail_model.objects.annotate(
    is_source_changed=models.Case(
        models.When(
            last_modified__lt=models.F("source_image__last_modified"),
            then=models.Value(True),
        ),
        default=models.Value(False),
        output_field=models.BooleanField(),
    )
)
return self.prefetch_related(models.Prefetch("thumbnails", queryset=thumbnails_qs))

The reverse FK source_image__last_modified is reachable from the prefetched queryset, so no OuterRef/Subquery was needed. NULL < x is NULL in SQL, which the CASE resolves as the False default — same behaviour the inline Python check had (it returned False when either side was None).

thumbnail_urls() is now pure URL-formatting: reads thumb.is_source_changed straight off the prefetched row. No more staleness loop. Renamed row/rowsthumb/thumbs while I was in there.

Contract is now strict: callers MUST go through with_thumbnails() — without it, thumbnail_urls() raises AttributeError on thumb.is_source_changed. Production callers (the serializer) already do; the direct-instance test was updated to re-fetch through with_thumbnails().

Comment thread ami/main/api/views.py Outdated
).order_by("timestamp")
queryset = (
queryset.select_related("event", "deployment", "deployment__data_source")
# Required by SourceImage.thumbnail_urls — drops N+1 on SourceImageThumbnailSerializer.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they actually N+1 queries or N+1 requests? N+1 is a misleading statement that seems to be persisting.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Scrubbed in 1b54cc8a. Inline views.py comments + serializers.py docstring + models.py with_thumbnails/thumbnail_urls/get_was_processed docstrings all now say "each row triggers a separate SELECT on main_sourceimagethumbnail" (or "fires its own DB query" for the get_was_processed case). N+1 phrasing dropped.

Comment thread ami/main/models.py Outdated
)
if not thumb or thumb.width != size["width"] or source_changed or not default_storage.exists(thumb.path):
# No ``default_storage.exists(thumb.path)`` here: HEAD-per-request on remote storage
# is expensive, and the cached row is the warm signal. We do NOT currently detect or

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove reactive/corrective language, the all caps "NOT"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Rewritten in 1b54cc8a. New text:

# Cached row is the warm signal — no HEAD against storage on the regen check.
# Orphan rows (DB row whose blob was deleted out of band) show a broken ``<img>``
# until something removes the row and the next request regenerates.

No "NOT", flat behaviour description.

Comment thread ami/main/models.py Outdated
# dimensions down by a pixel or two vs the spec (e.g. spec=240 → actual
# 239). Use a small tolerance so existing rows aren't treated as stale
# by every warm-check; the same tolerance is applied in the regen gate.
width_match = row is not None and abs(row.width - spec["width"]) <= _THUMBNAIL_WIDTH_TOLERANCE

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss some alternatives to this approach of width matching with tolerance.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Five options to weigh; current tolerance is the cheapest interim.

a. Store spec width, not PIL's output. Generator writes row.width = spec["width"] instead of img.size[0]. row.width == spec["width"] then works without tolerance. Trade: row.width becomes "what we asked PIL to produce" rather than "what PIL actually produced." Anything that reads row.width for layout (e.g. a hypothetical <img width=…> server-side) would be off by 1–2px.

b. Store the spec key as a denorm field. Add generated_for_spec (CharField, e.g. "small@240") or full spec-dict on SourceImageThumbnail at generation time. Predicate becomes row.generated_for_spec == settings.THUMBNAILS["SIZES"][label].canonical_key. Migration cost; simplest semantic; future-proof if THUMBNAILS["SIZES"] ever grows fields beyond width/height.

c. Generation-version field. THUMBNAILS["VERSION"] bump in settings, stored on row, regen when row.version != current. Decouples regen from any per-row attribute. Adds a versioning burden — bump-on-edit discipline lives in settings, easy to forget.

d. One-direction predicate (row.width >= spec["width"]). PIL only shrinks, so anything ≥ spec is fresh-enough by definition. No tolerance constant. Doesn't handle the case where spec shrinks to a smaller size — e.g. old row at 1024, new spec 800, would falsely be "fresh" and visually too big.

e. Current tolerance (abs(row.width - spec["width"]) <= 2). No schema change, single module constant, symmetric in both directions. Arbitrary 2-px threshold, opaque without context.

My read: (a) is cleanest semantically and matches how the row should be labelled ("this thumbnail represents what we wanted to generate, not what PIL emitted bit-for-bit"). It's a one-line generator change + no migration. (b) is the right long-term answer if THUMBNAILS["SIZES"] ever grows beyond width/height. The tolerance is fine as an interim.

If you want me to swap to (a) in this PR I'm happy to — it's a small follow-up commit. Or land the current tolerance as the unblocking fix and queue (a)/(b) as a follow-up issue. Your call.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loppear what do you think about options a) or d) here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Went with option (a). The generator now records the requested spec width on the row (parent branch commit 012efbc, with a regression test that simulates Pillow's rounding), and the 2px tolerance is gone from both the warm-check and the regen gate (56bc46a on this branch — strict equality everywhere). Legacy rows that stored encoder output read as stale, fall back to the route URL, and self-heal with a single regeneration. Option (b) (a generated_for_spec field) remains the long-term answer if thumbnail specs ever grow beyond width; queued as a follow-up rather than another migration in this pair of PRs.

@mihow

mihow commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

@loppear thanks for reviewing this one. the agent went a bit overboard on my requests. But there are a few changes I like. I think we can consider these all optimizations and merge your main thumbs PR soon.

mihow and others added 5 commits June 9, 2026 06:36
…-feedback rewrites

* SourceImageQuerySet.with_thumbnails() now annotates is_source_changed onto
  the prefetched thumbnail rows in SQL (CASE / WHEN last_modified < source's),
  so SourceImage.thumbnail_urls() is pure URL formatting — no Python staleness
  loop. Per @mihow review note on #1331.
* thumbnail_urls() rename row/rows -> thumb/thumbs for consistency with the
  rest of the codebase.
* Drop 'N+1' phrasing from views.py, serializers.py, and models.py comments
  and docstrings — replaced with 'each row triggers a separate SELECT' /
  'each row fires its own DB query' which is the actual mechanism.
* Drop 'NOT' caps + reactive framing from the HEAD-removal comment in
  find_or_generate_thumbnail_for_label; rewrite as flat behaviour description.
* Update test_thumbnail_urls_model_method_emits_warm_and_route to re-fetch
  through with_thumbnails() so the strict annotation contract holds.

24/24 ami.main.tests.TestImageThumbnailViews pass.

Co-Authored-By: Claude <noreply@anthropic.com>
… main merge) into feat/thumbnails-serializer-direct-urls
…a fix) into feat/thumbnails-serializer-direct-urls
Comment thread ami/main/tests.py Outdated
mihow and others added 6 commits June 10, 2026 17:26
…EAD, progressive JPEG, redirect caching, blank-path guard) into feat/thumbnails-serializer-direct-urls

Co-Authored-By: Claude <noreply@anthropic.com>

# Conflicts:
#	ami/main/models.py
#	ami/main/tests.py
…ec width

The generator records the requested spec width on the row (see parent
branch), so the warm-check can use strict equality instead of the 2px
tolerance that compensated for PIL's aspect-ratio rounding. Legacy rows
that stored encoder output read as stale, fall back to the route URL,
and self-heal with a single regeneration.

Co-Authored-By: Claude <noreply@anthropic.com>
The test pinned a Pillow encoder flag rather than behavior we own; it
served its purpose during development.

Co-Authored-By: Claude <noreply@anthropic.com>
…thumbnails-serializer-direct-urls

Co-Authored-By: Claude <noreply@anthropic.com>
Same pass as the parent branch: keep the contract statements (prefetch
required, annotation required, strict width semantics) and drop the
review-era justification prose.

Co-Authored-By: Claude <noreply@anthropic.com>
…to feat/thumbnails-serializer-direct-urls

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants