feat(exports): add taxa_list_csv export format by mihow · Pull Request #1293 · RolnickLab/antenna

mihow · 2026-05-05T17:57:33Z

Summary

New export format taxa_list_csv — emits one CSV row per unique Taxon observed in a user-selected SourceImageCollection, with full Linnaean hierarchy columns and per-taxon aggregations (occurrence count, score, session-day / session-time / session-month, per-occurrence duration) plus external-DB IDs/URLs (GBIF, iNat, BOLD, Fieldguide). Collection-scoped via the existing OccurrenceCollectionFilter; project default filters (score threshold + include/exclude taxa) are applied so output matches what users see in the taxa list view.

The aggregator and Taxon-fetch path are designed to also feed a future Darwin Core Taxon-Core archive variant (taxa_list_dwca) and a future "absence rows" mode driven by a project-level taxonomic scope.

Carried over a small generic optimization from #1131: a filename_label class attr on BaseExporter so each format can inject a slug (taxa_list) into its filename.

Reusable helpers added to models

A handful of helpers that started life inside the exporter were moved to the natural model so other code (API serializers, future DwC-A export, UI) can use them without duplicating logic:

Taxon.gbif_url / inat_url / bold_url / fieldguide_url — properties returning the public URL for the taxon on each external database, or None when the underlying ID isn't set.
Taxon.linnaean_hierarchy(ranks=None) — returns {lowercase-rank: taxon-name} dict from parents_json + own rank. All requested rank keys present; unfilled ranks are empty string. Defaults to DEFAULT_RANKS (Kingdom..Species). Useful for any flat-rank rendering — CSV columns, DwC Taxon-Core, GBIF-style hierarchy fields.
Project.get_taxonomic_scope() — stub returning None today. The real impl will return a queryset from a per-project default TaxaList (coming soon). Reference impl + a DwC-A "single root via lowest common ancestor" alternative are sketched in the docstring. The exporter consumes it now: when scope becomes non-None, absence rows light up automatically (direct_occurrences_count = 0 placeholder rows for in-scope-but-unobserved taxa).

Phenology-friendly temporal columns

The temporal aggregations are session-anchored (a session = the monitoring night the occurrence was first seen on, via first_appearance_timestamp). The shape is built around the typical phenology question "when does this taxon usually fly?":

Column group	Unit	Aggregation
`session_day_{min,max,median}`	day of year, 1-366	median (robust to outlier nights)
`session_time_{min,max,median}`	clock `HH:MM:SS`, noon-anchored so midnight-spanning windows work	median
`session_month_{min,max,mean}`	calendar month, 1-12	mean (fractional `6.3` reads better than a stepped median)
`min/max/avg_duration_seconds`	seconds	mean

Time-of-night aggregations are computed in a noon-anchored axis so midnight-spanning monitoring nights work correctly (median of 22:00 + 02:00 = 00:00, not 12:00 noon).

Output preview

E2E run against project 18 (Vermont Atlas of Life), collection 176 (363 images, full event 5297, 2022-06-18 to 2022-06-19). After apply_default_filters: 92 occurrences → 13 unique taxa → 13 CSV rows.

Columns (in CSV order):

Column	Sample value	Notes
`id`	`3839`	Antenna `Taxon.pk`
`name`	`Xestia c-nigrum`	scientificName
`display_name`	`Xestia c-nigrum`	UI-formatted
`rank`	`SPECIES`	`TaxonRank` enum
`common_name_en`	(empty)	when set
`kingdom` ... `species`	`Lepidoptera` / `Noctuidae` / `Noctuinae` / `Noctuini` / `Xestia` / `Xestia c-nigrum`	from `Taxon.linnaean_hierarchy()`
`direct_occurrences_count`	`17`	clearly named to leave room for a future `recursive_occurrences_count`
`min_score` / `max_score` / `avg_score`	`0.5025` / `0.9073` / `0.6516`	classification confidence
`session_day_min` / `_max` / `_median`	`169` / `170` / `169`	day of year (June 18 = 169, June 19 = 170)
`session_time_min` / `_max` / `_median`	`21:12:00` / `01:25:16` / `23:48:06`	noon-anchored median → spans midnight correctly
`session_month_min` / `_max` / `_mean`	`6` / `6` / `6.0`	June
`min_duration_seconds` / `_max` / `_avg`	(empty for single-detection occurrences); `2` / `6` / `4` for `Scoparia biplagialis`	becomes meaningful once detection tracking is wired up
`gbif_taxon_key` / `gbif_url`	`5714973` / `https://www.gbif.org/species/5714973`	from `Taxon.gbif_url`, empty when ID missing
`inat_taxon_id` / `inat_url`	(empty)	from `Taxon.inat_url`
`bold_taxon_bin` / `bold_url`	(empty)	from `Taxon.bold_url`
`fieldguide_id` / `fieldguide_url`	(empty)	from `Taxon.fieldguide_url`
`cover_image_url`	(empty)	populated when set on Taxon

Sample rows (lightly trimmed):

id,name,rank,order,family,genus,species,direct_occurrences_count,min_score,max_score,avg_score,session_day_min,session_day_max,session_day_median,session_time_min,session_time_max,session_time_median,session_month_min,session_month_max,session_month_mean,min_duration_seconds,max_duration_seconds,avg_duration_seconds,gbif_taxon_key
6135,Agriphila vulgivagellus,SPECIES,Lepidoptera,Crambidae,Agriphila,Agriphila vulgivagellus,1,0.5545,0.5545,0.5545,169,169,169,22:42:20,22:42:20,22:42:20,6,6,6.0,372,372,372,1878937
6,Campaea perlata,SPECIES,Lepidoptera,Geometridae,Campaea,Campaea perlata,2,0.7378,0.8981,0.8180,169,169,169,23:51:41,23:51:43,23:51:42,6,6,6.0,,,,
11613,Not Lepidoptera,ORDER,Not Lepidoptera,,,,34,0.5002,0.9580,0.8986,169,170,169,22:10:48,04:37:58,22:46:21,6,6,6.0,,,,
6935,Scoparia biplagialis,SPECIES,Lepidoptera,Crambidae,Scoparia,Scoparia biplagialis,15,0.6337,0.9183,0.7751,169,170,169,21:45:57,00:44:38,23:36:03,6,6,6.0,2,6,4,5126963
3839,Xestia c-nigrum,SPECIES,Lepidoptera,Noctuidae,Xestia,Xestia c-nigrum,17,0.5025,0.9073,0.6516,169,170,169,21:12:00,01:25:16,23:48:06,6,6,6.0,,,,5714973

Wraparound check: Xestia c-nigrum first seen at 21:12, last at 01:25 (next morning), median 23:48 — straddles midnight cleanly thanks to noon-anchored time aggregation. Not Lepidoptera (34 occurrences across two nights, days 169-170): median day 169, median time 22:46.

Duration is empty for single-detection occurrences (most current rows). When it shows up — Agriphila vulgivagellus 372s, Scoparia biplagialis 2-6s avg 4s — it's the (last_appearance - first_appearance) window already exposed via with_timestamps(). Once detection tracking lands, more occurrences will have multi-detection durations and these columns will populate.

Test plan

4 unit tests in ami/exports/tests.py::TaxaListExportTest — registry/scope/filename, score aggregation + threshold, external links + hierarchy, time-of-night wraparound (asserts on session_time_*).
Full ami.exports.tests suite (23/23) and ami.main.tests.TestTaxonomy + TestTaxonomyViews (13/13) green after the model-migration refactor.
E2E export against real project 18 data (collection 176): 13 rows, columns + values verified per preview above.
UI smoke: trigger taxa_list_csv from the export dropdown and confirm download (UI label already wired in ui/src/data-services/models/export.ts).

Future hooks (stubbed only)

Absence rows. When Project.get_taxonomic_scope() returns a non-None queryset (per-project default TaxaList, landing soon), the writer emits direct_occurrences_count = 0 rows for unobserved in-scope taxa. v1 returns None so column shape is stable when this turns on.
DwC Taxon-Core archive variant. Columns are intentionally a superset of DwC Taxon-Core terms so a sibling taxa_list_dwca format can reuse the aggregator + Taxon fetch and emit taxon.txt + meta.xml + eml.xml zip. See PR feat: add Darwin Core Archive (DwC-A) export format #1131 for the surrounding DwC-A export. A "single-root via lowest common ancestor" scope shape is sketched in Project.get_taxonomic_scope()'s docstring as a sibling method (get_taxonomic_scope_root()) for that variant — different consumers want different shapes (enumerated set vs single root).
Tracking-derived duration. min/max/avg_duration_seconds use the existing first/last-appearance window. When detection tracking is wired up these fields stay valid and become more meaningful (multi-detection, post-tracking) without schema churn.

Design doc

docs/claude/planning/2026-05-05-taxa-list-export-design.md

Adds a new export format that emits one CSV row per unique taxon observed in a SourceImageCollection (with project default filters applied), instead of one row per occurrence. Columns cover taxon identity, the Linnaean rank hierarchy from parents_json, occurrence/score/date/time-of-night aggregations, and external IDs/links (GBIF, iNaturalist, BOLD, Fieldguide). Time-of-night aggregations use a noon-anchored axis so that nightly windows straddling midnight aggregate correctly (avg of 22:00 and 02:00 is 00:00, not 12:00). Includes two stubbed-but-inert hooks for follow-up work: - _get_expected_taxa() returns an empty queryset today; once a project declares a taxonomic scope, this method will return the expected taxa set and the writer will emit zero-count rows for absent species (presence/absence checklist). - Column shape is intentionally a superset of DwC Taxon-Core, so a sibling taxa_list_dwca format can ship later by reusing the same accumulator + Taxon fetch with an alternate writer. Generic carry-over from PR #1131: BaseExporter gains a filename_label class attribute that DataExport.generate_filename inserts into the output filename so users can tell formats apart in their downloads folder. Existing exporters keep filename_label="" and produce the same filenames as before. Co-Authored-By: Claude <noreply@anthropic.com>

…t_csv `BaseExporter.update_export_stats()` resets `record_count` from the source queryset's count after the file is written. For `taxa_list_csv`, that value is the filtered occurrence count (e.g. 92), not the unique-taxon row count that actually landed in the CSV (e.g. 13). The mismatch is misleading on the export listing UI and breaks the progress denominator. Override `update_export_stats` in `TaxaListCSVExporter` to use the row counter we already track during the write loop. E2E against project 18 collection 176 now reports 13 rows for 13 unique taxa; existing 4 unit tests still pass. Co-Authored-By: Claude <noreply@anthropic.com>

…e stats If `export()` raises before the write loop completes (or before the final assignment), the overridden `update_export_stats` would AttributeError when it reads `self._rows_written`. Set it to 0 in `__init__` for safety. Co-Authored-By: Claude <noreply@anthropic.com>

netlify · 2026-05-05T17:57:39Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`514d3eb`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69fa37735d95ec0007bfcf18

netlify · 2026-05-05T17:57:39Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`514d3eb`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69fa3773b5405200081365af

coderabbitai · 2026-05-05T17:57:41Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88bf0132-6cac-44f4-b2a3-4fb25f6b06e4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/taxa-export-format

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Per-occurrence duration in seconds, computed from (last_appearance_timestamp - first_appearance_timestamp) on the existing with_timestamps annotation. Aggregated per-taxon as min/max/avg seconds. Single-detection occurrences (first == last) are skipped so the columns stay empty rather than dragging averages to zero. Will become more meaningful once detection tracking is wired up; today many occurrences are single-detection and end up blank, which matches expectations. Co-Authored-By: Claude <noreply@anthropic.com>

…time,month} Reshape the temporal aggregation columns to be more useful for phenology analysis. Out: first/last/avg occurrence date and min/max/avg time of night. In: session_day_{min,max,median} # day of year, 1..366 session_time_{min,max,median} # noon-anchored clock seconds, midnight-OK session_month_{min,max,mean} # calendar month, 1..12 Switching to median for day and time-of-night because the typical ecological question is "when does this taxon usually fly?" — a median is more robust to a single outlier night than a mean. Month uses mean because the bucket is coarse and a fractional answer ("6.3 → mostly mid-June, drifting to early July") reads better than a stepped median. The accumulator now keeps the per-occurrence first_appearance lists for day/time/month rather than running min/max/sum stats; medians need the full set. Memory: O(occurrences-per-taxon) instead of O(1) per taxon, which is fine — the bound here is bounded by what we'd already pull into RAM via the streaming queryset. Time-of-night still operates in noon-anchored seconds so midnight-spanning windows survive for min/max/median (median of 22:00 + 02:00 = 00:00). Updated the wraparound test to assert on `session_time_*`. Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-06-04T20:37:31Z

How about we add the number of occurrences verified per taxa in this export? That information, combined with the min/max/avg confidence score from the classifier, should make it more clear which taxa summaries are more reliable than others.

mihow and others added 3 commits May 5, 2026 10:46

mihow and others added 2 commits May 5, 2026 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(exports): add taxa_list_csv export format#1293

feat(exports): add taxa_list_csv export format#1293
mihow wants to merge 5 commits into
mainfrom
feat/taxa-export-format

mihow commented May 5, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 5, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Review skipped

Uh oh!

mihow commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mihow commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reusable helpers added to models

Phenology-friendly temporal columns

Output preview

Test plan

Future hooks (stubbed only)

Design doc

Uh oh!

netlify Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

netlify Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

mihow commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mihow commented May 5, 2026 •

edited

Loading

netlify Bot commented May 5, 2026 •

edited

Loading

netlify Bot commented May 5, 2026 •

edited

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading