feat(exports): add taxa_list_csv export format#1293
Conversation
Adds a new export format that emits one CSV row per unique taxon observed in a SourceImageCollection (with project default filters applied), instead of one row per occurrence. Columns cover taxon identity, the Linnaean rank hierarchy from parents_json, occurrence/score/date/time-of-night aggregations, and external IDs/links (GBIF, iNaturalist, BOLD, Fieldguide). Time-of-night aggregations use a noon-anchored axis so that nightly windows straddling midnight aggregate correctly (avg of 22:00 and 02:00 is 00:00, not 12:00). Includes two stubbed-but-inert hooks for follow-up work: - _get_expected_taxa() returns an empty queryset today; once a project declares a taxonomic scope, this method will return the expected taxa set and the writer will emit zero-count rows for absent species (presence/absence checklist). - Column shape is intentionally a superset of DwC Taxon-Core, so a sibling taxa_list_dwca format can ship later by reusing the same accumulator + Taxon fetch with an alternate writer. Generic carry-over from PR #1131: BaseExporter gains a filename_label class attribute that DataExport.generate_filename inserts into the output filename so users can tell formats apart in their downloads folder. Existing exporters keep filename_label="" and produce the same filenames as before. Co-Authored-By: Claude <noreply@anthropic.com>
…t_csv `BaseExporter.update_export_stats()` resets `record_count` from the source queryset's count after the file is written. For `taxa_list_csv`, that value is the filtered occurrence count (e.g. 92), not the unique-taxon row count that actually landed in the CSV (e.g. 13). The mismatch is misleading on the export listing UI and breaks the progress denominator. Override `update_export_stats` in `TaxaListCSVExporter` to use the row counter we already track during the write loop. E2E against project 18 collection 176 now reports 13 rows for 13 unique taxa; existing 4 unit tests still pass. Co-Authored-By: Claude <noreply@anthropic.com>
…e stats If `export()` raises before the write loop completes (or before the final assignment), the overridden `update_export_stats` would AttributeError when it reads `self._rows_written`. Set it to 0 in `__init__` for safety. Co-Authored-By: Claude <noreply@anthropic.com>
✅ Deploy Preview for antenna-ssec canceled.
|
✅ Deploy Preview for antenna-preview canceled.
|
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Per-occurrence duration in seconds, computed from (last_appearance_timestamp - first_appearance_timestamp) on the existing with_timestamps annotation. Aggregated per-taxon as min/max/avg seconds. Single-detection occurrences (first == last) are skipped so the columns stay empty rather than dragging averages to zero. Will become more meaningful once detection tracking is wired up; today many occurrences are single-detection and end up blank, which matches expectations. Co-Authored-By: Claude <noreply@anthropic.com>
…time,month}
Reshape the temporal aggregation columns to be more useful for phenology
analysis. Out: first/last/avg occurrence date and min/max/avg time of
night. In:
session_day_{min,max,median} # day of year, 1..366
session_time_{min,max,median} # noon-anchored clock seconds, midnight-OK
session_month_{min,max,mean} # calendar month, 1..12
Switching to median for day and time-of-night because the typical
ecological question is "when does this taxon usually fly?" — a median is
more robust to a single outlier night than a mean. Month uses mean
because the bucket is coarse and a fractional answer ("6.3 → mostly
mid-June, drifting to early July") reads better than a stepped median.
The accumulator now keeps the per-occurrence first_appearance lists for
day/time/month rather than running min/max/sum stats; medians need the
full set. Memory: O(occurrences-per-taxon) instead of O(1) per taxon,
which is fine — the bound here is bounded by what we'd already pull into
RAM via the streaming queryset.
Time-of-night still operates in noon-anchored seconds so midnight-spanning
windows survive for min/max/median (median of 22:00 + 02:00 = 00:00).
Updated the wraparound test to assert on `session_time_*`.
Co-Authored-By: Claude <noreply@anthropic.com>
|
How about we add the number of occurrences verified per taxa in this export? That information, combined with the min/max/avg confidence score from the classifier, should make it more clear which taxa summaries are more reliable than others. |
Summary
New export format
taxa_list_csv— emits one CSV row per uniqueTaxonobserved in a user-selectedSourceImageCollection, with full Linnaean hierarchy columns and per-taxon aggregations (occurrence count, score, session-day / session-time / session-month, per-occurrence duration) plus external-DB IDs/URLs (GBIF, iNat, BOLD, Fieldguide). Collection-scoped via the existingOccurrenceCollectionFilter; project default filters (score threshold + include/exclude taxa) are applied so output matches what users see in the taxa list view.The aggregator and Taxon-fetch path are designed to also feed a future Darwin Core Taxon-Core archive variant (
taxa_list_dwca) and a future "absence rows" mode driven by a project-level taxonomic scope.Carried over a small generic optimization from #1131: a
filename_labelclass attr onBaseExporterso each format can inject a slug (taxa_list) into its filename.Reusable helpers added to models
A handful of helpers that started life inside the exporter were moved to the natural model so other code (API serializers, future DwC-A export, UI) can use them without duplicating logic:
Taxon.gbif_url/inat_url/bold_url/fieldguide_url— properties returning the public URL for the taxon on each external database, orNonewhen the underlying ID isn't set.Taxon.linnaean_hierarchy(ranks=None)— returns{lowercase-rank: taxon-name}dict fromparents_json+ own rank. All requested rank keys present; unfilled ranks are empty string. Defaults toDEFAULT_RANKS(Kingdom..Species). Useful for any flat-rank rendering — CSV columns, DwC Taxon-Core, GBIF-style hierarchy fields.Project.get_taxonomic_scope()— stub returningNonetoday. The real impl will return a queryset from a per-project defaultTaxaList(coming soon). Reference impl + a DwC-A "single root via lowest common ancestor" alternative are sketched in the docstring. The exporter consumes it now: when scope becomes non-None, absence rows light up automatically (direct_occurrences_count = 0placeholder rows for in-scope-but-unobserved taxa).Phenology-friendly temporal columns
The temporal aggregations are session-anchored (a session = the monitoring night the occurrence was first seen on, via
first_appearance_timestamp). The shape is built around the typical phenology question "when does this taxon usually fly?":session_day_{min,max,median}session_time_{min,max,median}HH:MM:SS, noon-anchored so midnight-spanning windows worksession_month_{min,max,mean}6.3reads better than a stepped median)min/max/avg_duration_secondsTime-of-night aggregations are computed in a noon-anchored axis so midnight-spanning monitoring nights work correctly (median of 22:00 + 02:00 = 00:00, not 12:00 noon).
Output preview
E2E run against project 18 (Vermont Atlas of Life), collection 176 (363 images, full event 5297, 2022-06-18 to 2022-06-19). After
apply_default_filters: 92 occurrences → 13 unique taxa → 13 CSV rows.Columns (in CSV order):
id3839Taxon.pknameXestia c-nigrumdisplay_nameXestia c-nigrumrankSPECIESTaxonRankenumcommon_name_enkingdom...speciesLepidoptera/Noctuidae/Noctuinae/Noctuini/Xestia/Xestia c-nigrumTaxon.linnaean_hierarchy()direct_occurrences_count17recursive_occurrences_countmin_score/max_score/avg_score0.5025/0.9073/0.6516session_day_min/_max/_median169/170/169session_time_min/_max/_median21:12:00/01:25:16/23:48:06session_month_min/_max/_mean6/6/6.0min_duration_seconds/_max/_avg2/6/4forScoparia biplagialisgbif_taxon_key/gbif_url5714973/https://www.gbif.org/species/5714973Taxon.gbif_url, empty when ID missinginat_taxon_id/inat_urlTaxon.inat_urlbold_taxon_bin/bold_urlTaxon.bold_urlfieldguide_id/fieldguide_urlTaxon.fieldguide_urlcover_image_urlSample rows (lightly trimmed):
Wraparound check:
Xestia c-nigrumfirst seen at 21:12, last at 01:25 (next morning), median 23:48 — straddles midnight cleanly thanks to noon-anchored time aggregation.Not Lepidoptera(34 occurrences across two nights, days 169-170): median day 169, median time 22:46.Duration is empty for single-detection occurrences (most current rows). When it shows up —
Agriphila vulgivagellus372s,Scoparia biplagialis2-6s avg 4s — it's the(last_appearance - first_appearance)window already exposed viawith_timestamps(). Once detection tracking lands, more occurrences will have multi-detection durations and these columns will populate.Test plan
ami/exports/tests.py::TaxaListExportTest— registry/scope/filename, score aggregation + threshold, external links + hierarchy, time-of-night wraparound (asserts onsession_time_*).ami.exports.testssuite (23/23) andami.main.tests.TestTaxonomy+TestTaxonomyViews(13/13) green after the model-migration refactor.taxa_list_csvfrom the export dropdown and confirm download (UI label already wired inui/src/data-services/models/export.ts).Future hooks (stubbed only)
Project.get_taxonomic_scope()returns a non-Nonequeryset (per-project defaultTaxaList, landing soon), the writer emitsdirect_occurrences_count = 0rows for unobserved in-scope taxa. v1 returnsNoneso column shape is stable when this turns on.taxa_list_dwcaformat can reuse the aggregator + Taxon fetch and emittaxon.txt+meta.xml+eml.xmlzip. See PR feat: add Darwin Core Archive (DwC-A) export format #1131 for the surrounding DwC-A export. A "single-root via lowest common ancestor" scope shape is sketched inProject.get_taxonomic_scope()'s docstring as a sibling method (get_taxonomic_scope_root()) for that variant — different consumers want different shapes (enumerated set vs single root).min/max/avg_duration_secondsuse the existing first/last-appearance window. When detection tracking is wired up these fields stay valid and become more meaningful (multi-detection, post-tracking) without schema churn.Design doc
docs/claude/planning/2026-05-05-taxa-list-export-design.md