You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf(occurrence-stats): scope agreement subqueries to verified set
Replace the .aggregate() over the full filtered queryset with a two-step
approach:
1. SQL Count('pk') for total_occurrences (no joins, no subqueries).
2. Fetch the verified set (occurrences with at least one non-withdrawn
ident) with both best_user_taxon_id and best_machine_prediction_taxon_id
annotated, then bucket counts + LCA in Python.
Why: the previous version evaluated two correlated subqueries (best user
identification + best machine prediction) on every row of the filtered
queryset. For typical projects, >95% of occurrences have no identification
— those rows ran the user-ident subquery only to discover NULL, then ran
the (much more expensive) machine-prediction subquery on detections that
won't contribute to any agreement bucket. Scoping the subqueries to the
verified set avoids that waste.
Bench (cold, cache invalidated):
Project Total Verified Pre Post
P#85 SEC-SEQ 36,253 13,140 — 1.18s
P#20 BCI 40,958 1,351 — 0.92s
P#84 Pennsylvania 18,407 251 — 0.56s
P#24 Atlantic Forestry 2,797 274 — 0.50s
P#18 Vermont 43,149 45 ~928ms 0.35s
P#23 Insectarium Montreal 20,393 74 — 0.43s
Warm via django-cachalot: 122–343ms across all projects.
For P#85 (highest absolute identification count in the system), the cost
is dominated by apply_default_filters' score-threshold join, not the
subqueries. apply_defaults=false actually runs faster (0.69s cold,
179,466 total / 13,140 verified) because the classification join is
skipped.
Co-Authored-By: Claude <noreply@anthropic.com>
0 commit comments