Skip to content

Conversation

@bettercallok
Copy link

Closes #11572
The Solr updater has been experiencing reliability issues in recent weeks:

  • Missing many changes (updates not being indexed)
  • Updates taking longer than expected when manually triggered
  • Increased database load and potential timeouts

Root Cause

A cache clearing bug introduced in June 2021 (commit c24b3e7) was clearing the data provider cache after every 100-key batch instead of after each iteration. This caused:

  • Redundant database queries for the same documents
  • Redundant Archive.org API calls
  • 70-90% unnecessary database load
  • Slower processing leading to timeouts and missed updates

Solution

Moved data_provider.clear_cache() from inside the batch loop to after each iteration in the main event loop:

  • Cache now persists across all batches within a single [update_keys()]
  • Cache is cleared after processing all keys from one iteration
  • Prevents unbounded cache growth while maximizing cache reuse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Solr updater unreliable in recent weeks

2 participants