-
Notifications
You must be signed in to change notification settings - Fork 19
Description
It was noticed during a test suite that running the backfill procedure, would occasionally fail with confusing results (including chunks losing all entries, or the uniqueness constraint of a chunk/hypertable being broken, or a compression job being rescheduled for -infinity
).
What seems to happen, is that a brand new database, created by our test runner, has not yet registered the Compression Policy worker.
The worker registration looks like this in the logs. It seems that there's a generic worker for Timescale for a given database, that is then responsible for starting jobs.
2022-05-12 14:33:46.027 UTC [1] DEBUG: registering background worker "TimescaleDB Background Worker Scheduler"
2022-05-12 14:33:46.027 UTC [1] DEBUG: starting background worker process "TimescaleDB Background Worker Scheduler"
2022-05-12 14:33:46.048 UTC [627] DEBUG: database scheduler starting for database 18419
2022-05-12 14:33:46.049 UTC [627] DEBUG: launching job 1000 "Compression Policy [1000]"
2022-05-12 14:33:46.049 UTC [1] DEBUG: registering background worker "Compression Policy [1000]"
2022-05-12 14:33:46.049 UTC [1] DEBUG: starting background worker process "Compression Policy [1000]"
backfill.sql
reschedules the compression job for given chunks before doing any operations.
But, when the compression policy job has not been created, the rescheduling does not take any action.
timescaledb-extras/backfill.sql
Lines 98 to 108 in 2358d75
IF compression_job_id IS NULL THEN | |
old_time = NULL::timestamptz; | |
ELSE | |
SELECT next_start INTO old_time FROM _timescaledb_internal.bgw_job_stat WHERE job_id = compression_job_id FOR UPDATE; | |
IF version = 1 THEN | |
PERFORM alter_job_schedule(compression_job_id, next_start=> new_time); | |
ELSE | |
PERFORM alter_job(compression_job_id, next_start=> new_time); | |
END IF; | |
END IF; |
The race condition is if the registration of the background worker happens after the attempt to reschedule. The worker then can run at the same time as the main part of decompress_backfill()
, causing data corruption.
We've been unable to reproduce outside of our test suite annoyingly, and I'm not sure if this something that can arise outside of freshly created databases that don't have background workers yet.