Race condition in backfill.sql when worker scheduler has not started

It was noticed during a test suite that running the backfill procedure, would occasionally fail with confusing results (including chunks losing all entries, or the uniqueness constraint of a chunk/hypertable being broken, or a compression job being rescheduled for `-infinity`).

What seems to happen, is that a brand new database, created by our test runner, has not yet registered the Compression Policy worker.

The worker registration looks like this in the logs. It seems that there's a generic worker for Timescale for a given database, that is then responsible for starting jobs.
```
2022-05-12 14:33:46.027 UTC [1] DEBUG:  registering background worker "TimescaleDB Background Worker Scheduler"
2022-05-12 14:33:46.027 UTC [1] DEBUG:  starting background worker process "TimescaleDB Background Worker Scheduler"
2022-05-12 14:33:46.048 UTC [627] DEBUG:  database scheduler starting for database 18419
2022-05-12 14:33:46.049 UTC [627] DEBUG:  launching job 1000 "Compression Policy [1000]"
2022-05-12 14:33:46.049 UTC [1] DEBUG:  registering background worker "Compression Policy [1000]"
2022-05-12 14:33:46.049 UTC [1] DEBUG:  starting background worker process "Compression Policy [1000]"
```

`backfill.sql` reschedules the compression job for given chunks before doing any operations.
*But*, when the compression policy job has not been created, the rescheduling does not take any action.
https://github.com/timescale/timescaledb-extras/blob/2358d75435a47a129042d0091d101ed87d145178/backfill.sql#L98-L108

The race condition is if the registration of the background worker happens _after_ the attempt to reschedule. The worker then can run at the same time as the main part of `decompress_backfill()`, causing data corruption.

We've been unable to reproduce outside of our test suite annoyingly, and I'm not sure if this something that can arise outside of freshly created databases that don't have background workers yet.

	IF compression_job_id IS NULL THEN
	old_time = NULL::timestamptz;
	ELSE
	SELECT next_start INTO old_time FROM _timescaledb_internal.bgw_job_stat WHERE job_id = compression_job_id FOR UPDATE;

	IF version = 1 THEN
	PERFORM alter_job_schedule(compression_job_id, next_start=> new_time);
	ELSE
	PERFORM alter_job(compression_job_id, next_start=> new_time);
	END IF;
	END IF;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Race condition in backfill.sql when worker scheduler has not started #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition in backfill.sql when worker scheduler has not started #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions