-
Notifications
You must be signed in to change notification settings - Fork 90
Description
Discussed in #1243
Originally posted by ttngu207 June 12, 2025
Problem Statement:
The current dataJoint-python approach for jobs reservation, orchestration, and execution (i.e. the autopopulate
) faces scalability limitations. While its original design effectively handled job reservation/distribution for parallelization, it falls short when building a comprehensive data platform.
Limitations of the jobs
table
The existing jobs
table functions more as an error/reserve table than a true jobs queue.
- Limited Job Statuses: It primarily records
error
(failed jobs) andreserved
(jobs in progress) states. It lacks crucial statuses such as:pending
/scheduled
(jobs not yet started)success
(record of successfully completed jobs and their duration).
- Inefficient Job Queue: It doesn't operate as a true jobs queue where workers can efficiently pull tasks.
- Each worker must individually call
key_source
to get a list of jobs, which, while ensuring up-to-date information, strains the database.
- Each worker must individually call
- Non-Queryable for Status: The table is not easily queryable for overall job status, hindering the development of dashboards, monitoring tools, and reporting.
Limitations of key_source
Behavior/Usage
The default key_source
(an inner-join of parent tables) is intended to represent all possible jobs for a given table.
- Frequent Modification Needed: In practice, the actual set of jobs of interest is often a subset of this, requiring frequent modifications to
key_source
(e.g., restricting byparamset
or other tables). - Local Visibility Only: Modified
key_source
settings are only visible to the local code executing the pipeline, not globally at the database level. This leads to:- Out-of-sync code and
key_source
definitions. - Lack of visibility and accessibility via "virtual modules."
- The need to install the entire pipeline/codebase to run specific parts, increasing complexity for microservices in a platform like Works.
- Out-of-sync code and
- Performance Bottleneck:
(Table.key_source - Table).fetch('KEY')
is DataJoint's method for retrieving the job queue and can be an expensive operation, especially when called frequently by multiple workers. This significantly strains the database server, as observed by other users.
Proposed implementation::
- The jobs tables will now be implemented as separate hidden tables for each computed table. To compare, currently, the jobs table is implemented at the schema level.
- The jobs tables will be accessed as schema.Table.jobs (as opposed to the current schema.jobs)
- The jobs tables will have the same primary key as their computed table. We will no longer rely on hashes to address the jobs.
- The jobs tables will form the same foreign keys as their compute tables form from their primary key, with cascaded delete.
populate()
will, by default, invokeself.jobs.refresh()
, which (1) deletes all existing entries other than with status in('reserved', 'error')
and (2) fill thekey_source
into the jobs table with(status='')
, except for the keys that are already present in thepopulate()
will accept argumentrefresh_jobs=True
. To skip the refresh step, set it torefresh_jobs=False
.populate()
will then use the jobs table for job reservations, changing the status to reserved and then removing the entry when computed successfully. Or set the status to error if failed. This part is the same as before.- The jobs table will provide a new
uint8
priority field, which can default to 3 for example. - The default ordering may be
(priority, job_date)
but other ordering will now become possible.
Notes
We have considered adopting and integrating with other industry standards for workflow orchestration such as Airflow, Flyte or Prefect, and have produced and evaluated multiple working prototypes.
However, we think that the additional burden of deployment & maintenance of those tools is too much for a python open-source project such as DataJoint - the enhanced features come with significant DevOps requirements & burden.