Skip to content

FEAT: Autopopulate 2.0 #1258

@dimitri-yatsenko

Description

@dimitri-yatsenko

Discussed in #1243

Originally posted by ttngu207 June 12, 2025

Problem Statement:

The current dataJoint-python approach for jobs reservation, orchestration, and execution (i.e. the autopopulate) faces scalability limitations. While its original design effectively handled job reservation/distribution for parallelization, it falls short when building a comprehensive data platform.

Limitations of the jobs table

The existing jobs table functions more as an error/reserve table than a true jobs queue.

  • Limited Job Statuses: It primarily records error (failed jobs) and reserved (jobs in progress) states. It lacks crucial statuses such as:
    • pending/scheduled (jobs not yet started)
    • success (record of successfully completed jobs and their duration).
  • Inefficient Job Queue: It doesn't operate as a true jobs queue where workers can efficiently pull tasks.
    • Each worker must individually call key_source to get a list of jobs, which, while ensuring up-to-date information, strains the database.
  • Non-Queryable for Status: The table is not easily queryable for overall job status, hindering the development of dashboards, monitoring tools, and reporting.

Limitations of key_source Behavior/Usage

The default key_source (an inner-join of parent tables) is intended to represent all possible jobs for a given table.

  • Frequent Modification Needed: In practice, the actual set of jobs of interest is often a subset of this, requiring frequent modifications to key_source (e.g., restricting by paramset or other tables).
  • Local Visibility Only: Modified key_source settings are only visible to the local code executing the pipeline, not globally at the database level. This leads to:
    • Out-of-sync code and key_source definitions.
    • Lack of visibility and accessibility via "virtual modules."
    • The need to install the entire pipeline/codebase to run specific parts, increasing complexity for microservices in a platform like Works.
  • Performance Bottleneck: (Table.key_source - Table).fetch('KEY') is DataJoint's method for retrieving the job queue and can be an expensive operation, especially when called frequently by multiple workers. This significantly strains the database server, as observed by other users.

Proposed implementation::

  1. The jobs tables will now be implemented as separate hidden tables for each computed table. To compare, currently, the jobs table is implemented at the schema level.
  2. The jobs tables will be accessed as schema.Table.jobs (as opposed to the current schema.jobs)
  3. The jobs tables will have the same primary key as their computed table. We will no longer rely on hashes to address the jobs.
  4. The jobs tables will form the same foreign keys as their compute tables form from their primary key, with cascaded delete.
  5. populate() will, by default, invoke self.jobs.refresh(), which (1) deletes all existing entries other than with status in ('reserved', 'error') and (2) fill the key_source into the jobs table with (status=''), except for the keys that are already present in the
  6. populate() will accept argument refresh_jobs=True. To skip the refresh step, set it to refresh_jobs=False.
  7. populate() will then use the jobs table for job reservations, changing the status to reserved and then removing the entry when computed successfully. Or set the status to error if failed. This part is the same as before.
  8. The jobs table will provide a new uint8 priority field, which can default to 3 for example.
  9. The default ordering may be (priority, job_date) but other ordering will now become possible.

Notes

We have considered adopting and integrating with other industry standards for workflow orchestration such as Airflow, Flyte or Prefect, and have produced and evaluated multiple working prototypes.

However, we think that the additional burden of deployment & maintenance of those tools is too much for a python open-source project such as DataJoint - the enhanced features come with significant DevOps requirements & burden.

Metadata

Metadata

Labels

featureIndicates new features

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions