FEAT: Autopopulate 2.0


### Discussed in https://github.com/datajoint/datajoint-python/discussions/1243

<div type='discussions-op-text'>

<sup>Originally posted by **ttngu207** June 12, 2025</sup>
# Problem Statement: 

The current dataJoint-python approach for jobs reservation, orchestration, and execution (i.e. the `autopopulate`) faces scalability limitations. While its original design effectively handled job reservation/distribution for parallelization, it falls short when building a comprehensive data platform.

## Limitations of the `jobs` table

The existing `jobs` table functions more as an error/reserve table than a true jobs queue.

* **Limited Job Statuses:** It primarily records `error` (failed jobs) and `reserved` (jobs in progress) states. It lacks crucial statuses such as:
    * `pending`/`scheduled` (jobs not yet started)
    * `success` (record of successfully completed jobs and their duration).
* **Inefficient Job Queue:** It doesn't operate as a true jobs queue where workers can efficiently pull tasks.
    * Each worker must individually call `key_source` to get a list of jobs, which, while ensuring up-to-date information, strains the database.
* **Non-Queryable for Status:** The table is not easily queryable for overall job status, hindering the development of dashboards, monitoring tools, and reporting.

## Limitations of `key_source` Behavior/Usage

The default `key_source` (an inner-join of parent tables) is intended to represent all possible jobs for a given table.

* **Frequent Modification Needed:** In practice, the actual set of jobs of interest is often a subset of this, requiring frequent modifications to `key_source` (e.g., restricting by `paramset` or other tables).
* **Local Visibility Only:** Modified `key_source` settings are only visible to the local code executing the pipeline, not globally at the database level. This leads to:
    * Out-of-sync code and `key_source` definitions.
    * Lack of visibility and accessibility via "virtual modules."
    * The need to install the entire pipeline/codebase to run specific parts, increasing complexity for microservices in a platform like Works.
* **Performance Bottleneck:** `(Table.key_source - Table).fetch('KEY')` is DataJoint's method for retrieving the job queue and can be an expensive operation, especially when called frequently by multiple workers. This significantly strains the database server, as observed by other users.

### Proposed implementation::

1. The jobs tables will now be implemented as separate hidden tables for each computed table. To compare, currently, the jobs table is implemented at the schema level.
2. The jobs tables will be accessed as schema.Table.jobs (as opposed to the current schema.jobs)
3. The jobs tables will have the same primary key as their computed table. We will no longer rely on hashes to address the jobs.
4. The jobs tables will form the same foreign keys as their compute tables form from their primary key, with cascaded delete.
5. `populate()` will, by default, invoke `self.jobs.refresh()`, which (1) deletes all existing entries other than with status in `('reserved', 'error')` and (2) fill the `key_source` into the jobs table with `(status='')`, except for the keys that are already present in the
6. `populate()` will accept argument `refresh_jobs=True`. To skip the refresh step, set it to `refresh_jobs=False`.
7. `populate()` will then use the jobs table for job reservations, changing the status to reserved and then removing the entry when computed successfully. Or set the status to error if failed. This part is the same as before.
8. The jobs table will provide a new `uint8` priority field, which can default to 3 for example.
9. The default ordering may be `(priority, job_date)` but other ordering will now become possible.


### Notes

We have considered adopting and integrating with other industry standards for workflow orchestration such as Airflow, Flyte or Prefect, and have produced and evaluated multiple working prototypes. 

However, we think that the additional burden of deployment & maintenance of those tools is too much for a python open-source project such as DataJoint - the enhanced features come with significant DevOps requirements & burden.

</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FEAT: Autopopulate 2.0 #1258

Discussed in #1243

Problem Statement:

Limitations of the `jobs` table

Limitations of `key_source` Behavior/Usage

Proposed implementation::

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FEAT: Autopopulate 2.0 #1258

Description

Discussed in #1243

Problem Statement:

Limitations of the jobs table

Limitations of key_source Behavior/Usage

Proposed implementation::

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Limitations of the `jobs` table

Limitations of `key_source` Behavior/Usage