-
Notifications
You must be signed in to change notification settings - Fork 90
Description
Feature Request
Problem
Scientific research, particularly in fields like neuroscience and imaging, generates massive n-dimensional arrays that are too large to be stored efficiently in traditional blob fields. The Zarr format is an industry standard for storing chunked, compressed array data, enabling parallel I/O and efficient partial access (slicing). Without native support for Zarr, users are forced to manage these datasets manually by storing file paths, which disconnects the data from the DataJoint pipeline and forfeits all the benefits of integrated data management and integrity checks.
Requirements
A successful implementation of this improvement should provide a built-in Custom Type Adaptor for handling Zarr arrays, leveraging the object type for external storage. This implementation must adhere to the DataJoint 2.0 Specification.
The core requirements are:
Create a dj.CustomType
Adaptor:
- A new class must be implemented that inherits from
dj.CustomType
as a plugin
Implement the Standard Interface:
- The type_name property must return the string
<dj_zarr>
. - The
stored_type
property must return the object. - The put method must accept a Zarr-compatible array object (e.g., a NumPy array) and write it to the configured external object store as a Zarr dataset.
- The
get
method must read the Zarr dataset from the external store and return it as a lazy-loading Zarr array object.
Default Registration:
- The
<dj_zarr>
adaptor should be registered by default with the DataJoint client, making it available out-of-the-box.
Justification
Providing native support for Zarr arrays via a type adaptor will be a transformative feature for DataJoint. It will enable the seamless management of petabyte-scale array data in a format that is optimized for high-performance, parallel computing and cloud environments. This directly addresses the needs of modern data-intensive science and solidifies DataJoint's position as a cutting-edge platform for scientific data management.