Embedding STAC Directly in Zarr Store #1344
Replies: 5 comments 20 replies
-
|
Thanks @emmanuelmathot for mentioning the proposed approaches and explaining it so clearly. Separating STAC metadata into dedicated .zstac (or .stac) objects within a Zarr store—rather than embedding it in .zattrs—allows STAC-only tools to parse discovery metadata without requiring Zarr libraries, and Zarr-only tools to function without handling STAC structures. This separation is likely beneficial, as many tools already exist but serve either domain independently. It also facilitates linking to external product metadata documents via URL, and avoids the need to load the entire consolidated Zarr metadata |
Beta Was this translation helpful? Give feedback.
-
|
My 2 cents, as a zarr/xarray person who recently learned more about STAC, and has had some discussions with @rabernat about this problem recently, and also with other people at CNG. One big Zarr could do Level 2 tooThe "many smaller Zarrs" proposal is premised on the idea that you cannot put Level 2 data into Zarr, but this is false. Everyone just thinks it can't because most people access Zarr from Xarray, and it’s awkward to represent unaligned Level 2 data in a single Xarray object (Dataset or DataTree). But Xarray and Zarr are totally separate, with data models that are subtly but importantly different. This belief about Level 2 ultimately stems from the idea that 1 COG == 1 Zarr store, and therefore 1 STAC item == 1 Zarr store. But actually 1 COG == 1 Zarr group (and that group could be nested further, which could be useful for storing overviews). This mistake is made in that linked blog post, which says that each store represents one scene. Having a zillion small Zarr stores loses some of the advantages of Zarr (e.g. you now have a zillion URLs to catalog, instead of just one), so should be avoided. Instead I think you should store Level 2 data by placing the contents that would normally be in each COG into separate groups, e.g. 1 scene (& I think 1 STAC item) per group. (Notice also that this approach could be used with a Virtual Zarr store, and then those groups would actually contain COGs!) Then you store the STAC collection data in a special root group of the single big Zarr store. This might require some minor tooling changes to adapt to, but the result is very powerful - every Level 1/2/3/4 dataset can be distributed as a single big Zarr store, with the STAC collection data consistent with the Zarr array data. Store STAC data scalably as tabular, in Zarr
Don't put all the STAC metadata into the Zarr metadata files. Instead, for the special root group containing STAC collection metadata, you can put that into the Zarr arrays themselves. The insight here is that STAC collection metadata is currently often put into a tabular database (e.g. PostGIS) or tabular data format (e.g. STAC-GeoParquet), but column-oriented tabular data is basically just a special case of Zarr's multi-dimensional array model. This metadata can include the names of the groups that contain the actual data for that item - similar to how STAC + COG references COGs today. This means we now no longer need a separate Parquet file / database to store the STAC collection data, it can all be in Zarr! Combined with Icechunk this can even be made transactionally safe as new data is added, always appearing in a consistent state to the user. A single, self-describing, consistent, serverless database for the entire collection. The advantage of putting the data into the zarr array itself is that, like Parquet, it can be chunked. So this approach should be able to scale to massive STAC collections of millions of scenes, whereas your suggestion of putting the STAC collection data in the zarr JSON metadata would not scale well.
To make sure this scales, any of this data that scales in size with the number of groups should go into the chunks, whilst any of it that doesn't can go in the actual zarr metadata of that special root group. STAC collections aren't special enough to special caseThe tabular suggestion might make more sense if we remove STAC from the picture for a second. Your fundamental problem is that you want to distribute a set of groups of related arrays, which may or may not be mutually aligned, but you also want to distribute per-group metadata about those arrays. This is not a problem specific to geospatial. For example, in my old field of plasma physics, they have array data for each plasma experimental run (a "shot"), and run perhaps one shot per day. The shots represent plasmas that existed for different lengths of time, so do not mutually align. Users want to access all the data for any given shot, but they also want to search by shot-level scalar metadata, such as total fusion power output, or whether or not a particular sensor instrument was turned on. This non-geospatial use case is also served by a similar Zarr store schema: they could store the various array data for each shot in a separate group (which again could be individually nested), and store per-shot metadata in a tabular-like form in a dedicated group at the root of the store. Here we have a "collection" of data that is not STAC (a "Data Collection"?). Solving this with a domain-agnostic pattern opens the door to powerful domain-agnostic tools - a tabular tool (e.g. pandas, duckDB) which could read the root group of the STAC-like store could also read the root group of the plasma data store. Stay inside the Spec!By now it should be obvious that I don't think you should stray outside of the existing Zarr format specification to make any of this work. It should be possible to fully support STAC with the Zarr format as it exists today.
Don't do this if at all possible. Straying from the spec (e.g. by adding
This seems like a weak reason to me. It's much easier to change tooling then it is to deal with incompatible formats, especially at the non-user-facing level like this is. In this case you only have control over your own geospatial tooling, not everyone else's, so make the choice that only requires changing your own tooling slightly.
It's most important to avoid conflict with the main Zarr specification! In other words, instead of a domain-specific Zarr extension that somehow extends the Zarr format (making it no longer vanilla Zarr), instead you should be trying to make a domain-specific standard for a Zarr schema of metadata that nevertheless fully fits within the established Zarr data model. (Note how this is much more like GeoTIFF's relationship to TIFF - all GeoTIFFs are TIFFs. That was the correct call.) tl;dr: I don't have a strong opinion on how exactly you choose to lay out the STAC data in Zarr, but I do strongly think you should keep it all in vanilla Zarr, and make use of the scalability of Zarr chunk data rather than putting it in the metadata. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you, @TomNicholas, for contributing to the discussion. That aligns perfectly with my expectations. My aim here is to collect experiences like yours to evaluate the advantages and disadvantages of various options for a possible recommendation or specifications. |
Beta Was this translation helpful? Give feedback.
-
|
@TomNicholas @jsignell @emmanuelmathot Would the approaches being discussed here work with a hierarchy of STAC catalogues? The Met Office and other National Met Services are exploring hierarchical STACs as a way of exposing all the |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @emmanuelmathot. Dropping an initial response and then I'll try to get caught up on the other discussion threads. And I've only skimmed https://cpm.pages.eopf.copernicus.eu/eopf-cpm/main/developer-guide/store-developer-guide/mapping_file_samples_g.html and nothing else from the EOPF docs, so my understanding of it is limited. On the big question of putting the STAC metadata inside vs. outside For the Planetary Computer, our data usage primarily came through our STAC API. That was our entrypoint for everything. Given this entrypoint, embedding STAC metadata inside the Zarr metadata at And it's worth stating explicitly that STAC metadata embedded inside Zarr metadata would not be searchable, at least not for a static Zarr dataset on Blob Storage; users wouldn't be able to do some kind of query over the STAC metadata of many STAC items (either through a STAC API endpoint or through something like stac-geoparquet). If your users' entrypoint is instead some sort of catalog on top of Zarr, like a root Zarr Group with many nested Zarr (sub) Groups and eventually Arrays inside those, then maybe this makes sense. At that point, you'd need to ask "what are my users doing with this STAC metadata once they've loaded it?" For the PC, users primarily used STAC metadata to load assets (maybe some people used the assets themselves, but 99% of the value at this point would come from the asset HREFs). Either by loading a single asset using something like rioxarray or zarr-python, or many with something like stackstac or odc-stac. I might be missing something, but it sounds to me like STAC metadata embedded in a Zarr metadata document wouldn't be helpful here; you don't need any help loading assets since Zarr already does that. So my initial reaction is that this doesn't sound too valuable for how I use STAC:
That's not to say it's a bad idea. Just that (my understanding of) the proposal doesn't align with the use cases I'm familiar with. The biggest benefit I see is that STAC has some established conventions for cataloging metadata about geospatial assets. So following those conventions possibly seems like a good idea. But we already have CF-Conventions, and IIUC geozarr is exploring some things in these area, so just putting STAC in Zarr again doesn't seem the most useful. One last thing to call out: the assets in the example are relative: STAC does allow for relative HREFs (for both items and assets). But in my experience (which is especially biased on this detail) they're not that useful. Our approach was all "cloud native" so a relative asset HREF is bad, since it implies you're downloading the asset to some working directory before you can use it. FYI the link https://eopf-cpm.eu/ gives a 404 (not sure if that ever worked). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
I'd like to initiate a discussion about the potential for storing STAC Collections and Items directly within Zarr store. This approach has been implemented by ESA's Earth Observation Processing Framework (EOPF) for new Sentinel satellite data processing, and I believe it warrants broader community consideration given the growing adoption of Zarr in the geospatial community.
The EOPF Sentinel CPM processors currently embed STAC-compliant metadata using a
stac_discoveryfield within the root.zattrsof their Zarr store consolidated metadata (.zmetadata). This implementation allows the Zarr store to be self-describing from a STAC perspective, containing complete Item metadata including basic STAC Item properties (id, bbox, geometry, properties), asset definitions with proper roles and access information, STAC extension metadata (eo, sat, processing, etc.), and collection-level information.Conceptual Rationale
This approach fundamentally aligns with @rabernat's observation from discussion #1222 that "Zarr is more akin to a STAC Catalog or Collection" rather than a simple asset. The conceptual foundation rests on the idea that Zarr stores, particularly those containing multidimensional Earth observation data, often represent complete spatiotemporal datasets that naturally align with STAC's organizational concepts.
From a practical standpoint, embedding STAC metadata directly in Zarr stores creates truly self-describing data containers. This supports offline and distributed use cases where catalog connectivity isn't guaranteed. The approach also significantly reduces metadata duplication by establishing a single source of truth for spatiotemporal and technical metadata, ensuring automatic synchronization between data and metadata while simplifying maintenance workflows for data producers.
Perhaps most importantly, this pattern enhances discoverability by enabling Zarr stores to be cataloged and searched without requiring separate STAC Item creation processes. This supports federated discovery across distributed Zarr collections and enables direct integration with STAC-aware tools and workflows, potentially streamlining the entire data discovery and access pipeline.
Questions for Community Discussion
The implementation of this pattern raises several important questions that would benefit from community input. At the standards level, I'm curious whether there are existing recommendations at the Zarr specification level for embedding STAC or similar metadata in Zarr stores, and whether we should consider developing a formal STAC extension for this pattern. The question of appropriate location also seems important to establish early (See above Storage Location and Implementation Approaches).
The implementation patterns raise questions about when this approach is most appropriate. The "One Big Zarr" versus "Many Smaller Zarr" scenarios described in recent @jsignell's blog post seem to have different optimal approaches. Integration with existing STAC API patterns is another consideration - could APIs dynamically extract STAC metadata from Zarr stores, and what would be the performance implications of including STAC collections and items in metadata files?
Storage Location and Implementation Approaches
A critical question for this pattern concerns the exact location and format for storing STAC metadata within Zarr stores. The current EOPF implementation embeds STAC metadata in the
stac_discoveryfield within the root.zattrsof the consolidated metadata. However, @christophenoel has proposed an alternative approach in GeoZarr issue #32 that introduces a dedicated.zstacobject at each level of the Zarr hierarchy, alongside the existing.zgroupobject.This
.zstacapproach offers several potential advantages. It would provide a clear separation between access metadata (handled by standard Zarr mechanisms and specifications) and discovery metadata (handled by STAC structures). The hierarchical nature would allow different levels of the Zarr store to have their own STAC catalog information, potentially supporting more complex organizational patterns. Most importantly, this approach would maintain compatibility with traditional STAC catalog linking patterns, where individual catalog files can be linked together to form larger catalog hierarchies.The choice between embedding STAC metadata in
.zattrsversus using dedicated.zstacfiles has significant implications for tooling, performance, and interoperability. The.zattrsapproach keeps all metadata consolidated and immediately accessible through existing Zarr metadata access patterns. The.zstacapproach provides cleaner separation of concerns and potentially better alignment with existing STAC tooling that expects discrete catalog files.Seeking Community Input
I'm particularly interested in learning about historical discussions or implementations of similar approaches within the STAC community. Understanding potential conflicts with existing Zarr specifications would be valuable, as would hearing about implementation experiences from other data providers who may have explored similar patterns.
The broader question of community appetite for formalizing this pattern is also important. If there's sufficient interest, we could consider developing formal recommendations or extensions to support this approach consistently across implementations.
Has anyone else explored or implemented similar approaches? What challenges or benefits have you encountered, and what considerations might I be overlooking in this discussion?
References
cc @maxrjones @vdumoul @TomAugspurger @m-mohr
Beta Was this translation helpful? Give feedback.
All reactions