Embedding STAC Directly in Zarr Store #1344

emmanuelmathot · 2025-05-26T20:56:34Z

emmanuelmathot
May 26, 2025
Collaborator

Background

I'd like to initiate a discussion about the potential for storing STAC Collections and Items directly within Zarr store. This approach has been implemented by ESA's Earth Observation Processing Framework (EOPF) for new Sentinel satellite data processing, and I believe it warrants broader community consideration given the growing adoption of Zarr in the geospatial community.

The EOPF Sentinel CPM processors currently embed STAC-compliant metadata using a stac_discovery field within the root .zattrs of their Zarr store consolidated metadata (.zmetadata). This implementation allows the Zarr store to be self-describing from a STAC perspective, containing complete Item metadata including basic STAC Item properties (id, bbox, geometry, properties), asset definitions with proper roles and access information, STAC extension metadata (eo, sat, processing, etc.), and collection-level information.

Conceptual Rationale

This approach fundamentally aligns with @rabernat's observation from discussion #1222 that "Zarr is more akin to a STAC Catalog or Collection" rather than a simple asset. The conceptual foundation rests on the idea that Zarr stores, particularly those containing multidimensional Earth observation data, often represent complete spatiotemporal datasets that naturally align with STAC's organizational concepts.

From a practical standpoint, embedding STAC metadata directly in Zarr stores creates truly self-describing data containers. This supports offline and distributed use cases where catalog connectivity isn't guaranteed. The approach also significantly reduces metadata duplication by establishing a single source of truth for spatiotemporal and technical metadata, ensuring automatic synchronization between data and metadata while simplifying maintenance workflows for data producers.

Perhaps most importantly, this pattern enhances discoverability by enabling Zarr stores to be cataloged and searched without requiring separate STAC Item creation processes. This supports federated discovery across distributed Zarr collections and enables direct integration with STAC-aware tools and workflows, potentially streamlining the entire data discovery and access pipeline.

Questions for Community Discussion

The implementation of this pattern raises several important questions that would benefit from community input. At the standards level, I'm curious whether there are existing recommendations at the Zarr specification level for embedding STAC or similar metadata in Zarr stores, and whether we should consider developing a formal STAC extension for this pattern. The question of appropriate location also seems important to establish early (See above Storage Location and Implementation Approaches).

The implementation patterns raise questions about when this approach is most appropriate. The "One Big Zarr" versus "Many Smaller Zarr" scenarios described in recent @jsignell's blog post seem to have different optimal approaches. Integration with existing STAC API patterns is another consideration - could APIs dynamically extract STAC metadata from Zarr stores, and what would be the performance implications of including STAC collections and items in metadata files?

Storage Location and Implementation Approaches

A critical question for this pattern concerns the exact location and format for storing STAC metadata within Zarr stores. The current EOPF implementation embeds STAC metadata in the stac_discovery field within the root .zattrs of the consolidated metadata. However, @christophenoel has proposed an alternative approach in GeoZarr issue #32 that introduces a dedicated .zstac object at each level of the Zarr hierarchy, alongside the existing .zgroup object.

This .zstac approach offers several potential advantages. It would provide a clear separation between access metadata (handled by standard Zarr mechanisms and specifications) and discovery metadata (handled by STAC structures). The hierarchical nature would allow different levels of the Zarr store to have their own STAC catalog information, potentially supporting more complex organizational patterns. Most importantly, this approach would maintain compatibility with traditional STAC catalog linking patterns, where individual catalog files can be linked together to form larger catalog hierarchies.

The choice between embedding STAC metadata in .zattrs versus using dedicated .zstac files has significant implications for tooling, performance, and interoperability. The .zattrs approach keeps all metadata consolidated and immediately accessible through existing Zarr metadata access patterns. The .zstac approach provides cleaner separation of concerns and potentially better alignment with existing STAC tooling that expects discrete catalog files.

Seeking Community Input

I'm particularly interested in learning about historical discussions or implementations of similar approaches within the STAC community. Understanding potential conflicts with existing Zarr specifications would be valuable, as would hearing about implementation experiences from other data providers who may have explored similar patterns.

The broader question of community appetite for formalizing this pattern is also important. If there's sufficient interest, we could consider developing formal recommendations or extensions to support this approach consistently across implementations.

Has anyone else explored or implemented similar approaches? What challenges or benefits have you encountered, and what considerations might I be overlooking in this discussion?

References

STAC Discussion #1222: Zarr Extension - Original discussion about STAC+Zarr integration
EOPF STAC Extension - Current implementation reference
Element 84 STAC + Zarr Report - Recent analysis of STAC+Zarr patterns

cc @maxrjones @vdumoul @TomAugspurger @m-mohr

christophenoel · 2025-05-27T06:29:54Z

christophenoel
May 27, 2025

Thanks @emmanuelmathot for mentioning the proposed approaches and explaining it so clearly.

Separating STAC metadata into dedicated .zstac (or .stac) objects within a Zarr store—rather than embedding it in .zattrs—allows STAC-only tools to parse discovery metadata without requiring Zarr libraries, and Zarr-only tools to function without handling STAC structures. This separation is likely beneficial, as many tools already exist but serve either domain independently.

It also facilitates linking to external product metadata documents via URL, and avoids the need to load the entire consolidated Zarr metadata

11 replies

TomNicholas May 27, 2025

Beyond theoretical considerations, many stakeholders aim to maintain format-agnostic STAC catalogues—independent of Zarr, COG, etc.

Interesting. But they must have a format for storing the STAC catalog itself? Is that standardized?

STAC and Zarr can live side by side, serving different discovery and access use cases, without requiring deep entanglement.

I think it could break things to place new files into the same object store. It's analogous to inserting extra metadata into a file's byte stream - in theory it can be skipped over, but we might imagine tools that are intended to be general that might need to know of its existence (e.g. even just checksumming the whole file would break).

To me avoiding entanglement implies that one project fits entirely within the defined scope of another, so that the outer one can reason with confidence about what the inner one is or is not doing. In other words, one has to be aware of the other. For STAC + Zarr that seems to imply either:

STAC metadata sits entirely outside of Zarr, referring to stores or groups. This is decoupled and would satisfy your format-agnosticism wish, but also means there is no guarantee of consistency between the STAC metadata and the actual data - an application would have to speak to the STAC storage and the Zarr storage to be able to guarantee that, but you've just chosen to potentially store them in different databases. The Zarr store(s) would not be self-describing in the sense that they would not have item, or collection-level STAC metadata. Note you could still make a single big Store containing Level 2 data that worked like this, it would just not be self-describing as you wouldn't put anything in the root group.
STAC metadata lives entirely inside of the existing Zarr key-value space, where its presence or absence can be completely tracked and reasoned about by tools like Icechunk or VirtualiZarr. Then all the relevant data is in one database (if "native" Zarr it wouldn't provide transactional safety during updates, but with Icechunk it would).

Again, my personal bias is that one "dataset" (Level 1/2/3/4) should be distributable as one URL, following a domain-agnostic standard. Making that URL be the root of a Zarr store gives you that, for all domains, not just geospatial.

gadomski May 27, 2025
Maintainer

But they must have a format for storing the STAC catalog itself? Is that standardized?

No, other than single-file JSON entities as an implicitly standard data storage format due to STAC being defined in terms of JSON. De-facto data storage format standards exist in the form of database schemas, and there are also community specifications (e.g. stac-geoparquet). But no "standard" data format other than single-file JSON.

christophenoel May 27, 2025

Interesting. But they must have a format for storing the STAC catalog itself? Is that standardized?

From what I have seen, such stores are simply hierarchy of JSON objects on S3.

So when you reach the product level, you then start to have the Zarr (or COG) store, with possibly STAC object within that Zarr hierarchy.

TomNicholas May 27, 2025

community specifications (e.g. stac-geoparquet)

It sounds like what I proposed below then is effectively "stac-zarr". i.e. not literally standalone JSON files, but the same information, stored in a different format but a similar way to stac-geoparquet.

jsignell May 29, 2025
Maintainer

I tend to agree with Tom that I don't really see the value in having a separate json file when you can already have arbitrary keys in the existing zarr.json file. Just nesting the whole STAC blob under a "STAC" key seems like it would work well. I guess the one thing that you wouldn't get if you take that approach is that a STAC item as json is a valid geojson document and that would not be the case in this scenario because maybe what you would nest under the "STAC" key would be a collection rather than an item anyways (and collections are not valid geojson).

I also want to point out that stac-geoparquet is purely a metadata format. It is not a means of storing actual data. You can put data in a geoparquet file and catalog that in STAC but that is a different thing. When we talk about stac-zarr here we are talking about having both all the data and all the metadata within the same Zarr store.

One thing that would be important to nail down is does stac-zarr imply that metadata is chunked and stored in zarr arrays or does it just mean: look for a certain key in the zarr.json file and parse the json under that key.

TomNicholas · 2025-05-27T06:32:23Z

TomNicholas
May 27, 2025

My 2 cents, as a zarr/xarray person who recently learned more about STAC, and has had some discussions with @rabernat about this problem recently, and also with other people at CNG.

One big Zarr could do Level 2 too

The "many smaller Zarrs" proposal is premised on the idea that you cannot put Level 2 data into Zarr, but this is false. Everyone just thinks it can't because most people access Zarr from Xarray, and it’s awkward to represent unaligned Level 2 data in a single Xarray object (Dataset or DataTree). But Xarray and Zarr are totally separate, with data models that are subtly but importantly different. This belief about Level 2 ultimately stems from the idea that 1 COG == 1 Zarr store, and therefore 1 STAC item == 1 Zarr store. But actually 1 COG == 1 Zarr group (and that group could be nested further, which could be useful for storing overviews). This mistake is made in that linked blog post, which says that each store represents one scene.

Having a zillion small Zarr stores loses some of the advantages of Zarr (e.g. you now have a zillion URLs to catalog, instead of just one), so should be avoided. Instead I think you should store Level 2 data by placing the contents that would normally be in each COG into separate groups, e.g. 1 scene (& I think 1 STAC item) per group. (Notice also that this approach could be used with a Virtual Zarr store, and then those groups would actually contain COGs!) Then you store the STAC collection data in a special root group of the single big Zarr store.

This might require some minor tooling changes to adapt to, but the result is very powerful - every Level 1/2/3/4 dataset can be distributed as a single big Zarr store, with the STAC collection data consistent with the Zarr array data.

Store STAC data scalably as tabular, in Zarr

what would be the performance implications of including STAC collections and items in metadata files?

Don't put all the STAC metadata into the Zarr metadata files. Instead, for the special root group containing STAC collection metadata, you can put that into the Zarr arrays themselves. The insight here is that STAC collection metadata is currently often put into a tabular database (e.g. PostGIS) or tabular data format (e.g. STAC-GeoParquet), but column-oriented tabular data is basically just a special case of Zarr's multi-dimensional array model. This metadata can include the names of the groups that contain the actual data for that item - similar to how STAC + COG references COGs today.

This means we now no longer need a separate Parquet file / database to store the STAC collection data, it can all be in Zarr! Combined with Icechunk this can even be made transactionally safe as new data is added, always appearing in a consistent state to the user. A single, self-describing, consistent, serverless database for the entire collection.

The advantage of putting the data into the zarr array itself is that, like Parquet, it can be chunked. So this approach should be able to scale to massive STAC collections of millions of scenes, whereas your suggestion of putting the STAC collection data in the zarr JSON metadata would not scale well.

basic STAC Item properties (id, bbox, geometry, properties), asset definitions with proper roles and access information, STAC extension metadata (eo, sat, processing, etc.), and collection-level information.

To make sure this scales, any of this data that scales in size with the number of groups should go into the chunks, whilst any of it that doesn't can go in the actual zarr metadata of that special root group.

STAC collections aren't special enough to special case

The tabular suggestion might make more sense if we remove STAC from the picture for a second. Your fundamental problem is that you want to distribute a set of groups of related arrays, which may or may not be mutually aligned, but you also want to distribute per-group metadata about those arrays.

This is not a problem specific to geospatial. For example, in my old field of plasma physics, they have array data for each plasma experimental run (a "shot"), and run perhaps one shot per day. The shots represent plasmas that existed for different lengths of time, so do not mutually align. Users want to access all the data for any given shot, but they also want to search by shot-level scalar metadata, such as total fusion power output, or whether or not a particular sensor instrument was turned on.

This non-geospatial use case is also served by a similar Zarr store schema: they could store the various array data for each shot in a separate group (which again could be individually nested), and store per-shot metadata in a tabular-like form in a dedicated group at the root of the store. Here we have a "collection" of data that is not STAC (a "Data Collection"?).

Solving this with a domain-agnostic pattern opens the door to powerful domain-agnostic tools - a tabular tool (e.g. pandas, duckDB) which could read the root group of the STAC-like store could also read the root group of the plasma data store.

Stay inside the Spec!

By now it should be obvious that I don't think you should stray outside of the existing Zarr format specification to make any of this work. It should be possible to fully support STAC with the Zarr format as it exists today.

@christophenoel has proposed an alternative approach in zarr-developers/geozarr-spec#32 that introduces a dedicated .zstac object at each level of the Zarr hierarchy, alongside the existing .zgroup object.

Don't do this if at all possible. Straying from the spec (e.g. by adding .zstac files) would carry huge downsides. For example, any generic Zarr tool that expected only zarr.json and chunk data files to exist (i.e. the unextended Zarr v3 spec) would completely miss the existence of the .zstac files. If that tool was intended e.g. for moving zarr stores, it would silently fail on your special non-zarr store, by leaving important metadata behind! This is a recipe for unnecessarily causing geospatial to isolate itself on an island, forced to rely on a separate ecosystem from the rest of Zarr's users in other domains.

potentially better alignment with existing STAC tooling that expects discrete catalog files.

This seems like a weak reason to me. It's much easier to change tooling then it is to deal with incompatible formats, especially at the non-user-facing level like this is. In this case you only have control over your own geospatial tooling, not everyone else's, so make the choice that only requires changing your own tooling slightly.

Understanding potential conflicts with existing Zarr specifications would be valuable

It's most important to avoid conflict with the main Zarr specification! In other words, instead of a domain-specific Zarr extension that somehow extends the Zarr format (making it no longer vanilla Zarr), instead you should be trying to make a domain-specific standard for a Zarr schema of metadata that nevertheless fully fits within the established Zarr data model. (Note how this is much more like GeoTIFF's relationship to TIFF - all GeoTIFFs are TIFFs. That was the correct call.)

tl;dr: I don't have a strong opinion on how exactly you choose to lay out the STAC data in Zarr, but I do strongly think you should keep it all in vanilla Zarr, and make use of the scalability of Zarr chunk data rather than putting it in the metadata.

3 replies

chris-little May 27, 2025

Interesting discussion. Just to add another, not-quite geospatial, use case, and definitely not plasma physics: weather and climate data. If I need information for specific locations and geometries, and specific instants or durations, I have the choice of multiple institutions to supply the data (US NWS, UK Met Office, ECMWF, etc); possibly several systems from each institution, such as global, continental, national or even storm-level domains and resolutions; several forecasts for each system, such as today's 12 hour forecast, yesterday's 36 hour forecast, or last week's 180 hour forecast; and then there may be operational, beta or experimental versions. All of these have some significant shared metadata, but the datasets are not aligned. This applies to climate predictions as well, though the timescales are different.
Because of the scale of the data, format conversion, or even moving the data, is often not feasible, so shared, interoperable non-domain specific metadata seems a sine qua non. E.g. using GeoZARR metadata with original COG/GeoTIFF/CF-NetCDF data underneath, without actually converting to Zarr unless feasible.
I hope this is useful for your dicusssion, and not too far off your path.

jsignell May 29, 2025
Maintainer

One big Zarr could do Level 2 too

We talked about this at CNG and I agree that it is just an xarray model constraint rather than a fundamental constraint in Zarr that was driving me towards thinking that for Level 2 Zarr stores would be per-scene.

Instead I think you should store Level 2 data by placing the contents that would normally be in each COG into separate groups, e.g. 1 scene (& I think 1 STAC item) per group. (Notice also that this approach could be used with a Virtual Zarr store, and then those groups would actually contain COGs!)

I really like the idea of just creating COGs and having a Virtual Zarr (or maybe even just using the sharding spec to use the COGs as native Zarr shards). But I am wondering what the value is of one zarr store containing many groups that don't share dimensions. This is probably me not understanding groups well enough. Maybe it would make it easier to do the kind of reprojection and alignment that you need to get a data cube (currently handled for STAC + COG by odc-stac or stackstac)?

I guess it also gives you a place to store STAC collection-level metadata within the Zarr store so that the Zarr store is totally self-describing. Is this enough of a reason? Maybe yes. The downside (in the native rather than virtual zarr case) is that the individual scenes are no longer directly accessible individually. A group within a zarr is not in and of itself a zarr store (right??).

Store STAC data scalably as tabular, in Zarr

Don't put all the STAC metadata into the Zarr metadata files. Instead, for the special root group containing STAC collection metadata, you can put that into the Zarr arrays themselves.

Do you mean separate Zarr arrays containing metadata? Surely we don't want to include metadata within data-chunk files themselves - that gets us back to the NetCDF situation that we are running away from in the first place.

In general I think chunked metadata might make sense for metadata that is not useful for data discovery, but if we are talking about putting all the metadata that scales with the number of scenes (so something like cloud-cover say) in chunks then it would be very GET-intensive to do a global search for everywhere that has low-cloud-cover or something. You would be practically required to specify spatial/temporal bounds on your searches and on the implementation side they would have to be applied first. The number of GETs would scale with the size of your extents.

Stay inside the Spec!

By now it should be obvious that I don't think you should stray outside of the existing Zarr format specification to make any of this work. It should be possible to fully support STAC with the Zarr format as it exists today.

Is the chunked metadata that you are proposing in-spec?

TomNicholas May 30, 2025

One big Zarr could do Level 2 too

I guess it also gives you a place to store STAC collection-level metadata within the Zarr store so that the Zarr store is totally self-describing.

This is my primary motivation yes. Another advantage is that it's potentially easier to keep the tabular-like data or metadata and the array data consistent with one another if they are all tracked by the same system (i.e. Zarr or Icechunk).

A group within a zarr is not in and of itself a zarr store (right??).

Actually I believe it is, at least without consolidated metadata and using the "native zarr" format. Then there is nothing special that exists only at the root of the store that doesn't also exist at the start of any group's prefix.

The downside (in the native rather than virtual zarr case) is that the individual scenes are no longer directly accessible individually.

Well you can still do something like xr.open_zarr(store, group=<arbitrary_group_name>). That seems to me just as accessible as having to do xr.open_dataset(<arbitrary_file_name.tiff>)?

Store STAC data scalably as tabular, in Zarr

Do you mean separate Zarr arrays containing metadata? Surely we don't want to include metadata within data-chunk files themselves - that gets us back to the NetCDF situation that we are running away from in the first place.

I do mean that, but I don't think it causes the same problem netCDF has. In netCDF the problem is that you cannot know a-priori, or easily learn, what range requests you have to make to find all the metadata in the file. You have no choice but to iteratively look through most of the file's contents to find it all. In my metadata-in-zarr-arrays suggestion it's still only 2 sequential GET requests to get all that metadata - one to find out what arrays are in the store, the other to get the contents of the 1D arrays containing the STAC metadata. So putting them in proper chunks rather than zarr.json only costs you ~1 extra GET request.

it would be very GET-intensive to do a global search for everywhere that has low-cloud-cover or something.

If all the cloud-cover values were in one zarr array at the root, surely that would be very efficient to go and GET?

we are talking about putting all the metadata that scales with the number of scenes (so something like cloud-cover say) in chunks

If we don't do that then how can it ever scale to a very large scenes without running an actual database to hold the metadata? A single massive JSON, or many many JSON would not be an efficient format for storing what seems like effectively tabular data. This is presumably why stac-geoparquet exists. This is the real advantage of the everything-in-zarr idea - hopefully avoiding have to have multiple different sources of truth or run a database to enable search over the data.

Stay inside the Spec!

Is the chunked metadata that you are proposing in-spec?

Yes, I'm just talking about using 1D zarr arrays. It just so happens that a 1D Zarr array isn't that different from Parquet.

The challenge here is the tooling to serialize and read these 1D arrays (really indexes) efficiently enough to use them for search.

emmanuelmathot · 2025-05-27T07:01:46Z

emmanuelmathot
May 27, 2025
Collaborator Author

Thank you, @TomNicholas, for contributing to the discussion. That aligns perfectly with my expectations. My aim here is to collect experiences like yours to evaluate the advantages and disadvantages of various options for a possible recommendation or specifications.
Would you have by chance an implementation experimentation like a PoC or notebook to have a concrete example of STAC catalog in Zarr array?

6 replies

emmanuelmathot May 27, 2025
Collaborator Author

I am actually interested in developing multiple PoC to have concrete options to propose for a potential STAC/Zarr specification activity.

TomNicholas May 27, 2025

Let me briefly sketch out what a PoC would look like, in case you or anyone else wants to work on it.

Write the data

Take some existing STAC-described data that's currently in separate COGs / many small Zarr stores. It would make the point better if this data is Level 2, and you only need a handful of scenes for a PoC, so keep the size small. Leave out the overviews for now, they aren't needed for a PoC as they could be fleshed out later.
Create an empty zarr store using the zarr-python API directly. Create a set of empty groups with some kind of STAC-collection-friendly group naming convention.
Use xarray to loop over the COGs and write each scene into one group (using ds.to_zarr(group=...)).

Write the metadata into the root group

Somehow get your STAC collection metadata into a tabular form. I suggest just doing this with Pandas - you want one DataFrame that contains all the fields for your STAC collection data, with one column per STAC metadata field, and one row per item in the collection.
Make sure there is a field which contains the name of each group that is referred to by that row-item.
To get this into zarr:
- Use pandas.Series.to_numpy() to create one 1D numpy array per column.
- Using the zarr-python API directly, create one zarr array per column, with a shape of (N,), where N is the number of rows
- Use the zarr-python API directly to write each numpy array as one zarr array. Use chunks if you want, but that's also not needed for a PoC.
- Write any STAC collection-level metadata into the zarr group metadata of the root group

Teach something to read this STAC metadata.

You either need to teach some existing STAC tool to read the 1D zarr arrays, or make a new toy tool that does this using zarr-python. I personally would do the latter, and just make something that uses pandas to read the zarr, but only because I don't actually work directly with STAC tools.
Now you should be able to display the STAC collection data that you just read from Zarr!
You can search through this tabular data to find the group name containing a scene of interest.

Reading the actual data back with Xarray

To read the data for any particuar scene, just use xr.open_zarr(<store>, group=...)

Does that help clarify? Let me know if you have any questions.

P.S. If you can point me to some self-describing example of suitable Level 2 data I can have a go at this too...

emmanuelmathot May 27, 2025
Collaborator Author

Great guidance! Thank you, @TomNicholas. I intended to test this with sentinel2-l2a since it contains multiple bands at different resolutions.

TomNicholas May 27, 2025

For different resolutions you may need to add another level(s) of groups, especially if you want that group to be openable with Xarray later. But the important thing for this idea is that at the first level of groups there is one group per STAC item.

emmanuelmathot Jul 29, 2025
Collaborator Author

FYI, an initial data model for EOPF Sentinel GeoZarr is under construction here.
In a future evolution, we are interested in an embedded STAC catalog as you described above. Any news on a potential PoC?

chris-little · 2025-06-03T18:33:58Z

chris-little
Jun 3, 2025

@TomNicholas @jsignell @emmanuelmathot Would the approaches being discussed here work with a hierarchy of STAC catalogues? The Met Office and other National Met Services are exploring hierarchical STACs as a way of exposing all the Scenes or atomic data fields (one parameter, one area/level/time , provenance, etc) Only the bottom STACs actually point to a datastore/STAC item, the higher level STACs point to a lower one. Advantages: no moving parts or custom code, links can be traversed by software or humans. Apologies if this is off the wall/really out of scope.

0 replies

TomAugspurger · 2025-06-03T22:44:15Z

TomAugspurger
Jun 3, 2025

Thanks @emmanuelmathot. Dropping an initial response and then I'll try to get caught up on the other discussion threads. And I've only skimmed https://cpm.pages.eopf.copernicus.eu/eopf-cpm/main/developer-guide/store-developer-guide/mapping_file_samples_g.html and nothing else from the EOPF docs, so my understanding of it is limited.

On the big question of putting the STAC metadata inside vs. outside .zattrs (zarr.json in Zarr V3): I'd start by asking yourself how you / your users are accessing the data.

For the Planetary Computer, our data usage primarily came through our STAC API. That was our entrypoint for everything. Given this entrypoint, embedding STAC metadata inside the Zarr metadata at .zattrs doesnt't add anything since the STAC metadata is how you get to the Zarr dataset in the first place!

And it's worth stating explicitly that STAC metadata embedded inside Zarr metadata would not be searchable, at least not for a static Zarr dataset on Blob Storage; users wouldn't be able to do some kind of query over the STAC metadata of many STAC items (either through a STAC API endpoint or through something like stac-geoparquet).

If your users' entrypoint is instead some sort of catalog on top of Zarr, like a root Zarr Group with many nested Zarr (sub) Groups and eventually Arrays inside those, then maybe this makes sense. At that point, you'd need to ask "what are my users doing with this STAC metadata once they've loaded it?"

For the PC, users primarily used STAC metadata to load assets (maybe some people used the assets themselves, but 99% of the value at this point would come from the asset HREFs). Either by loading a single asset using something like rioxarray or zarr-python, or many with something like stackstac or odc-stac. I might be missing something, but it sounds to me like STAC metadata embedded in a Zarr metadata document wouldn't be helpful here; you don't need any help loading assets since Zarr already does that.

So my initial reaction is that this doesn't sound too valuable for how I use STAC:

search
asset loading

That's not to say it's a bad idea. Just that (my understanding of) the proposal doesn't align with the use cases I'm familiar with.

The biggest benefit I see is that STAC has some established conventions for cataloging metadata about geospatial assets. So following those conventions possibly seems like a good idea. But we already have CF-Conventions, and IIUC geozarr is exploring some things in these area, so just putting STAC in Zarr again doesn't seem the most useful.

One last thing to call out: the assets in the example are relative:

  'assets': {'otci': {'href': '/measurements/otci', 'title': 'otci'}},

STAC does allow for relative HREFs (for both items and assets). But in my experience (which is especially biased on this detail) they're not that useful. Our approach was all "cloud native" so a relative asset HREF is bad, since it implies you're downloading the asset to some working directory before you can use it.

FYI the link https://eopf-cpm.eu/ gives a 404 (not sure if that ever worked).

0 replies

Embedding STAC Directly in Zarr Store #1344

Uh oh!

Uh oh!

emmanuelmathot May 26, 2025 Collaborator

Background

Conceptual Rationale

Questions for Community Discussion

Storage Location and Implementation Approaches

Seeking Community Input

References

Replies: 5 comments · 20 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gadomski May 27, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsignell May 29, 2025 Maintainer

Uh oh!

Uh oh!

One big Zarr could do Level 2 too

Store STAC data scalably as tabular, in Zarr

STAC collections aren't special enough to special case

Stay inside the Spec!

Uh oh!

Uh oh!

jsignell May 29, 2025 Maintainer

One big Zarr could do Level 2 too

Store STAC data scalably as tabular, in Zarr

Stay inside the Spec!

Uh oh!

One big Zarr could do Level 2 too

Store STAC data scalably as tabular, in Zarr

Stay inside the Spec!

Uh oh!

emmanuelmathot May 27, 2025 Collaborator Author

Uh oh!

emmanuelmathot May 27, 2025 Collaborator Author

Uh oh!

Write the data

Write the metadata into the root group

Teach something to read this STAC metadata.

Reading the actual data back with Xarray

Uh oh!

emmanuelmathot May 27, 2025 Collaborator Author

Uh oh!

Uh oh!

emmanuelmathot Jul 29, 2025 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

emmanuelmathot
May 26, 2025
Collaborator

Replies: 5 comments 20 replies

gadomski May 27, 2025
Maintainer

jsignell May 29, 2025
Maintainer

jsignell May 29, 2025
Maintainer

emmanuelmathot
May 27, 2025
Collaborator Author

emmanuelmathot May 27, 2025
Collaborator Author

emmanuelmathot May 27, 2025
Collaborator Author

emmanuelmathot Jul 29, 2025
Collaborator Author