Best practice for large numbers of images per item (>1000) #1357

DFEvans · 2025-08-06T09:12:21Z

DFEvans
Aug 6, 2025

Hi all,

I'm currently considering the problem of how best to arrange STAC metadata for a data product with a large number (25-1500) of images. It's a low-level data product providing the consumer with the individual frames from a frame camera system during a single acquisition, and so it feels like they belong together as an "item" (this is also how we intend to distribute them, not as individual frames).

The challenging part is how (if at all) to represent them as assets. Things which make that a bit difficult are:

Needing to provide a schema that lists 1500 identical asset definitions, each with a unique name (which assets are required to have) - it could be auto-generated, but that feels like a workaround for a bad decision!
It's not trivial to constrain the schema so that the frames must be in sequence - having frame1 present, frame2 missing, then frame3 present wouldn't be very friendly, but frame1, frame2, and no further frames is OK.
The size of a STAC item with 1500 assets starts getting troublesome. Individually it's not terribly bad, but e.g. an API returning 100 items each with 1500 assets would start hitting response/message size limits in many systems.

We do already have an "index" file giving further metadata on a per-frame basis, and one option could be to push everything to do with the frames down to that level, but a data consumer wouldn't then be able to rely on the STAC metadata.

Are there any examples of how others have approached this sort of problem?

jsignell · 2025-08-06T16:57:35Z

jsignell
Aug 6, 2025
Maintainer

That's a really interesting question and I haven't heard of it before, so this is my speculation. I think you are right that all the frames belong under a single item.

We do already have an "index" file giving further metadata on a per-frame basis, and one option could be to push everything to do with the frames down to that level

That seems like the right approach to me, but it depends on how you expect people to use the data:

Do people need to access a particular frame?
- If so how will they figure out which frame they need? In the frame-per-asset scenario would they still need the "index" file to figure out which frame they need?
- If not then do they load all the frames up in a stack?

but a data consumer wouldn't then be able to rely on the STAC metadata.

I am wondering what you are currently thinking would be in the STAC asset metadata that the user would be interested in. And is it something they would be interested in when searching the catalog or just something that they need to access the data or once they have loaded the data.

1 reply

DFEvans Aug 7, 2025
Author

Thanks for your thoughts and questions - perfectly worded to make me look at why I'm asking the question to begin with!

If so how will they figure out which frame they need? In the frame-per-asset scenario would they still need the "index" file to figure out which frame they need?

The ways we envision consumers wanting to use the data are:

Combining the frames themselves (rather than one of our higher level data products), in which case selecting a specific frame isn't as important, but metadata needed to combine them usefully like per-frame imaging angles is in the "index" file
Performing some form of stereoimaging analysis, where again, the per-frame imaging angles are in the "index" file
Creating video products, where the frame number could be taken from the asset name, but if I were a data consumer I'd probably prefer to read it out of the index...

So, in all cases, it seems like putting every asset in the STAC is of no real use to them, or at best is more cumbersome than the index file.

I am wondering what you are currently thinking would be in the STAC asset metadata that the user would be interested in. And is it something they would be interested in when searching the catalog or just something that they need to access the data or once they have loaded the data.

Part of where this question comes from is that we're actually using STAC for several different purposes:

Storing the information on what assets make up each data product we can serve, so we know what to ship out when a user orders a data product ABC123 (it's a commercial data product, so the assets aren't directly accessible to a user)
Providing the search/exploration facility that STAC/STAC APIs do well at
The metadata that we ship to the user when they order a data product (but then augmented with the index file for this specific data product)

And so for the first one it (currently) matters to us internally that we list every asset, but it's not so useful for a user - I agree with your suspicion that a data consumer would rarely care about the individual frames until they've got the data in their hands already.

jsignell · 2025-08-07T17:00:26Z

jsignell
Aug 7, 2025
Maintainer

This is super interesting to think through! I really appreciate you taking the time to write it up! I should also mention that people are more than happy to talk about questions like this at the biweekly STAC Community Meetings.

Storing the information on what assets make up each data product we can serve, so we know what to ship out when a user orders a data product ABC123 (it's a commercial data product, so the assets aren't directly accessible to a user)

Ok so is driving your current need to list every asset individually because some of your data products use certain frames and other ones use other frames? Are there multiple products derived from any given frame?

Zooming out if you are worried about response size and you already have an index file then is starting to sound a bit like stac-geoparquet (if you squint). I am wondering if you could look to that work for inspiration? From what I've seen stac-geoparquet has been pretty focused on item collections and what you are talking about is more of a per-item asset collection. In this setup you would have one parquet file per item and it would have one row per asset and the information from the index file would be in there as well as all the stac-asset information.

0 replies

DFEvans · 2025-08-07T17:50:17Z

DFEvans
Aug 7, 2025
Author

Ok so is driving your current need to list every asset individually because some of your data products use certain frames and other ones use other frames? Are there multiple products derived from any given frame?

Not so much that - it's that this is a low-level data product intended to give a user something close to the raw data, and in some cases, that's a lot of frames! On the processing side of things, there's a separate database with its own structure to handle the linking between the various processing outputs; at that level, we've not found a need to be as verbose as STAC is.

The STAC comes in when it's packaged up as a data product to go into the customer-facing catalog. The data processing side packages up the data at a few different processing levels into STAC items, and hands these over to the customer-facing API team. Each STAC item is then used in the STAC API that faces customers for data exploration and ordering, to feed our own frontend (via the API), to ship data to customers when they order it (order item ABC123, we send you all of the assets linked to that item), etc.

Having had a chat internally today, with this discussion in mind, the approach we're planning to take is to zip up all the frames for a specific item into a single asset on that data processing side, and to then ship that zip to customers. The thoughts are:

We already ship data to customers via a Zip file when they place an order, and a zip in a zip isn't too unusual
It solves the "many assets" problem when handing data over between teams and presenting it in API responses.
There isn't a need to have each frame individually accessible in the places we're using STAC internally - the lower level data processing DB/pipeline/data storage will still have them accessible, but between "package it up once it's processed" and "customer receives the data", it's a single bundle of data that doesn't make sense to split
Similarly, the metadata for each frame is unlikely to be of interest in the "search, explore, and order" phase. The item level metadata should be sufficient (and is far more informative anyway, partly by design of how STAC works)
For data consumers, a reasonable number of processing tools (e.g. GDAL) can work directly with rasters inside zips, and if not, just unzip it

Zips only come in as a convenient data container that everything under the sun can deal with - the compression isn't going to do anything to the frames. It's more .tar than .tar.gz.

On GeoParquet, that might be something to consider if we were using our STAC API to provide direct data access as well. In that hypothetical world, I think my main concern would be the barrier to entry for some of our data consumers, particularly if they're using existing third party software. The Index file is currently just a CSV, one row per frame, to keep things nice and simple. That concept came from how other satellite data providers have presented similar products - just roll a dice to choose between whether it's CSV, XML, and JSON.

1 reply

jsignell Aug 7, 2025
Maintainer

Gotcha I think that seems like a sane solution!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practice for large numbers of images per item (>1000) #1357

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Best practice for large numbers of images per item (>1000) #1357

Uh oh!

DFEvans Aug 6, 2025

Replies: 3 comments · 2 replies

Uh oh!

jsignell Aug 6, 2025 Maintainer

Uh oh!

DFEvans Aug 7, 2025 Author

Uh oh!

jsignell Aug 7, 2025 Maintainer

Uh oh!

Uh oh!

DFEvans Aug 7, 2025 Author

Uh oh!

jsignell Aug 7, 2025 Maintainer

DFEvans
Aug 6, 2025

Replies: 3 comments 2 replies

jsignell
Aug 6, 2025
Maintainer

DFEvans Aug 7, 2025
Author

jsignell
Aug 7, 2025
Maintainer

DFEvans
Aug 7, 2025
Author

jsignell Aug 7, 2025
Maintainer