Skip to content

What's actually making this faster? #703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
richardrl opened this issue May 28, 2025 · 12 comments
Closed

What's actually making this faster? #703

richardrl opened this issue May 28, 2025 · 12 comments

Comments

@richardrl
Copy link

richardrl commented May 28, 2025

My understanding is the fundamental issue with decoding from MP4s is the decoding gets extremely slow deeper into the MP4, because you have to start from the beginning and sequentially decode to your target frame.

I did an experiment using the torchcodec indexing API and it is no faster than a very naive decoding with AV library.

How do you get speedups in a setting where you need to sample random clips across multiple videos?

Is it because you give up on diversity within a minibatch? For example, I could imagine if, for 100 subsequences in a minibatch, instead of having 1 subsequence each from 100 videos you want 10 subsequences from 10 videos, and for the latter torchcodec can sweep over each video one-time to get 10 clips - and this could be much faster.

@NicolasHug
Copy link
Member

NicolasHug commented May 29, 2025

Hi @richardrl

There are a bunch of optimizations we try to do within torchcodec. In the random clip sampling scenario you are mentioning, the bulk of the speedup over a naive implementation comes from the fact that torchcodec prevents backwards seeks, which we have observed to be very slow in these scenarios. That is, if the frames we need to decode are at indices 50, 10, 1, 30, torchcodec would decode frames 1, 10, 30, 50 and re-order them to match the expected output ordering.

There are other optimizations, like the choice of color conversion library (libswscale vs filtergraph).

There are no trade-off, i.e. we don't compromise on anything (like diversity) unless explicitly stated e.g. with the seek_mode="approximate".

the decoding gets extremely slow deeper into the MP4, because you have to start from the beginning and sequentially decode to your target frame

Just to note that this is usually not the case: decoding a given frame requires decoding the previous (and sometimes the next) key frame, but there is no need to decode from the beginning of the file.

@richardrl
Copy link
Author

@NicolasHug is there a specific example / API you’d recommend for this use case of random clip sampling (like in imitation learning setting)?

I think I wasn’t seeing great results from the indexing API.

I’d like to compare the speed compared to my existing dataloader which first decodes the video then uses ffcv to sample subclips.

@NicolasHug
Copy link
Member

Sure, all of our clip samplers are details in this tutorials: https://docs.pytorch.org/torchcodec/stable/generated_examples/sampling.html

@VimalMollyn
Copy link

@NicolasHug do you have an example where you can use torchcodec to sample a batch of video clips from multiple videos? Say, 30 frames each from 10 videos, to get a batch of size 10 x 30 x channels x height x width? I couldn't find such an example in the docs.

@NicolasHug
Copy link
Member

@VimalMollyn I think you'd just need to call the samplers (like clips_at_random_indices) individually on each 10 videos? Let me know if I'm missing something

@richardrl
Copy link
Author

@NicolasHug Are the sampling API's supposed to be faster than indexing API?

I did a test and it seems to be the same.

@NicolasHug
Copy link
Member

They're not supposed to be faster than the batch APIs like get_frames_*, because they rely on those under the hood. But they'll be faster than individually calling the single-frame APIs like get_frame_*

@richardrl
Copy link
Author

Suppose we want to sample 1 frame from 100 videos in our dataloader. How would we structure the Pytorch dataset? If we use the TorchCodec inside getitem for the dataset, it would be very slow (decoding one frame at a time).

@NicolasHug
Copy link
Member

To decode one frame for each video you will need one VideoDecoder instance per video.

@richardrl
Copy link
Author

Yes, but how do we structure getitem? Is it a VideoDecoder created per getitem that retrieves one frame with the indexing API? @NicolasHug

@richardrl
Copy link
Author

I continued this in #715

@NicolasHug
Copy link
Member

Sounds good, I'll close this issue as I think the original questions were addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants