Skip to content

Allow num_frames and duration to be absent in C++ decoder #708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 3, 2025

Conversation

scotts
Copy link
Contributor

@scotts scotts commented May 31, 2025

This is necessary to decode live streaming data (#695), but not sufficient. It's necessary because we need the C++ layer to be robust to the number of frames and duration being missing from the metadata, as that will not be available when decoding a live stream. But it's also not sufficient because the Python layer for VideoDecoder also asserts that metadata is present.

In fact, because VideoDecoder uses the number frames from the metadata in its __len__() method, we will need to expose a different public Python API to support live streaming. We should, however, be able to reuse the C++ layer.

I don't particularly love the way I'm currently achieving this, so I'm open to alternatives. I'm also concerned that when the number of frames or duration are missing, we might actually be open to segfaults. I think we should just get the exceptions from the inner decoding loop about not being able to decode frames, but I'm not sure.

Discussion points:

  1. Are we comfortable relaxing these requirements at the C++ level? We can (and do) still enforce some of this at the Python level if the Python APIs require this.
  2. If yes to 1, is this the best way to do it?
  3. We don't have any explicit tests for what happens when this data is not present. We don't have any files which have it missing. We could write some C++ tests that go in an just remove the values from the C++ VideoDecoder. That's ugly - is it worth it?

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 31, 2025
@scotts scotts marked this pull request as ready for review June 2, 2025 12:55
const StreamMetadata& streamMetadata) {
switch (seekMode_) {
case SeekMode::exact:
return streamMetadata.numFramesFromScan.value();
case SeekMode::approximate: {
TORCH_CHECK(
streamMetadata.numFrames.has_value(),
"Cannot use approximate mode since we couldn't find the number of frames from the metadata.");
return streamMetadata.numFrames.value();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we mean to return streamMetadata.numFrames instead of streamMetadata.numFrames.value()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but I'm surprised this compiles. This function seems to always return int64_t, never a std::optional<int64_t> ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's overly broad, but still true. That is, the following should compile just fine:

std::optional<int> alwaysOne() {
  return 1;
}

It's just that the return value is always an optional with the value 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it reminds me of this great random number generator:

def get_random_int():
    return 4

@NicolasHug
Copy link
Member

I think it's reasonable to relax the existing requirements when the video doesn't have enough metadata to enforce them. The good news is we haven't had any user report indicating that our hard TORCH_CHECK were limiting, so such videos hopefully aren't common in the wild (except for live streams).

On point 2. I'm a bit puzzled about what the PR is doing (see my question above) but overall I don't see a much better alternative.

And re tests / potential segfaults: we definitely want to test this, but I agree that we want to avoid writing some C++ tests for that, it's hard to maintain. We could create an issue to remind us to generate such bad video once we have our own video encoder implemented, it should be easier at this point. Until then, to make sure we don't have segfaults, maybe we could just modify the code locally to e.g. force numFrames to be nullops and see how our tests behave?

@scotts
Copy link
Contributor Author

scotts commented Jun 3, 2025

@NicolasHug, I hope the purpose of PR is more clear now that I've corrected my code. :)

And on how to quickly check we don't segfault: good thinking, I just tried that. Specifically, I commented out the lines:

int64_t frameCount = avStream->nb_frames;
if (frameCount > 0) {
streamMetadata.numFrames = frameCount;
}
if (avStream->duration > 0 && avStream->time_base.den > 0) {
streamMetadata.durationSeconds =
av_q2d(avStream->time_base) * avStream->duration;
}

That means that even if the metadata is present in the file, we just ignore it. As expected, we get tons of test failures - in particular, constructing a Python VideoDecoder object asserts those values are not None in the metadata. But no segfaults!

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for checking. I also tried the following, to bypass the instantiation failures related to missing metadata:

diff --git a/src/torchcodec/_core/SingleStreamDecoder.cpp b/src/torchcodec/_core/SingleStreamDecoder.cpp
index f4a285e..2947c64 100644
--- a/src/torchcodec/_core/SingleStreamDecoder.cpp
+++ b/src/torchcodec/_core/SingleStreamDecoder.cpp
@@ -1487,7 +1487,8 @@ std::optional<int64_t> SingleStreamDecoder::getNumFrames(
     case SeekMode::exact:
       return streamMetadata.numFramesFromScan.value();
     case SeekMode::approximate: {
-      return streamMetadata.numFrames;
+      return std::nullopt;
     }
     default:
       throw std::runtime_error("Unknown SeekMode");
@@ -1512,7 +1513,8 @@ std::optional<double> SingleStreamDecoder::getMaxSeconds(
     case SeekMode::exact:
       return streamMetadata.maxPtsSecondsFromScan.value();
     case SeekMode::approximate: {
-      return streamMetadata.durationSeconds;
+      return std::nullopt;
     }
     default:
       throw std::runtime_error("Unknown SeekMode");

And all the tests are passing except for 2 tests with the following error, which doesn't seem to be problematic:

E       AssertionError: Regex pattern did not match.
E        Regex: 'must be less than'
E        Input: 'Requested next frame while there are no more frames left to decode.'

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for checking. I also tried the following, to bypass the instantiation failures related to missing metadata:

diff --git a/src/torchcodec/_core/SingleStreamDecoder.cpp b/src/torchcodec/_core/SingleStreamDecoder.cpp
index f4a285e..2947c64 100644
--- a/src/torchcodec/_core/SingleStreamDecoder.cpp
+++ b/src/torchcodec/_core/SingleStreamDecoder.cpp
@@ -1487,7 +1487,8 @@ std::optional<int64_t> SingleStreamDecoder::getNumFrames(
     case SeekMode::exact:
       return streamMetadata.numFramesFromScan.value();
     case SeekMode::approximate: {
-      return streamMetadata.numFrames;
+      return std::nullopt;
     }
     default:
       throw std::runtime_error("Unknown SeekMode");
@@ -1512,7 +1513,8 @@ std::optional<double> SingleStreamDecoder::getMaxSeconds(
     case SeekMode::exact:
       return streamMetadata.maxPtsSecondsFromScan.value();
     case SeekMode::approximate: {
-      return streamMetadata.durationSeconds;
+      return std::nullopt;
     }
     default:
       throw std::runtime_error("Unknown SeekMode");

And all the tests are passing except for 2 tests with the following error, which doesn't seem to be problematic:

E       AssertionError: Regex pattern did not match.
E        Regex: 'must be less than'
E        Input: 'Requested next frame while there are no more frames left to decode.'

@scotts scotts merged commit 0f22b2b into pytorch:main Jun 3, 2025
47 checks passed
@scotts scotts deleted the relax_duration_frames branch June 3, 2025 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants