Skip to content

Verify decoder outputs #728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 17, 2025
Merged

Conversation

Dan-Flores
Copy link
Contributor

@Dan-Flores Dan-Flores commented Jun 16, 2025

Overview

While adding the OpenCV decoder in #711, we realized that OpenCV was decoding a different stream than TorchCodec, resulting in unexpected benchmark results. This PR implements a function to compare the outputs of decoders.

Example usage

By adding --verify-outputs to the benchmark command, the script outputs a warning indicating that the output frames are different for TorchCodec in approximate mode compared to OpenCV on nasa_13013.mp4:

python benchmarks/decoders/benchmark_decoders.py --decoders torchaudio,torchcodec_public:seek_mode=approximate,opencv   --verify-outputs 
Screenshot 2025-06-13 at 4 19 47 PM
Error while comparing TorchCodecPublic:seek_mode=approximate and OpenCV[backend=FFMPEG]: The values for attribute 'shape' do not match: torch.Size([3, 270, 480]) != torch.Size([3, 180, 320]).

The dimensions match when stream_index is utilized, but the tensors are unequal:

python benchmarks/decoders/benchmark_decoders.py --decoders torchaudio:stream_index=0,torchcodec_public:seek_mode=approximate+stream_index=0,opencv   --verify-outputs --video-paths output.mp4

AssertionError: Tensor-likes are not equal!

Mismatched elements: 38759 / 172800 (22.4%)
Greatest absolute difference: 16 at index (0, 98, 96)
Greatest relative difference: inf at index (0, 44, 210)

For further testing, I generated a video using ffmpeg, where all results matched:

ffmpeg -f lavfi -i testsrc=duration=10:size=1280x720:rate=30 -vf "drawtext=fontfile=/path/to/font.ttf: text='%{pts\:hms}': x=(w-text_w)/2: y=(h-text_h)/2: fontsize=48: fontcolor=white: box=1: [email protected]: boxborderw=5, hue=H=2*t" -pix_fmt yuv420p output.mp4
python benchmarks/decoders/benchmark_decoders.py --decoders torchaudio:stream_index=0,torchcodec_public:seek_mode=approximate+stream_index=0,opencv   --verify-outputs --video-paths output.mp4
video=output.mp4, decoder=OpenCV[backend=FFMPEG]
Results of baseline TorchCodecPublic and OpenCV[backend=FFMPEG] match!
video=output.mp4, decoder=TorchAudio:stream_index=0
Results of baseline TorchCodecPublic and TorchAudio:stream_index=0 match!
video=output.mp4, decoder=TorchCodecPublic:seek_mode=approximate+stream_index=0
Results of baseline TorchCodecPublic and TorchCodecPublic:seek_mode=approximate+stream_index=0 match!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 16, 2025
try:
torch.testing.assert_close(f1, f2)
except Exception as e:
tensorcat(f1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This library is useful for visually comparing frames, but I am open to removing it since it is not necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should remove it. Even though most users won't run the benchmarks, we still want to keep dependencies down.

I also think we can simplify this part, and just do the plain assertion. For this kind of validation, it's generally better to get a failure than it is to just print the result to stdout. Our CI, and a lot of other monitoring infra, will use the process exit code to determine if there was a problem. Doing what we're currently doing will mean that we'd need to parse stdout to figure out if there was a status if we wanted to hook this up to CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, I'll remove the library and keep that in mind going forward.

# Import library to show frames that don't match
from tensorcat import tensorcat

# Reuse TorchCodecPublic decoder with options, if provided, as the baseline
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is TorchCodecPublic the correct decoder to use as the baseline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - yes, I think so. We have a large battery of unit tests testing for bit-for-bit correctness on TorchCodec, and it's easier to test on the public API.

None,
):
torchcodec_public_decoder = decoders_to_run[torchcodec_display_name]
# Create default TorchCodecPublic decoder to use as a baseline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the reference decoder will be subject to the options that the user provides, such as seek_mode. I think we shouldn't try to use the options the user provided, but instead decide what the reference decoder is, and always use that. That means that we probably shouldn't use the default options, but decide what options we want to use. Regarding seek_mode, I think we should probably use exact as the reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One catch is that given the new 'stream_index' option, we rely on the user to indicate which stream the benchmarks should compare. Without this option, the benchmark for OpenCV vs TorchCodecPublic would not match.

I agree that exact should be used as the reference. I've updated the code to only reuse the stream_index argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! But we want to use the stream_index that the user provided. Good call. :)

@@ -177,6 +135,9 @@ def main() -> None:
if entry.is_file() and entry.name.endswith(".mp4"):
video_paths.append(entry.path)

if args.verify_outputs:
verify_outputs(decoders_to_run, video_paths, num_uniform_samples)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this option is most useful if it's mutually exclusive with running the actual benchmarks. That way someone can specify it to quickly test the benchmark's correctness. So here I think we should make running the benchmarks and printing the benchmark results the else branch here.

# Generate uniformly random PTS
duration = metadata.duration_seconds
pts_list = [i * duration / num_samples for i in range(num_samples)]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this is uniformly-spaced PTS values, or just evenly-spaced PTS values. Uniformly random would look like pts_list = (torch.rand(num_samples) * duration).tolist().

# Decode non-sequential frames using decode_frames function
random_frames = decoder.decode_frames(video_file_path, pts_list)
# Extract the frames from the FrameBatch if necessary
if isinstance(random_frames, FrameBatch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On comments: often the what is something a reader can easily deduce. It's the why that usually needs a comment. In this case, the why is that TorchCodec's batch APIs return a FrameBatch. We sometimes use these APIs in our experiments, and we just return that directly. But for all other decoders, we just return a list of frames.

pts_list=pts_list,
)
decoders_and_frames.append((decoder_name, frames))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we just want to assert that the frames are close (see comment below), I think we can simplify even further and just do that assertion here. Then we don't need to do a separate loop over decoders_and_frames. In fact, I think we don't even need to record decoders_and_frames, as we don't need to remember the frames after the assertion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I was able to significantly simplify this portion.

Copy link
Contributor

@scotts scotts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I think addressing the remaining comments will simplify the verification and make it easier for us to hook this up to CI (eventually, if we want to).

@scotts
Copy link
Contributor

scotts commented Jun 17, 2025

This is great! We should have been doing this all along. :) Let's go ahead and merge - I'm confident that the testing itself is correct, and I suspect we have a bug in how we're decoding with OpenCV. Let's create a new issue to investigate the problem with how we're using OpenCV, and label it as a "bug."

@Dan-Flores Dan-Flores merged commit d88cf1a into pytorch:main Jun 17, 2025
42 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants