Skip to content

Add SkyReels V2: Infinite-Length Film Generative Model #11518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 256 commits into
base: main
Choose a base branch
from

Conversation

tolgacangoz
Copy link
Contributor

@tolgacangoz tolgacangoz commented May 7, 2025

Thanks for the opportunity to fix #11374!

Original Work

Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074

SkyReels V2's main contributions are summarized as follow:
• Comprehensive video captioner that understand the shot language while capturing the general description of the video, which dramatically improve the prompt adherence.
• Motion-specific preference optimization enhances motion dynamics with a semi-automatic data collection pipeline.
• Effective Diffusion-forcing adaptation enables the generation of ultra-long videos and story generation capabilities, providing a robust framework for extending temporal coherence and narrative depth.
• SkyCaptioner-V1 and SkyReels-V2 series models including diffusion-forcing, text2video, image2video, camera director and elements2video models with various sizes (1.3B, 5B, 14B) are open-sourced.

main_pipeline

TODOs:
FlowMatchUniPCMultistepScheduler: just copy-pasted from the original repo
SkyReelsV2Transformer3DModel: 90% WanTransformer3DModel
SkyReelsV2DiffusionForcingPipeline
SkyReelsV2DiffusionForcingImageToVideoPipeline: Includes FLF2V.
SkyReelsV2DiffusionForcingVideoToVideoPipeline: Extends a given video.
SkyReelsV2Pipeline
SkyReelsV2ImageToVideoPipeline
scripts/convert_skyreelsv2_to_diffusers.py

⏳ Did you make sure to update the documentation with your changes? Did you write any new necessary tests?: We will construct these during review.

T2V with Diffusion Forcing (OLD)

Skywork/SkyReels-V2-DF-1.3B-540P
seed 0 and num_frames 97
Original repo diffusers integration
original_0_short.mp4
diffusers_0_short.mp4
seed 37 and num_frames 97
Original repo diffusers integration
original_37_short.mp4
diffusers_37_short.mp4
seed 0 and num_frames 257
Original repo diffusers integration
original_0_long.mp4
diffusers_0_long.mp4
seed 37 and num_frames 257
Original repo diffusers integration
original_37_long.mp4
diffusers_37_long.mp4
!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingPipeline
from diffusers.utils import export_to_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."

output = pipe(
    prompt=prompt,
    num_inference_steps=30,
    height=544,
    width=960,
    num_frames=97,
    ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "T2V.mp4", fps=24, quality=8)

"""
You can set `ar_step=5` to enable asynchronous inference. When asynchronous inference,
`causal_block_size=5` is recommended while it is not supposed to be set for
synchronous generation. Asynchronous inference will take more steps to diffuse the
whole sequence which means it will be SLOWER than synchronous mode. In our
experiments, asynchronous inference may improve the instruction following and visual consistent performance.
"""

I2V with Diffusion Forcing (OLD)

prompt="A penguin dances." diffusers integration
i2v-short.mp4
#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

image = load_image("Penguin from https://huggingface.co/tasks/image-to-video")
prompt = "A penguin dances."

output = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "I2V.mp4", fps=24, quality=8)

"""
When I set `ar_step=5` and `causal_block_size=5`, then the results seem really bad.
"""

FLF2V with Diffusion Forcing (OLD)

Now, Houston, we have a problem.
I have been unable to produce good results with this task. I tried many hyperparameter combinations with the original code.
The first frame's latent (torch.Size([1, 16, 1, 68, 120])) is overwritten onto the first of 25 frame latents of latents (torch.Size([1, 16, 25, 68, 120])). Then, the last frame's latent is concatenated, thus latents is torch.Size([1, 16, 26, 68, 120]). After the denoising process, the length of the last frame latent is discarded at the end and then decoded by the VAE. I tried not concatenating the last frame but overwriting onto the latest frame of latents and not discarding the latest frame latent at the end, but still got bad results. Here are some results:

First Frame Last Frame
0.mp4
1.mp4
2.mp4
3.mp4
4.mp4
5.mp4
6.mp4
7.mp4
#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

output = pipe(
    image=first_frame,
    last_image=last_frame,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "FLF2V.mp4", fps=24, quality=8)

V2V with Diffusion Forcing (OLD)

This pipeline extends a given video.

Input Video diffusers integration
video1.mp4
v2v.mp4
#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingVideoToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird flaps its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its continuing flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
video = load_video("Input video.mp4")

output = pipe(
    video=video,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=120,
    base_num_frames=97,
    ar_step=0,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=17,  # Number of frames to overlap for smooth transitions in long videos
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "V2V.mp4", fps=24, quality=8)

Firstly, I want to congratulate you on this great work, and thanks for open-sourcing it, SkyReels Team! This PR proposes an integration of your model.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@yiyixuxu @a-r-r-o-w @linoytsaban @yjp999 @Howe2018 @RoseRollZhu @pftq @Langdx @guibinchen @qiudi0127 @nitinmukesh @tin2tin @ukaprch @okaris

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@ukaprch
Copy link

ukaprch commented May 8, 2025

It's about time. Thanks.

tolgacangoz added 28 commits May 8, 2025 20:01
Replaces custom attention implementations with `SkyReelsV2AttnProcessor2_0` and the standard `Attention` module.
Updates `WanAttentionBlock` to use `FP32LayerNorm` and `FeedForward`.
Removes the `model_type` parameter, simplifying model architecture and attention block initialization.
Introduces new classes `SkyReelsV2ImageEmbedding` and `SkyReelsV2TimeTextImageEmbedding` for enhanced image and time-text processing. Refactors the `SkyReelsV2Transformer3DModel` to integrate these embeddings, updating the constructor parameters for better clarity and functionality. Removes unused classes and methods to streamline the codebase.
…ds and begin reorganizing the forward pass.
…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.
…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.
…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.
…ing components and streamline the text-to-video generation process. Updates class documentation and adjusts parameter handling for improved clarity and functionality.
…parameter handling and improving integration.
… model types by dynamically adjusting zero padding.
…substring matching for model directory checks
… scheduler for SkyReels pipelines, enhancing model integration
… Film Generative model, enhancing text-to-video generation examples, and updating model references throughout the API documentation.
… documentation, updating TOC and introducing new model and scheduler files.
…t flow matching scheduler parameter for I2V from 3.0 to 5.0, ensuring clarity in usage examples.
…elines, clarifying its role in asynchronous inference.
@tolgacangoz tolgacangoz marked this pull request as ready for review June 8, 2025 18:01
@DN6
Copy link
Collaborator

DN6 commented Jun 9, 2025

Thank you @tolgacangoz @a-r-r-o-w Could you take a look please

@tolgacangoz
Copy link
Contributor Author

tolgacangoz commented Jun 10, 2025

Hi @nitinmukesh @tin2tin. You can make tests, reviews for this PR just as you have done in other PRs, if you want.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@nitinmukesh
Copy link

nitinmukesh commented Jun 10, 2025

Thank you @tolgacangoz for making the feature available in diffusers.

I will test it now.

@yiyixuxu yiyixuxu added the roadmap Add to current release roadmap label Jun 11, 2025
@yiyixuxu yiyixuxu requested a review from a-r-r-o-w June 11, 2025 22:09
yiyixuxu and others added 5 commits June 11, 2025 12:10

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Introduced a new markdown file detailing the SkyReelsV2Transformer3DModel, including usage instructions and model output specifications.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
- Adjusted `in_channels` from 36 to 16 in `test_skyreels_v2_df_image_to_video.py`.
- Added new parameters: `overlap_history`, `num_frames`, and `base_num_frames` in `test_skyreels_v2_df_video_to_video.py`.
- Updated expected output shape in video tests from (17, 3, 16, 16) to (41, 3, 16, 16).
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the awesome work here @tolgacangoz! The PR looks great. I just have some nits and changes that will help keep consistent implementations across our other model/processors, and cleanup the pipelines a bit.

It is a massive PR to review, but not the reason why it took so long for me. I'll have to admit the idea of diffusion forcing is new to me and I couldn't fully wrap my head around it until going through some different implementations. Don't know how you did it so fast :)

Also great work on figuring out the numerical precision matching!

Regarding hosting the models, we will try to establish contact with SkyReels team (if not already) and see if they can host the weights.

Comment on lines +32 to +39
- [SkyReels-V2 DF 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers)
- [SkyReels-V2 DF 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P-Diffusers)
- [SkyReels-V2 DF 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P-Diffusers)
- [SkyReels-V2 T2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-540P-Diffusers)
- [SkyReels-V2 T2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-720P-Diffusers)
- [SkyReels-V2 I2V 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P-Diffusers)
- [SkyReels-V2 I2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-540P-Diffusers)
- [SkyReels-V2 I2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P-Diffusers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yiyixuxu Do we have contact with the SkyReels team and do we know if they would be okay with hosting the weights? If it's not possible, we could maintain skyreels-community org similar to hunyuan


The example below demonstrates how to generate a video from text optimized for memory or inference speed.

<hfoptions id="T2V usage">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice examples, great work!

Comment on lines +236 to +247
### Any-to-Video Controllable Generation

SkyReels-V2 supports various generation techniques which achieve controllable video generation. Some of the capabilities include:
- Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: [huggingface/controlnet_aux]()
- Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips)
- Inpainting and Outpainting
- Subject to Video (faces, object, characters, etc.)
- Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.)

The general rule of thumb to keep in mind when preparing inputs for the SkyReels-V2 pipeline is that the input images, or frames of a video that you want to use for conditioning, should have a corresponding mask that is black in color. The black mask signifies that the model will not generate new content for that area, and only use those parts for conditioning the generation process. For parts/frames that should be generated by the model, the mask should be white in color.

The code snippets available in [this](https://github.com/huggingface/diffusers/pull/11582) pull request demonstrate some examples of how videos can be generated with controllability signals.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part is adapted from the Wan VACE documentation, no? IIUC SkyReels does not support all these tasks though, so let's remove this


- The number of frames per second (fps) or `k` should be calculated by `4 * k + 1`.

- Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution images.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution images.
- Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution videos.


## Notes

- SkyReels-V2 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a completely new transformer implementation (in the sense that we have new file, but similar as Wan), let's create a new lora loader mixin

video = pipe(**inputs).frames
generated_video = video[0]

self.assertEqual(generated_video.shape, (21, 3, 16, 16))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the total num_frames logic (we set 9 above but expect 21 here). Could you explain it a bit and provide a small example?

pass


# TODO: Is this FLF2V test necessary, because the original repo doesn't seem to have this functionality for this pipeline?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading through the code paths, I don't think there is something that could potentially break easily when handling last image. If you think there is, we can keep the test. Otherwise, to the above test suite, let's just add a simple extension test_inference_with_last_image for minimal testing

hidden_states=latent_model_input,
timestep=timestep,
encoder_hidden_states=negative_prompt_embeds,
flag_df=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit): would give this a better name like enable_diffusion_forcing

@@ -0,0 +1,1109 @@
# Copyright 2025 The SkyReels-V2 Team, The Wan Team and The HuggingFace Team. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above diffusion forcing pipeline. We could probably wrap the repeated logic into a helper function and call that directly

@@ -0,0 +1,962 @@
# Copyright 2025 The SkyReels-V2 Team, The Wan Team and The HuggingFace Team. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above about shift and helper function for repeated logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
roadmap Add to current release roadmap
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[Feature request] Integrate SkyReels-V2 support in diffusers
7 participants