-
Notifications
You must be signed in to change notification settings - Fork 6k
Add SkyReels V2: Infinite-Length Film Generative Model #11518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It's about time. Thanks. |
Replaces custom attention implementations with `SkyReelsV2AttnProcessor2_0` and the standard `Attention` module. Updates `WanAttentionBlock` to use `FP32LayerNorm` and `FeedForward`. Removes the `model_type` parameter, simplifying model architecture and attention block initialization.
Introduces new classes `SkyReelsV2ImageEmbedding` and `SkyReelsV2TimeTextImageEmbedding` for enhanced image and time-text processing. Refactors the `SkyReelsV2Transformer3DModel` to integrate these embeddings, updating the constructor parameters for better clarity and functionality. Removes unused classes and methods to streamline the codebase.
…ds and begin reorganizing the forward pass.
…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.
…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.
…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.
…ing components and streamline the text-to-video generation process. Updates class documentation and adjusts parameter handling for improved clarity and functionality.
…parameter handling and improving integration.
…to streamline video generation process.
… model types by dynamically adjusting zero padding.
…substring matching for model directory checks
… across image and video pipelines
… scheduler for SkyReels pipelines, enhancing model integration
… Film Generative model, enhancing text-to-video generation examples, and updating model references throughout the API documentation.
… documentation, updating TOC and introducing new model and scheduler files.
…t flow matching scheduler parameter for I2V from 3.0 to 5.0, ensuring clarity in usage examples.
…elines, clarifying its role in asynchronous inference.
Thank you @tolgacangoz @a-r-r-o-w Could you take a look please |
Hi @nitinmukesh @tin2tin. You can make tests, reviews for this PR just as you have done in other PRs, if you want. |
Thank you @tolgacangoz for making the feature available in diffusers. I will test it now. |
Introduced a new markdown file detailing the SkyReelsV2Transformer3DModel, including usage instructions and model output specifications.
- Adjusted `in_channels` from 36 to 16 in `test_skyreels_v2_df_image_to_video.py`. - Added new parameters: `overlap_history`, `num_frames`, and `base_num_frames` in `test_skyreels_v2_df_video_to_video.py`. - Updated expected output shape in video tests from (17, 3, 16, 16) to (41, 3, 16, 16).
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the awesome work here @tolgacangoz! The PR looks great. I just have some nits and changes that will help keep consistent implementations across our other model/processors, and cleanup the pipelines a bit.
It is a massive PR to review, but not the reason why it took so long for me. I'll have to admit the idea of diffusion forcing is new to me and I couldn't fully wrap my head around it until going through some different implementations. Don't know how you did it so fast :)
Also great work on figuring out the numerical precision matching!
Regarding hosting the models, we will try to establish contact with SkyReels team (if not already) and see if they can host the weights.
- [SkyReels-V2 DF 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers) | ||
- [SkyReels-V2 DF 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P-Diffusers) | ||
- [SkyReels-V2 DF 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P-Diffusers) | ||
- [SkyReels-V2 T2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-540P-Diffusers) | ||
- [SkyReels-V2 T2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-720P-Diffusers) | ||
- [SkyReels-V2 I2V 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P-Diffusers) | ||
- [SkyReels-V2 I2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-540P-Diffusers) | ||
- [SkyReels-V2 I2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P-Diffusers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yiyixuxu Do we have contact with the SkyReels team and do we know if they would be okay with hosting the weights? If it's not possible, we could maintain skyreels-community
org similar to hunyuan
|
||
The example below demonstrates how to generate a video from text optimized for memory or inference speed. | ||
|
||
<hfoptions id="T2V usage"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice examples, great work!
### Any-to-Video Controllable Generation | ||
|
||
SkyReels-V2 supports various generation techniques which achieve controllable video generation. Some of the capabilities include: | ||
- Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: [huggingface/controlnet_aux]() | ||
- Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips) | ||
- Inpainting and Outpainting | ||
- Subject to Video (faces, object, characters, etc.) | ||
- Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.) | ||
|
||
The general rule of thumb to keep in mind when preparing inputs for the SkyReels-V2 pipeline is that the input images, or frames of a video that you want to use for conditioning, should have a corresponding mask that is black in color. The black mask signifies that the model will not generate new content for that area, and only use those parts for conditioning the generation process. For parts/frames that should be generated by the model, the mask should be white in color. | ||
|
||
The code snippets available in [this](https://github.com/huggingface/diffusers/pull/11582) pull request demonstrate some examples of how videos can be generated with controllability signals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part is adapted from the Wan VACE documentation, no? IIUC SkyReels does not support all these tasks though, so let's remove this
|
||
- The number of frames per second (fps) or `k` should be calculated by `4 * k + 1`. | ||
|
||
- Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution images. | |
- Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution videos. |
|
||
## Notes | ||
|
||
- SkyReels-V2 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have a completely new transformer implementation (in the sense that we have new file, but similar as Wan), let's create a new lora loader mixin
video = pipe(**inputs).frames | ||
generated_video = video[0] | ||
|
||
self.assertEqual(generated_video.shape, (21, 3, 16, 16)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the total num_frames logic (we set 9 above but expect 21 here). Could you explain it a bit and provide a small example?
pass | ||
|
||
|
||
# TODO: Is this FLF2V test necessary, because the original repo doesn't seem to have this functionality for this pipeline? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reading through the code paths, I don't think there is something that could potentially break easily when handling last image. If you think there is, we can keep the test. Otherwise, to the above test suite, let's just add a simple extension test_inference_with_last_image
for minimal testing
hidden_states=latent_model_input, | ||
timestep=timestep, | ||
encoder_hidden_states=negative_prompt_embeds, | ||
flag_df=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit): would give this a better name like enable_diffusion_forcing
@@ -0,0 +1,1109 @@ | |||
# Copyright 2025 The SkyReels-V2 Team, The Wan Team and The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above diffusion forcing pipeline. We could probably wrap the repeated logic into a helper function and call that directly
@@ -0,0 +1,962 @@ | |||
# Copyright 2025 The SkyReels-V2 Team, The Wan Team and The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above about shift and helper function for repeated logic
Thanks for the opportunity to fix #11374!
Original Work
Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074
TODOs:
✅
FlowMatchUniPCMultistepScheduler
: just copy-pasted from the original repo✅
SkyReelsV2Transformer3DModel
: 90%WanTransformer3DModel
✅
SkyReelsV2DiffusionForcingPipeline
✅
SkyReelsV2DiffusionForcingImageToVideoPipeline
: Includes FLF2V.✅
SkyReelsV2DiffusionForcingVideoToVideoPipeline
: Extends a given video.✅
SkyReelsV2Pipeline
✅
SkyReelsV2ImageToVideoPipeline
✅
scripts/convert_skyreelsv2_to_diffusers.py
tolgacangoz/SkyReels-V2-Diffusers
⏳ Did you make sure to update the documentation with your changes? Did you write any new necessary tests?: We will construct these during review.
T2V with Diffusion Forcing (OLD)
diffusers
integrationoriginal_0_short.mp4
diffusers_0_short.mp4
diffusers
integrationoriginal_37_short.mp4
diffusers_37_short.mp4
diffusers
integrationoriginal_0_long.mp4
diffusers_0_long.mp4
diffusers
integrationoriginal_37_long.mp4
diffusers_37_long.mp4
I2V with Diffusion Forcing (OLD)
prompt
="A penguin dances."diffusers
integrationi2v-short.mp4
FLF2V with Diffusion Forcing (OLD)
Now, Houston, we have a problem.
I have been unable to produce good results with this task. I tried many hyperparameter combinations with the original code.
The first frame's latent (
torch.Size([1, 16, 1, 68, 120])
) is overwritten onto the first of25
frame latents oflatents
(torch.Size([1, 16, 25, 68, 120])). Then, the last frame's latent is concatenated, thuslatents
istorch.Size([1, 16, 26, 68, 120])
. After the denoising process, the length of the last frame latent is discarded at the end and then decoded by the VAE. I tried not concatenating the last frame but overwriting onto the latest frame oflatents
and not discarding the latest frame latent at the end, but still got bad results. Here are some results:0.mp4
1.mp4
2.mp4
3.mp4
4.mp4
5.mp4
6.mp4
7.mp4
V2V with Diffusion Forcing (OLD)
This pipeline extends a given video.
diffusers
integrationvideo1.mp4
v2v.mp4
Firstly, I want to congratulate you on this great work, and thanks for open-sourcing it, SkyReels Team! This PR proposes an integration of your model.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@yiyixuxu @a-r-r-o-w @linoytsaban @yjp999 @Howe2018 @RoseRollZhu @pftq @Langdx @guibinchen @qiudi0127 @nitinmukesh @tin2tin @ukaprch @okaris