-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Speedup model loading by 4-5x ⚡ #11904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for quickly getting this up 🔥
My comments are mostly minor, the major one being adding hf_quantizer
to the allocator function.
Additionally, for a potentially better user-experience, if we could try to rethink the to()
method of DiffusionPipeline
, it would be helpful. I mean the following.
Currently, from what I understand, we have to first initialize the denoiser using device_map
and then the rest of the components. If a user is calling .to()
on a DiffusionPipeline
, we could consider using device_map="cuda"
for dispatching the model-level components to CUDA. I don't immediately see a downside to it.
@@ -520,3 +526,64 @@ def load_gguf_checkpoint(gguf_checkpoint_path, return_tensors=False): | |||
parsed_parameters[name] = GGUFParameter(weights, quant_type=quant_type) if is_gguf_quant else weights | |||
|
|||
return parsed_parameters | |||
|
|||
|
|||
def _find_mismatched_keys( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken out of here:
diffusers/src/diffusers/models/modeling_utils.py
Line 1509 in 9f4d997
def _find_mismatched_keys( |
if device_type is None: | ||
device_type = get_device() | ||
device_mod = getattr(torch, device_type, torch.cuda) | ||
device_mod.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess all different backends ought to have this method. Just flagging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik, synchronize should be available on all devices. Just the empty_cache function required a special check because it would fail if device was cpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this ! Just a nit
All thanks to @Cyrilvallez's PR: huggingface/transformers#36380
The accelerate PR is required because we end up calling
clear_device_cache
in a loop (over the sharded files). This is bad. Without this, you'll see no speedup.Another small optimization is using non_blocking everywhere and syncing just before returning control to the user. This is slightly faster.
Sister PR in accelerate required to obtain speedup: huggingface/accelerate#3674
16.765s
4.521s