Skip to content

Clarification on Fine-tuning PE-Core Models using OpenCLIP Framework #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rakomar opened this issue May 7, 2025 · 9 comments
Open
Labels
PE Perception Encoder

Comments

@rakomar
Copy link

rakomar commented May 7, 2025

Issue: Fine-tuning PE-Core Models with OpenCLIP

Thank you for your excellent work on the PE-Core models and for open-sourcing them!

In the documentation you only refer to using the OpenCLIP framework for training and evaluation of the PE-Code encoder models. We're attempting contrastive fine-tuning of these models (e.g., PE-Core-L14-336) using OpenCLIP with custom image-caption datasets, but encountered a few challenges:

1. Model Registration in OpenCLIP

PE-Core models (e.g., PE-Core-L14-336) aren't registered directly in OpenCLIP:

open_clip_train.main --model PE-Core-L14-336

This results in RuntimeError(f'Model config for {model_name} not found.')

2. JSON Configuration Challenges

We attempted to register the model using a custom configuration as suggested in OpenCLIP Discussion #1022:

open_clip.add_model_config(custom_config_path)

With a configuration like:

{
  "embed_dim": 1024,
  "vision_cfg": {
    "image_size": 336,
    "layers": 24,
    "width": 1024,
    "patch_size": 14,
    "mlp_ratio": 4
  },
  "text_cfg": {
    "context_length": 32,
    "layers": 24,
    "width": 1024,
    "mlp_ratio": 4,
    "heads": 16,
    "vocab_size": 49408
  }
}

However, this approach led to issues:

  • Unexpected key errors (e.g., visual.attn_pool.*)
  • Dimension mismatches due to differing pooling methods

Question

Could you clarify:

  • Is there a more straightforward approach or recommended workflow to fine-tune PE-Core models with OpenCLIP than the manual approach described above?
  • If not, could you provide additional documentation or examples detailing the correct configuration and fine-tuning steps using OpenCLIP for PE-Core encoder models?

Your support on this would be greatly appreciated!

Thanks!

@berniebear
Copy link
Contributor

berniebear commented May 7, 2025

Hello @rakomar, thank you for your interest in PE. We are working with HF to integrate PE into timm and openclip. We will update the progress and thanks for your patience.

@rakomar
Copy link
Author

rakomar commented May 7, 2025

Sounds great. Thank you for the update.
We'll be waiting for it.

@berniebear
Copy link
Contributor

berniebear commented May 7, 2025

Hello @rakomar, before the official (HuggingFace) timm and open_clip integration, a quick pe+open_clip integration (draft) is here: https://github.com/berniebear/open_clip (experimental usage only. no torch script support for now). It uses the original CLIP class in open_clip, featuring a customized PE vision transformer (w/ attention pool, abs + rope pos_emb, etc). The text transformer is identical to the one used in open_clip. The PE model configs are under src/model_configs/{PE-Core-B16-224, PE-Core-L14-336, PE-Core-G14-448}.json.

You may use standard open_clip framework like:

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('PE-Core-L14-336', pretrained=True)
tokenizer = open_clip.get_tokenizer('PE-Core-L14-336')
image = preprocess(Image.open("docs/cat.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])
with torch.no_grad(), torch.autocast("cuda"):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", probs)  # prints: [[0., 0., 1.]]

Some open_clip functionalities may not be supported before the official open_clip integration by HF. Hope this helps! Cheers!

@rakomar
Copy link
Author

rakomar commented May 8, 2025

Thank you a lot for the clarification.
Do you have a rough estimate time until which you assume the changes to be integrated in open_clip officially?

@mmaaz60 mmaaz60 added the PE Perception Encoder label May 9, 2025
@kailih
Copy link

kailih commented May 15, 2025

Hi @berniebear, thanks for sharing the repo (https://github.com/berniebear/open_clip) earlier. It has been very helpful. I encountered an issue with the embed_dim setting while using PE-Core-B16-224 and submitted a PR to address it. Just wanted to let you know in case it's useful.

@berniebear
Copy link
Contributor

Merged! Thank you

@berniebear
Copy link
Contributor

Also PE is now integrated in the latest timm (need to clone and install the latest one). eg

import timm
model = timm.create_model('vit_pe_core_gigantic_patch14_448', pretrained=True)

Hope this helps!

@rakomar
Copy link
Author

rakomar commented May 16, 2025

Thank you very much! I'll be testing it out.

With this, what is the easiest way to get a contrastive fine-tuning running?
Is it possible to run training via timm like here or here?
Or do I need to write the training script from scratch? If yes, could you maybe provide a minimal example?

@kailih
Copy link

kailih commented May 22, 2025

Hi @rakomar, I have been using the pe+open_clip integration draft repo that Bernie shared earlier (https://github.com/berniebear/open_clip) to fine-tune PE-base on my own datasets, and have seen some gains. However, when it comes to more flexible operations like freezing some text / image layers, the draft hasn't incorporated functions like visual.freeze() so it still needs some extra work - but should be much easier now since PE has already been integrated in timm.

I look forward to seeing PE fully integrated into open_clip to make full usage of the various features that open_clip supports. Wondering if there is any ongoing work for this? @berniebear. Thank you and I appreciate your wonderful work and follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PE Perception Encoder
Projects
None yet
Development

No branches or pull requests

4 participants