Clarification on Fine-tuning PE-Core Models using OpenCLIP Framework #32

rakomar · 2025-05-07T04:42:31Z

Issue: Fine-tuning PE-Core Models with OpenCLIP

Thank you for your excellent work on the PE-Core models and for open-sourcing them!

In the documentation you only refer to using the OpenCLIP framework for training and evaluation of the PE-Code encoder models. We're attempting contrastive fine-tuning of these models (e.g., PE-Core-L14-336) using OpenCLIP with custom image-caption datasets, but encountered a few challenges:

1. Model Registration in OpenCLIP

PE-Core models (e.g., PE-Core-L14-336) aren't registered directly in OpenCLIP:

open_clip_train.main --model PE-Core-L14-336

This results in RuntimeError(f'Model config for {model_name} not found.')

2. JSON Configuration Challenges

We attempted to register the model using a custom configuration as suggested in OpenCLIP Discussion #1022:

open_clip.add_model_config(custom_config_path)

With a configuration like:

{
  "embed_dim": 1024,
  "vision_cfg": {
    "image_size": 336,
    "layers": 24,
    "width": 1024,
    "patch_size": 14,
    "mlp_ratio": 4
  },
  "text_cfg": {
    "context_length": 32,
    "layers": 24,
    "width": 1024,
    "mlp_ratio": 4,
    "heads": 16,
    "vocab_size": 49408
  }
}

However, this approach led to issues:

Unexpected key errors (e.g., visual.attn_pool.*)
Dimension mismatches due to differing pooling methods

Question

Could you clarify:

Is there a more straightforward approach or recommended workflow to fine-tune PE-Core models with OpenCLIP than the manual approach described above?
If not, could you provide additional documentation or examples detailing the correct configuration and fine-tuning steps using OpenCLIP for PE-Core encoder models?

Your support on this would be greatly appreciated!

Thanks!

The text was updated successfully, but these errors were encountered:

berniebear · 2025-05-07T05:07:40Z

Hello @rakomar, thank you for your interest in PE. We are working with HF to integrate PE into timm and openclip. We will update the progress and thanks for your patience.

rakomar · 2025-05-07T06:14:36Z

Sounds great. Thank you for the update.
We'll be waiting for it.

berniebear · 2025-05-07T09:26:12Z

Hello @rakomar, before the official (HuggingFace) timm and open_clip integration, a quick pe+open_clip integration (draft) is here: https://github.com/berniebear/open_clip (experimental usage only. no torch script support for now). It uses the original CLIP class in open_clip, featuring a customized PE vision transformer (w/ attention pool, abs + rope pos_emb, etc). The text transformer is identical to the one used in open_clip. The PE model configs are under src/model_configs/{PE-Core-B16-224, PE-Core-L14-336, PE-Core-G14-448}.json.

You may use standard open_clip framework like:

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('PE-Core-L14-336', pretrained=True)
tokenizer = open_clip.get_tokenizer('PE-Core-L14-336')
image = preprocess(Image.open("docs/cat.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])
with torch.no_grad(), torch.autocast("cuda"):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", probs)  # prints: [[0., 0., 1.]]

Some open_clip functionalities may not be supported before the official open_clip integration by HF. Hope this helps! Cheers!

rakomar · 2025-05-08T05:11:25Z

Thank you a lot for the clarification.
Do you have a rough estimate time until which you assume the changes to be integrated in open_clip officially?

kailih · 2025-05-15T05:22:55Z

Hi @berniebear, thanks for sharing the repo (https://github.com/berniebear/open_clip) earlier. It has been very helpful. I encountered an issue with the embed_dim setting while using PE-Core-B16-224 and submitted a PR to address it. Just wanted to let you know in case it's useful.

berniebear · 2025-05-15T20:25:40Z

Merged! Thank you

berniebear · 2025-05-15T20:26:29Z

Also PE is now integrated in the latest timm (need to clone and install the latest one). eg

import timm
model = timm.create_model('vit_pe_core_gigantic_patch14_448', pretrained=True)

Hope this helps!

rakomar · 2025-05-16T00:48:19Z

Thank you very much! I'll be testing it out.

With this, what is the easiest way to get a contrastive fine-tuning running?
Is it possible to run training via timm like here or here?
Or do I need to write the training script from scratch? If yes, could you maybe provide a minimal example?

kailih · 2025-05-22T22:50:02Z

Hi @rakomar, I have been using the pe+open_clip integration draft repo that Bernie shared earlier (https://github.com/berniebear/open_clip) to fine-tune PE-base on my own datasets, and have seen some gains. However, when it comes to more flexible operations like freezing some text / image layers, the draft hasn't incorporated functions like visual.freeze() so it still needs some extra work - but should be much easier now since PE has already been integrated in timm.

I look forward to seeing PE fully integrated into open_clip to make full usage of the various features that open_clip supports. Wondering if there is any ongoing work for this? @berniebear. Thank you and I appreciate your wonderful work and follow-up.

mmaaz60 added the PE Perception Encoder label May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on Fine-tuning PE-Core Models using OpenCLIP Framework #32

Clarification on Fine-tuning PE-Core Models using OpenCLIP Framework #32

rakomar commented May 7, 2025 •

edited

Loading

berniebear commented May 7, 2025 •

edited

Loading

Uh oh!

rakomar commented May 7, 2025

Uh oh!

berniebear commented May 7, 2025 •

edited

Loading

Uh oh!

rakomar commented May 8, 2025

Uh oh!

kailih commented May 15, 2025

Uh oh!

berniebear commented May 15, 2025

Uh oh!

berniebear commented May 15, 2025

Uh oh!

rakomar commented May 16, 2025 •

edited

Loading

Uh oh!

kailih commented May 22, 2025

Uh oh!

Clarification on Fine-tuning PE-Core Models using OpenCLIP Framework #32

Clarification on Fine-tuning PE-Core Models using OpenCLIP Framework #32

Comments

rakomar commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue: Fine-tuning PE-Core Models with OpenCLIP

1. Model Registration in OpenCLIP

2. JSON Configuration Challenges

Question

berniebear commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rakomar commented May 7, 2025

Uh oh!

berniebear commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rakomar commented May 8, 2025

Uh oh!

kailih commented May 15, 2025

Uh oh!

berniebear commented May 15, 2025

Uh oh!

berniebear commented May 15, 2025

Uh oh!

rakomar commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kailih commented May 22, 2025

Uh oh!

rakomar commented May 7, 2025 •

edited

Loading

berniebear commented May 7, 2025 •

edited

Loading

berniebear commented May 7, 2025 •

edited

Loading

rakomar commented May 16, 2025 •

edited

Loading