Conversation
|
I also think it will probably fix the broadcast error reported here. I tried adding print statements to idefics3.swift so I can see the tensors but I can't see them in the CLI when I run it using mlx_run. Here is the execution command I tried. Btw, the vision patch embed sanitization is not working well tho the code checks out to me. |
@Blaizzy Unfortunately this PR didn't solve that issue. It crash here before entering
Full stacktrace before crash |
| // From the Python code and default config, we know image_token_id is usually 49153. | ||
| // Hardcode this since we can't pass it in or rely on it from the processor config. | ||
| private let imageTokenId = 49153 | ||
| private let imageTokenId = 49190 |
There was a problem hiding this comment.
This no longer matches the comment. Also, might want to change this line:
self.imageTokenId = (try? container.decode(Int.self, forKey: .imageTokenId)) ?? 49153
That said, perhaps we should address the "Hardcode this since we can't pass it in or rely on it from the processor config" part of it -- if UserInputProcessor needs it then it should get passed in (VLMModelFactory).
Comparing this to the other VLMs it looks like the difference is here:
// Encode only the text part of the prompt, without <image>
var promptTokens = try tokenizer.encode(text: prompt)
let imageTokenIndex = promptTokens.count / 2
promptTokens.insert(imageTokenId, at: imageTokenIndex)
this deals in numeric tokens after tokenization while the others are injecting text.
I think we can keep this hard coded for this specific issue but it would be good to decide which approach to take here:
- inject text into the prompt (no token id needed)
- we need the model config
|
Looking at pixel_values coming in to vs It looks like:
if do_image_splitting:
# We first resize both height and width of each image to the nearest max_image_size multiple, disregarding the aspect ratio
# for size=(10, max_image_size) -> rescaled_size=(max_image_size, max_image_size)
# for size=(11, max_image_size+1) -> rescaled_size=(max_image_size, max_image_size*2)
...
As far as I can tell we are passing a non-split image in where a split image is expected. I don't know if that is the entirety of it, but that seems to be the difference between the swift and python code before it hits the assertions. |
In python it works with do_split TRUE and FALSE. Because we are flattening the first 3/4 dimensions of the image_feature tensor. |
|
@davidkoski thanks a lot for the tips!
My next question to be able to move on this is how to setup everything using Xcode? I struggled for 30 min and ended up doing via terminal. To reitarate, I'm just getting started with swift and xcode :) |
Without knowing specifically what you are running into it is hard to say. You might look here:
Once you have it installed, you should be able to open the These are the schemes:
you pick one and a build/run destination. The simplest will be macOS. You can press cmd-opt-r to edit the "run" configuration for the scheme. In the first tab you can select between Debug/Release builds (Debug if you want to run in the debugger, Release to run full speed). The arguments tab is useful for If you run into specific problems just ask! |
|
Thanks! I'm running MacOS :) |
|
Hi @Blaizzy, welcome to Swift-land :) Incidentally, I started working on this before seeing your PR. I have a preliminary question, did you manage to get any smolvlm running, even before your patch? I tested the default model in this repo (mlx-community/SmolVLM-Instruct-4bit) and I'm getting gibberish (after changing the image token id to 49190). |


Description
This PR matches the fix for idefics 2 and 3 on mlx-vlm.
Source: Blaizzy/mlx-vlm#191
@davidkoski @awni I'm not a swift expert and would love to start contributing. This is my first attempt.
Although I couldn't manage to run this example sucessfully this change should allow the model to process images in multi-resolution boost OCR and finer grained details. For this Idefics 3 models have an argument in the preprocessor called do_image_splitting which is True by default. I previously had set it to false on all mlx models but now I pinpointed the bug and made a fix.
Input pixel values:
-> (1, 3, 512, 512) without do_image_splitting
-> (13, 3, 512, 512) with do_image_splitting
Python MLX-VLM example
Input Image:
Before
After (matches transformers ✅)