Skip to content

fix: merge config.json eos_token_id with tokenizer-derived value instead of replacing it#1414

Open
EmreCelenli wants to merge 1 commit into
ml-explore:mainfrom
EmreCelenli:fix/config-eos-override-silent-replacement
Open

fix: merge config.json eos_token_id with tokenizer-derived value instead of replacing it#1414
EmreCelenli wants to merge 1 commit into
ml-explore:mainfrom
EmreCelenli:fix/config-eos-override-silent-replacement

Conversation

@EmreCelenli

Copy link
Copy Markdown

Summary

TokenizerWrapper.__init__ treats eos_token_ids passed in from config.json
as a hard replacement for whatever the tokenizer itself correctly derives.
When config.json has a stale value (common in third-party MLX-quantized repos),
the correct stop token is silently discarded, causing generation to never stop
at the right boundary. The real stop token (e.g. <|im_end|>) leaks into output
as plain text and the model keeps generating until max tokens.

Root Cause

In utils.py, both load() and sharded_load() do:

tokenizer = load_tokenizer(
    model_path, tokenizer_config, eos_token_ids=config.get("eos_token_id", None)
)

And TokenizerWrapper.__init__ treats this as a strict replacement:

self._eos_token_ids = (
    set(eos_token_ids)
    if eos_token_ids is not None
    else {tokenizer.eos_token_id}
)

So if config.json has a stale value, the tokenizer's own correct derivation
is silently thrown away with no warning.

Fix

if eos_token_ids is not None:
    self._eos_token_ids = set(
        [eos_token_ids] if isinstance(eos_token_ids, int) else eos_token_ids
    )
else:
    self._eos_token_ids = set()
if tokenizer.eos_token_id is not None:
    self._eos_token_ids.add(tokenizer.eos_token_id)

Verified

Model: mlx-community/Qwen2.5-Coder-7B-Instruct-4bit, 16GB M1 Mac.
config.json has eos_token_id: 151643 (<|endoftext|>, stale pretraining EOS).
Tokenizer correctly derives 151645 (<|im_end|>).

# Before fix
Via load(): {151643}          # correct value silently replaced, generation never stops

# After fix
Via load(): {151643, 151645}  # both present, generation stops correctly

Real-world symptom: <|im_end|> no longer leaks into output when using
mlx_lm.server with this model. Generation speed returns to normal since
the model stops at the correct boundary instead of running to max tokens.

Tests

203 tests pass. Black formatting check passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant