Releases: ml-explore/mlx-lm
Releases · ml-explore/mlx-lm
v0.31.3
Highlights
- Lots of bugfixes
- Thread local generation stream to accompany MLX v0.31.2
What's Changed
- Bump the patch version by @angeloskath in #1124
- Fix batch dimension mismatch in BatchKVCache and BatchRotatingKVCache extend() by @razorback16 in #1141
- Fix parallel tool call handling in server by @kernelpool in #1170
- Fix MiniMax M2 parallel tool calling by @kernelpool in #1171
- Fix missing tree_reduce import in models/cache.py by @siiea-ai in #1165
- Apertus tie_word_embeddings fix by @BlackSamorez in #1143
- Fix batch dimension mismatch in ArraysCache extend() by @techtoboggan in #1169
- Fix dwq: check for actual safetensors in target_dir by @micuentadecasa in #1173
- fix: handle NoneType check for think tokens in TokenizerWrapper by @yuetyeelo2855 in #1167
- Fix Gemma4 tool parser: support hyphenated function names and braces in string args by @AkashKhamkar in #1150
- Fix empty tool_call_end breaking Mistral tool calls by @eyupcanakman in #1151
- Fix ArraysCache extend by @angeloskath in #1177
- Fix Gemma 4 KV-shared layers creating unused projections by @glyphVault in #1158
- Thread local generation stream by @angeloskath in #1090
New Contributors
- @razorback16 made their first contribution in #1141
- @siiea-ai made their first contribution in #1165
- @BlackSamorez made their first contribution in #1143
- @techtoboggan made their first contribution in #1169
- @micuentadecasa made their first contribution in #1173
- @yuetyeelo2855 made their first contribution in #1167
- @AkashKhamkar made their first contribution in #1150
- @glyphVault made their first contribution in #1158
Full Changelog: v0.31.2...v0.31.3
v0.31.2
Highlights
- Caching system prompt and user messages for non-trimmable caches
- Batch generator refactoring
What's Changed
- Bump the patch version by @angeloskath in #959
- Presence and frequency penalties by @angeloskath in #971
- Eval self.left_padding whenever it is updated in BatchRotatingKVCache by @rltakashige in #960
- Late binding caused incorrect cache checkpoint by @angeloskath in #976
- Move to metal agnostic device_info by @angeloskath in #979
- Fix CompletionsDataset mask_prompt crash by @eyupcanakman in #967
- Bump the patch version by @angeloskath in #981
- Fix test after latest MLX update by @angeloskath in #996
- Clear cache trainer memory by @N8python in #986
- feat(server): add --allowed-origins by @nwtgck in #987
- Delta net precision by @angeloskath in #997
- avoid mutating input in SuScaledRoPE and YarnRoPE by @mm65x in #1003
- handle missing content-length header in server by @mm65x in #1001
- fall back to ast.literal_eval for malformed JSON in qwen3_coder tool parser by @mm65x in #1004
- Nemotron super support by @angeloskath in #992
- Supporting delay in mlx_lm benchmark by @AndreasPlt in #1010
- Fix flaky test by @angeloskath in #1020
- Fix missing cache advance from qwen 3.5 by @angeloskath in #1024
- Refactor LRUPromptCache by @angeloskath in #1019
- Fix SSM dt clamp default for Nemotron-H by @kernelpool in #1026
- Inserting logits processors into BatchGenerator in batch_generate by @arthurhjorth in #1008
- fix: break shared-buffer memory leak in GatedDeltaNet cache by @adurham in #1077
- Fix PromptTrie.pop_prefixes() off-by-one when pruning immediate prefixes by @LxYuan0420 in #1078
- Batch generation refactoring and various fixes by @angeloskath in #1072
- perf: use max instead of argsort in apply_min_p sampling by @matteocelani in #1083
- Add gemma 4 by @Blaizzy in #1093
- Bring back max-kv-size to the batch generator by @angeloskath in #1106
- Add Gemma 4 tool call parser by @nicdavidson in #1105
- Fix Gemma 4 quantized per-layer projection loading by @spicyneuron in #1112
- Fix output corruption in speculative decoding by @kernelpool in #1109
- Gemma4 final fixes and multi-token think/tool start/end by @angeloskath in #1114
- Align batch logits processor token contract by @neilmehta24 in #1115
New Contributors
- @rltakashige made their first contribution in #960
- @eyupcanakman made their first contribution in #967
- @nwtgck made their first contribution in #987
- @mm65x made their first contribution in #1003
- @AndreasPlt made their first contribution in #1010
- @arthurhjorth made their first contribution in #1008
- @adurham made their first contribution in #1077
- @LxYuan0420 made their first contribution in #1078
- @matteocelani made their first contribution in #1083
- @nicdavidson made their first contribution in #1105
Full Changelog: v0.31.0...v0.31.2
v0.31.0
What's Changed
- Fix save/load of CacheList by @angeloskath in #886
- Share model by @angeloskath in #871
- Fix mixed quant predicates for MLA models by @spicyneuron in #892
- Add JoyAI LLM Flash by @kernelpool in #894
- perplexity: add --trust-remote-code option by @ivanfioravanti in #896
- server: add usage.prompt_tokens_details.cached_tokens to json response by @percontation in #849
- Fix qwen3.5 casting to fp32 by @awni in #902
- Fix sharded rms norm in MiniMax M2.5 by @angeloskath in #898
- Bump for next version by @awni in #904
- Add tie_word_embeddings modulars in mistral and qwen3 moe by @Goekdeniz-Guelmez in #889
- Allow reading LFM2 models nested rope params by @ykhrustalev in #908
- Improve the cache size limits by @angeloskath in #906
- Make the cache limits more friendly by @angeloskath in #910
- Add 'mx.clear_cache()' to piecewise prompt processing in server. by @N8python in #917
- Add filter guard to ArraysCache.nbytes property by @f1yn in #918
- Add tokens to eval to avoid large graphs when they are not used by @awni in #924
- Clear the cache during batch generation by @awni in #926
- Fix qwen3.5 sanitize by @awni in #928
- step3p5: use rotating cache for sliding attention layers by @lyonsno in #949
- Proposal:
--prefill-step-sizeas cmd line argument for speed/memory usage trade-off by @Abioy in #943 - fix:
convert()uses incorrect defaults for quantization mode by @spicyneuron in #935 - Bump minor by @angeloskath in #954
- Ensure normalization does not promote to fp32 by @angeloskath in #951
- Better caching in the server by @angeloskath in #911
- Adds tensor parallelism for Qwen 3.5 by @angeloskath in #957
New Contributors
- @spicyneuron made their first contribution in #892
- @ykhrustalev made their first contribution in #908
- @f1yn made their first contribution in #918
- @lyonsno made their first contribution in #949
- @Abioy made their first contribution in #943
Full Changelog: v0.30.7...v0.31.0
v0.30.7
What's Changed
- Fix Kimi Linear by @kernelpool in #853
- Bump version for next release by @awni in #865
- Pythonic tool calling for LFM2 models by @viktike in #864
- Fix DeepSeek V3.2 indexer and weight loading by @kernelpool in #866
- Make validation set optional in training process by @Goekdeniz-Guelmez in #857
- Mistral tool parser by @awni in #874
- LongCat MLA by @kernelpool in #868
- [MODEL] support qwen3.5 series w/o vision by @JJJYmmm in #869
- Faster DSV32 generation by @kernelpool in #885
- Add GLM5 by @Goekdeniz-Guelmez in #867
New Contributors
Full Changelog: v0.30.6...v0.30.7
v0.30.6
What's Changed
- Transformers v5 by @awni in #811
- Add LongCat Flash tool parser by @kernelpool in #810
- Add Kimi-K2.5 by @kernelpool in #813
- Bump mlx version and version by @awni in #816
- Fix NemotronH config compatibility with HuggingFace format by @LuqDaMan in #820
- Fix for Exception - MultiLinear.to_quantized() missing 'mode' by @inferencers in #809
- Fix Kimi K2.5 tool call handling by @kernelpool in #821
- Actually add cli by @awni in #823
- Add LongCat Flash Lite by @kernelpool in #819
- Fix mixed quant by @awni in #825
- Support distributed inference in the server by @angeloskath in #741
- fix cli by @solarpunkin in #827
- Enable loading custom models by @awni in #830
- Allow default creation of BatchRotatingKVCache instead of BatchKVCache in batch mode by @christian-lms in #834
- Add Step 3.5 Flash by @kernelpool in #836
- server: support chat_template_kwargs and top_logprobs by @percontation in #829
- fix: handle GLM 4.7 tool call fallbacks by @jalehman in #792
- Deepseek V3.2 implementation fixes by @sjug in #838
- Fix Step 3.5 Flash model conversion by @kernelpool in #840
- Fix batch mamba by @awni in #842
- Fix sliding window mask during generation by @kernelpool in #843
- DSV3 MLA by @awni in #839
New Contributors
Full Changelog: v0.30.5...v0.30.6
v0.30.5
What's Changed
- import logging as it throws no logging error in place of actual error by @Maanas-Verma in #778
- server: use OpenAI compatible finish_reason by @percontation in #782
- move Xielu Activation in Apertus to activations.py by @Goekdeniz-Guelmez in #772
- bump transformers by @awni in #746
- Update glm4_moe_lite to store KV latent in cache by @N8python in #780
- Adding TeleChat3 by @Goekdeniz-Guelmez in #773
- add kimi tool parser by @Evanev7 in #791
- Allow qq ops with activation quantization by @awni in #749
- fix: use correct variable for logprobs in batch generation by @LuqDaMan in #800
- Sync random seed across ranks in distributed chat by @kernelpool in #801
- Fix ArraysCache.from_state not initializing left_padding and lengths by @lpalbou in #807
New Contributors
- @Maanas-Verma made their first contribution in #778
- @percontation made their first contribution in #782
- @LuqDaMan made their first contribution in #800
- @lpalbou made their first contribution in #807
Full Changelog: v0.30.4...v0.30.5
v0.30.4
What's Changed
- Add AWQ/GPTQ weight transformation utilities by @ericcurtin in #730
- Add IQuest Coder V1 Loop variant by @kernelpool in #716
- Fix sliding window batching by @awni in #738
- Fix Batch Generation: Add extract method to ArraysCache for item retrieval by @Goekdeniz-Guelmez in #740
- Make MambaCache compatible with batch generation for nemotron-h by @nikhilmitrax in #690
- Add a server benchmark for continuous batching by @awni in #728
- Fix tools parameter in apply_chat_template call by @kernelpool in #747
- Refactor tokenizer error handling to use warnings instead of exceptio… by @cubist38 in #744
- Make cache list batchable by @awni in #743
- Fix batch generation for IQuestLoopCoder model by @kernelpool in #748
- Fix type hint and pydoc for batch_generate by @tibbes in #745
- Handle empty caches during batch merge by @ivanfioravanti in #755
- Update for latest mlx by @awni in #759
- Use compiled Swiglu by @awni in #753
- Adds support for Nemotron Super 49b v1.5 by @lazarust in #756
- fix(falcon_h1): support tied embeddings and correct muP scaling by @solarpunkin in #764
- Fix swiglu parameter order by @kernelpool in #767
- Fix CacheList batching by @kernelpool in #769
- fix: unused batch_size parameter for mlx_lm.evaluate by @AndrewTan517 in #762
- Add gpt-oss sharding by @Evanev7 in #761
- Fix LongCat Flash extended context support by @kernelpool in #768
- Add minimax tensor sharding by @Evanev7 in #760
- Shard LongCat Flash by @kernelpool in #771
- Add glm4 moe lite model by @ivanfioravanti in #776
New Contributors
- @ericcurtin made their first contribution in #730
- @nikhilmitrax made their first contribution in #690
- @tibbes made their first contribution in #745
- @solarpunkin made their first contribution in #764
- @AndrewTan517 made their first contribution in #762
- @Evanev7 made their first contribution in #761
Full Changelog: v0.30.2...v0.30.4
v0.30.2
v0.30.1
What's Changed
- custom dsv32 chat template by @awni in #693
- shard glm by @awni in #698
- support minimax m2 by @awni in #700
- Enhance load_config function to check for config file existence and i… by @cubist38 in #701
- batch_generate fails with Phi3 (LongRoPE) when prompts have different lengths by @vyaivanove in #707
- Fix GIL starvation in _generate thread when batch is idle by @sjug in #706
- Ignore generation_config decode errors by @will-lms in #708
- Allow mxfp8 and nvfp4 by @awni in #709
- Fix chat template detection for models with custom tokenizers by @kernelpool in #712
- chore: add model-path param flag for convert API for better clarity by @jaycoolslm in #702
- Add RWKV7 by @MollySophia in #580
- Fix empty /v1/models response for locally loaded models by @cxl-git-hub in #713
- Add IQuest Coder V1 by @kernelpool in #714
- Add YoutuLLM by @johnmai-dev in #720
- Add logits_processors support to batch_generate by @lazarust in #635
- Add Solar Open by @kernelpool in #721
- Add K-EXAONE MoE by @kernelpool in #719
- Improve reasoning and tool call parsing in server by @awni in #711
- Patch bump by @awni in #731
New Contributors
- @cubist38 made their first contribution in #701
- @vyaivanove made their first contribution in #707
- @sjug made their first contribution in #706
- @jaycoolslm made their first contribution in #702
- @MollySophia made their first contribution in #580
- @cxl-git-hub made their first contribution in #713
- @lazarust made their first contribution in #635
Full Changelog: v0.30.0...v0.30.1
v0.30.0
What's Changed
- fix: server busy-waiting during idle request polling by @zenyr in #674
- Fixes for transformers v5 by @awni in #684
- Add mimo v2 flash by @awni in #685
- More useful error message for unsupported batching by @awni in #687
- Model parallel generation by @angeloskath in #676
- Bump to transformer v5 by @awni in #689
- Revert return dict and wrap apply_chat_template by @awni in #691
- Bump the version by @angeloskath in #692
Full Changelog: v0.29.0...v0.30.0