Releases: microsoft/onnxruntime-genai
Releases · microsoft/onnxruntime-genai
v0.8.0
What's Changed
New Features
- Add Chat Template API Changes by @sayanshaw24 in #1398
- Add Python and C# bindings for Chat Template API by @sayanshaw24 in #1411
- Support for gemma3 model by @baijumeswani in #1374
- Support more QNN models with different model structures by @baijumeswani in #1322
- Add ability to load audio from bytes, to match images API by @RyanUnderhill in #1304
- Add support for DML Graph Capture to improve speed by @aciddelgado in #1305
- Added OnnxRuntimeGenAIChatClient ctor with Config. by @azchohfi in #1364
- Extensible AppendExecutionProvider and expose OrtSessionOptions::AddConfigEntry directly by @RyanUnderhill in #1384
- OpenVINO: Model Managed KVCache by @RyanMetcalfeInt8 in #1399
- Changes how the device OrtAllocators work, use a global OrtSession instead by @RyanUnderhill in #1378
- Remove audio attention mask processing and update ort-extensions by @baijumeswani in #1319
- Simplify the C API definitions and prevent any type mismatches going forward by @RyanUnderhill in #1365
Model builder updates
- Quark Quantizer Support by @shobrienDMA in #1207
- Add Gemma 3 to model builder by @kunal-vaishnavi in #1359
- Initial support for VitisAI EP by @AnanyaA-9 in #1370
- [OVEP] feat: Adding OpenVINO EP in ORT-GenAI by @ankitm3k in #1389
- Initial support for NV EP by @BLSharda in #1404
- Adapt to MatMulNBitsQuantizer in ort by @jiafatom in #1426
- Fix LM head for Gemma-2 by @kunal-vaishnavi in #1420
Bug Fixes
- Fix mismatch in Java bindings by @CaptainIRS in #1307
- Fix type mismatch in Java bindings by @CaptainIRS in #1313
- Update ort-extensions to fix tokenizer bug for phi4 by @baijumeswani in #1331
- Windows: Show more useful DLL load errors to say exactly what DLL is missing by @RyanUnderhill in #1345
- deprecate graph cap by @aciddelgado in #1338
- Support load/unload of models to avoid QNN errors on deepseek r1 1.5B by @baijumeswani in #1346
- Add missing 'value_stats' to logging API, and fix wrong default by @RyanUnderhill in #1353
- Convert tokens to list for concat by @ajindal1 in #1358
- Improve and Fix TopKTopP by @jiafatom in #1363
- Switch the order of softmax on CPU Top K by @aciddelgado in #1354
- Update pybind and fix rpath for macos and check for nullptr by @baijumeswani in #1367
- iterate over the providers by @baijumeswani in #1486
- Correctly iterate over the providers to check if graph capture is enabled by @baijumeswani in #1487
Examples and Documentation
- Update README.md by @RyanUnderhill in #1372
- Add slm engine example by @avijit-chakroborty in #1242
- Added cancellation to the streaming method of OnnxRuntimeGenAIChatClient. by @azchohfi in #1289
- Update nuget README with latest API by @natke in #1326
- Update C examples downloads by @ajindal1 in #1332
- Add Q&A Test Example in Nightly by @ajindal1 in #1277
- docs: update the doc of slm_engine to ensure consistency with the code by @dennis2030 in #1386
- C++ and python samples: follow_config support by @RyanMetcalfeInt8 in #1413
- Fix Do Sample example by @ajindal1 in #1337
- Make phi3 example Q&A rather than chat by @ajindal1 in #1392
- Fix broken link in package description by @rogerbarreto in #1360
Packaging and Testing
- Remove DirectML.dll dependency by @baijumeswani in #1342
- Add support to creating a custom nuget in the packaging pipeline by @baijumeswani in #1315
- Remove onnxruntime-genai-static library (non trivial change) by @RyanUnderhill in #1264
- Add macosx to custom nuget package by @baijumeswani in #1419
- Update the C++ clang-format lint workflow to use clang 20 by @snnn in #1418
- Add model_benchmark options to specify prompt to use. by @edgchen1 in #1328
- Add value_stats logging option to show statistical information about … by @RyanUnderhill in #1352
- Fixed the MacOS build and updated the test script. by @avijit-chakroborty in #1310
- Fix iOS packaging pipeline after static library removal by @RyanUnderhill in #1316
- fix bug in python benchmark script by @thevishalagarwal in #1206
- Fix macos package by @baijumeswani in #1347
- Missing *.dylib in package_data, so Mac would not package our shared libraries by @RyanUnderhill in #1341
Dependency Updates
- Update upload Artifact version by @ajindal1 in #1274
- Update to M.E.AI 9.3.0-preview.1.25161.3 by @stephentoub in #1317
- Update android min sdk version to 24 by @baijumeswani in #1324
- Update torch to 2.5.1 by @baijumeswani in #1343
- Update Pipelines for S360 by @ajindal1 in #1323
- Update Nuget pkg name by @ajindal1 in #1351
- update version to 0.8.0 by @baijumeswani in #1376
- Update custom nuget packaging logic by @baijumeswani in #1377
- Update Microsoft.Extensions.AI.Abstractions to 9.4.0-preview.1.25207.5 by @stephentoub in #1388
- Bump torch from 2.5.1 to 2.6.0 in /test/python/macos/torch by @dependabot in #1408
- Bump torch from 2.5.1+cu124 to 2.6.0+cu124 in /test/python/cuda/torch by @dependabot in #1409
- Bump torch from 2.5.1+cpu to 2.7.0 in /test/python/cpu/torch by @dependabot in #1422
- pin cmake version by @snnn in #1424
New Contributors
- @avijit-chakroborty made their first contribution in #1242
- @CaptainIRS made their first contribution in #1307
- @AnanyaA-9 made their first contribution in #1370
- @dennis2030 made their first contribution in #1386
- @ankitm3k made their first contribution in #1389
- @RyanMetcalfeInt8 made their first contribution in #1399
Full Changelog: v0.7.1...v0.8.0
v0.7.1
Release Notes
- Add AMD Quark Quantizer Support #1207
- Added Gemma 3 to model builder #1359
- Updated Phi-3 Python Q&A example to be consistent with C++ example #1392
- Updated Microsoft.Extensions.AI.Abstractions to 9.4.0-preview.1.25207.5 #1388
- Added OnnxRuntimeGenAIChatClient constructor with Config #1364
- Improve and Fix TopKTopP #1363
- Switch the order of softmax on CPU Top K #1354
- Updated custom nuget packaging logic #1377
- Updated pybind and fix rpath for macos and check for nullptr #1367
- Convert tokens to list for concat to accommodate breaking API change in tokenizer #1358
v0.7.0
Release Notes
We are excited to announce the release of onnxruntime-genai version 0.7.0. Below are the key updates included in this release:
- Support for a wider variety of QNN NPU models (such as Deepseek R1)
- Remove
onnxruntime-genaistatic library. All language bindings now interface withonnxruntime-genaithrough theonnxruntime-genaishared library.- All return types from
onnxruntime-genaipython package is now a numpy array type. - Previously the return type from tokenizer.encode was a python list. This broke examples/python/model-qa.py which was using '+' to concatenate two lists. np.concatenate must be used instead for these cases.
- All return types from
- Abstract away execution provider specific code into shared libraries of their own (for example
onnxruntime-genai-cudafor cuda, andonnxruntime-genai-dmlfor dml). This allows using the onnxruntime-genai-cuda package to also work on non cuda machines (as an example). - Support for multi-modal models (text, speech, and vision) such as phi4-multi-modal.
- Add an IChatClient implementation to the
onnxruntime-genaiC# bindings. - Expose the model type through the Python bindings.
- Code and performance improvements for DML EP.
This release also includes several bug fixes that resolve issues reported by users.
v0.6.0
Release Notes
We are excited to announce the release of onnxruntime-genai version 0.6.0. Below are the key updates included in this release:
- Support for contextual or continuous decoding allows users to carry out multi-turn conversation style generation.
- Support for new models such as Deepseek R1, AMD OLMo, IBM Granite and others.
- Python 3.13 wheels have been introduced
- Support for generation for models sourced from Qualcomm's AI Hub. This work also includes publishing a nuget package
Microsoft.ML.OnnxRuntimeGenAI.QNNfor QNN EP - Support for WebGPU EP
This release also includes performance improvements to optimize memory usage and speed. In addition, there are several bug fixes that resolve issues reported by users.
v0.5.2
Release Notes
Patch release 0.5.2 adds:
- Fixes for bugs #1074, #1092 via PRs #1065 and #1070
- Fix Nuget sample in package README to show correct disposal of objects
- Added extra validation via PRs #1050 #1066
Features in 0.5.0:
- Support for MultiLoRA
- Support for multi-frame for Phi-3 vision and Phi-3.5 vision models
- Support for the Phi-3 MoE model
- Support for NVIDIA Nemotron model
- Support for the Qwen model
- Addition of the Set Terminate feature, which allows users to cancel mid-generation
- Soft capping support for Group Query Attention
- Extend quantization support to embedding and LM head layers
- Mac support in published packages
Known issues
- Models running with DirectML do not support batching
- Python 3.13 is not supported in this release
v0.5.1
Release Notes
In addition to the features in the 0.5.0 release, this release adds:
- Add ability to choose provider and modify options at runtime
- Fixed data leakage bug with KV caches
Features in 0.5.0:
- Support for MultiLoRA
- Support for multi-frame for Phi-3 vision and Phi-3.5 vision models
- Support for the Phi-3 MoE model
- Support for NVIDIA Nemotron model
- Support for the Qwen model
- Addition of the Set Terminate feature, which allows users to cancel mid-generation
- Soft capping support for Group Query Attention
- Extend quantization support to embedding and LM head layers
- Mac support in published packages
Known issues
- Models running with DirectML do not support batching
- Python 3.13 is not supported in this release
v0.5.0
Release Notes
- Support for MultiLoRA
- Support for multi-frame for Phi-3 vision and Phi-3.5 vision models
- Support for the Phi-3 MoE model
- Support for NVIDIA Nemotron model
- Support for the Qwen model
- Addition of the Set Terminate feature, which allows users to cancel mid-generation
- Soft capping support for Group Query Attention
- Extend quantization support to embedding and LM head layers
- Mac support in published packages
Known issues
- Models running with DirectML do not support batching
- Python 3.13 is not supported in this release
v0.4.0
Release Notes
- Support for new models such as Qwen 2, LLaMA 3.1, Gemma 2, Phi-3 small on CPU
- Support to build already-quantized models that were quantized with AWQ or GPTQ
- Performance improvements for Intel and Arm CPU
- Packing and language binding
- Added Java bindings (build from source)
- Separate OnnxRuntime.dll and directml.dll out of GenAI package to improve usability
- Publish packages for Win Arm
- Support for Android (build from source)
v0.3.0
Release Notes
- Phi-3 Vision model support for DML EP.
- Addressed DML memory leak issue and crashes on long prompts.
- Addressed crashes and slowness on CPU EP GQA on long prompts due to integer overflow issues.
- Added the import lib for windows C API package.
- Addressed a bug with
get_output('logits')so that it returns the logits for the entire prompt and not for the last generated token. - Addressed a bug with querying the device type of the model so that it won't crash.
- Added NetStandard 2.0 compatibility.
ONNX Runtime GenAI v0.3.0-rc2
Release Notes
- Added support for the Phi-3-Vision model.
- Added support for the Phi-3-Small model.
- Removed usage of
std::filesystemto avoid runtime issues when loading incompatible symbols from stdc++ and stdc++fs.