Releases: microsoft/onnxruntime-genai
v0.11.2
What's Changed
- Revert removal of eps_without_if_support by @xiaofeihan1 in #1878
- Fix condition for NPU by @apsonawane in #1880
- Set version as 0.11.2 by @kunal-vaishnavi in #1881
Full Changelog: v0.11.1...v0.11.2
v0.11.1
What's Changed
- Cherry pick guidance fix into 0.11.1 release by @kunal-vaishnavi in #1872
- Set version as 0.11.1 by @kunal-vaishnavi in #1873
- Fix regex by @apsonawane in #1876
Full Changelog: v0.11.0...v0.11.1
v0.11.0
What's Changed
- ADO - Update WinML build pipeline by @chrisdMSFT in #1768
- Fix CMakeLists.txt auto-detection of library directory by @anujj in #1774
- Fix new/delete override and Enable cuda kernel test in Windows by @tianleiwu in #1772
- Use abbreviation for TensorRT RTX EP by @kunal-vaishnavi in #1763
- Add trust remote code option to model builder by @kunal-vaishnavi in #1766
- Support block-wise quant in qmoe op by @apsonawane in #1746
- Change the status for TRT-RTX EP by @gaugarg-nv in #1780
- Cherry-Pick changes from rel 0.10.0 back to main. by @chrisdMSFT in #1782
- Fix /CETCOMPAT Usage for Cross-Compiling by @sayanshaw24 in #1779
- Provide distributed version of improved TopK kernel by @hariharans29 in #1710
- [TRT-RTX] Disable KV cache re-computation for Phi models by @gaugarg-nv in #1787
- [CUDA] Add high-performance Top-K kernels and online benchmarking by @tianleiwu in #1748
- Change shared indices array type from float to int by @hariharans29 in #1789
- Enable bfloat16 multi-modal models by @kunal-vaishnavi in #1786
- Disable lmhead while prompt processing by @qti-ashimaj in #1762
- Introduce support for dynamic batching by @baijumeswani in #1662
- Generate pyd type info by @chemwolf6922 in #1742
- Add trt-rtx c packages in c example by @anujj in #1794
- [CUDA] Fix build with CUDA >= 12.9 by @tianleiwu in #1802
- [CUDA] topk kernels v2 by @tianleiwu in #1798
- Add prefill Chunking Support for NvTensorRtRtx and Cuda Providers by @anujj in #1765
- Add TRT-RTX EP support, keep NvTensorRtRtx as user facing name, and force QDQ by @anujj in #1791
- [CUDA] Add static assert to suppress windows build warnings by @tianleiwu in #1804
- Revert "Generate pyd type info" by @baijumeswani in #1805
- [QNN] Support continuous decoding by @baijumeswani in #1808
- ADO Pipeline - nuget_winml_package_reference_version is configured at build time. by @chrisdMSFT in #1811
- Update version to 0.11.0-dev by @baijumeswani in #1815
- Add Support For Tokenizer Options by @sayanshaw24 in #1785
- Fix exit call in README example by @justinchuby in #1823
- Add tokenizer APIs for accessing important ids by @kunal-vaishnavi in #1822
- Use correct classes for config-only usage in model builder by @kunal-vaishnavi in #1828
- Fix packaging pipeline by @baijumeswani in #1829
- Add missing tokenizer methods in java by @baijumeswani in #1833
- Add run options to ONNX Runtime GenAI by @kunal-vaishnavi in #1795
- Avoid Processing EOS Token During Continuous Decoding by @baijumeswani in #1814
- Fix nuget packaging pipeline for dev builds by @baijumeswani in #1837
- Add tool normalization for tool calling by @kunal-vaishnavi in #1838
- Refactor past_present_share_buffer logic into reusable function by @anujj in #1839
- Fix nuget packaging pipeline by @baijumeswani in #1841
- Add enable_webgpu_graph in extra_options by @qjia7 in #1788
- Update tool normalization in ORT GenAI by @kunal-vaishnavi in #1842
- Support RotaryEmbedding in GQA for webgpu ep by @xiaofeihan1 in #1847
- Enable guidance ff tokens for faster inference by @JC1DA in #1803
- Support pre-registered plug-in cuda execution provider library by @baijumeswani in #1850
- ADO: Update pipeline to publish onnxruntime-genai. for relwithdebinfo builds. by @chrisdMSFT in #1855
- Layer-wise KV Cache Allocation for Models with Alternating Attention Patterns by @anujj in #1832
- Mpasumarthi/nvtrt test suite by @mpasumarthi-git in #1756
- bugfix: fix a memory issue in Whisper by @fs-eire in #1859
- Add disable cuda graph when num_beams > 1 and fix set_provider_option bug by @anujj in #1846
- Mixed precision export support for gptq quantized model by @rM-planet in #1853
- Enable If Node Support for TRT-RTX in Phi-3.5/Phi-4 LongRoPE Models by @anujj in #1851
- Fix handling EOS token id detection by @kunal-vaishnavi in #1849
- Ensure Consistent Tool Calling JSON Serialization and Deserialization by @sayanshaw24 in #1863
- Add C# binding for GetNextTokens by @kunal-vaishnavi in #1865
- Set version as 0.11.0 by @kunal-vaishnavi in #1866
New Contributors
- @hariharans29 made their first contribution in #1710
- @qti-ashimaj made their first contribution in #1762
- @chemwolf6922 made their first contribution in #1742
- @qjia7 made their first contribution in #1788
- @xiaofeihan1 made their first contribution in #1847
- @JC1DA made their first contribution in #1803
- @mpasumarthi-git made their first contribution in #1756
- @rM-planet made their first contribution in #1853
Full Changelog: v0.10.0...v0.11.0
v0.10.0
What's Changed
- Enable continuous decoding for NvTensorRtRtx EP by @anujj in #1697
- Use updated Decoder API with
skip_special_tokensby @sayanshaw24 in #1722 - Update extensions to include memleak fix by @baijumeswani in #1724
- Support batch processing for whisper example by @jiafatom in #1723
- Update onnxruntime_extensions dependency version by @baijumeswani in #1725
- Include C++ header in native nuget and fix compiler warnings by @baijumeswani in #1727
- Update Microsoft.Extensions.AI to 9.8.0 by @rogerbarreto in #1689
- Update Extensions commit for Qwen 2.5 Chat Template Tools Fix by @sayanshaw24 in #1730
- Whisper Truncation Extensions Commit Update by @sayanshaw24 in #1735
- Enable Cuda Graph for TensorRtRtx by default by @anujj in #1734
- Update sampling benchmark by @tianleiwu in #1729
- Add Windows WinML x64 build workflow by @chrisdMSFT in #1740
- Fix CUDA synchronization issue between ORT-GenAI and TRT-RTX inference by @anujj in #1733
- Hello WindowsML by @chrisdMSFT in #1711
- [CUDA] sampling kernel improvements by @tianleiwu in #1732
- Update GitHub Actions to latest versions by @snnn in #1749
- Update WinML version to 1.8.2091 by @nieubank in #1750
- Address macos packaging pipeline issues by @baijumeswani in #1747
- ProviderOptions level device filtering and APIs to configure model level device filtering by @vortex-captain in #1744
- Fix string indexing bug with Phi-4 mm tokenization by @kunal-vaishnavi in #1751
- Fix TRT-RTX EP regression by @gaugarg-nv in #1754
- Fix typo in C API header by @kunal-vaishnavi in #1753
- Enable WinML by default in ADO pipelines by @chrisdMSFT in #1755
- Change default build configuration to 'relwithdebinfo' by @baijumeswani in #1757
- Pin cmake and vcpkg versions in macOS workflows by @snnn in #1760
- Add TRT_RTX support for onnxruntime-genai-trt-rtx wheel by @anujj in #1736
- rel-0.10.0 by @chrisdMSFT in #1767
- Microsoft.ML.OnnxRuntimeGenAI.WinML.props by @chrisdMSFT in #1776
- Warning fix - ort_genai.h by @chrisdMSFT in #1778
- Microsoft.ML.OnnxRuntimeGenAI.targets by @chrisdMSFT in #1781
New Contributors
Full Changelog: v0.9.2...v0.10.0
v0.9.2
This release fixes a pre-processing bug with Phi-4 multimodal.
Full Changelog: v0.9.1...v0.9.2
v0.9.1
🚀 Features
Support for Continuous Batching (#1580) by @baijumeswani
RegisterExecutionProviderLibrary (#1628) by @vortex-captain
Enable CUDA graph for LLMs for NvTensorRtRtx EP (#1645) by @anujj
Add support for smollm3 (#1666) by @xenova
Add OpenAI's gpt-oss to ONNX Runtime GenAI (#1678) by @kunal-vaishnavi
Add custom ops library path resolution using EP metadata (#1707) by @psakhamoori
Use OnnxRuntime API wrapper for EP device operations (#1719) by @psakhamoori
🛠 Improvements
Update Extensions Commit to Support Strft Custom Function for Chat Template (#1670) by @sayanshaw24
Add parameters to chat template in chat example (#1673) by @kunal-vaishnavi
Update how Hugging Face's config files are processed (#1693) by @kunal-vaishnavi
Tie embedding weight sharing (#1690) by @jiafatom
Improve top-k sampling CUDA kernel (#1708) by @gaugarg-nv
🐛 Bug Fixes
Fix accessing final norm for Gemma-3 models (#1687) by @kunal-vaishnavi
Fix runtime bugs with multi-modal models (#1701) by @kunal-vaishnavi
Fix BF16 CUDA version of OpenAI's gpt-oss (#1706) by @kunal-vaishnavi
Fix benchmark_e2e (#1702) by @jiafatom
Fix benchmark_multimodal (#1714) by @jiafatom
Fix pad vs. eos token misidentification (#1694) by @aciddelgado
⚡ Performance & EP Enhancements
NvTensorRtRtx: Support num_beam > 1 (#1688) by @anujj
NvTensorRtRtx: Skip if node of Phi4 models (#1696) by @anujj
Remove QDQ and Opset Coupling for TRT RTX EP (#1692) by @xiaoyu-work
🔒 Build & CI
Enable Security Protocols in MSVC for BinSkim (#1672) by @sayanshaw24
Explicitly specify setup-java architecture in win-cpu-arm64-build.yml (#1685) by @edgchen1
Use dotnet instead of nuget in mac build (#1717) by @natke
📦 Versioning & Release
Update version to 0.10.0 (#1676) by @ajindal1
Cherrypick 0: Forgot to change versions (#1721) by @aciddelgado
Cherrypick 1... Becomes RC1 (#1726) by @aciddelgado
Cherrypick 2 (#1743) by @aciddelgado
🙌 New Contributors
@xiaoyu-work (#1692)
@psakhamoori (#1707)
✅ Full Changelog: v0.9.0...v0.9.1
v0.9.0
What's Changed
New Features
- Constrained decoding integration by @ajindal1 in #1381
- Update constrained decoding by @ajindal1 in #1477
- Enable TRT multi profile option though provider option by @anujj in #1493
- Add support for Machine Translation model by @apsonawane in #1482
- Overlap prompt processing KV cache update for WindowedKeyValueCache in DecoderOnlyPipelineState by @edgchen1 in #1526
- Add basic support for tracing by @edgchen1 in #1524
- Logging SetLogCallback + Debugging cleanup by @RyanUnderhill in #1471
- Support loading models from memory by @baijumeswani in #1571
- Add SLM Engine support function calling by @kinfey in #1582
- Pass the batch_size thought the Overlay by @anujj in #1627
- Enable GPU based sampling for TRT-RTX by @gaugarg-nv in #1650
Model Builder Changes
- Whisper Redesigned Solution by @kunal-vaishnavi in #1229
- [Builder] Add support for Olive quantized models by @jambayk in #1647
- Add Qwen3 to model builder by @xenova in #1428
- Model builder: Add ability to exclude a node from quantization by @sushraja-msft in #1436
- Support k_quant in model builder by @jiafatom in #1444
- Add final norm for LoRA models by @kunal-vaishnavi in #1446
- Add bfloat16 support in model builder by @kunal-vaishnavi in #1447
- Fix accuracy issues with Gemma models by @kunal-vaishnavi in #1448
- Always cast bf16 logits to fp32 by @nenad1002 in #1479
- NvTensorRtRtx EP option in GenAI - model builder by @BLSharda in #1453
- Add Gemma3 Model support for NvTensorRtRtx execution provider by @anujj in #1520
- Use IRv10 in the model builder by @justinchuby in #1547
- [Builder] Rename methods make_value and make_initializer by @justinchuby in #1554
- Always use opset21 in builder by @justinchuby in #1548
- Clamp KV Cache Size to Sliding Window for NvTensorRtRtx EP by @BLSharda in #1523
- [Builder] Fix output name in make_rotary_embedding_multi_cache by @justinchuby in #1562
- [Builder] Use lazy tensor by @justinchuby in #1556
- [Builder] Fix KeyError for torch.uint8 in dtype mapping for MoE quantization by @Copilot in #1561
- [Builder] Fix 1d constant creation by @justinchuby in #1568
- [Builder] Create progress bar by @justinchuby in #1559
- [Builder] Use packed 4bit tensors directly by @justinchuby in #1566
- [Builder] Simplify constant creation by @justinchuby in #1569
- [Builder] Add cuda-bfloat16 entry to valid_gqa_configurations by @justinchuby in #1585
- [Builder] use dtype conversion helpers from onnx_ir by @justinchuby in #1587
- [Model builder] Add support for Ernie 4.5 models by @xenova in #1608
- whisper: Allow session options to be used for encoder by @RyanMetcalfeInt8 in #1622
- Make default top_k=50 in model builder by @jiafatom in #1642
- Update builder.py by @lnigam in #1665
- Change IO dtype for INT4 CUDA models by @kunal-vaishnavi in #1629
Bug fixes
- CUDA Top K / Top P Fixes by @aciddelgado in #1371
- Persist provider options across ClearProviders, AppendProvider where possible by @baijumeswani in #1454
- Add enable_skip_layer_norm_strict_mode flag by @nenad1002 in #1462
- Avoid adding providers if not requested by @baijumeswani in #1464
- Fix array eos_token_id handling by @RyanUnderhill in #1463
- Remove BF16 CPU from valid GQA configuration by @nenad1002 in #1469
- Address QNN specific regressions by @baijumeswani in #1470
- Fix how torch tensors are saved by @kunal-vaishnavi in #1476
- Fix model chat example for rewind by @ajindal1 in #1480
- Correctly iterate over the providers to check if graph capture is enabled by @baijumeswani in #1497
- Fix missing parameter name by @xadupre in #1502
- Fix from pretrained method for quantized models by @kunal-vaishnavi in #1503
- Remove position_id and fix context phase KV shapes for in-place cache buffer support by @anujj in #1505
- Fix last layer generation for text-only models by @nenad1002 in #1513
- [Fix] Remove references to TensorProto by @justinchuby in #1549
- Fix make_layernorm_casts usage of value infos by @justinchuby in #1551
- Fix DML Memory Leak by @aciddelgado in #1578
- [DML] Bind the dml global objects to the Model by @baijumeswani in #1590
- NvTensorRTRTx: Enable CUDA graph via config and fix attention_mask shape handling by @anujj in #1594
- Append eos token to the end of input sequence for marian models by @apsonawane in #1630
- Use two-step Softmax to do cuda sampling by @jiafatom in #1617
- Use two-step softmax for CPU sampling by @jiafatom in #1631
- Use last windowed input ids to update logits by @baijumeswani in #1636
- Fix attention‑mask stride bug for static masking (batch > 1) by @anujj in #1639
- Add open bytes functionality for C# by @ajindal1 in #1634
Packaging/Testing/Pipelines
- Sign macos binaries by @baijumeswani in #1439
- Add chat template tests by @sayanshaw24 in #1457
- Update triggers by @snnn in #1490
- Add support for building a cuda + dml package by @baijumeswani in #1600
- NvTensorRtRtx: Pass the dynamic shapes (ISL and batch_size) to the ep at runtime as nv profile. by @anujj in #1614
- Update docker image by @snnn in #1633
- sign all genai dlls, in both onnxunrime-genai and python targets by @vortex-captain in #1635
- Fixes all packaging pipelines by @baijumeswani in #1641
- Update the benchmark scripts to account for the time spent in sampling by @gaugarg-nv in #1646
- Add date for nightly packages by @ajindal1 in #1668
Compliance
- Enable policheck in packaging pipeline by @apsonawane in #1449
- Add third party notices in file exclusion by @apsonawane in #1459
- Enable tsa options in packaging pipelines by @apsonawane in #1460
- Update windows packaging pipelines to use build.py by @aciddelgado in #1468
Documentation and Examples
- Update OnnxRuntimeGenAIChatClient with chat template and guidance by @stephentoub in #1533
- Update SimpleGenAI.java docs by @edgc...
v0.8.3
v0.8.2
What's changed
New features
- Use Accuracy level 4 for webgpu by default by @guschmue (#1474)
- Enable guidance by default on macos by @ajindal1 (#1514)
Bug fixes
- Remove position_id and fix context phase KV shapes for in-place cache buffer support by @anujj (#1505)
- Update Extensions Commit for 0.8.2 by @sayanshaw24 (#1519)
- Update Extensions Commit for another DeepSeek Fix by @sayanshaw24 (#1521)
Packaging and testing
Full Changelog: v0.8.1...v0.8.2
v0.8.1
What's changed
New features
- Integrate tools input into Chat Template API by @sayanshaw24 (#1472)
- NvTensorRtRtx EP option in GenAI - model builder by @BLSharda (#1453)
- Enable TRT multi profile option though provider option by @anujj (#1493)
Bug fixes
- Always cast bf16 logits to fp32 by @nenad1002 (#1479)
Examples and documentation
- Update Chat Template Examples for Tools API change by @sayanshaw24 (#1506)
- Fix model chat example for rewind by @ajindal1 (#1480)
Model builder changes
- Fix from pretrained method for quantized models by @kunal-vaishnavi (#1503)
- Fix missing parameter name by @xadupre (#1502)
- minor change to support qwen3 by @guschmue (#1499)
- Fix how torch tensors are saved by @kunal-vaishnavi (#1476)
- Support k_quant in model builder by @jiafatom (#1444)
Dependency updates
- Update to stable release of Microsoft.Extensions.AI.Abstractions by @stephentoub (#1489)
- Update to M.E.AI 9.4.3-preview.1.25230.7 by @stephentoub (#1443)
Full Changelog: v0.8.0...v0.8.1