Releases · microsoft/onnxruntime-genai

18 Nov 12:53

kunal-vaishnavi

v0.11.2

25962b0

v0.11.2 Latest

Latest

What's Changed

Revert removal of eps_without_if_support by @xiaofeihan1 in #1878
Fix condition for NPU by @apsonawane in #1880
Set version as 0.11.2 by @kunal-vaishnavi in #1881

Full Changelog: v0.11.1...v0.11.2

Contributors

xiaofeihan1, apsonawane, and kunal-vaishnavi

Assets 13

onnxruntime-genai-0.11.2-linux-x64-cuda.tar.gz

sha256:148616e7ff1790b9502606660b98f6f853e968eb32ae7f4348c357fb57ed9e0a

81.5 MB 2025-11-18T12:51:49Z
onnxruntime-genai-0.11.2-linux-x64.tar.gz

sha256:0f63ef0fd3ba6a5c49b0b7d58e28e734c8a95deb8162221f50c716773341b286

41.9 MB 2025-11-18T12:51:41Z
onnxruntime-genai-0.11.2-osx-arm64.tar.gz

sha256:b68ec21dff08650193621c1963dea2cd4c101a65c989cd07be125bb102e4fce1

3.13 MB 2025-11-18T12:52:10Z
onnxruntime-genai-0.11.2-osx-x64.tar.gz

sha256:944cf63509b5f08f2d0ae2def1bdddeb6d0f6ed2cacf851ebfd780946f83edf4

3.33 MB 2025-11-18T12:52:09Z
onnxruntime-genai-0.11.2-win-arm64-dml.zip

sha256:a481a97348b45f13f7428cfe82b25fc8a7dd5e82290111ca6b6793cc6558c1b3

14.6 MB 2025-11-18T12:53:14Z
onnxruntime-genai-0.11.2-win-arm64.zip

sha256:62de017f5762fef92a1ef68d58955c0d798e0bc1e5d95f901c25746c2e8285b0

13.8 MB 2025-11-18T12:53:24Z
onnxruntime-genai-0.11.2-win-x64-cuda.zip

sha256:4a84495d8cfe19471d1971a5f72c9985c4c2fcea57445dbf678c4ed136bc50aa

51.6 MB 2025-11-18T12:52:57Z
onnxruntime-genai-0.11.2-win-x64-dml.zip

sha256:69b869a45519a449f3caf1cdb8ffe5e8368016bb746ae4addf8c8887cc4d5134

15.4 MB 2025-11-18T12:52:53Z
onnxruntime-genai-0.11.2-win-x64.zip

sha256:31aeeb4fa7e1d9bf284f6215d60e0025d534b99add7b0daeaccd729ee8ad1595

14.5 MB 2025-11-18T12:53:19Z
onnxruntime-genai-android-0.11.2.aar

sha256:47049128deb062f26771e6c3ad98826de777a28c0e228ad05542a936f40e4213

30.5 MB 2025-11-18T12:52:34Z
Source code (zip)

2025-11-18T06:49:04Z
Source code (tar.gz)

2025-11-18T06:49:04Z

17 Nov 03:39

kunal-vaishnavi

v0.11.1

ec0f733

v0.11.1

What's Changed

Cherry pick guidance fix into 0.11.1 release by @kunal-vaishnavi in #1872
Set version as 0.11.1 by @kunal-vaishnavi in #1873
Fix regex by @apsonawane in #1876

Full Changelog: v0.11.0...v0.11.1

Contributors

apsonawane and kunal-vaishnavi

Assets 13

14 Nov 02:51

kunal-vaishnavi

v0.11.0

e0e02a9

v0.11.0

What's Changed

ADO - Update WinML build pipeline by @chrisdMSFT in #1768
Fix CMakeLists.txt auto-detection of library directory by @anujj in #1774
Fix new/delete override and Enable cuda kernel test in Windows by @tianleiwu in #1772
Use abbreviation for TensorRT RTX EP by @kunal-vaishnavi in #1763
Add trust remote code option to model builder by @kunal-vaishnavi in #1766
Support block-wise quant in qmoe op by @apsonawane in #1746
Change the status for TRT-RTX EP by @gaugarg-nv in #1780
Cherry-Pick changes from rel 0.10.0 back to main. by @chrisdMSFT in #1782
Fix /CETCOMPAT Usage for Cross-Compiling by @sayanshaw24 in #1779
Provide distributed version of improved TopK kernel by @hariharans29 in #1710
[TRT-RTX] Disable KV cache re-computation for Phi models by @gaugarg-nv in #1787
[CUDA] Add high-performance Top-K kernels and online benchmarking by @tianleiwu in #1748
Change shared indices array type from float to int by @hariharans29 in #1789
Enable bfloat16 multi-modal models by @kunal-vaishnavi in #1786
Disable lmhead while prompt processing by @qti-ashimaj in #1762
Introduce support for dynamic batching by @baijumeswani in #1662
Generate pyd type info by @chemwolf6922 in #1742
Add trt-rtx c packages in c example by @anujj in #1794
[CUDA] Fix build with CUDA >= 12.9 by @tianleiwu in #1802
[CUDA] topk kernels v2 by @tianleiwu in #1798
Add prefill Chunking Support for NvTensorRtRtx and Cuda Providers by @anujj in #1765
Add TRT-RTX EP support, keep NvTensorRtRtx as user facing name, and force QDQ by @anujj in #1791
[CUDA] Add static assert to suppress windows build warnings by @tianleiwu in #1804
Revert "Generate pyd type info" by @baijumeswani in #1805
[QNN] Support continuous decoding by @baijumeswani in #1808
ADO Pipeline - nuget_winml_package_reference_version is configured at build time. by @chrisdMSFT in #1811
Update version to 0.11.0-dev by @baijumeswani in #1815
Add Support For Tokenizer Options by @sayanshaw24 in #1785
Fix exit call in README example by @justinchuby in #1823
Add tokenizer APIs for accessing important ids by @kunal-vaishnavi in #1822
Use correct classes for config-only usage in model builder by @kunal-vaishnavi in #1828
Fix packaging pipeline by @baijumeswani in #1829
Add missing tokenizer methods in java by @baijumeswani in #1833
Add run options to ONNX Runtime GenAI by @kunal-vaishnavi in #1795
Avoid Processing EOS Token During Continuous Decoding by @baijumeswani in #1814
Fix nuget packaging pipeline for dev builds by @baijumeswani in #1837
Add tool normalization for tool calling by @kunal-vaishnavi in #1838
Refactor past_present_share_buffer logic into reusable function by @anujj in #1839
Fix nuget packaging pipeline by @baijumeswani in #1841
Add enable_webgpu_graph in extra_options by @qjia7 in #1788
Update tool normalization in ORT GenAI by @kunal-vaishnavi in #1842
Support RotaryEmbedding in GQA for webgpu ep by @xiaofeihan1 in #1847
Enable guidance ff tokens for faster inference by @JC1DA in #1803
Support pre-registered plug-in cuda execution provider library by @baijumeswani in #1850
ADO: Update pipeline to publish onnxruntime-genai. for relwithdebinfo builds. by @chrisdMSFT in #1855
Layer-wise KV Cache Allocation for Models with Alternating Attention Patterns by @anujj in #1832
Mpasumarthi/nvtrt test suite by @mpasumarthi-git in #1756
bugfix: fix a memory issue in Whisper by @fs-eire in #1859
Add disable cuda graph when num_beams > 1 and fix set_provider_option bug by @anujj in #1846
Mixed precision export support for gptq quantized model by @rM-planet in #1853
Enable If Node Support for TRT-RTX in Phi-3.5/Phi-4 LongRoPE Models by @anujj in #1851
Fix handling EOS token id detection by @kunal-vaishnavi in #1849
Ensure Consistent Tool Calling JSON Serialization and Deserialization by @sayanshaw24 in #1863
Add C# binding for GetNextTokens by @kunal-vaishnavi in #1865
Set version as 0.11.0 by @kunal-vaishnavi in #1866

New Contributors

@hariharans29 made their first contribution in #1710
@qti-ashimaj made their first contribution in #1762
@chemwolf6922 made their first contribution in #1742
@qjia7 made their first contribution in #1788
@xiaofeihan1 made their first contribution in #1847
@JC1DA made their first contribution in #1803
@mpasumarthi-git made their first contribution in #1756
@rM-planet made their first contribution in #1853

Full Changelog: v0.10.0...v0.11.0

Contributors

anujj, JC1DA, and 16 other contributors

Assets 13

10 Oct 17:26

baijumeswani

v0.10.0

6deb570

v0.10.0

What's Changed

Enable continuous decoding for NvTensorRtRtx EP by @anujj in #1697
Use updated Decoder API with skip_special_tokens by @sayanshaw24 in #1722
Update extensions to include memleak fix by @baijumeswani in #1724
Support batch processing for whisper example by @jiafatom in #1723
Update onnxruntime_extensions dependency version by @baijumeswani in #1725
Include C++ header in native nuget and fix compiler warnings by @baijumeswani in #1727
Update Microsoft.Extensions.AI to 9.8.0 by @rogerbarreto in #1689
Update Extensions commit for Qwen 2.5 Chat Template Tools Fix by @sayanshaw24 in #1730
Whisper Truncation Extensions Commit Update by @sayanshaw24 in #1735
Enable Cuda Graph for TensorRtRtx by default by @anujj in #1734
Update sampling benchmark by @tianleiwu in #1729
Add Windows WinML x64 build workflow by @chrisdMSFT in #1740
Fix CUDA synchronization issue between ORT-GenAI and TRT-RTX inference by @anujj in #1733
Hello WindowsML by @chrisdMSFT in #1711
[CUDA] sampling kernel improvements by @tianleiwu in #1732
Update GitHub Actions to latest versions by @snnn in #1749
Update WinML version to 1.8.2091 by @nieubank in #1750
Address macos packaging pipeline issues by @baijumeswani in #1747
ProviderOptions level device filtering and APIs to configure model level device filtering by @vortex-captain in #1744
Fix string indexing bug with Phi-4 mm tokenization by @kunal-vaishnavi in #1751
Fix TRT-RTX EP regression by @gaugarg-nv in #1754
Fix typo in C API header by @kunal-vaishnavi in #1753
Enable WinML by default in ADO pipelines by @chrisdMSFT in #1755
Change default build configuration to 'relwithdebinfo' by @baijumeswani in #1757
Pin cmake and vcpkg versions in macOS workflows by @snnn in #1760
Add TRT_RTX support for onnxruntime-genai-trt-rtx wheel by @anujj in #1736
rel-0.10.0 by @chrisdMSFT in #1767
Microsoft.ML.OnnxRuntimeGenAI.WinML.props by @chrisdMSFT in #1776
Warning fix - ort_genai.h by @chrisdMSFT in #1778
Microsoft.ML.OnnxRuntimeGenAI.targets by @chrisdMSFT in #1781

New Contributors

@nieubank made their first contribution in #1750

Full Changelog: v0.9.2...v0.10.0

Contributors

snnn, anujj, and 10 other contributors

Assets 13

16 Sep 07:27

kunal-vaishnavi

v0.9.2

fc6f8d7

v0.9.2

This release fixes a pre-processing bug with Phi-4 multimodal.

Full Changelog: v0.9.1...v0.9.2

Assets 11

09 Sep 22:53

aciddelgado

v0.9.1

41211b8

v0.9.1

🚀 Features

Support for Continuous Batching (#1580) by @baijumeswani
RegisterExecutionProviderLibrary (#1628) by @vortex-captain
Enable CUDA graph for LLMs for NvTensorRtRtx EP (#1645) by @anujj
Add support for smollm3 (#1666) by @xenova
Add OpenAI's gpt-oss to ONNX Runtime GenAI (#1678) by @kunal-vaishnavi
Add custom ops library path resolution using EP metadata (#1707) by @psakhamoori
Use OnnxRuntime API wrapper for EP device operations (#1719) by @psakhamoori

🛠 Improvements

Update Extensions Commit to Support Strft Custom Function for Chat Template (#1670) by @sayanshaw24
Add parameters to chat template in chat example (#1673) by @kunal-vaishnavi
Update how Hugging Face's config files are processed (#1693) by @kunal-vaishnavi
Tie embedding weight sharing (#1690) by @jiafatom
Improve top-k sampling CUDA kernel (#1708) by @gaugarg-nv

🐛 Bug Fixes

Fix accessing final norm for Gemma-3 models (#1687) by @kunal-vaishnavi
Fix runtime bugs with multi-modal models (#1701) by @kunal-vaishnavi
Fix BF16 CUDA version of OpenAI's gpt-oss (#1706) by @kunal-vaishnavi
Fix benchmark_e2e (#1702) by @jiafatom
Fix benchmark_multimodal (#1714) by @jiafatom
Fix pad vs. eos token misidentification (#1694) by @aciddelgado

⚡ Performance & EP Enhancements

NvTensorRtRtx: Support num_beam > 1 (#1688) by @anujj
NvTensorRtRtx: Skip if node of Phi4 models (#1696) by @anujj
Remove QDQ and Opset Coupling for TRT RTX EP (#1692) by @xiaoyu-work

🔒 Build & CI

Enable Security Protocols in MSVC for BinSkim (#1672) by @sayanshaw24
Explicitly specify setup-java architecture in win-cpu-arm64-build.yml (#1685) by @edgchen1
Use dotnet instead of nuget in mac build (#1717) by @natke

📦 Versioning & Release

Update version to 0.10.0 (#1676) by @ajindal1
Cherrypick 0: Forgot to change versions (#1721) by @aciddelgado
Cherrypick 1... Becomes RC1 (#1726) by @aciddelgado
Cherrypick 2 (#1743) by @aciddelgado

🙌 New Contributors

@xiaoyu-work (#1692)
@psakhamoori (#1707)

✅ Full Changelog: v0.9.0...v0.9.1

Contributors

anujj, natke, and 12 other contributors

Assets 11

06 Aug 17:30

aciddelgado

v0.9.0

5ba9fce

v0.9.0

What's Changed

New Features

Constrained decoding integration by @ajindal1 in #1381
Update constrained decoding by @ajindal1 in #1477
Enable TRT multi profile option though provider option by @anujj in #1493
Add support for Machine Translation model by @apsonawane in #1482
Overlap prompt processing KV cache update for WindowedKeyValueCache in DecoderOnlyPipelineState by @edgchen1 in #1526
Add basic support for tracing by @edgchen1 in #1524
Logging SetLogCallback + Debugging cleanup by @RyanUnderhill in #1471
Support loading models from memory by @baijumeswani in #1571
Add SLM Engine support function calling by @kinfey in #1582
Pass the batch_size thought the Overlay by @anujj in #1627
Enable GPU based sampling for TRT-RTX by @gaugarg-nv in #1650

Model Builder Changes

Whisper Redesigned Solution by @kunal-vaishnavi in #1229
[Builder] Add support for Olive quantized models by @jambayk in #1647
Add Qwen3 to model builder by @xenova in #1428
Model builder: Add ability to exclude a node from quantization by @sushraja-msft in #1436
Support k_quant in model builder by @jiafatom in #1444
Add final norm for LoRA models by @kunal-vaishnavi in #1446
Add bfloat16 support in model builder by @kunal-vaishnavi in #1447
Fix accuracy issues with Gemma models by @kunal-vaishnavi in #1448
Always cast bf16 logits to fp32 by @nenad1002 in #1479
NvTensorRtRtx EP option in GenAI - model builder by @BLSharda in #1453
Add Gemma3 Model support for NvTensorRtRtx execution provider by @anujj in #1520
Use IRv10 in the model builder by @justinchuby in #1547
[Builder] Rename methods make_value and make_initializer by @justinchuby in #1554
Always use opset21 in builder by @justinchuby in #1548
Clamp KV Cache Size to Sliding Window for NvTensorRtRtx EP by @BLSharda in #1523
[Builder] Fix output name in make_rotary_embedding_multi_cache by @justinchuby in #1562
[Builder] Use lazy tensor by @justinchuby in #1556
[Builder] Fix KeyError for torch.uint8 in dtype mapping for MoE quantization by @Copilot in #1561
[Builder] Fix 1d constant creation by @justinchuby in #1568
[Builder] Create progress bar by @justinchuby in #1559
[Builder] Use packed 4bit tensors directly by @justinchuby in #1566
[Builder] Simplify constant creation by @justinchuby in #1569
[Builder] Add cuda-bfloat16 entry to valid_gqa_configurations by @justinchuby in #1585
[Builder] use dtype conversion helpers from onnx_ir by @justinchuby in #1587
[Model builder] Add support for Ernie 4.5 models by @xenova in #1608
whisper: Allow session options to be used for encoder by @RyanMetcalfeInt8 in #1622
Make default top_k=50 in model builder by @jiafatom in #1642
Update builder.py by @lnigam in #1665
Change IO dtype for INT4 CUDA models by @kunal-vaishnavi in #1629

Bug fixes

CUDA Top K / Top P Fixes by @aciddelgado in #1371
Persist provider options across ClearProviders, AppendProvider where possible by @baijumeswani in #1454
Add enable_skip_layer_norm_strict_mode flag by @nenad1002 in #1462
Avoid adding providers if not requested by @baijumeswani in #1464
Fix array eos_token_id handling by @RyanUnderhill in #1463
Remove BF16 CPU from valid GQA configuration by @nenad1002 in #1469
Address QNN specific regressions by @baijumeswani in #1470
Fix how torch tensors are saved by @kunal-vaishnavi in #1476
Fix model chat example for rewind by @ajindal1 in #1480
Correctly iterate over the providers to check if graph capture is enabled by @baijumeswani in #1497
Fix missing parameter name by @xadupre in #1502
Fix from pretrained method for quantized models by @kunal-vaishnavi in #1503
Remove position_id and fix context phase KV shapes for in-place cache buffer support by @anujj in #1505
Fix last layer generation for text-only models by @nenad1002 in #1513
[Fix] Remove references to TensorProto by @justinchuby in #1549
Fix make_layernorm_casts usage of value infos by @justinchuby in #1551
Fix DML Memory Leak by @aciddelgado in #1578
[DML] Bind the dml global objects to the Model by @baijumeswani in #1590
NvTensorRTRTx: Enable CUDA graph via config and fix attention_mask shape handling by @anujj in #1594
Append eos token to the end of input sequence for marian models by @apsonawane in #1630
Use two-step Softmax to do cuda sampling by @jiafatom in #1617
Use two-step softmax for CPU sampling by @jiafatom in #1631
Use last windowed input ids to update logits by @baijumeswani in #1636
Fix attention‑mask stride bug for static masking (batch > 1) by @anujj in #1639
Add open bytes functionality for C# by @ajindal1 in #1634

Packaging/Testing/Pipelines

Sign macos binaries by @baijumeswani in #1439
Add chat template tests by @sayanshaw24 in #1457
Update triggers by @snnn in #1490
Add support for building a cuda + dml package by @baijumeswani in #1600
NvTensorRtRtx: Pass the dynamic shapes (ISL and batch_size) to the ep at runtime as nv profile. by @anujj in #1614
Update docker image by @snnn in #1633
sign all genai dlls, in both onnxunrime-genai and python targets by @vortex-captain in #1635
Fixes all packaging pipelines by @baijumeswani in #1641
Update the benchmark scripts to account for the time spent in sampling by @gaugarg-nv in #1646
Add date for nightly packages by @ajindal1 in #1668

Compliance

Enable policheck in packaging pipeline by @apsonawane in #1449
Add third party notices in file exclusion by @apsonawane in #1459
Enable tsa options in packaging pipelines by @apsonawane in #1460
Update windows packaging pipelines to use build.py by @aciddelgado in #1468

Documentation and Examples

Update OnnxRuntimeGenAIChatClient with chat template and guidance by @stephentoub in #1533
Update SimpleGenAI.java docs by @edgc...

Contributors

asoldano, snnn, and 26 other contributors

Assets 11

03 Jul 20:37

baijumeswani

v0.8.3

dc2d850

v0.8.3

This release addresses regressions with DML.

Fixes include:

Contributors

baijumeswani and aciddelgado

Assets 13

05 Jun 23:03

ajindal1

v0.8.2

fea4e96

v0.8.2

What's changed

New features

Use Accuracy level 4 for webgpu by default by @guschmue (#1474)
Enable guidance by default on macos by @ajindal1 (#1514)

Bug fixes

Remove position_id and fix context phase KV shapes for in-place cache buffer support by @anujj (#1505)
Update Extensions Commit for 0.8.2 by @sayanshaw24 (#1519)
Update Extensions Commit for another DeepSeek Fix by @sayanshaw24 (#1521)

Packaging and testing

Update triggers by @snnn (#1490)

Full Changelog: v0.8.1...v0.8.2

Assets 13

30 May 22:14

baijumeswani

v0.8.1

caba648

v0.8.1

What's changed

New features

Integrate tools input into Chat Template API by @sayanshaw24 (#1472)

NvTensorRtRtx EP option in GenAI - model builder by @BLSharda (#1453)
Enable TRT multi profile option though provider option by @anujj (#1493)

Bug fixes

Always cast bf16 logits to fp32 by @nenad1002 (#1479)

Examples and documentation

Update Chat Template Examples for Tools API change by @sayanshaw24 (#1506)
Fix model chat example for rewind by @ajindal1 (#1480)

Model builder changes

Fix from pretrained method for quantized models by @kunal-vaishnavi (#1503)
Fix missing parameter name by @xadupre (#1502)
minor change to support qwen3 by @guschmue (#1499)
Fix how torch tensors are saved by @kunal-vaishnavi (#1476)
Support k_quant in model builder by @jiafatom (#1444)

Dependency updates

Update to stable release of Microsoft.Extensions.AI.Abstractions by @stephentoub (#1489)
Update to M.E.AI 9.4.3-preview.1.25230.7 by @stephentoub (#1443)

Full Changelog: v0.8.0...v0.8.1

Assets 13

Releases: microsoft/onnxruntime-genai

v0.11.2

What's Changed

Contributors

Uh oh!

v0.11.1

What's Changed

Contributors

Uh oh!

v0.11.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.10.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.9.2

Uh oh!

v0.9.1

Contributors

Uh oh!

v0.9.0

What's Changed

New Features

Model Builder Changes

Bug fixes

Packaging/Testing/Pipelines

Compliance

Documentation and Examples

Contributors

Uh oh!

v0.8.3

Contributors

Uh oh!

v0.8.2

What's changed

New features

Bug fixes

Packaging and testing

Uh oh!

v0.8.1

What's changed

New features

Bug fixes

Examples and documentation

Model builder changes

Dependency updates

Uh oh!