v0.9.0
What's Changed
New Features
- Constrained decoding integration by @ajindal1 in #1381
- Update constrained decoding by @ajindal1 in #1477
- Enable TRT multi profile option though provider option by @anujj in #1493
- Add support for Machine Translation model by @apsonawane in #1482
- Overlap prompt processing KV cache update for WindowedKeyValueCache in DecoderOnlyPipelineState by @edgchen1 in #1526
- Add basic support for tracing by @edgchen1 in #1524
- Logging SetLogCallback + Debugging cleanup by @RyanUnderhill in #1471
- Support loading models from memory by @baijumeswani in #1571
- Add SLM Engine support function calling by @kinfey in #1582
- Pass the batch_size thought the Overlay by @anujj in #1627
- Enable GPU based sampling for TRT-RTX by @gaugarg-nv in #1650
Model Builder Changes
- Whisper Redesigned Solution by @kunal-vaishnavi in #1229
- [Builder] Add support for Olive quantized models by @jambayk in #1647
- Add Qwen3 to model builder by @xenova in #1428
- Model builder: Add ability to exclude a node from quantization by @sushraja-msft in #1436
- Support k_quant in model builder by @jiafatom in #1444
- Add final norm for LoRA models by @kunal-vaishnavi in #1446
- Add bfloat16 support in model builder by @kunal-vaishnavi in #1447
- Fix accuracy issues with Gemma models by @kunal-vaishnavi in #1448
- Always cast bf16 logits to fp32 by @nenad1002 in #1479
- NvTensorRtRtx EP option in GenAI - model builder by @BLSharda in #1453
- Add Gemma3 Model support for NvTensorRtRtx execution provider by @anujj in #1520
- Use IRv10 in the model builder by @justinchuby in #1547
- [Builder] Rename methods make_value and make_initializer by @justinchuby in #1554
- Always use opset21 in builder by @justinchuby in #1548
- Clamp KV Cache Size to Sliding Window for NvTensorRtRtx EP by @BLSharda in #1523
- [Builder] Fix output name in make_rotary_embedding_multi_cache by @justinchuby in #1562
- [Builder] Use lazy tensor by @justinchuby in #1556
- [Builder] Fix KeyError for torch.uint8 in dtype mapping for MoE quantization by @Copilot in #1561
- [Builder] Fix 1d constant creation by @justinchuby in #1568
- [Builder] Create progress bar by @justinchuby in #1559
- [Builder] Use packed 4bit tensors directly by @justinchuby in #1566
- [Builder] Simplify constant creation by @justinchuby in #1569
- [Builder] Add cuda-bfloat16 entry to valid_gqa_configurations by @justinchuby in #1585
- [Builder] use dtype conversion helpers from onnx_ir by @justinchuby in #1587
- [Model builder] Add support for Ernie 4.5 models by @xenova in #1608
- whisper: Allow session options to be used for encoder by @RyanMetcalfeInt8 in #1622
- Make default top_k=50 in model builder by @jiafatom in #1642
- Update builder.py by @lnigam in #1665
- Change IO dtype for INT4 CUDA models by @kunal-vaishnavi in #1629
Bug fixes
- CUDA Top K / Top P Fixes by @aciddelgado in #1371
- Persist provider options across ClearProviders, AppendProvider where possible by @baijumeswani in #1454
- Add enable_skip_layer_norm_strict_mode flag by @nenad1002 in #1462
- Avoid adding providers if not requested by @baijumeswani in #1464
- Fix array eos_token_id handling by @RyanUnderhill in #1463
- Remove BF16 CPU from valid GQA configuration by @nenad1002 in #1469
- Address QNN specific regressions by @baijumeswani in #1470
- Fix how torch tensors are saved by @kunal-vaishnavi in #1476
- Fix model chat example for rewind by @ajindal1 in #1480
- Correctly iterate over the providers to check if graph capture is enabled by @baijumeswani in #1497
- Fix missing parameter name by @xadupre in #1502
- Fix from pretrained method for quantized models by @kunal-vaishnavi in #1503
- Remove position_id and fix context phase KV shapes for in-place cache buffer support by @anujj in #1505
- Fix last layer generation for text-only models by @nenad1002 in #1513
- [Fix] Remove references to TensorProto by @justinchuby in #1549
- Fix make_layernorm_casts usage of value infos by @justinchuby in #1551
- Fix DML Memory Leak by @aciddelgado in #1578
- [DML] Bind the dml global objects to the Model by @baijumeswani in #1590
- NvTensorRTRTx: Enable CUDA graph via config and fix attention_mask shape handling by @anujj in #1594
- Append eos token to the end of input sequence for marian models by @apsonawane in #1630
- Use two-step Softmax to do cuda sampling by @jiafatom in #1617
- Use two-step softmax for CPU sampling by @jiafatom in #1631
- Use last windowed input ids to update logits by @baijumeswani in #1636
- Fix attention‑mask stride bug for static masking (batch > 1) by @anujj in #1639
- Add open bytes functionality for C# by @ajindal1 in #1634
Packaging/Testing/Pipelines
- Sign macos binaries by @baijumeswani in #1439
- Add chat template tests by @sayanshaw24 in #1457
- Update triggers by @snnn in #1490
- Add support for building a cuda + dml package by @baijumeswani in #1600
- NvTensorRtRtx: Pass the dynamic shapes (ISL and batch_size) to the ep at runtime as nv profile. by @anujj in #1614
- Update docker image by @snnn in #1633
- sign all genai dlls, in both onnxunrime-genai and python targets by @vortex-captain in #1635
- Fixes all packaging pipelines by @baijumeswani in #1641
- Update the benchmark scripts to account for the time spent in sampling by @gaugarg-nv in #1646
- Add date for nightly packages by @ajindal1 in #1668
Compliance
- Enable policheck in packaging pipeline by @apsonawane in #1449
- Add third party notices in file exclusion by @apsonawane in #1459
- Enable tsa options in packaging pipelines by @apsonawane in #1460
- Update windows packaging pipelines to use build.py by @aciddelgado in #1468
Documentation and Examples
- Update OnnxRuntimeGenAIChatClient with chat template and guidance by @stephentoub in #1533
- Update SimpleGenAI.java docs by @edgchen1 in #1532
- Make OnnxRuntime GenAI Examples Simpler by @baijumeswani in #1615
- Update extensions commit and update example script for translation model by @apsonawane in #1623
- Add instructions for macOS by @baijumeswani in #1625
- Add nightly build badge to README by @natke in #1653
- Add NvTenosrRtRtx ep in example file by @anujj in #1656
- Update main README for 0.9.0 release by @kunal-vaishnavi in #1660
- Update C++, C# and Python Examples by @sayanshaw24 in #1664
Tokenizer/Templating Changes
- Set
add_special_tokensto false by default in Encode by @sayanshaw24 in #1442 - Remove prompt templates from GenAI config by @kunal-vaishnavi in #1445
- Update Extensions Commit to Support Chat Template Override for Unsupported Models by @sayanshaw24 in #1452
- Integrate
toolsinput into Chat Template API by @sayanshaw24 in #1472 - Update Chat Template Examples for Tools API change by @sayanshaw24 in #1506
Dependency Updates
- Update to M.E.AI 9.4.3-preview.1.25230.7 by @stephentoub in #1443
- Update to stable release of Microsoft.Extensions.AI.Abstractions by @stephentoub in #1489
- Use ONNX IR for model builder by @justinchuby in #1416
- Automatically install java maven artifact in the local maven repository by @asoldano in #1570
- Bump onnx-ir to 0.1.2 by @jiafatom in #1579
- Update OnnxRuntimeGenAIChatClient to M.E.AI.Abstractions 9.7.0 by @stephentoub in #1612
- Update ORT Extensions Commit by @sayanshaw24 in #1667
New Contributors
- @xenova made their first contribution in #1428
- @sushraja-msft made their first contribution in #1436
- @anujj made their first contribution in #1493
- @xadupre made their first contribution in #1502
- @satreysa made their first contribution in #1483
- @Copilot made their first contribution in #1516
- @asoldano made their first contribution in #1510
- @justinchuby made their first contribution in #1416
- @microsoft-github-policy-service[bot] made their first contribution in #1552
- @mattleibow made their first contribution in #1572
- @kinfey made their first contribution in #1582
- @gaugarg-nv made their first contribution in #1646
- @lnigam made their first contribution in #1665
Full Changelog: v0.8.3...v0.9.0