Releases · NVIDIA/TensorRT-LLM · GitHub

Release list

v1.3.0rc20 Pre-release

Pre-release

mikeiovine released this 30 Jun 01:11

c25c23f

This RC version will be the last one supporting the TensorRT backend, in the next version the TensorRT backend will be removed!

Known Issues
- DeepSeek V3/V3.2 can crash with an illegal memory access or hang during warm up.
- Autotuning for Qwen3-family models can crash with "Assertion failed: Failed to initialize cutlass TMA WS grouped gemm."
API
- Add API to configure TeaCache coefficients (#13170)
- BREAKING CHANGE: Make request chat_template opt-in (#14646)
Feature
- Add DeepSeek V4 preparation (#15378, #15379, #15381, #15394, #15402, #15222)
- Add MXFP8 weight format plus CUTLASS W8A8 Linear and MoE (#14962)
- Add Marlin NVFP4 backend for MoE and Linear on Hopper (#13476)
- Add CUDA graph wrapper for multimodal encoders (#14829)
- Support cross-attention with FlashInfer TRT-LLM Gen kernels on Blackwell (#15429)
- Support post-norm and per-aux fc_norm for Eagle3 draft models (Eagle 3.1) (#14988)
- Add EPLB support for Qwen3.5 (#15543)
- Optimize CuteDSL NVFP4 MoE grouped/SwiGLU GEMM accumulation pipeline (#15258)
- Add CuTe DSL GVR-TopK load-balance optimization (#15304)
- Enable split-KV heuristic for low-occupancy cross-attention in LTX-2 FA4 (#15399)
- Fuse MLP up-GEMM + bias + GELU(tanh) + NVFP4-quant into the CuteDSL epilogue for LTX2 and WAN (#15299)
- Add async mp4 encode and configurable noise latent via env vars in VisualGen (#15229)
Fix
- Harden disagg cache transceiver teardown (#15422)
- Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice (#15461)
- Fix overallocation of draft KV cache (#15017)
- Disable NCCL window buffers on GB10 (#15559)
- Fix wrong NCCL fallback in nemotron-h (#15294)
- Fix CuteDSL NVFP4 EPLB weight layout (#15538)
- Enable CuTe DSL BF16 kernels for SM100 PP (#14993)
- Fix Gemma4 multimodal vision TP and xgrammar startup crashes (#15566)
- Add necessary methods for guided decoding in Kimi K2.5 (#15180)
- Re-enable Ulysses for LTX-2 v2a cross-attention (#15303)
- Fix passing scaled timestep to time_embedder in Cosmos3 (#15545)
- Clarify and align trtllm-bench runtime logging (#15254)
Documentation
- Add deploy guide for Minimax M3 (#15587)
- Add Qwen Image visual generation examples (#15235)
Benchmark
- Add Qwen-Image-Bench evaluator (#14837)
- Add modularized perf tests for attention and MoE (discrete/continuous) (#15541)
- Add Qwen3.5-397B-A17B-NVFP4 B200 aggregated perf-sanity tests (#15650)
- Add DeepSeek R1 0528 FP4 performance test to llm_perf_core.yml (#15453)
Test & Infra
- Move more test cases to post-merge (#15568)
- Stabilize perf-sanity tests (#15440)
- Avoid type checking failures due to pip dependency resolution (#15517)
- Gate GPT-OSS TRT-LLM Gen MoE tests to SM100/SM103 (#15128)
- Add GPT-OSS disagg test for transceiver v2 (#15301)
- Fix Cosmos3 tests after VisualGen config split (#15170)
- Fix visual gen test leaked issue (#15236)
- Fix Qwen3-Next bf16 4gpu test (#15206)
- Clean up Nemotron test cases (#15586)
- Fix and unwaive step3p7 test cases (#15583)
- Add test coverage for MiniMax model with multi-node M2.5 checkpoints eval (#15361)
- Add GLM NVFP4 stress test (#15437)
- Remove unreferenced accuracy tests and orphaned entries (#15593)
- Update .gitattributes (#15606)

What's Changed

[None][fix] AutoDeploy: Fixed wrong dist_backend AUTO detection when using trtllm-llmapi-launch by @MrGeva in #15423
[None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #15341
[TRTLLMINF-81][feat] Avoid failed runners on infra retry by @dpitman-nvda in #15237
[https://nvbugs/6179661][fix] Harden disagg cache transceiver teardown by @chienchunhung in #15422
[https://nvbugs/6273846][test] gate GPT-OSS TRTLLM Gen MoE tests to SM100/SM103 by @dongfengy in #15128
[None][fix] avoid type checking failures due to pip dependency resolution by @ixlmar in #15517
[None][feat] VisualGen: async mp4 encode + fixed noise latent via env vars by @wu6u3tw in #15229
[https://nvbugs/6337235][test] Fix MX/GMS model loader fixtures by @chienchunhung in #15471
[None][test] Un-waive K2.5 Thinking FP4 disagg-NIXL e2e/gen_only tests by @chenfeiz0326 in #15443
[None][test] Waive 3 failed cases for main in QA CI by @tensorrt-cicd in #15509
[None][test] Waive 11 failed cases for main in QA CI by @tensorrt-cicd in #15506
[None][test] Waive 4 failed cases for main in QA CI by @tensorrt-cicd in #15505
[TRTLLM-13550][feat] WideEP FT: add MPI signal handler replacement (1d.0) by @chienchunhung in #14160
[None][test] Remove 60 closed-bug waive entries for main by @tensorrt-cicd in #15511
[#3237][fix] Support negative numbers in MajorityVote digit validation by @nikJ13 in #12294
[None][test] Waive 10 failed cases for main in post-merge by @tensorrt-cicd in #15535
[None][test] Waive 9 failed cases for main in QA CI by @tensorrt-cicd in #15504
[None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15499
[None][test] Waive 4 failed cases for main in QA CI by @tensorrt-cicd in #15510
[None][fix] AutoDeploy: handle torch dist all_gather in multi_stream MLA transform by @MrGeva in #15456
[None][feat] Add Gemma-4 NVFP4 quantized models to AutoDeploy registry by @marinayanov in #15382
[None][fix] Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice by @achartier in #15461
[https://nvbugs/6306936][test] Re-enable AutoDeploy disagg tests by @govind-ramnarayan in #15325
[None][infra] split single-node perf sanity GB200 by @tburt-nv in #15548
[None][chore] Bump version to 1.3.0rc20 by @yuanjingx87 in #15551
[#10710][fix] clarify and align trtllm-bench runtime logging by @marinayanov in #15254
[https://nvbugs/6290345][fix] Fix allreduce benchmark input setup by @nv-lschneider in #15427
[None][feat] DSv4 prep: IndexerTopK and TopK primitives by @lfr-0531 in #15381
[None][perf] Cutedsl NVF4 MOE: grouped/swiglu GEMM: Fix acc pipeline release arrive threads + FC2 meta stage code clean by @liyuhannnnn in #15258
[https://nvbugs/6271740][test] Update llm_perf_core.yml to include new performance test for DeepSeek R1 0528 FP4 model by @yufeiwu-nv in #15453
[None][fix] Stabilize perf-sanity tests by @chenfeiz0326 in #15440
[None][test] fix Cosmos3 tests after VisualGen config split by @bobboli in #15170
[None][feat] DSv4 prep: compressor and mHC primitives by @lfr-0531 in #15379
[None][infra] Waive 3 failed cases for main in post-merge 2802 by @ZhanruiSunCh in #15571
[https://nvbugs/6264844][fix] Fix wrong NCCL fallback in nemotron-h by @Wanli-Jiang in #15294
[None][test] Waive 6 failed cases for main in QA CI by @tensorrt-cicd in #15570
[https://nvbugs/6344108][fix] skip TestNemotron3Super120B on pre-blackwell by @bo-nv in #15539
[None][fix] Fix passing scaled timestep to time_embedder in Cosmos3 by @bastefaniak in #15545
[None][chore] Remove nv-internal-release guardword comments in mega_moe_nvfp4 by @xxi-nv in #15575
[None][ci] move more test cases to post merge by @QiJune in #15568
[https://nvbugs/6185146][fix] Use mat_a.new_empty([m, n_out//2]) / input_scale.new_empty([sf_size]) in the by @tensorrt-cicd in #14710
[TRTLLM-35882][feat] cute dsl gvr-topk load-balance optimization by @limin2021 in #15304
[None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #15579
[None][test] waive hang issues by @xinhe-nv in #15576
[None][test] waive hang issues by @xinhe-nv in #15581
[#14874][feat] AutoDeploy : Perf optimization for gpt-oss-120b for low conc by @taylor-yb-lee in #15531
[TRTLLM-12982][perf] reuse multi-item scoring position_ids and params by @ixlmar in #15413
[TRTLLM-13599][test] Refine Qwen3.5 test cases by @nv-guomingz in #15544
[TRTLLMINF-111][inf...

Read more

Contributors

Thachnh, achartier, and 61 other contributors

Assets 2

v1.3.0rc19 Pre-release

Pre-release

mikeiovine released this 23 Jun 16:49

a8c5955

Known Issues
- Llama 3.1 8B FP8 can hang during the autotuner warmup on GB200.
Model Support
- Support NVIDIA Wan2.2-T2V quantized checkpoints (#15093)
- Enable MTP for Step-3.7 NVFP4 and port Step-3.7VL vision tower to TRT-LLM modules (#14926)
- Support T5 and BART in the PyTorch backend (#13919)
- Support MiniMax-M3 in the PyTorch backend (#15292)
API
- Align VisualGen serve request schema with VisualGenParams (#14733)
- Support multi-item scoring in LLM.encode (#14693)
- Drop legacy --extra_visual_gen_options CLI alias (#15262)
Feature
- Enable TRTLLM MoE backend for Nemotron-H BF16 checkpoint (#14944)
- Add async Ulysses pipeline (enabled for LTX-2 and WAN) (#13978)
- Make TrtllmGenAttention the default decode backend on Blackwell+ (#14618)
- Skip redundant data expand in DeepGemmFusedMoE via fused expand+quant Triton kernel (#14591)
- Add Prometheus metrics for prompt cache, speculative decoding, perplexity, and batch occupancy (#12636)
- Add Indexer TopK single-block / multi-pass radix implementation (#14268)
- Enable gen-only speculative decoding for disagg setups (#14546)
- Support EAGLE3 dynamic trees on Blackwell (#12958)
- Add CUDA graph support for per-expert LoRA in Cutlass backend (#14881)
- Add support for beam search in disaggregated serving (#14876)
- Add maximal LLMAPI capture in usage telemetry (#14398)
- Optimize Qwen2.5/3/3.5-VL performance (#11943)
- Add skip-softmax TMA-load + sync-MMA warp-specialized context FMHA for sm_120/sm_121 (#15163)
- Enable TRTLLM cross attention backend (#15345)
- Support per-request mm_processor_kwargs for Qwen3-VL (#14702)
- Add prefetch_reuse_blocks and configurable prefetch count (#15149)
- Add MegaMoECuteDsl NVFP4 MoE backend (#14608)
- Make EAGLE3 honor sampling params by default (#14745)
- Add multiple FMHA library support to TRTLLM attention backend (#15204)
- Add checkpointing variant of replay for MTP for mamba models (#14203)
Fix
- Remove redundant TikTokenTokenizer shim from Kimi-K2.5 input processor (#14741)
- Rename misnamed tunable_fp4_quantize kwarg and add real SF-swizzle control (#15002)
- Gate FlashInfer GDN kernels to supported configurations (#15094)
- Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate (#15088)
- Select CUTLASS MoE backend on non-Blackwell SMs for Qwen3.5-35B-A3B FP8 (#15081)
- Fix SageAttention kernel regression by using static scheduler (#15047)
- Fall back to local cache when loading tokenizer for gated models (#12998)
- Fix PyExecutor FPM iteration timing (#14922)
- Register multimodal placeholders for Qwen3.5 MoE VLM serving (#15079)
- Fix and unwaive Nemotron-related bugs (#15085)
- Guard DSA DSL atom-split against MTP draft next (#14891)
- Scope disagg-ctx cache-transfer quorum vote to TP instead of WORLD (#15136)
- Clear workspace in run_mla_generation to avoid illegal memory access (#15173)
- Fix MAX_UTILIZATION reuse token budget (#15066)
- Add kv_transfer_timeout_ms to avoid timeout (#15152)
- Preserve ip:port for trtllm-serve visual-gen (#14355)
- Fix guided decoding (xgrammar) + EAGLE-3 + draft_len_schedule crash during CUDA graph capture (#15023)
- Stabilize Mamba replay state update (#14841)
- Fix max_context_length value for attention workspace sizing (#15156)
- Fix issue where host KV cache usage would double when speculative decoding is used (#14373)
- Disable NCCL_SYMMETRIC tactic on GB10 (DGX Spark) (#12902)
- Fix attentionOp FP8 MLA KV-reuse workspace calculation (#14852)
- Fix beam search log_probs non-determinism with batch_size > 1 (#15125)
- Forward secondary_offload_min_priority to KVCacheManager in PyTorch executor (#13768)
- Enable multi-block mode for XQA HMMA spec-dec (#15312)
- Fix TinyGEMM barrier bug (#15338)
- Fix stale sparse attention kwargs (#15460)
- Fix CppMambaHybridCacheManager to handle dp dummy request (#15054)
- Fix embedding vocab mask for rejection sampling in Kimi-K2.5 (#15233)
Documentation
- Add FLUX visual generation examples (#14987)
- Add Qwen3.5 deployment guide doc (#15111)
- Fix stale --disable_xqa reference in legacy docs (#13395)
- Add Cache-DiT documentation (#15268)
Benchmark
- Weight trtllm-bench AR/AL averages by output length (#14998)
Test & Infra
- Add accuracy tests for nemotron-v3-ultra (#14808)
- Remove TestLlama4ScoutInstruct tests (#15144)
- Require minimum of 4 GPUs in llm_perf_core.yml and add new performance tests (#15090)
- Add DFlash coverage for Qwen3.5 MoE variant (#15132)
- Add e2e example tests for flux1/2, ltx2, wan_i2v, and cosmos3 (#15126)
- Enable disagg cancellation stress test (#15174)
- Fix periodic-junit in unittest pytest (#14075)
- Update K2.5 and GLM-5 into CI perf test (#14960)
- Add Qwen3-32B FP8 disagg stress test (#14278)
- Sunset old disagg test cases for the QA side (#15290)
- Add e2e Tensor Parallel LPIPS tests for VisualGen (#15208)
- Remove TensorRT performance baseline and update to PyTorch only (#15256)
- Add integration tests for MoE LoRA and bugfixes (#15271)

What's Changed

[None][infra] Waive TestQwen3NextInstruct nvfp4 cases by @mzweilz in #15086
[https://nvbugs/6248757][fix] Avoid running all reduce in aux stream by @tensorrt-cicd in #14917
[https://nvbugs/6221483][fix] AutoDeploy: Fix Eagle metadata host syncs by @govind-ramnarayan in #14714
[None][feat] add FLUX visual generation examples by @karljang in #14987
[https://nvbugs/6261164][fix] AutoDeploy: Don't allocate speculative caches when speculation is off by @tensorrt-cicd in #15020
[https://nvbugs/6211189][fix] Lower the reference to 46.5 (matching cross-GPU empirical mean) and remove the t by @tensorrt-cicd in #14799
[None][refactor] split VisualGen pipeline and model configs by @bobboli in #14956
[TRTLLM-11457][feat] Async Ulysses pipeline (Enabled for LTX-2 + WAN) by @luyiyun1021 in #13978
[TRTLLM-11548][doc] Add Qwen3.5 deployment guide doc by @nv-guomingz in #15111
[https://nvbugs/6181383][fix] Build inner text/vision/audio sub-configs as empty PretrainedConfig() then setat by @tensorrt-cicd in #14399
[https://nvbugs/6273850][chore] waive TestQwen3_5_4B::test_bf16 for all GPUs by @tburt-nv in #15112
[None][doc] Add docs for AutoDeploy transforms by @bmarimuthu-nv in #15122
[None][infra] Waive 4 failed cases for main in post-merge 2769 by @ZhanruiSunCh in #15140
[https://nvbugs/6227203][fix] Remove redundant TikTokenTokenizer shim from KimiK25InputProcessor by @tianyuxbear in #14741
[None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control by @luyiyun1021 in #15002
[None][test] Fix gen_only missing prev_device_step_time race in perf sanity by @tensorrt-cicd in #15108
[None][test] Fix disagg test result dir by @fredricz-20070104 in #14864
[TRTLLM-13332][test] Remove TestLlama4ScoutInstruct tests by @QiJune in #15144
[https://nvbugs/6266705][fix] Gate FlashInfer GDN kernels to supporte… by @nv-guomingz in #15094
[https://nvbugs/6255037][fix] Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate by @eopXD in #15088
[https://nvbugs/6194812][test] Update llm_perf_core.yml to require a minimum of 4 GPUs and add new performance tests by @yufeiwu-nv in #15090
[TRTLLMINF-112][infra] Reduce the waiting time between check node is online or not by @EmmaQiaoCh in #14819
[None][infra] Waive 1 failed cases for main in pre-merge 41821 by @ZhanruiSunCh in #15135
[None][infra] CBTS Layer 3: pass test-db via Artifactory instead of env var by @crazydemo in #15142
[TRTLLM-13264][feat] Add native bias epilogue to NVFP4 GEMM by @luyiyun1021 in #15053
[https://nvbugs/6278380][unwaive] unwaive ad cases by @crazydemo in #15148
[https://nvbugs/6244474][fix] AutoDeploy: Remove llama perf test from CI by @MrGeva in #15107
[https://nvbugs/6212252][fix] Select CUTLASS MoE backend on non-Blackwell SMs in TestQwen3_5_35B_A3B::test_fp8 by @xxi-nv in #15081
[TRTLLM-13302][feat] Register NVIDIA Wan2.2-T2V quantized checkpoints by @zhenhuaw-me in #15093
[None][chore] add VisualGen team as the codeowner of the VisualGen Attention by @zhenhuaw-me in #15150
[None][feat] Default on FlashInferTrtllmGenAttention by @yihwang-nv in #14618
[None][infra] Test DFW with BSL branch by @yuanjingx87 in #14597
[TRTLLM-12214][perf] customMoeRoutingKernel: lower BLOCK_SIZE to 128, raise m...

Read more

Contributors

karljang, dcampora, and 84 other contributors

Assets 2

v1.3.0rc18 Pre-release

Pre-release

mikeiovine released this 10 Jun 00:10

15d06c0

Known Issues
- DSV3.2 will crash with an IMA in various long-running perf tests on GB200/GB300 when the CuteDSL MoE backend is used. Work around this issue by using another MoE backend.
Model Support
- Support Nemotron-H NVFP4 checkpoint on Hopper (#14775)
- Add Qwen image support (#13449)
- Support Step-3.7-Flash model (#14711)
- Add Cosmos3-Nano and Cosmos3-Super support (#14824)
- Add AFMoE Trinity support (#13148)
API
- Add logprobs_simple_format option to return logprobs as a flat list[float] (#13972)
- trtllm-serve, trtllm-eval, trtllm-bench: Make CLI flags take precedence over --config / --extra_llm_api_options YAML (#14812)
Feature
- Upgrade NIXL to v1.0.1 and UCX to 1.21 (#14436)
- Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL (#14453)
- Enable FlashInfer GDN decoding kernel for Qwen3.5 (#13645)
- Add per-expert LoRA support with Cutlass backend (#14801)
- Reduce OpenAI stream postprocess overhead (#14708)
- Add encoder CUDA graph support to llm.encode() (#14326)
- Use a Triton kernel for C++ mamba hybrid state update (#14869)
- Fuse masked gather + finalize-scale into one Triton kernel in DeepGemmFusedMoE (#14592)
- Support KVCacheManagerV2 adjust() in single GPU + agg PyExecutor loop (#14578)
- Add disk cache config for KVCacheManagerV2 (#14845)
- Add Wan I2V generation example (#14981)
- Add LTX-2 visual generation example (#14976)
- Update flashinfer-python from 0.6.12rc2 to 0.6.12 (#14805)
Fix
- Fix mamba-out-of-block error with ADP + BS=1 + disagg (#14853)
- Fix XQA IMA for invalid pages with sliding window (#14459)
- Propagate event loop errors to await_responses callers (#12735)
- Fix Mamba replay mode accuracy issues (#14509)
- Fix PyExecutor hang in disagg TP prefill (#14020)
- Fix stale runtime metadata issues during MLA fallback transitions (#14049)
- Fix KVCacheManagerV2 block counting correctness issues (#14725)
- Canonicalize multimodal cache-key serialization to prevent hash collisions (#14800)
- Fix LTX-2 audio PE padding issues (#14818)
- Release KVCacheManagerV1 blocks on MAX_UTILIZATION pause (#14723)
- Fix config sharing issue for Qwen3-VL (#14766)
- Enforce request and buffer index lifecycle integrity (#14768)
- Add nemotron-v3 as the proper nemotron-h reasoning parser (#14900)
- Clamp KV pool window sizes to max_seq_len (#14905)
- Fix mamba block calculation (#14524)
- Add trust_remote_code=True to the LLM(...) constructor to fix various model loading issues (#14892)
- Fix deep EP partial warp sync for GPT-OSS shapes (#14977)
- Add warmup for trtllm-gen fmha JIT kernels (#14851)
Documentation
- Add VisualGen API walkthrough example and docs page (#14685)
- Add Nemotron 3 Ultra doc (#14964, #15113)
Test & Infra
- Pipe stderr separately in subprocess calls to improve error reporting in Allure (#14750)
- Remove obsolete tests (#14995, #14660, #14992, #14952, #14749)
- Parallelize post stages: Rerun Report, Test Coverage, and AI Failure Analysis (#14528)
- Relocate tests to right-sized stages (#14684)
- Move non-default-feature tests to post merge (#15038)

What's Changed

[None][test] Update datasets path by @JennyLiu-nv in #14671
[None][infra] Update new .test_durations by @EmmaQiaoCh in #14661
[TRTLLM-13015][feat] drop complex visual_gen CLI example scripts by @zhenhuaw-me in #14632
[https://nvbugs/6117811][fix] Fix XQA IMA for invalid pages with sliding window by @pengbowang-nv in #14459
[None][feat] Tune mamba config by env variables by @Wanli-Jiang in #14730
[None][test] Update moe backend for ctx and acceptance length env by @fredricz-20070104 in #14803
[None][test] Update precision of previous device step time by @fredricz-20070104 in #14809
[None][infra] Waive 12 failed cases for main in post-merge 2749 by @ZhanruiSunCh in #14802
[TRTLLM-12971][infra] Fix parse classname logic in timeout result by @yiqingy0 in #14559
[https://nvbugs/6038228][fix] Propagate event loop errors to await_responses callers by @JunyiXu-nv in #12735
[TRTLLM-12288][feat] Support Nemotron-H nvfp4 ckpt on Hopper by @JadoTu in #14775
[TRTLLM-12596][feat] Support simple logprob format by @tongyuantongyu in #13972
[None][fix] Stabilize Mamba replay state update by @sunnyqgg in #14509
[None][feat] Upgrade NIXL to v1.0.1 and UCX to 1.21 by @chuangz0 in #14436
[None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA by @tianyuz-nv in #14453
[TRTLLM-10947][perf] eagle3: use cudaMemcpy2DAsync custom op for hidden-state capture by @pcicotti in #14479
[None][fix] PyExecutor Hang in Disagg TP Prefill by @jthomson04 in #14020
[https://nvbugs/6240561][fix] Autodeploy fix the deepseek accuracy drop by @nvchenghaoz in #14774
[#12702][feat] Autodeploy deprecate the legacy triton attention by @nvchenghaoz in #14194
[None][test] Waive 5 failed cases for main in QA CI by @tensorrt-cicd in #14789
[None][test] Waive 7 failed cases for main in QA CI by @tensorrt-cicd in #14791
[https://nvbugs/6240561][fix] Fix AutoDeploy DeepSeek-R1 accuracy drop by @taylor-yb-lee in #14793
[#14588][fix] [AutoDeploy] Fix OOM of DeepSeek-R1 NVFP4 for tp=4 by @taylor-yb-lee in #14477
[https://nvbugs/6179761][fix] Save LTX-2 BF16 weights to speed up perf by @yibinl-nvidia in #14639
[TRTLLM-13028][doc] Add VisualGen API walkthrough example and docs page by @zhenhuaw-me in #14685
[None][chore] Update flashinfer-python from 0.6.12rc2 to 0.6.12 by @yihwang-nv in #14805
[None][fix] AutoDeploy: Unwaive llmc standalone tests by @bmarimuthu-nv in #14700
[TRTLLM-35882][feat] Add cute dsl gvr top-k decode kernel by @limin2021 in #14602
[https://nvbugs/6222480][test] fix stress test issue on H100 by @xinhe-nv in #14721
[None][test] Waive 6 failed cases for main in QA CI by @tensorrt-cicd in #14787
[None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #14783
[None][fix] synchronize MLA cache reuse fallback metadata by @DhineshPonnarasan in #14049
[None][feat] Add KV cache prefetch by @lowsfer in #14748
[https://nvbugs/6191524][fix] In MLA.forward_context, also call the warmup when has_cached_kv_for_mla_context by @tensorrt-cicd in #14536
[None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #14839
[None][fix] Cherry-pick kv_cache_manager_v2 fixes to main by @lowsfer in #14725
[None][test] Waive 11 failed cases for main in post-merge by @tensorrt-cicd in #14854
[None][feat] Enable flashifner gdn decoding kernel for qwen3.5 by @nv-guomingz in #13645
[https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update by @pcicotti in #14697
[https://nvbugs/6221450][fix] AutoDeploy: Qwen3.5 400B NVFP4 accuracy regression fix by @taylor-yb-lee in #14667
[TRTLLM-12648][test] implement disagg cancel stress metrics_thread by @chienchunhung in #14807
[None][chore] Update AD model list by @tcherckez-nvidia in #14686
[https://nvbugs/6226933][fix] canonicalize multimodal cache-key serialization to prevent hash collisions by @venkywonka in #14800
[https://nvbugs/6240561][fix] Unwaive DeepSeek R1 accuracy test by @taylor-yb-lee in #14870
[None][feat] Add Qwen image support by @pst2154 in #13449
[TRTLLM-12507][feat] Per-expert lora support with Cutlass backend by @brb-nv in #14801
[None][chore] Make submit.py can run single GPU test and accept customized config file by @HuiGao-NV in #14630
[None][test] Waive 9 failed cases for main in QA CI by @tensorrt-cicd in #14792
[None][test] Update DSV32 32k4k config to avoid timeout issue by @chenfeiz0326 in #14856
[None][chore] Bump version to 1.3.0rc18 by @yuanjingx87 in #14872
[None][infra]...

Read more

Contributors

karljang, reasonsolo, and 68 other contributors

Assets 2

v1.3.0rc17 Pre-release

Pre-release

mikeiovine released this 02 Jun 18:50

a422420

Highlights

Known Issues
- DeepSeek V3.2 will crash with an illegal memory access during long-running performance tests under various agg/disagg configurations.
Model Support
- Add MoT World Model support (#14012)
- Enable multi-node tensor parallelism for MiniMax-M2 (#14314)
- Restore Mistral Large 3 text-only processor (#14248)
- Support Gemma4 multi-head_dim pools and host-side slicing for SWA Triton kernels (#13745)
- Add a reasoning parser for Qwen3.5 (#14659)
- Add LTX-2 Ulysses cross-attention for v2a with audio padding (#14044)
- Add Poolside Laguna tool parser (#14638)
- Replace Parakeet audio encoder with native TensorRT-LLM layers (#14474)
- Set Mamba SSM cache to fp32 for NemotronV2 (#14448)
- API
- Allow content: null in CustomChatCompletionMessageParam (#14368)
- Enforce trust_remote_code flag (#13527)
- Add thinking token budget control (#14665)
- Expose host/GPU per-iter time and clarify iter labeling in /metrics (#14127)
- Make attention backend case-insensitive (#14635)
Feature
- Add FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron (#13773)
- Integrate the FlashInfer GDN prefill kernel for Qwen3.5 (#13644)
- Add LoRA support to LLMAPI Triton backend (#14079)
- Log KV cache utilization and context tokens per iteration (#14206)
- Remove one-warp-per-token policy from MoE A2A kernels (#14550)
- Support non-divisible expert parallelism in MoE all-to-all and Slurm benchmark (#13888)
- Add CuTe DSL attention via exported binaries in VisualGen (#13721)
- Enable NVFP4 KV cache support in trtllm-gen attention (#12544)
- Add GMS-only weight sharing support (#13926)
- Add VisualGen tensor parallelism support (#13614)
- Enable NCCL symmetric zero-copy by default (#14472)
- Improve disaggregated TTFT (#14719)
Fix
- Restore K2.5 multimodal dep8 accuracy test on Transformers 5.5.x (#14392)
- Remove sync after FlashInfer attention plan() (#14634)
- Add a compatibility shim in load_hf_tokenizer for bytes_to_unicode (#14090)
- Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer (#14452)
- Fix crash in deep_ep.pyby falling back to the pre-quant dispatch path when hidden_states_sf is missing (#14404)
- Fix gpt-oss accuracy issue by moving TinyGEMM PDL release after reduction (#14537)
- Fix Mistral-Large-3 weight loading crash (#14033)
- Bypass FlashInfer SSD prefill to fix state dtype precision (#14600)
- Fix qwen3 hang on SM120/121 (#14424)
- Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench (#13498)
- Catch OSError in config_file_lock for NFS compatibility (#11960)
- Fix MoE DeepGEMM workspace size with attention DP (#13310)
- Fix inf/NaN issues in Triton Mamba softplus (#14652)
- Cap per-rank max_num_active_requests by max_num_tokens under attention DP (#14481)
- Propagate external SWA window to FMHA kernel in V2 KV cache (#13719)
- Resolve NVML device index mismatch in get_numa_aware_cpu_affinity when CUDA_VISIBLE_DEVICES is set (#12985)
- Replace fixed disagg fill throttle with slow-start ramp (#14475)
- Reuse batch_indices_cuda across CUDA graph captures in EAGLE3 (#14381)
- Make FA4 a proper pip dependency (#13788)
- Fix GSM8K accuracy tests for LagunaXS on B200/GB200/B300 (#14580)
Documentation
- Add CUTLASS DSL uninstall step to installation guide (#14621)
- Add deprecation notice to legacy support-matrix.md (#14495)
- Fix incorrect auto sampler behavior description for beam search (#14487)
- Add VisualGen context to AGENTS.md (#14732)
Test & Infra
- Update flashinfer-python from 0.6.11.post1 to 0.6.12rc2 (#14512, #14607)
- Add disagg local one-step run script for CI submit (#14557)
- Update model path definitions in test_perf.py and clean up waives.txt (#14393)
- Dedup executor unit tests on H100/B200 (#14556)
- Add disagg cancellation stress-test harness skeleton (#14375)
- Add UCX TLS env in disagg-related tests (#14626)
- Replace ONNX spec with onnx>=1.21.0 in requirements.txt (#14577)
- Add test lists with multi-GPU tests to CI multi-GPU test trigger files (#14087)
- Add offline equivalence test for sharding IR (#13963)
- Enable kv_cache_manager_v2 test for A10 (#12885)
- Remove two-model EAGLE3 spec-decoding tests (#14735)
- Add TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS in spec decoding perf test (#14438)

What's Changed

[https://nvbugs/6182617][fix] Restore K2.5 multimodal dep8 accuracy test on transformers 5.5.x by @tianyuxbear in #14392
[None][feat] FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron … by @farazkh80 in #13773
[None][perf] Integrate the flashinfer gdn prefill kernel for qwen3.5 by @nv-guomingz in #13644
[None][chore] Update flashinfer-python from 0.6.11.post1 to 0.6.12rc1 by @yihwang-nv in #14512
[https://nvbugs/6162328][fix] Add a tiny compat shim in load_hf_tokenizer that, when bytes_to_unicode is m by @tensorrt-cicd in #14090
[https://nvbugs/6114610][test] unwaive disagg tests fixed by UCX_TLS setter by @xwang233 in #14440
[None][fix] Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer by @dc3671 in #14452
[https://nvbugs/6184914][test] Unwaive related tests by @yuxianq in #14523
[https://nvbugs/6186880][fix] In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is by @tensorrt-cicd in #14404
[None][infra] Waive 2 failed cases for main in post-merge 2734 by @ZhanruiSunCh in #14526
[None][infra] Waive 1 failed cases for main in post-merge 2735 by @ZhanruiSunCh in #14542
[#11257][feat] Add LoRA support to llmapi triton backend by @karljang in #14079
[None][chore] Include layer_idx in MoE backend fallback warnings by @dc3671 in #13409
[None][chore] Add disagg local one-step run script for CI submit by @fredricz-20070104 in #14557
[https://nvbugs/5974335][refactor] Update model path definitions in test_perf.py and clean up waives.txt by @yufeiwu-nv in #14393
[TRTLLM-12968][ci] Dedup executor unit tests on H100/B200 by @YihuiLu512 in #14556
[TRTLLM-12949][refactor] visual_gen: unify fused QK-norm+rope dispatch by @luyiyun1021 in #14529
[https://nvbugs/6143579][fix] Allow content: null in CustomChatCompletionMessageParam by @tijyojwad in #14368
[None][chore] log KV cache utilization and context tokens per iter by @pcicotti in #14206
[https://nvbugs/6168859][fix] move tinygemm PDL release after reduction by @dongfengy in #14537
[None][chore] Unwaive test_cp_tp_broadcast_object by @brb-nv in #14328
[https://nvbugs/6211185][fix] Fix failed GSM8K accuracy tests for LagunaXS on B200/GB200/B300 by @DomBrown in #14580
[TRTLLMINF-106][infra] Use B300 frontend platforms by @mlefeb01 in #14581
[None] [refactor] Unify compressed-tensors quant config parsing by @DomBrown in #14468
[None][feat] AutoDeploy push the rope buffer to later stage by @nvchenghaoz in #13859
[https://nvbugs/6215736][infra] Unwaive test_fp8_blockscale[throughput_mtp] by @bobboli in #14541
[https://nvbugs/6175923][test] Revert gpt_oss_20b perf MoE-backend pin by @ruodil in #14612
[https://nvbugs/6221621][test] Update trust_remote to nemotron and phi4 models by @yufeiwu-nv in #14570
[None][chore] update VisualGen codeowner settings by @zhenhuaw-me in #14530
[None][infra] Waive 8 failed cases for main in post-merge 2738 by @ZhanruiSunCh in #14615
[None][perf] Fuse FlashInfer GDN prefill state I/O into Triton kernels by @nv-guomingz in #14548
[https://nvbugs/6164924][fix] Lower free_gpu_memory_fraction for Exaone tests by @tensorrt-cicd in #14486
[https://nvbugs/6163033][fix] Guard q_a_proj.weight dict access behind nvfp4_fused_a; update test to `chec by @tensorrt-cicd in #14033
[None][fix] Bypass FlashInfer SSD prefill to fix state dtype precision by @tijyojwad in #14600
[None][fix] Exclude Qwen3 VL vision model from quantization by @2ez4bz in #12851
[https://nvbugs/6162860][fix] Set free_gpu_memory_fraction=0.6 only when torch_compile=True for test_bfloat16_ by @tensorrt-cicd in #14109
[None][chore] Remove one-warp-per-token policy from MoE A2A kernels by @bobboli in #14550
[None][test] Waive 7 failed cases for main in QA CI by @xinhe-nv in https://github.c...

Read more

Contributors

karljang, tijyojwad, and 62 other contributors

Assets 2

v1.3.0rc16 Pre-release

Pre-release

VALLIS-NERIA released this 26 May 08:08

4517988

Highlights

Model Support
- Add Gemma4 multimodal support with native vision and audio towers (#14300)
- Add Qwen3.5 MTP and Qwen3.6-27B-FP8 model support (#12646, #14359)
- Add EXAONE-4.5 and Laguna model support (#12873, #13559)
- Switch DeepSeek, NemotronH, Qwen3, and Qwen3.5-MoE to sharding-IR canonical models (#13478)
API
- Refactor the VisualGenArgs API and registry (#14175)
- Drop sink_token_length from the PyTorch attention surface (#14275)
- Add OpenAI chat logit bias validation (#13518)
- Reject incompatible KV connector configurations at construction time (#13577)
Feature
- Add exact multimodal KV block hashing and KV cache reuse probing (#13815, #14333)
- Add KV cache manager v2 with Python transceiver updates (#12928)
- Add disaggregated serving support with block reuse enabled for hybrid models (#14060)
- Add FlashInfer MLA attention backend support and SkipSoftmax sparse attention support for visual generation (#13428, #12947)
- Add Ring Attention and unified context parallelism for VisualGen (#13821)
- Add legacy and TensorRT-LLM 1.x modelopt quantization config support (#14088)
- Add debugging environment variables for mamba modules (#14170)
- Add single-rank MPI sleep/wakeup support and a rank-0 collective_rpc shim (#14052)
- Add opentelemetry metrics for disaggregated serving with multiple postprocessing workers (#12637)
- Support SWA scratch reuse rewind (#14412)
- Improve FMHA, FlashInfer TRTLLM-Gen, and KV cache buffer calculation paths (#14291, #12525)
- Improve fused-kernel and attention performance with shared-expert combine fusion, paged MQA logits decode tuning, LTX2 fused RMSNorm/RoPE, EAGLE3 dynamic tree kernel optimizations, and cu_seqlens conversion updates (#14306, #14133, #13985, #13426, #13566)
- Optimize beam search candidate reconstruction by skipping prompt-prefix copies (#14197)
- Update cubins to resolve the FMHA PDL issue (#14462)
- Use CUDA 13 CUTLASS DSL package (#14354)
Fix
- Fix disaggregated benchmark, usage propagation, and worker registration stability issues (#13347, #14177, #14289)
- Fix DeepSeek-V3 OOM handling and artifacts paths (#14232)
- Fix missing get_draft_token_length import in py_executor (#14366)
- Fix Lora load failure handling (#13517)
- Fix Kimi K2.5 speculative decoding behavior (#14379)
- Fix Qwen3HybridConfig layer_types derivation and route load_hf_model_config through AutoConfig (#13832, #14410)
- Fix CppMambaHybridCacheManager functional and performance issues (#14003)
- Fix MTP disaggregated speculative_config coverage (#14391)
- Fix KVCacheTransfer divide-by-zero and KV cache grain slot refinement issues (#13618, #14442)
- Fix memory usage during refit and EPLB config model loading (#14331, #11962)
- Fix MPI worker allocator configuration and GB300 cluster environment setup (#14152, #14460)
- Fix profiler runner exception handling with synchronized CUDA cleanup (#13469)
- Disable mamba replay by default (#14471)
Documentation
- Add a Claude skill for multimodal model onboarding (#13842)
- Update Gemma 4 entries in supported-models.md (#14463)
- Fix invalid documentation and deployment guide links (#14337, #14522)
Benchmark
- Add LPIPS scoring for visual generation model regression tests (#13567)
- Add a bench_moe microbenchmark (#14507)
- Update visual generation and accuracy thresholds for Wan 2.2, Qwen3.5-4B DFlash, and Nano V3 (#14372, #14411, #14078)
- Disable ignore-eos when using speculative decoding in performance tests (#14347)
Test & Infra
- Split verl tests into fine-grained per-case wrappers (#14037)
- Add new stress cases (#14390)
- Clean outdated test duration entries and remove deprecated disaggregated sampler and spark test cases (#14340, #14335, #14380)
- Isolate ray tests to avoid GCS timeout in a single pytest session (#14342)
- Improve L0 retry timeout budgeting and cap infra retry attempts (#14323, #14415)
- Handle sacct errors when checking Slurm job status (#14367)
- Fix B300 MegaMoE and MoE test selection (#14362, #14401)
- Fix container scanning according to the latest security team guidance (#14430)
- Deduplicate miscellaneous unit tests on B200 (#14525)

What's Changed

[None][chore] Update Claude Code agents and skills by @kaiyux in #14344
[None][perf] Fuse sigmoid+mul+add shared-expert combine into one Trit… by @nv-guomingz in #14306
[None][infra] Waive 1 failed cases for main in pre-merge 38925 by @ZhanruiSunCh in #14346
[None][infra] Revert Mingyang back to mingyangHao in allowlist by @ZhanruiSunCh in #14349
[None][cleanup] MistralSmall related cleanups by @2ez4bz in #14271
[None][chore] Clean test_durations file by removing outdated items. by @nv-guomingz in #14340
[None][infra] Waive 2 failed cases for main in post-merge 2725 by @ZhanruiSunCh in #14357
[None][feat] Exact multimodal KV blockhashing by @venkywonka in #13815
[None][infra] Waive 1 failed cases for main in pre-merge 38987 by @ZhanruiSunCh in #14350
[None][feat] Update the logic of FMHA JIT path by @heyuhhh in #14291
[None][feat] opentelemetry metrics for num_postproc_workers > 0 disagg by @karen-sy in #12637
[TRTLLM-12385][feat] Use LPIPS score for visual gen model regression test by @yibinl-nvidia in #13567
[None][chore] Remove closed bugs by @xinhe-nv in #14217
[https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf YAMLs by @xwang233 in #14191
[None][feat] add single-rank MPI sleep/wakeup and rank-0 collective_rpc shim by @hhzhang16 in #14052
[https://nvbugs/6093911][fix] Fix disagg gen-only benchmark hang under ADP router imbalance by @chienchunhung in #13347
[None][fix] Import missing get_draft_token_length in py_executor by @nv-guomingz in #14366
[TRTLLM-12342][feat] Ring Attention, Unified Context Parallel for VisualGen by @NVShreyas in #13821
[None][test] Split verl tests into 19 fine-grained per-case wrappers by @Superjomn in #14037
[TRTLLM-11547][feat] Add Qwen3.5 MTP support. by @nv-guomingz in #12646
[https://nvbugs/6143599][fix] DeepSeek-V3 OOM and artifacts path by @dominicshanshan in #14232
[https://nvbugs/6114141][test] Remove deprecated disagg trtllm_sampler test by @Shixiaowei02 in #14335
[None][doc] Add Claude skill for multimodal model onboarding by @yechank-nvidia in #13842
[https://nvbugs/6141803][fix] Skip Qwen3.5-4B tests pre-hopper by @amukkara in #14055
[None][fix] ADP router crashes on serve when scheduling_params.attent… by @nv-guomingz in #14267
[https://nvbugs/6185190][doc] fix invalid links in doc by @nv-guomingz in #14337
[None][feat] Refactor to support legacy and 1.x modelopt quant config format by @Wanli-Jiang in #14088
[None][feature] Add env variables to help debugging mamba modules. by @Wanli-Jiang in #14170
[None][infra] Handle sacct error when checking slurm job status by @yuanjingx87 in #14367
[https://nvbugs/6027594][fix] Unwaive testcase by @YihuiLu512 in #14383
[None][chore] Remove unnecessary buffer to save memory during refit by @shuyixiong in #14331
[https://nvbugs/6153638][fix] unwaive tests for testing the flaky issue by @JunyiXu-nv in #14284
[https://nvbugs/6171743][fix] Set PYTORCH_ALLOC_CONF=expandable_segments:True on MPI workers via `patch_mpi_ by @tensorrt-cicd in #14152
[None][test] Add new stress cases by @fredricz-20070104 in #14390
[None][feat] Gemma4 MM: native vision + audio towers by @Hudayday in #14300
[TRTLLM-12719][cbts] Add core code related rule by @crazydemo in #14266
[None][test] Update bug ID for test_all_optimizations_combined waiver by @mzweilz in #14402
[None][infra] Waive 6 failed cases for main in post-merge 2726 by @ZhanruiSunCh in #14405
[None][test] Disable ignore-eos when Spec Decoding in Perf Test by @chenfeiz0326 in #14347
[None][fix] Isolate ray tests to avoid GCS timeout in one pytest session by @shuyixiong in #14342
[https://nvbugs/6110638][fix] Mark AutoDeploy attention DP world sizes by GPU count by @galagam in #14148
[None][feat] EXAONE-4.5 Support by @yechank-nvidia in https://github.com/NVIDIA/Tens...

Read more

Contributors

Superjomn, karljang, and 61 other contributors

Assets 2

v1.3.0rc15 Pre-release

Pre-release

VALLIS-NERIA released this 21 May 14:27

c72d43d

Highlights

Model Support
- Add Gemma4 multimodal model support with text, vision, audio, and chunked prefill capabilities (#12932, #14134)
- Add Kimi K2.5 multimodal vision support and reasoning parser integration (#12788, #13801)
- Add GPT-OSS, Ministral3, Nemotron-H, Nemotron Nano, and DeepSeek model enablement and compatibility updates (#12743, #12884, #13844, #13977)
- Improve DeepSeek V4 and DeepSeek V3.2 support with new attention kernels, routing updates, tokenizer loading, and AutoConfig registration (#13652, #13186, #14261, #14293)
API
- Add a typed exception hierarchy, shared classifier, retry-consumer migration, and typed Slurm infra failures (#13732, #13780, #13863, #13809, #14147)
- Add VisualGen public output APIs, serving batch inference, and benchmark timing decomposition (#13635, #12350)
- Add per-request media_io_kwargs support for chat completions (#13779)
- Add per-rank iteration statistics and Attention-DP metrics to serving endpoints (#13221, #13649)
- Add cache_salt_id support to the KV cache v2 manager (#13793)
- Limit requested sampling logprobs as a breaking API change (#13520)
Feature
- Improve MoE and fused-kernel performance with MegaMoE DeepGEMM, CUTEDSL MoE, shared-expert SwiGLU quantization, GDN fusion, bf16 FlashInfer MoE, and refreshed MoE cubins (#13384, #12884, #11897, #12966, #13689, #12440)
- Add FP4 and FP8 decode kernels, FP4 DSA indexing, DeepSeek V4 attention kernels, FMHA head_dim 80 cubins, and multi-K and multi-dtype GVR Top-K support (#13929, #13219, #13340, #13652, #13808, #13948)
- Improve VisualGen and diffusion pipelines with SageAttention for Wan/FLUX, fused cross-head QK Norm plus RoPE for WAN, LTX2 refactoring, and parallel VAE scaling (#13570, #13052, #13285, #13873)
- Improve KV reuse, disaggregated serving, and transfer paths with transceiver v2 KV reuse, multi-threaded KV transfer, internal TRTLLM-Gen routing, additional conversation headers, and LoRA request-broadcast reduction (#13115, #13075, #13997, #13656, #12959)
- Improve speculative decoding and hybrid-model execution with fractional synthetic acceptance rates, MTP block reuse, EAGLE3 rejection sampling, MTP max_draft_len decoupling, and mamba SSD prefill optimizations (#13569, #12896, #12588, #12341, #12731)
- Improve performance tooling and runtime throughput with DFlash optimizations, host-profiler utilities, batch-full benchmark metrics, model-init NVLink caching, scheduling overhead reductions, beam-search overlap scheduling, and FC2 DenseGEMM autotuning (#13996, #11741, #13638, #14070, #13843, #14061, #13833)
- Add CMake third-party cache support for clean builds (#13942)
Fix
- Fix CUDA graph, profiling, and scheduling correctness issues including YAML CudaGraphConfig validation, profiler scoping, piecewise capture, Eagle3 hidden-state reuse, and guided decoding GIL handling (#13397, #12432, #13574, #13920, #13251)
- Fix KV cache and scheduler behavior for FlashMLA token block overrides, mamba slot memory, delayed batching page release, adaptive ratio sampling, zero-layer mamba ranks, stale Scheduler V2 state, stale attention metadata, and chunked prefill EVS merging (#13752, #13489, #13805, #13857, #13999, #13592, #13696, #13754)
- Fix model loading and quantization issues for GPT-OSS MXFP4, dummy weights, Mixtral modelopt export, DeepSeek V3 Lite FP8 MTP weights, composite HF configs, GLM-5 router GEMM, INT4 AWQ on SM120/121, and Qwen3 FP4 CUTLASS MoE OOM (#13708, #13879, #14179, #12530, #14068, #13740, #11561, #13349)
- Fix serving and benchmark clients with hardened media URL loading, split SSE chunk parsing, aiohttp 3.13 streaming handling, /metrics tee-buffer serving, bounded gRPC payloads, router tokenizer skipping, unset attention_dp_relax handling, and clear GPT-OSS backend errors (#12748, #13686, #13952, #13405, #13519, #14030, #14276, #13166)
- Fix distributed and disaggregated runtime stability for mamba disaggregation, worker preparation, PP executor shutdown, SM120 all-reduce launch, guided-decoding PP warmup barriers, Torch process-group teardown, Triton MoE memory freeing, and GB300 UCX settings (#13274, #13755, #13267, #13169, #13132, #12993, #14069, #14168)
- Fix accuracy and memory regressions in DeepSeek, Nemotron, Qwen3, MTP, beam search, FMHA workspace sizing, and FP8 block-scaling autotuner cache growth (#13924, #13968, #13782, #14063, #13799, #13880, #14165)
- Fix package, license, and compliance issues in llm-c standalone generation, SPDX headers, OSS headers, diffusers pinning, and broken documentation URLs (#14011, #14106, #14193, #14281, #13242, #13422)
Documentation
- Add and update technical blogs for Helix Parallelism, Scaffolding, Gemma4, MoE as Dense GEMM on Blackwell, and VisualGen-related content (#13547, #11841, #13947, #13834, #14171)
- Add DFlash quickstart updates, custom PyTorch backend kernel integration guidance, Gemma4 usage examples, spec-decoding support matrices, and layer-wise benchmark doc fixes (#13545, #13917, #14303, #14195, #13979)
- Refresh image links and broken URLs in documentation and blog content (#13838, #13422)
Test & Infra
- Add model and multimodal coverage for Wan 2.2 TI2V, nano v3 omni audio and video, Nemotron Ultra V3, Gemma4 CUDA graph registration, and W4A8_MXFP4_FP8 MoE unit tests (#13739, #13616, #13750, #13883, #13658, #14082, #13401)
- Add and refresh performance coverage for VisualGen sanity, GB300 disaggregated NIXL, DSR1 disaggregated tests, trtllm-bench metrics, and Kimi K2.5 FP4 RCCA tests (#13144, #13594, #13882, #14178, #14172)
- Improve change-based testing, CI triggers, GitHub checks, stage splitting, rerun handling, and LFS synchronization (#13382, #13899, #13993, #14022, #14064, #14035, #12406, #13826)
- Improve build, dependency, and package infrastructure with FlashInfer updates, Transformers 5.x upgrades, compressed cubin archives, SBSA wheel image support, license scanning, and llm-c artifact cleanup (#13746, #13992, #14076, #12829, #13994, #13542, #12635, #13921, #13272)
- Improve CI coverage organization by moving chunked-prefill cases, splitting long hardware-agnostic tests, adding feature-contract keys, and promoting DeepSeek-V4-Flash to the MoE CI subset (#14083, #13751, #13756, #13933, #13964)
- Improve developer and CI operations with blossom-ci allowlist updates, skills naming enforcement, pre-commit validation, source-scan cleanup, and NFS temporary-file ignores (#13951, #14132, #14295, #14304, #14285, #13778, #14211)

What's Changed

[https://nvbugs/6001694][fix] Add CUDA profiler API scoping for visual gen nsys profiling by @chang-l in #12432
[https://nvbugs/6080024][fix] Fix CudaGraphConfig validation conflict from YAML deep merge by @nvchenghaoz in #13397
[None][perf] AutoDeploy: reduce C++ dispatch overhead in decode scheduling loop by @nvchenghaoz in #13012
[None][doc] Blogpost for Helix Parallelism by @brb-nv in #13547
[None][chore] Fix indexing conflict in blogposts by @brb-nv in #13772
[#12713][feat] AutoDeploy Model Onboarding Sprint 03/19 - Part 1 (Remove Patches) by @govind-ramnarayan in #13247
[https://nvbugs/5911304][fix] Add URL validation and request hardening for media input loading by @yibinl-nvidia in #12748
[None][infra] Remove PULSE_REPO_BRANCH when running source code scanning by @yuanjingx87 in #13778
[TRTLLMINF-54][feat] Add typed exception hierarchy + unified classifier by @dpitman-nvda in #13732
[https://nvbugs/6094072][fix] swizzle GPT-OSS dummy MXFP4 weights by @dongfengy in #13708
[https://nvbugs/6094224][fix] Fix mamba disagg issues when conc > mbs by @bo-nv in #13274
Add log for raw model weights memory consumption by @HuiGao-NV in #13760
[None][perf] Drop cubin and Eliminate ~6s FMHA JIT recompile in eager generation by aligning kernel selection with CUDA graph warmup by @yunruis in #13505
[https://nvbugs/5615248][fix] Reduce beam-search prefill->decode handoff cost by @brb-nv in #13748
[None][chore] Update flashinfer-python from 0.6.9 to 0.6.10 by @yihwang-nv in #13746
[None][feat] Fuse GDN elementwise ops and split/transpose kernels by @Wong4j in #12966
[None][infra] Waive 3 failed cases for main in post-merge by @xinhe-nv in #13797
[None][chore] Update nvidia-cutlass-dsl version in visual_gen pyproject.toml by @yihwang-nv in #13642
[None][infra] Waive 3 failed cases for main in post-merge by @xinhe-nv in #13789
[None][feat] Update TRTLLM MoE cubins by @rosenrodt in #12440
[None][fix] Fix Autodeploy standalone package builder script tests by @bmarimuthu-nv in #13794
[#13320][fix] Propagate FlashMLA tokens_per_block override onto kv_cache_config by @eopXD in #13752
[None][test] Unset MPI related Env in local Perf Test Script by @chenfeiz0326 in #13795
[https://nvbugs/5615248][fix] Broader capture of piecewise cudagraph by @brb-nv in #13574
[...

Read more

Contributors

Superjomn, janbernloehr, and 110 other contributors

Assets 2

v1.3.0rc14 Pre-release

Pre-release

VALLIS-NERIA released this 07 May 05:55

93cb651

Highlights

Model Support
- Add prefix caching for Mamba hybrid models including Qwen3.5 and Nemotron Super V3 (#12185)
- Improve Qwen3.5 support with custom MoE routing and dense and NVFP4 weight loading fixes (#13433, #13090, #13716)
- Improve Nemotron and Nemotron Nano support with GEMM tuning and multimodal placeholder expansion (#13160, #13069)
- Add Wan 2.2 5B TI2V support and refine LTX-2 FP4 stage handling (#13256, #13244)
API
- Embed VisualGenParams in DiffusionRequest and simplify generate() inputs (#13313)
- Add llm.encode() fast path support for encoder-only models (#12801)
- Add per-iteration request-aggregate counters to InflightBatchingStats (#13199)
- Add AGSI middleware support for Serve (#13378)
- Introduce cancellation support in transceiver v2 (#12734)
- Fix Triton backend generation parameter handling for promptIgnoreLength, lengthPenalty, earlyStopping, and early_stopping (#13633, #13692)
Feature
- Improve VisualGen serving with fast PNG compression, multi-node diffusion workers, non-contiguous multimodal chunked prefill, and Attention2D sequence parallelism (#13074, #13140, #12944, #12943)
- Improve disaggregated serving and routing with gen-first ADP serving, KV-aware hit-rate gates and fair-share caps, and consolidated aiohttp session handling (#13112, #13198, #13408)
- Expand kernel and runtime performance with GEMM-to-allreduce registered buffers, CuteDSL bf16 dense GEMMs, sparse-attention GVR Top-K dispatchers, fused add-norm-FP8 quantization, TF32 DSA GEMMs, sampler optimizations, and leaner MPI collectives (#11589, #12074, #13477, #12674, #13452, #13480, #13380, #13089)
- Improve speculative decoding with DFlash one-model support, Mamba-2 rollback replay, radix-based SWA cleanup, and trtllm-gen routing refactoring (#12794, #13453, #13346, #13328)
- Support NVFP4 weight updates (#12320)
- Add per-rank torch profile traces for distributed profiling (#13536)
Fix
- Fix KV cache and scheduler correctness issues, including WindowBlockManager statistics, Mamba cache handling under MTP with CUDA graph padding, free-block counter corruption, V2 extra_tokens accounting, PEFT page accumulation, and temporary attention-window cleanup (#12448, #13151, #12834, #13619, #13709, #13528, #12450)
- Fix disaggregated serving and worker reliability by resolving aggregate PP4 hangs, preventing zombie worker pods, and correcting cached-token usage accounting (#12888, #12718, #13620)
- Fix OpenAI and Triton generation flows for None tokenizers, prompt ignore lengths, early stopping, and terminateRequest handling from background logits threads (#13184, #13633, #13692, #13059)
- Fix attention and VisualGen runtime issues, including UlyssesAttention sequence lengths, Ulysses plus Sage execution, TRTLLM-Gen GmemReduction illegal memory access, and low-memory Qwen3 skip-softmax behavior (#13486, #13440, #13541, #13581)
- Fix distributed runtime stability with corrected pipeline-parallel layer distribution, reduced host-memory regression in speculative decoding, and MoE communication fallback after init exceptions (#13066, #13130, #13331)
- Fix cache memory estimation for Qwen3 hybrid models in trtllm-bench and lower Eagle3 one-model acceptance thresholds for H20 (#13268, #13565)
Documentation
- Add batch-size tuning guidance for CUDA graph padding and a GVR Top-K technical blog (#13393, #13714)
- Remove outdated news items and clean up llmc licensing documentation (#13603, #13700)
Test & Infra
- Add and refresh coverage for disaggregated post-merge performance, GPT-OSS 20B MHA, prefix-aware scheduling, cascade-prune repros, and issue-specific regressions (#13343, #12796, #13578, #13572, #13553)
- Improve CI triage and failure analysis with Perf Triage Bot integration, rendered HTML failure reports, K8s infrastructure retry, PR base freshness checks, static test validation, and clearer Slurm pending logs (#12429, #13526, #13530, #13430, #13423, #13586)
- Improve CI and build stability with lower test memory pressure, adjusted DeepEP token limits, CUDA line info defaults, Debug CUDA flag fixes, module-level skips, and longer FMHA timeouts (#13402, #13484, #13334, #13598, #13223, #12860)
- Refresh test organization and dependencies with post-merge test moves, updated constraints, FlashInfer Python updates, B200 multimodal unit-test deduplication, and sorted waive enforcement (#13376, #13482, #13064, #13631, #13584, #12672)
- Improve distributed and QA infrastructure with free-port FLUX/WAN test initialization, multinode fallback handling, NIXL-based perf sanity tests, QA popen workarounds, and KVCacheManager connector helper fixes (#13364, #13537, #13654, #13634, #13749)
- Improve package and release infrastructure with llmc standalone package cleanup, release-scanning PLC nightly adjustments, devel-stage apt cache mounts, and pip cache reuse (#13466, #13694, #13245, #13510)

What's Changed

[https://nvbugs/6093714][fix] Reduce batch size and add memory guard for test by @govind-ramnarayan in #13402
[TRTLLM-11373][refactor] Embed VisualGenParams in DiffusionRequest and simplify generate() inputs by @zhenhuaw-me in #13313
[None][test] Update CI Post-Merge Disagg Perf Tests by @chenfeiz0326 in #13343
[None][chore] AutoDeploy: Refactor finegrained FP8 scale sharding helpers by @galagam in #12999
[https://nvbugs/6076564][fix] unwaive TestNemotronH::test_auto_dtype[trtllm-flashinfer_ssm-False] by @tcherckez-nvidia in #13187
[TRTLLM-10061][feat] Prefix caching support for mamba hybrid models by @VALLIS-NERIA in #12185
[None][cleanup] remove legacy addSequence path by @liji-nv in #13280
[None][infra] Waive 1 failed cases for main in pre-merge 35790 by @ZhanruiSunCh in #13483
[None][fix] Fix bugs in WindowBlockManager destructor statistics by @eopXD in #12448
[None][chore] Update CI allowlist 2026-04-23 by @ZhanruiSunCh in #13381
[None][fix] Consolidate aiohttp session management in disagg router by @reasonsolo in #13408
[None][test] Remove SLACK Bot and Modify Update Perf Data into CI Pipeline by @chenfeiz0326 in #12429
[None][infra] Waive 1 failed cases for main in post-merge 2694 by @ZhanruiSunCh in #13485
[None][infra] Waive 1 failed cases for main in post-merge 2695 by @ZhanruiSunCh in #13502
[https://nvbugs/6064029][perf] Use fast PNG compression for visual gen serving by @karljang in #13074
[None] [chore] Update skills by @kaiyux in #13507
[None][feat] Add llm.encode() fast path for encoder-only models by @tingyangk in #12801
[TRTLLM-12123][feat] Add per-iteration request-aggregate counters to InflightBatchingStats by @nv-yna in #13199
[None][fix] Fix Mamba cache correctness under MTP + CUDA-graph padding by @Wanli-Jiang in #13151
[TRTLLM-10004][feat] Enable GEMM -> AR with GEMM output in registered buffers by @nv-lschneider in #11589
[https://nvbugs/6043291][fix] Add fatal error detection to prevent zombie worker pods by @chienchunhung in #12718
[TRTLLM-11228][feat] Support DFlash in one-model spec dec by @ziyixiong-nv in #12794
[None][doc] Add blog post for tuning batch sizes for CUDA graph padding and increasing the default batch size granularity for it by @yijingl-nvidia in #13393
[None][feat] Assert attention DP disabled when KV connector is in use by @jthomson04 in #13448
[https://nvbugs/6050489][fix] fix agg pp4 hang issue by @bo-nv in #12888
[https://nvbugs/6095953][fix] Fix cache memory estimation for Qwen3 hybrid models in trtllm-bench by @hyukn in #13268
[None][test] add unit test and e2e test for gpt_oss_20b MHA kernel by @ruodil in #12796
[https://nvbugs/6037654][fix] Set DeepEP low-latency token limit for qwen3 CI to prevent OOM by @byshiue in #13484
[None][infra] Move some tests to post-merge by @EmmaQiaoCh in #13376
[TRTLLM-10491][test] unwaive DeepSeekV3Lite nvfp4 4gpus test (flaky, self-healed) by @tianyuxbear in #13196
[None][chore] Waive accuracy/test_disaggregated_serving.py::TestDeepSeekV32Exp::test_auto_dtype[False] by @yihwang-nv in #13539
[None][feat] Reduce sampler overhead with min_tokens by @galagam in #13480
[None][infra] enable CUDA line info by default for Debug/RelWithDebInfo by @bobboli in #13334
[None][test] Waive failed cases for main in QA CI by @crazydemo in #13504
[None][test] Waive 2 failed cases for main in QA CI by @xinhe-nv in #13508
[None][chore] Introduce flash...

Read more

Contributors

ttyio, venmugil, and 72 other contributors

Assets 2

v1.3.0rc13 Pre-release

Pre-release

VALLIS-NERIA released this 29 Apr 05:47

b9ce4b6

Highlights

Model Support
- Support and initial optimizations for Nemotron 3 Nano Omni; known issues for audio-from-video and chunked prefill for video being actively worked on
- Add audio extraction from video, optimize ViT attention, and reduce initialization memory for Nemotron and Nemotron Nano VL models (#12921, #12911, #13283)
- Add per-model VisualGen example scripts, shared configs, per-model defaults, and metadata updates (#12992, #12862)
- Add GLM-4.7 and GLM-5 tool parser support (#13150)
- Optimize Nemotron-H execution from the Python layer and preserve Nemotron HF mamba cache dtype during bench tuning (#13032, #12826)
- Improve DeepSeek-V3.2 and DeepSeek-V3-Lite support with targeted perf and chunked-prefill fixes on Blackwell and SM100-class GPUs (#13142, #13257)
API
- Fix the chunked prefill API contract for Nemotron Nano VL (#13025)
- Add abort and resume support for Async RL in verl (#12272)
- Add a modular logger with automatic module detection and per-module filtering (#13202)
- Improve prompt handling by accounting for existing multimodal placeholder tokens in text prompts (#12827)
- Propagate real server-side failures to disaggregated serving clients and improve empty-file handling in trtllm-bench (#13119, #12552)
Feature
- Add VisualGen Cache-DiT and a unified cache accelerator (#12548)
- Expand kernel support with broader RMSNorm coverage, optimized causal-conv1d prefill and decode, FP4 residual quantization, and refreshed SageAttention kernels (#13033, #13103, #13117, #12937)
- Add batched addSequence with two-phase claim and unified VSWA and non-reuse support (#13029)
- Add sparse MQA and GQA attention support and introduce new sharding infrastructure (#12470, #12419)
- Improve serving performance with async media loading, faster video frame decoding, cached text computation reuse, lower custom-op overhead, padding-aware CUDA graph tuning, and reduced single-rank broadcast overhead (#13034, #12677, #13149, #12895, #13412, #13259, #11640)
- Optimize runtime internals with Minimax RMSNorm tuning, consolidated prefix-reuse analysis, gen-only sync transfer v2, DWDP contention config cleanup, and round-robin CP cache transmission (#12163, #13095, #12882, #12974, #13180)
- Restore EAGLE3 dynamic-tree speculative decoding support and centralize perfect-router integration and validation (#13081, #13250)
Fix
- Fix KV cache and scheduler correctness issues, including SWA compatibility, token accounting with context chunking, over-allocation in VSWA plus EAGLE flows, KVCacheManagerV2 bugs, and multimodal and disaggregated cache reuse problems (#12968, #12976, #12855, #12306, #13104, #12472)
- Fix runtime stability issues by preventing benchmark fill-loop hangs, tightening warmup reservation behavior, and making host-memory-based prefetch decisions consistent across ranks (#13065, #13078, #13161)
- Fix EAGLE3 LoRA speculative decoding and preserve speculative layer weights to avoid MTP plus PP hangs (#13005, #12555)
- Fix FMHA and attention runtime issues, including SM90 full-mask skip-softmax dispatch, misleading generation warnings, stale CUDA graphs on beam-width changes, and FlashInfer KV layout handling (#13120, #13157, #13255, #13190)
- Fix vision and multimodal correctness issues, including KV-cache quantization leaks into the vision encoder, FLUX high-resolution scheduler off-by-one behavior, and Super V3 multi-stream MoE instability (#13181, #13091, #13122)
- Fix packaging and environment issues by restoring the missing aarch64 library, enforcing NCCL >= 2.28 at configure time, and using weights_only=True in LoRA manager loads (#13206, #13108, #13391)
- Fix operational reliability issues in CI and perf pipelines, including OpenSearch upload failures, hanging AIPerf metrics, SLURM host name propagation, and SLURM submission retry behavior (#13215, #13314, #13367, #12778)
- Fix additional model and runtime issues for Qwen3 mrope cache handling, DSA illegal memory access with CUDA graph plus host KV offload, stale tokenizer alias imports, and WAN example timing conflicts (#13269, #13124, #13086, #13193, #12128)
Documentation
- Restructure installation documentation and refresh verbose comments (#12402, #13387)
- Update invalid Dynamo documentation URLs (#13038)
Test & Infra
- Add Dynamo API compatibility tests, VisualGen regression coverage, and refactor MoE communication tests (#12970, #13372, #12841)
- Expand CI coverage for disaggregated serving and weekly performance suites, including K2.5 EPLB coverage, refreshed Nemotron datasets, and additional weekly perf models (#13185, #12982, #13325)
- Improve CI signal quality by splitting multimodal DGX_B200 jobs, removing obsolete or low-priority cases, dropping non-key-model L0 coverage, and moving bf16 and auto precision variants to post-merge (#12978, #13262, #13374, #13315, #13366)
- Improve CI tooling with PR-aware failure analysis, SwiftStack upload support, wildcard bot stage commands, a sync_qa_tests Jenkins script, doc tests, and markdown-only doc-build rules (#12849, #13291, #12881, #13028, #13152, #13358, #13441)
- Refresh repository ownership and security plumbing with CODEOWNERS updates, HMAC key enforcement, and container vulnerability fixes (#13110, #13213, #9850, #13447)

What's Changed

[https://nvbugs/5997092][fix] Remove waives for DS-V3.2/R1 FP4 Blackkwell perf tests by @peihu-nv in #13042
[None][infra] Waive 2 failed cases for main in post-merge by @xinhe-nv in #13105
[TRTLLM-9132][infra] Update to ignore failure for release check and building images by @EmmaQiaoCh in #9871
[https://nvbugs/5626259][fix] Enable nemotron-h chunk prefill test by @Wanli-Jiang in #12980
[None][feat] Add the invocation path for mamba2 mtp custom op by @JadoTu in #12787
[None][infra] Waive 4 failed cases for main in post-merge 2654 by @ZhanruiSunCh in #13113
[None][infra] Waive 3 failed cases for main in post-merge 2658 by @ZhanruiSunCh in #13141
[None][chore] Add CODEOWNERS mappings for @NVIDIA/trt-llm-multimodal-devs by @venkywonka in #13110
[None][chore] Add disaggregated tests that timeout to waives.txt by @2ez4bz in #13136
[https://nvbugs/5844149][fix] Fix issues with DSV3.2 perf tests by @chenfeiz0326 in #13142
[None][fix] Fix a capacity issue in KVCacheManagerV2 for SWA compatibility by @heyuhhh in #12968
[https://nvbugs/6044213][chore] unwaive and reduce free mem ratio in AutoDeploy's perf test: deepseek_r1_distill_qwen_32b by @MrGeva in #12965
[None][fix] Fix chunked prefill API contract for nemotron nano VL by @2ez4bz in #13025
[TRTLLM-11794][feat] Optimize ViT Attention kernel on Nemotron by @yechank-nvidia in #12911
[TRTLLMINF-38][feat] Pass PR number to CI failure analysis agent by @dpitman-nvda in #12849
[https://nvbugs/6074784][chore] Temp waive dis-agg transformers failed tests by @Shixiaowei02 in #13145
[None][fix] Fix scheduler off-by-one in FLUX pipelines at high resolutions by @karljang in #13091
[None][infra] Add 5 users to blossom-ci allowlist by @yuanjingx87 in #13146
[TRTLLM-11403][feat] VisualGen Cache-DiT + unified cache accelerator by @o-stoner in #12548
[None][fix] Enable LoRA in EAGLE3 speculative decoding by @Funatiq in #13005
[TRTLLM-11903][test] Add API compatibility tests for dynamo by @brb-nv in #12970
[None][feat] Update rms_norm + fp4_qaunt kernel supporting more dim by @Wanli-Jiang in #13033
[None][chore] Bump version to 1.3.0rc13 by @VALLIS-NERIA in #13159
[None][fix] Fix compute token accounting for KV cache reuse with context chunking by @lancelly in #12976
[None][feat] Batch addSequence with two-phase claim and unified VSWA/non-reuse support by @liji-nv in #13029
[None][bug] fix SM90 full-mask skip-softmax dispatch by @bobboli in #13120
[None][test] Refactor MoE comm tests: unified dispatch+combine pipeline by @xxi-nv in #12841
[https://nvbugs/5983320][fix] Use encoder_max_batch_size of 1 for LLaVa in test_multi_request_batch_chat by @moraxu in #12647
[TRTLLM-11771][feat] Add audio extraction from video for Nemotron Nano VL by @2ez4bz in #12921
[None][fix] Update stale TOKENIZER_ALIASES import path in serve and bench modules by @cascade812 in #13086
[TRTLLM-11695][feat] Add per-model VisualGen example scripts, shared configs, and per-model defaults by @zhenhuaw-me in #12992
[https://nvbugs/6060119][chore] Unwaive DSR1 FP4 128k8k disagg perf tests by @peihu-nv in #13088
[None][feat] Support sparse mqa/gqa attention by @heyuhh...

Read more

Contributors

karljang, reasonsolo, and 73 other contributors

Assets 2

v1.3.0rc12 Pre-release

Pre-release

VALLIS-NERIA released this 17 Apr 15:49

61cef21

Highlights

Model Support
- Add LTX-2 two-stage pipeline support (#12361)
- Add CUDA graph support for LTX-2 with torch.compile compatibility (#12653)
- Add video temporal compression for Nemotron Nano and RADIO (#12649)
- Extend the Python cache transceiver to support Qwen-Next (#12772)
- Add CuteDSL MoE backend support for Qwen3.5 (#12799)
- Fix LoRA support for Qwen3 models (#12785)
- Support loading FP8 LoRA weight files (#12848)
- Add support for speculative decoding with LoRA (#12661)
- Fix OOM with large numbers of LoRA adapters (#12815)
- Partially fix LoRA overallocation for Nemotron NAS (#12817)
- Skip inference_mode() when torch.compile=True for Gemma3 FP8 (#12367)
- Skip NVFP4 fused norm when the dimension does not meet requirements (#12901)
- Update MoE hidden_size in the communicator for Nemotron-H (#12890)
- Unify image-as-tensor handling to avoid repeated conversions for nano models (#12994)
API
- Refine the VisualGen API structure (#12807)
- Convert VisualGenParams to Pydantic with request validation, per-model defaults, and extra_params support (#12922)
- Align AttentionPlugin with the EdgeLLM interface (#12233)
- Add shorthand KVConnector paths for lmcache and kvbm (#12626)
- Add the missing allow_partial_loading parameter to CuteDSL and ConfigurableMoE load_weights (#12761)
- Improve KV cache statistics monitoring (#12413)
Feature
- Add NvTelemetry/GXT-compliant usage telemetry (#12384)
- Add production-level Prometheus metrics for iteration stats, config info, token counters, and phase histograms (#12545)
- Add conversation-affinity routing for disaggregated serving (#12526)
- Enable block reuse with the overlap scheduler (#12816)
- Unify VisualGen parallelism (#12509)
- Consolidate piecewise CUDA graph VLM updates (#12852)
- Add tunable NVFP4 quantization with an additional FlashInfer backend (#12126)
- Optimize GDN prefill with indexed in-kernel state updates (#12791)
Fix
- Propagate disaggregated_params through PostprocWorker (#12513)
- Prebuild disaggregated context responses to avoid ctx_request_id races (#12466)
- Generate HMAC keys for MGMN IPC servers in disaggregated serving (#12670)
- Enable HMAC authentication in VisualGen ZMQ IPC channels (#12680)
- Fix disaggregated gen-only hangs caused by blocking KV transfers (#12640)
- Replace busy-poll sleep in get_async_noblock with the ZMQ async poller (#12189)
- Make trust_remote_code opt-in in MultimodalModelRunner (#12669)
- Fix VLM guided decoding startup crashes caused by missing vocab_size_padded (#12284)
- Eliminate double PNG encoding in visual generation serving (#12903)
- Treat whitespace-only content correctly in nano-v3 reasoning swap (#12912)
- Clamp usedNumBlocks to non-negative values in KV cache statistics (#11922)
- Fix moe_chunking_tokens handling during MoE A2A (#12929)
- Guard CUDA event elapsed_time in perf_metrics_manager to prevent executor crashes (#12868)
- Remove leftover onboardBlocks parameters in kvCacheManagerTest (#13107)
- Add CUDA device setup before load_remote_agent (#12619)
- Fix Mooncake transfer agent binding (#12723)
- Fix multi_stream_moe accuracy with MLIR and piecewise CUDA graphs (#12847)
- Fix Nano chunked prefill (#12782)
- Fix constrained decoding for GLM5 (#12869)
- Fix benchmark disaggregated deadlocks by removing a blocking fill loop (#12208)
- Update CUTLASS C++ to 4.4.2 (#12897)
- Pin Ray to 2.54.1 (#13071)
Documentation
- Add the attention developer guide (#12693)
- Add a README for custom Claude Code skills and agents (#12920)
- Update coding guidelines to require Python >= 3.10 (#13094)
Benchmark
- Optimize the Qwen3.5 decode delta kernel (#12740)
- Reduce host overhead in DSA MLA attention (#12631)
- Add a host performance regression test suite for PyExecutor (#12148)
- Add benchmark coverage for allreduce backends (#12887)
- Restore DSR1/DSV32/K2 disaggregated performance tests (#12688)
- Support NV SA benchmarks in CI performance testing (#13004)
- Add K2.5 performance tests into CI (#12931)
Test & Infra
- Update Perf Sanity System code paths (#12430)
- Bump etcd to 3.6.9 to pick up the gRPC fix (#12594)
- Fix the PLC nightly pipeline and expose more pipeline data (#12940)
- Exclude QA nodes when running TRTLLM CI (#13102)
- Add a unit test for lifecycle race condition errors in disaggregated serving (#12803)
- Add an end-to-end test for PP + disagg + block reuse + chunked prefill hangs (#12913)
- Add Nemotron-3-Super-120B-A12B-NVFP4 functional and performance cases on DGX Spark (#12830)
- Remove obsolete RTX-6000 OOM tests (#12800)
- Remove unused tests (#12625)
- Check unused fixtures (#12730)
- Fix Qwen3 skip-softmax attention CI tests (#12789)
- Fix failing KV cache transceiver tests from the perf sanity changes (#12554)
- Fix Wan unit tests (#13026)
- Remove obsolete waivers (#12979)
- Move the PY312-UB2404 sanity check test to A100X nodes (#13077)
- Pin Ray to 2.54.1 in the Slurm CI stage (#13085)

What's Changed

[None][test] Unwaive Nemotron H flaky case by @nv-guomingz in #11236
[https://nvbugs/5997543][fix] unwaive test_disaggregated_overlap_transceiver_runtime_python by @chuangz0 in #12580
[TRTLLM-11574][feat] Some updates on Perf Sanity System codes by @chenfeiz0326 in #12430
[None][doc] add attention developer guide by @QiJune in #12693
[https://nvbugs/5991957][fix] Propagate disaggregated_params through PostprocWorker by @peihu-nv in #12513
[https://nvbugs/5883590][fix] Generate HMAC key for MGMN IPC server in disaggregated serving by @yibinl-nvidia in #12670
[https://nvbugs/5941242][fix] Fix SigLIP test failure by @tijyojwad in #12717
[None][feat] Optimize qwen3.5 decode delta kernel by @nv-guomingz in #12740
[https://nvbugs/5961736][fix] Prebuild disagg ctx response to avoid ctx_request_id race by @peihu-nv in #12466
[https://nvbugs/5922880][fix] Enable HMAC authentication in VisualGen ZMQ IPC channels by @yibinl-nvidia in #12680
[None][fix] Add missing allow_partial_loading param to CuteDSL and ConfigurableMoE load_weights by @qiaoxj07 in #12761
[None][chore] Waive hanging Nemotron Super test by @brb-nv in #12821
[None][fix] add cuda set device before load_remote_agent by @chuangz0 in #12619
[None][chore] Remove closed bugs by @xinhe-nv in #12766
[None][test] Remove RTX-6000 OOM test cases by @yufeiwu-nv in #12800
[None][fix] Fix LoRA support for Qwen3 models by @achartier in #12785
[TRTLLM-11343][feat] LTX-2 Two Stage pipeline support by @yibinl-nvidia in #12361
[#12808][feat] AutoDeploy: Add Gemma4 Support by @bmarimuthu-nv in #12710
[None][feat] Add Claude Code agents and skills for kernel dev, perf analysis, and compilation by @kaiyux in #12831
[#11879][fix] Clamp usedNumBlocks to non-negative in KV cache stats by @wojciech-wais in #11922
[https://nvbugs/6029864][fix] Fix flaky ray test failure by @brb-nv in #12697
[https://nvbugs/5813192][fix] Make trust_remote_code opt-in in MultimodalModelRunner by @yibinl-nvidia in #12669
[None][infra] Bump etcd to 3.6.9 to involve grpc fix by @yuanjingx87 in #12594
[https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters by @brb-nv in #12815
[None][feat] AutoDeploy: Add the Triton kernel for MLA by @nvchenghaoz in #12664
[None][fix] replace busy-poll sleep in get_async_noblock with zmq async poller by @edenfunf in #12189
[https://nvbugs/6018647][test] Add unit test for Lifecycle Race Condition error in disagg sever by @yingguo-trt in #12803
[None][infra] Add DSR1 DSV32 K2 Disagg Perf Tests Back by @chenfeiz0326 in #12688
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #12765
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #12814
[None][fix] Fix VLM guided decoding startup crash due to missing vocab_size_padded property by @stefanpantic in #12284
[None][fix] Fix Nano chunked prefill by @2ez4bz in #12782
[https://nvbugs/6029220][fix] Disable multi-stream in maybe_execute_i… by @liji-nv in #12659
[None][test] remove unused tests by @xinhe-nv in #12625
[https://nvbugs/6000658][fix] Fix disagg gen-only hang where 10s sleep in can_forward blocks KV transfers and overflows CTX memory by @peihu-nv in https://github.com/NVIDI...

Read more

Contributors

karljang, tijyojwad, and 60 other contributors

Assets 2

v1.2.1 Latest

Latest

VALLIS-NERIA released this 20 Apr 11:51

376f7e1

Highlights

Fixed Issue
- Fixed an issue that caused KV cache corruption (#12770)
Infrastructure Changes
- Upgraded xgrammar and flashinfer (#12811)

Assets 2