Releases: sgl-project/sglang
Release v0.5.4
Highlights
- AMD AI Dev Day 2025 SGLang (slide), PyTorch Conference 2025 SGLang (slide)
- Model gateway v0.2 release: https://docs.sglang.ai/advanced_features/router.html
- [beta] Overlap scheduler for speculative decoding: #11762
- [beta] Piecewise CUDA graph for prefill: #11490
- Prefix cache for qwen3 next and GDN/mamba models: #11214
- Fullset optimizations for DeepSeek-V3.2 (MTP, PD-Disagg, Function Calling) (https://docs.sglang.ai/basic_usage/deepseek_v32.html, #11989)
- Various Blackwell kernel optimizations
- DGX Spark Support: https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/
- KTransformer integration: https://lmsys.org/blog/2025-10-22-KTransformers/
- New model support: Nemotron, DeepSeek OCR, Qwen3-Omni, Olmo 3
- Native ModelOpt quantization support
What's Changed
- [router] add ipv6 support across all components by @slin1237 in #11219
- Remove env var warnings for release by @merrymercy in #11262
- Enable native ModelOpt quantization support (1/3) by @Edwardf0t1 in #7149
- [router][tool call] Clean up redundant
detect_formatandhas_tool_markersby @CatherineSue in #11270 - disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in #11274
- docker: add manifest to versioned docker releases by @ishandhanani in #11268
- [Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in #11182
- [router][grpc] Refine streaming processes by @CatherineSue in #11277
- Fix code sync scripts by @merrymercy in #11276
- [Auto Sync] Update test_utils.py (20251006) by @merrymercy in #11280
- Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in #11279
- Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in #11238
- Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in #11261
- fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in #11282
- docs: update sgl-kernel README by @zhyncs in #11286
- chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in #11281
- [router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in #11283
- convert test_deterministic into unit tests by @skyzh in #11095
- Feature/longbench v2 evaluation utils by @alhridoy in #10949
- [ci] fix pp test by @hnyls2002 in #11294
- EAGLE cache fix for SWARadixCache by @ispobock in #11231
- Remove overlap thread by @hnyls2002 in #11210
- [router] add reasoning and tool parser argument in router by @slin1237 in #11290
- Remove sampling info events and overlap thread file by @hnyls2002 in #11300
- Introduce future indices by @hnyls2002 in #11301
- [sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in #11068
- [Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in #11302
- [router] add get server info and get model info in grpc server by @slin1237 in #11303
- [router][grpc] Refactor chat template content format detection by @CatherineSue in #11288
- [Doc] HiCache Design Documents by @ykwd in #11027
- [Doc]: Best Practice for HICache by @hzh0425 in #11001
- [router] fix grpc connection conversion and add optimization by @slin1237 in #11305
- [router][grpc] Fix sampling_params.stop_strs is None by @CatherineSue in #11306
- Update tool parser and related documentation by @JustinTong0323 in #11223
- [router][grpc] Fix error message format in grpc chat handler by @CatherineSue in #11307
- [quantization] Properly ignore quantization for layers excluded in quant_config by @BowenBao in #11205
- [router] support Openai router conversation API CRUD by @key4ng in #11297
- [router][grpc] Fix request_id extraction when n > 1 by @CatherineSue in #11311
- [router] cleanup worker health check to return early by @slin1237 in #11310
- [oai serving chat] Add argument
--sampling-defaultsand fixChatCompletionRequestdefaults by @CatherineSue in #11304 - Clean match_prefix and prepare_for_extend for mem cache V2 by @cctry in #11200
- ci: unify the model launch method of nightly ci by @mickqian in #11230
- [Chore] Update xgrammar 0.1.24 -> 0.1.25 by @DarkSharpness in #10710
- update sampling_params documentation with defaults by @JustinTong0323 in #11315
- Optimize copy_kv_cache for spec decoding by @YAMY1234 in #11126
- Rename
ngram_utils->ngram_infoby @hnyls2002 in #11316 - [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator by @CatherineSue in #11314
- [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints by @adarshxs in #9545
- [8/N] MoE Refactor: deprecate
EPMoEby @ch-wan in #11211 - Skip weight loading in deepgemm compilation by @ch-wan in #11312
- [2/2] Support MHA prefill with FlashAttention 4. by @lifuhuang in #10937
- [Doc] Update mooncake nvlink transport doc for PD disaggregation by @ShangmingCai in #11321
- fix(decode): adjust ServerArgs import to explicit module path by @xiaguan in #11007
- Support LoRA in bench_serving oai interface by @lifuhuang in #11318
- benchmark: enhance configurable multimodal benchmarking in bench_serving by @AlienKevin in #9812
- [CI] improve disaggregation CI. by @hnyls2002 in #11264
- model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) by @netanel-haber in #10909
- [router] refactor generate to use new pipeline arch by @slin1237 in #11323
- [router] improve reasoning parser lock and reduce req cloning by @slin1237 in #11336
- [router][grpc] Cleanup debug logs in grpc_server and grpc_router by @CatherineSue in #11340
- [router] Fix all unused_qualifications by @CatherineSue in #11341
- [router] Support history management using conversation by @key4ng in #11339
- [router][grpc] Add dependencies in Cargo.toml to support chat template rendering by @CatherineSue in #11342
- fix: fix revision for sgl-flash-attn in sgl-kernel by @mickqian in #11327
- [Auto Sync] Update scheduler.py (20251009) by @zhyncs in #11350
- [Generative Score API] Multi-Item scoring with custom attention mask. by @sundar24295s in #10979
- [router][grpc] disable health check generation and increase timeout by @slin1237 in #11353
- [router] Refactor OpenAI router: split monolithic file and move location by @key4ng in #11359
- [router][lint] Add unused_qualifications to cargo lint warnings by @CatherineSue in #11366
- [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size by @trevor-m in #11309
- [router][grpc] Fix tool call streaming bugs: empty tool names, state pollution, and panics by @CatherineSue in https://github.c...
Release v0.5.3
Highlights
- Day 0 Support for DeepSeek-V3.2 with Sparse Attention: https://lmsys.org/blog/2025-09-29-deepseek-V32/
- Deterministic inference on multiple attention backends: https://lmsys.org/blog/2025-09-22-sglang-deterministic/
- Integration of FlashAttention 4 prefill kernels
- Enhancing support of Qwen3-Next with MTP, DP, optimized kernels and multiple hardware platforms
- Support models including Qwen3-VL series, dots.vlm1, Ling-V2, Apertus, SOLAR
What's Changed
- [Auto Sync] Update server_args.py (20250912) by @merrymercy in #10347
- [CPU][doc] add torch.compile param in example commands by @ZailiWang in #10349
- [router][ci] Add gpu utilization analyze with nvml by @key4ng in #10345
- [NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked by @wenscarl in #9199
- fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale by @trevor-m in #10296
- model: support Apertus by @EduardDurech in #9774
- fix dual stream bug by @yizhang2077 in #10352
- [router] Basic OAI Response api by @key4ng in #10346
- Implement Standalone gRPC Server for SGLang Python Scheduler by @CatherineSue in #10283
- support memory_pool_host page first direct layout by @huangtingwei9988 in #10031
- fix the break in FlashInferFusedMoE by @chenqianfzh in #10356
- fix: resolve transfer_kv_all_layer_direct_lf_pf import error by @zhyncs in #10360
- Support LingV2 model by @strgrb in #10359
- Fix Bailing MoE model bugs by @yuan-luo in #10362
- Revert add mainprocess's proctitle by @whybeyoung in #10351
- model: support dots.vlm1 model by @yonghenglh6 in #8778
- Support loading weights from remote instance by @amysaq2023 in #8215
- add qwen3-next ut by @yizhang2077 in #10355
- Fix chunked prefix cache for nvfp4 by @wenscarl in #10180
- Fix FA4 import cause moe_fused_gate output be illegal memory by @fzyzcjy in #10368
- Fix global input scale incompatible with CuTe DSL moe by @fzyzcjy in #10370
- [router] Add Rerank Routing Logic in Regular Router by @fangjian601 in #10219
- [router] enable sccache in ci and local build by @slin1237 in #10099
- fix: add fast path for function call by @yizhang2077 in #9023
- [Auto Sync] Update base_grammar_backend.py, llguidance_back... (20250911) by @merrymercy in #10333
- fix: resolve gb200 image link by @zhyncs in #10343
- fix: exclude protobuf generated code by @zhyncs in #10388
- [bug] fix ci syntax by @slin1237 in #10390
- Fix GPU fault issue when run dsv3 with dp mode and enable torch-compile by @kkHuang-amd in #10361
- feat: add deepseek v3 fp4 ut by @zhyncs in #10391
- Add sentencepiece to project dependencies by @mmangkad in #10386
- [router] allow one router to support different model families and serving mode by @slin1237 in #10244
- [router] Add get and cancel method for response api by @key4ng in #10387
- Benchmark: Support API_KEY without 'bearer' by @Muqi1029 in #10380
- Support Qwen3-Next on Ascend NPU by @iforgetmyname in #10379
- [HiCache] fix mooncake config in different tp size by @stmatengss in #10377
- [HiCache] doc: update deployment in readme by @stmatengss in #10332
- [router] add not implemented functions for multi model trait by @slin1237 in #10394
- [Auto Sync] Update xgrammar_backend.py (20250913) by @merrymercy in #10395
- fix probs name which without temp scaling name by @narutolhy in #9984
- Fix the style of sgl kernel by @merrymercy in #10398
- fix: tool parse in large streaming chunk beginning with normal content by @JustinTong0323 in #10397
- [Fix] Init mamba related memory pools with torch.zeros by @byjiang1996 in #10400
- support qwen3_next blackwell by @yizhang2077 in #10403
- [Fix] Support qwen3-next MTP+DP by @byjiang1996 in #10392
- Update ROCm docker image to add sgl-router support by @kkHuang-amd in #10406
- [Performance] Dynamic Batch Tokenizer by @sundar24295s in #9382
- [Generative Score API] Scoring(Prefill-only) optimizations. by @sundar24295s in #9748
- Remove repeatedly lists adding in
init_incremental_detokenizationby @hnyls2002 in #10412 - [Hack] Add pd-disaggregation decode polling interval by @hnyls2002 in #10411
- fix duplicated logger in eager_utils by @lj970926 in #10410
- Fix cutlass moe accuracy drop caused by attention UB from DP padding mode by @fzyzcjy in #10414
- Add self.capture_aux_hidden_states For GLM-4.5V by @zRzRzRzRzRzRzR in #10228
- Add h200 fused moe config for Qwen3-Next by @Ximingwang-09 in #10404
- Auto determine sgl kernel version in blackwell CI by @fzyzcjy in #10318
- Fix the global scale fix does not support EPLB and improve enabling condition by @fzyzcjy in #10369
- Let sgl-kernel changes be tested on srt by @fzyzcjy in #10313
- [2/2] Speed up prefill mla attention concat by @fzyzcjy in #10157
- Support offloading in fp8 by @fzyzcjy in #9948
- Support global scale in addition to per expert scale for cutedsl moe by @fzyzcjy in #10270
- Support profile args in Engine API by @fzyzcjy in #6539
- Fix sgl-kernel + srt CI by @fzyzcjy in #10419
- [PD metrics] Fix some uncompleted PD related metrics by @acelyc111 in #8627
- Typo: in
--enable-custom-logit-processor: agree with cli arg by @thalahors in #10076 - [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 by @sufeng-buaa in #9962
- fix: use latest flashinfer by @zhyncs in #10428
- fix: enable cu124 and cu128 build on main push by @zhyncs in #10431
- [Fix] MoE: fix w8a8_fp8 MoE and add tests to cover this code path by @ch-wan in #10429
- Add split tile size for Triton attention by @ispobock in #10425
- Fix correction bias undefined behavior for nvfp4 models by @fzyzcjy in #10426
- feat: add dsv3 fp4 cutlass moe etp ut by @zhyncs in #10433
- router: Add Embedding routing logic by @tao12345666333 in #10129
- Revert "Fix FA4 import cause moe_fused_gate output be illegal memory" by @fzyzcjy in #10432
- [4/N]DP refactor: support watching mode
get_loadand shortest queue strategy by @hnyls2002 in #10201 - automatically label pr for ci by @merrymercy in #10435
- Refactor TopK to ensure readability and extensibility by @ch-wan in #9338
- Tiny fix wrong naming by @fzyzcjy in #10437
- Fix label pr for ci by @merrymercy in #10441
- metrics: support customer labels specified in request header by @acelyc111 in #10143
- [docs / oneliner] update mmmu docs instruction by @vincentzed in #9768
- Add reasoning examples for GPT-OSS in Markdown examples by @vincentzed in #9626
- Fix label PR by @merrymercy in #10445
- Update permissions in label-...
Release v0.5.2
Highlights
- SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends: https://lmsys.org/blog/2025-09-10-sglang-hicache/
What's Changed
- feat: allow use local branch to build image by @gongwei-130 in #9546
- [readme] Include additional resources for the SGLang x AMD SF Meetup event by @wisclmy0611 in #9547
- [doc] deepseekv31 support by @XiaotongJiang in #9544
- fix(grok): remove duplicate replicate_lm_head configuration by @vincentzed in #9549
- chore: update configurer by @zhyncs in #9557
- chore: bump v0.5.1.post1 by @zhyncs in #9558
- [router] add right rustls dependency in sgl-router cargo.toml by @Bruce-x-1997 in #9498
- fix: use sgl-kernel 0.3.5 by @zhyncs in #9565
- Add target module validation for init adapters by @Beichen-Ma in #9429
- fix: Update OpenAI client base URL in documentation by @JustinTong0323 in #9576
- [PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats by @SCDESPERTATE in #7317
- remove redundant rank0_log function. by @miter6 in #9560
- Update CUTLASS 4.2 & Enable K-Major Scale Factor for SM90 FP8 Blockwise Group GEMM by @HydraQYH in #9559
- Reintroduce memory usage fix by @fzyzcjy in #9535
- Offload tensors by sharding on GPU by @fzyzcjy in #9536
- bugfix for undefined logging functions in HarmonyBrowserTool & HarmonyPythonTool by @CiaranZhou in #9229
- chore: upgrade flashinfer 0.2.14.post1 by @zhyncs in #9578
- fix: revert #8593 by @zhyncs in #9581
- fix: resolve tuning fused moe issue by @zhyncs in #9587
- Tiny fix wrong comments by @fzyzcjy in #9589
- chore: update config by @zhyncs in #9591
- chore: bump v0.5.1.post2 by @zhyncs in #9592
- [Doc] add LWS(LeaderWorkerSet) use case in sgl-router README by @Bruce-x-1997 in #9568
- [Performance] Batch Send from Tokenizer Manager. by @sundar24295s in #9436
- Fix GLM45 tool call multi-turn bug by @byjiang1996 in #9500
- Fix GLM45v launch server cuda torch compile bug by @byjiang1996 in #9554
- Fix Harmony reasoning parser for and auto-separation for gpt-oss models by @jonaslsaa in #9190
- [docs] Refactor, remove compiled results and add gpt-oss by @zhaochenyang20 in #9613
- [Fix] HiCache Bugfix & Mooncake Error Handling Enhance by @ykwd in #8901
- Improve bench_one_batch_server script by @hnyls2002 in #9608
- [router] add mistral tool parser by @slin1237 in #9622
- [router] add qwen tool parser by @slin1237 in #9623
- [router] add pythonic parser by @slin1237 in #9628
- [router] add llama tool parser by @slin1237 in #9629
- [router] add ut for mistral, llama, pythonic, and streaming tool parser by @slin1237 in #9632
- [new feat] ascend backend support fia fusion kernel by @ZhengdQin in #8328
- model: Support nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 by @netanel-haber in #9301
- Fix lint for router by @hebiao064 in #9636
- [docs] Update README with additional highlights and resources for SGLang x AMD SF Meetup by @wisclmy0611 in #9640
- Add reasoning_effort param in TiktokenTokenizer.apply_chat_template by @lshmouse in #9630
- fix: allow user to specify function as role by @GavinZhu-GMI in #9635
- Fix kimi k2 function calling format by @XiaotongJiang in #9606
- [router] address worker load tracking consistency by @slin1237 in #9523
- [router] add token bucket rate limiter by @CatherineSue in #9656
- [doc] add kimik2 --tool-call-parser by @XiaotongJiang in #9647
- Install py-spy by default for containers for easier debugging by @fzyzcjy in #9649
- BugFix(hicache): Fix host indices out of bound error by @hzh0425 in #9637
- HiCache Storage fix host memory leak by @xiezhq-hermann in #9648
- add
response_formatsupport forcompletionAPI by @cicirori in #9665 - Fix FA3 swa spec verify topk>1 by @ispobock in #9658
- [RL] fix register the same ops multiple times by @hebiao064 in #9564
- chore: enhance bench_serving for vlms with a new dataset of configurable image count and resolution by @mickqian in #9583
- refactor(hicache): Introduce generic HiCacheStorageConfig for improved configuration management by @hzh0425 in #9555
- feat: (chat-template matching) enhance multimodal model detection with config.json by @KEVINTUAN12 in #9597
- [docs] Instructions for bench_serving.py by @yhyang201 in #9071
- Support DeepSeek-V3.1 tool call by @Xu-Wenqing in #9446
- Add A100 fused MoE kernel configs for Dpsk by @ehuaa in #9677
- support cuda 13.0 and trtllm kernel by @rainj-me in #9495
- fix: HiRadixCache: fix prefetch completion race by @pabloiyu in #9397
- fix mooncake store mla zero copy meta by @huangtingwei9988 in #9678
- move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py by @merrymercy in #9679
- [router] restructure tool parser module folder by @slin1237 in #9693
- [router] add deepseek tool parser by @slin1237 in #9694
- Quick fix for loading processor for supporting internvl3_5 series by @yilian49 in #9676
- Fix get_ip when no external network by @whybeyoung in #9700
- Sets default model name in request classes by @JustinTong0323 in #9683
- [router] add step3 tool parser by @slin1237 in #9695
- [router] add kimi-k2 tool parser by @slin1237 in #9702
- [router] add gpt-oss and glm4 tool parser by @slin1237 in #9703
- [sgl-kernel] misc: update deepgemm version for sgl-kernel by @FlamingoPg in #9340
- chore: upgrade sgl-kernel 0.3.7 by @zhyncs in #9708
- chore: bump v0.5.1.post3 by @zhyncs in #9716
- [router] upgrade kernel version in pd ci by @CatherineSue in #9720
- [Sync] Update mxfp4.py (20250827) by @merrymercy in #9724
- [router] fix error response in pd_router by @Bruce-x-1997 in #9505
- [router] Add MCP Tool Handler by @key4ng in #9615
- gpt-oss blog reproduction document by @hnyls2002 in #9728
- [router] additional pythonic parser unit test by @slin1237 in #9730
- [router] additional llama32 parser unit test and multi json support by @slin1237 in #9732
- support mooncake store dp attention by @huangtingwei9988 in #9684
- add support for nvidia/gpt-oss-120b-Eagle3 by @zyksir in #9739
- Move git clone command up from README by @JustinTong0323 in #9740
- [feat] Reduce GPU memory overhead by using weakref by @yhyang201 in #9673
- Support speculative decoding in hybrid attention backend by @Qiaolin-Yu in #9573
- [router] add llama3.2 multi json streaming parser by @slin1237 in #9735
- Support compile sgl-kernel on cuda 13.0 by @rainj-me in https://github.co...
Release v0.5.1
What's Changed
- [PD] Use batch transfer for rdma transport and add notes for mnnvl usage by @ShangmingCai in #8595
- [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. by @sighingnow in #8611
- Fix hf3fs_fuse import error by @ispobock in #8623
- Update step3v default config by @ispobock in #8626
- [ci] fix genai-bench execution cmd by @slin1237 in #8629
- [router] update router pypi version by @slin1237 in #8628
- [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x by @b8zhong in #8577
- Fix typos in py_test/test_launch_server.py by @windsonsea in #6227
- misc: Remove debug print to logger.info by @CatherineSue in #8633
- SGLang HiCache NIXL Connector by @vvenkates27 in #8488
- [bug] remove pdlb from minilb since its no longer available by @slin1237 in #8634
- [bugfix] Fix flashinfer cutlass EP moe after MoE refactor by @trevor-m in #8630
- Conditionally import HiCacheHF3FS by @pansicheng in #8598
- TRTLLM Gen MLA Decode Kernel Integration (same as #7938) by @farazkh80 in #8632
- Fix nan value generated after custom all reduce by @kkHuang-amd in #8532
- Revert "Fix nan value generated after custom all reduce (#8532)" by @zhyncs in #8642
- Feature/modelscope model download by @yrk111222 in #8083
- chore: speedup NPU CI by cache by @pkking in #8270
- [Bugfix] fix w8a8_int8 load issue by @iforgetmyname in #8308
- [bugfix] fix router python parser for pd urls by @slin1237 in #8644
- [router] add basic usage doc by @slin1237 in #8640
- [router] upgrade router version to 0.1.8 by @slin1237 in #8645
- [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE by @kaixih in #8450
- HiCache, fixing hash value indexing by @xiezhq-hermann in #8636
- Interface change for kvcache io to support page first layout by @xiezhq-hermann in #8318
- Update batch size limitation of dsv3_router_gemm kernel to 16 by @Fridge003 in #8051
- chore: bump v0.4.10.post1 by @ispobock in #8652
- Add hf3fs_utils.cpp to package-data by @pansicheng in #8653
- Fix chat template handling for OpenAI serving by @JustinTong0323 in #8635
- Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… by @byjiang1996 in #8511
- [5/N] MoE Refactor: Update MoE parallelism arguments by @ch-wan in #8658
- Increase tolerance to address CI failures by @lifuhuang in #8643
- [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 by @panpan0000 in #8013
- [Doc] fix: Update README for cu126 sgl-kernel compile problem by @Hongbosherlock in #8665
- fix per token cuda kernel hidden dim cannot divide by 16 by @hebiao064 in #8543
- fix arg typo for --disaggregation-transfer-backend by @ZacWang in #8664
- [fix] fix pd disagg error of vlms by @ccw1996 in #8094
- Disable tp for shared experts under expert parallelism for GLM4.5 model (#8647) by @zminglei in #8647
- [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla by @trevor-m in #8685
- [bug] limit bootstrap room to to [0, 2^63 - 1] by @slin1237 in #8684
- Update CODEOWNERS by @merrymercy in #8686
- Fix deepgemm masked grouped gemm jit compile by @ispobock in #8679
- Fix FP8 block quantization when N or K is not multiples of 128 by @yanbing-j in #8648
- bugfix(hicache): Fix 'MooncakeStore' not defined error. by @hzh0425 in #8668
- upgrade xgrammar 0.1.22 by @Swipe4057 in #8522
- [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually by @lbh2001 in #8618
- Add support for NCCL symmetric memory for TP allreduces by @nvcastet in #8238
- [1/2] sgl-kernel: Fuse routed scaling factor into select_experts by @trevor-m in #8364
- chore(gb200): update dockerfile to handle fp4 disaggregation by @ishandhanani in #8694
- [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 by @trevor-m in #8688
- Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled by @GaoYusong in #7434
- model: adapt mllama4 to VisionAttention by @wenchen76 in #8512
- Add tensor.detach() back to update weight util by @hebiao064 in #8691
- [Doc] Polish sgl-kernel readme for cu126 build error by @FlamingoPg in #8704
- Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" by @hnyls2002 in #8706
- [router] minor code clean up and and refactoring by @slin1237 in #8711
- [Bug] fix green context's incompatibility with
cuda < 12.4by @hnyls2002 in #8701 - chore: bump sgl-kernel v0.2.9 by @zhyncs in #8713
- Remove assertions about per group quant fp8 by @fzyzcjy in #8717
- [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 by @merrymercy in #8693
- Fix triton moe error caused by TopK refactor by @fzyzcjy in #8705
- [router] Implement HTTP Dependency Injection Pattern for Router System by @slin1237 in #8714
- [Feature] Radix Tree in C++ by @DarkSharpness in #7369
- [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm by @HydraQYH in #8722
- Fix fused MoE when
routed_scaling_factor is Noneby @hnyls2002 in #8709 - Tiny fix CI pytest error by @fzyzcjy in #8524
- [hotfix] fix mixtral with tensor-level compressed-tensor quantization by @ch-wan in #8721
- Support limiting max loaded loras in CPU. by @lifuhuang in #8650
- Reduce memory accumulation in long-running server by @Edenzzzz in #8306
- HiCache storage, style change and bug fix by @xiezhq-hermann in #8719
- [feat] support minimum token load balance in dp attention by @WANG-GH in #7379
- Do layernorm before allgather for DP attention by @trevor-m in #8631
- [fix] Fix divide by zero error for llama4. by @shenoyvvarun in #8683
- feat: Add new moe triton for NVIDIA RTX 6000 Ada by @17Reset in #8547
- [Improvements] Merge health check route by @whybeyoung in #8444
- chore: bump sgl-kernel 0.3.0 with torch 2.8.0 by @zhyncs in #8718
- Save cuda graph memory for fa3 by @ch-wan in #8567
- [CUDA Graph] save cuda graph memory by using next_token_logits_buffer by @ch-wan in #8579
- [DP] fix the compatibility issue between DP attention and
--attention-backend tritonby @ch-wan in #8723 - chore: bump v0.4.10.post2 by @zhyncs in #8727
- feat: Support DP Attention for step3_vl by @yhyang201 in #8699
- [RL] fix update weight for FusedMoE with EP by @zhuzilin in #8676
- use fp32 for e_score_correction_bias in GLM-4.5 by @zRzRzRzRzRzRzR in #8729
- Fix triton kernels topk with keyword arguments by @ispobock in https://github.com/sgl-project/sglang/pull/...
v0.4.10
Highlights
This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs
- Please check the 2025 H2 roadmap #7736
- GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilities https://lmsys.org/blog/2025-07-31-glm4-5/
- SpecForge: Accelerating Speculative Decoding Training for SGLang https://lmsys.org/blog/2025-07-25-spec-forge/
- Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs
https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/ - Accelerating SGLang with Multiple Token Prediction https://lmsys.org/blog/2025-07-17-mtp/
- How to support new VLMs into SGLang: A Case Study with NVILA https://lmsys.org/blog/2025-07-16-nvila/
- Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/
- slime: An SGLang-Native Post-Training Framework for RL Scaling https://lmsys.org/blog/2025-07-09-slime/
What's Changed
- [AMD] add aiter fused moe in DeepEP path by @alexsun07 in #7268
- enable aiter_biased_grouped_topk kernel by @valarLip in #7423
- [PD Disaggregation] replace transfer with batch transfer for better performance by @ssssnow in #7236
- Remove cumsum_buffer initilization by @ispobock in #7439
- [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm by @BBuf in #7422
- Support multi-thread model weight loading by @xianzhiT in #7277
- [PD] NIXL: Register kv args in advance and cleanup finished requests by @trevor-m in #6717
- fix: Add
--modelas an alias for--model-pathin server_args by @CatherineSue in #7505 - misc: Improvement to serving_chat.py and add more ut by @CatherineSue in #7489
- Fuse sorted_token_ids padding to moe_align_block_size kernel by @ispobock in #7437
- [OAI] patch origin request_id logic by @whybeyoung in #7508
- [PD][Spec] Fix hidden state transfer for spec decode by @ShangmingCai in #7516
- EPLB support for MTP by @yilian49 in #7510
- clean duplicate code by @habaohaba in #7512
- [ci] add router benchmark script and CI by @slin1237 in #7498
- fix: force synchronization between TP workers when update_weights by @dangkai4u in #6626
- [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model by @chunyuan-w in #6641
- [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug by @ShangmingCai in #7522
- npu fused op by @ll819214 in #7386
- feat: send kvmetrics from sglang scheduler by @zixuanzhang226 in #6721
- [PD] Add different TP sizes support for no-MLA models by @Hongbosherlock in #6793
- enable aiter fp8 blockscale quant by @valarLip in #7520
- take aiter get_rope back by @valarLip in #7521
- Fix typo of flash_cache by @hebiao064 in #7513
- feat: add return hidden_states at async generation by @yyihuang in #7507
- minor: 'role' must be system/assistant/tool, but case insensitive for now by @minleminzui in #7499
- Fix FP8 KV Cache Support in FA3 Backend by @guoyuhong in #7148
- Fix gathered_buffer issues in tbo by @Qiaolin-Yu in #7531
- [PD] Raise error for incompatible mooncake version and some minor fixes by @ShangmingCai in #7527
- [CMake] Fix sgl-kernel CMakeLists for Blackwell by @MasterJH5574 in #7543
- Add Tencent HunYuanMoEV1 model support by @mpjlu in #7549
- Update seed in CPU UTs to avoid flaky failure with single test by @yanbing-j in #7544
- chore: improve ci bug reporting by @mickqian in #7542
- chore: remove vlm unnecessary import by @JustinTong0323 in #7541
- chore: bump v0.4.8.post1 by @zhyncs in #7559
- [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND by @trevor-m in #7330
- [Fix] incorrect assert in EPLB by @ch-wan in #7575
- Updates Gemma3n MLP layer to adapt latest transformers version by @JustinTong0323 in #7573
- Fix MTP error when enabling two-batch overlap by @fzyzcjy in #7569
- Add e2e test for multi instance multi stage memory release/resume occupuation by @MrAta in #7208
- [CI] Add CI Testing for Prefill-Decode Disaggregation with Router by @key4ng in #7540
- Updates transformers and timm dependencies by @JustinTong0323 in #7577
- feat: support compatibility between MTP and two-batch-overlap by @Qiaolin-Yu in #7225
- Move multimodal processors into a separate folder by @merrymercy in #7581
- Fix broken CI TestVILAServer by @lifuhuang in #7610
- [router] add centralized configuration module for sgl-router by @slin1237 in #7588
- Fix: Minicpm by @JustinTong0323 in #7612
- Hybrid kv cache for LLaMA4 by @tarinkk in #6563
- [CPU] add optimizations for INT8 and FP8 DeepSeek by @chunyuan-w in #6769
- Tiny add logs for expert location updater by @fzyzcjy in #7308
- Fix flakiness in LoRA batch test. by @lifuhuang in #7552
- [BUG] fix local_rank in initialize_dp_attention by @TomQuartz in #7584
- Support dynamic LoRA loading / unloading in engine/server API by @lifuhuang in #7446
- [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated by @ShangmingCai in #7598
- fix unit tests by @zhyncs in #7618
- Let ep_scatter support arbitrary strides / ue8m0 format by @fzyzcjy in #7309
- Let EP prefill support new DeepGEMM by @fzyzcjy in #7310
- docs: add gb200 nvl72 and a16z grant by @zhyncs in #7620
- Adds support for OpenAI chat completions API in bench_serving by @JustinTong0323 in #7036
- [bugfix] Remove PR comment posting from Rust benchmark workflow by @slin1237 in #7625
- [Minor] clean up multimodal processor and tokenizer manager by @merrymercy in #7624
- Add dsv3 fused a gemm to sgl-kernel by @ispobock in #7630
- Add @mickqian as the CODEOWNERS of multimodal by @merrymercy in #7636
- Fix stream reasoning parser and Adds Kimi reasoning parser by @JustinTong0323 in #7432
- Fix sgl-router startup crash by @finetunej in #7619
- [bugfix] fix runtime dropping panic in editable by @slin1237 in #7628
- Move files related to EPLB by @fzyzcjy in #7580
- [misc] reduce weird rope_scaling_factor warning by @Alcanderian in #7176
- [AMD] Add unit-test-sgl-kernel-amd to AMD CI by @hubertlu-tw in #7539
- Update CODEOWNERS by @merrymercy in #7640
- [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py by @merrymercy in #7643
- [CPU] add c++ kernel to bind CPU cores and memory node by @chunyuan-w in #7524
- Improve streaming, log_level, memory report, weight loading, and benchmark script by @merrymercy in #7632
- Add dsv3 router gemm kernel by @Fridge003 in #7627
- chore: upgrade flashinfer v0.2.7 jit by @zhyncs in #7663
- [doc] update lws doc for pd by @whybeyoung in #7318
- Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes by @narutolhy in https://github.com/sgl-project...
Release v0.4.8
Highlights
OpenAI-Compatible Server Refactor
Re-structured the OpenAI-compatible server to support production and enterprise environments. Key improvements include:
-
Consistent metrics and logging for better observability and debugging.
-
Unified error handling, request validation, and processing logic for improved reliability and maintainability.
-
Improved request tracking across sessions and components.
-
Fixed bugs in embedding requests and reasoning parsers.
This work was a collaborative effort involving engineers from academic and industry institutions. Special thanks to the Oracle Cloud team and the SGLang team and community — including @slin1237, @CatherineSue, @key4ng, @JustinTong0323, @jhinpan, @yhyang201, @woodx9 and @whybeyoung — for their invaluable contributions.
DeepSeek R1 FP4 on Blackwell GPU
Added support for DeepSeek R1 with FP4 and MTP on NVIDIA Blackwell GPU.
-
Integrated FlashInfer NVFP4 MoE, supporting TP, EP, and DP.
-
Supported 2-stream shared expert execution.
-
Achieved up to 90 TPS per user at isl/osl/bs = 1k/1k/16 on B200.
Further optimization in progress. Special thanks to the FlashInfer, NVIDIA Enterprise Products, Novita AI, DataCrunch, Google Cloud, and SGLang teams — especially @Alcanderian and @pyc96 — for their critical contributions.
Breaking Change: OpenAI-Compatible API Module Moved
The sglang/srt/openai_api directory has been removed and replaced with sglang/srt/entrypoints/openai.
Update your imports to the new module path. For example:
- from sglang.srt.openai_api.protocol import Tool
+ from sglang.srt.entrypoints.openai.protocol import ToolWhat's Changed
- Update README.md by @merrymercy in #7040
- [Docker] Upgrading base image from 24.04 to 24.12 by @Swipe4057 in #7043
- fix 24.12 docker by @zhyncs in #7045
- Minor cleanup of fa3 backend by @merrymercy in #6999
- Fix eagle on AMD by @merrymercy in #7051
- Clean up server_args.py by @merrymercy in #7037
- Minor style fix in cuda_graph_runner.py by @merrymercy in #7053
- [WA] fix output data is nan in CI test "test_moe_eval_accuracy_large.py" by @kkHuang-amd in #7021
- [fix] libmlx5.so already in base image by @HanHan009527 in #7060
- Fix test_lora.py CI by @Fridge003 in #7061
- Tiny fix cutlass_mla_get_workspace_size stub incorrect signature by @fzyzcjy in #7057
- Add sanity checks when a test file is not added to CI by @fzyzcjy in #6947
- Revert "Add sanity checks when a test file is not added to CI (#6947)" by @zhyncs in #7063
- Fix missing tool call id if tool call index >0 in streaming tool call output. by @Xu-Wenqing in #7049
- chore: update dev docker by @zhyncs in #7064
- Open AI API hidden states by @kyle-pena-kuzco in #6716
- fix arm sgl-kernel link issue by @zhyncs in #7066
- [Feature] Add Logit Bias by @b8zhong in #6579
- Improve perf tuning docs by @merrymercy in #7071
- Frontend language separate reasoning support by @binarycrayon in #6031
- Do not run frontend_reasoning.ipynb to reduce the CI load by @merrymercy in #7073
- Simplify the heuristics for setting --mem-fraction-static by @merrymercy in #7054
- update doc by @Ximingwang-09 in #7046
- Clean up docs for server args and sampling parameters (generated by grok) by @merrymercy in #7076
- Fix GGuf and add back test_gguf.py by @Fridge003 in #7067
- vlm: adapt internvl to VisionAttention by @mickqian in #6870
- Fix circular import in test_prefix_chunk_info.py by @Fridge003 in #7097
- Fix misusing the "_is_cuda". by @sogalin in #7091
- Support VILA models by @futrime in #6106
- [FIX]remove redundant code in logits_processor.py by @pc-neo in #7079
- [feat]: Emit fixed-size KV blocks events by @faradawn in #6824
- [Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations by @lifuhuang in #6994
- Fix positional argument by @liquanfeng in #7093
- [sgl-kernel] Add cuda kernel for moe_ep_silu_and_mul by @yuan-luo in #6919
- Improve log status by @hnyls2002 in #7115
- feat: update blackwell setup by @zhyncs in #7119
- Update CODEOWNERS by @merrymercy in #7126
- Add gfx950 support for sgl-kernel. by @sogalin in #7092
- [Fix] Reduce busy polling when scheduler is idle by @p12tic in #6026
- Minor add utility to read expert distribution recorder output by @fzyzcjy in #7134
- Remove unnecessary metadata_expand.max_seq_len_k operations in fa3 to… by @byjiang1996 in #7140
- Minor speedup topk postprocessing by @fzyzcjy in #7058
- filter by num_hidden_layers by @pansicheng in #7056
- Remove 200us slow concat kernel (part 1: kernel) by @fzyzcjy in #7145
- Support new DeepGEMM format in per token group quant by @fzyzcjy in #7146
- chore: bump v0.1.8.post1 by @zhyncs in #7152
- Support new DeepGEMM format in per token group quant (part 2: srt) by @fzyzcjy in #7155
- Fix DeepEP error in some environments by @fzyzcjy in #7154
- Minor speed up block_quant_dequant by @fzyzcjy in #6814
- Tiny add sanity checks for DeepGEMM inputs by @fzyzcjy in #7157
- Remove 200us slow concat kernel (part 2: srt) by @fzyzcjy in #7020
- Re-quantize DeepSeek model weights to support DeepGEMM new input format by @fzyzcjy in #7156
- Minor style change of triton backend by @merrymercy in #7165
- Split the eagle test into two files by @merrymercy in #7170
- Support new DeepGEMM input format in silu_and_mul_masked_post_quant_fwd by @fzyzcjy in #7153
- Refactor DeepGEMM integration by @fzyzcjy in #7150
- Add test for refactored openai server by @jhinpan in #7161
- Improve test cases for eagle infer by @merrymercy in #7173
- Support new DeepGEMM by @fzyzcjy in #7172
- Increase timeout in test/srt/test_disaggregation.py by @merrymercy in #7175
- Add Phi-4-mm to supported VLM supported model list. by @lifuhuang in #7178
- Fix shared experts fusion + weight requant by @fzyzcjy in #7177
- [fix] fix dsv3 weight loader tqdm and simplify shared experts fusion by @Alcanderian in #7181
- [fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla by @Alcanderian in #7184
- [PD] Update prefill.py by @ByronHsu in #7190
- Fix a minor bug related to DeepGEMM upgrade by @zhijian-liu in #7191
- chore: bump v0.1.8.post2 by @zhyncs in #7189
- [fix] fix determine_num_fused_shared_experts by @Alcanderian in #7180
- chore: upgrade sgl-kernel v0.1.8.post2 by @Alcanderian in #7186
- Fix NCCL 2.27.3 not in docker image by @fzyzcjy in #7195
- Fix error when disabling new DeepGEMM by @fzyzcjy in #7198
- [PD] Support decode retract and update decode.py by @ByronHsu in #7196
- Move host memory pools into a separate file by @merrymercy in #7200
- Lianmin/simplify memory pool by @merrymercy in #7202
- Fix grammar abort & Minor style fixes by @merrymercy in https://github.com/sg...
Release v0.4.7
Highlights
-
The PD disaggregation and large-scale EP functionalities from the blog post have now been fully merged into the latest release.
-
The blog has been successfully reproduced by over six industry teams, including the TensorRT LLM team.
-
SGLang’s large-scale EP is now actively used by leading organizations such as Cursor, Qwen, Alimama, Alibaba Cloud, iFlytek, and more. It has been deployed and validated at large scale, running on GPU clusters with thousands of devices.
-
PD disaggregation and large-scale EP, in addition to supporting DeepSeek V3/R1, now also support Qwen 3 in the latest release.
-
Full Blackwell support for DeepSeek V3/R1, Llama 4, and Qwen 3. Further optimizations are underway.
-
SGLang's DeepSeek V3/R1 now achieves 190 TPS on single H200, outperforming other frameworks by over 50%.
We extend our sincere thanks to the following contributors, listed in alphabetical order: Alibaba Cloud, AMD Team, Ant Group, Baseten Team, Cursor Team, Dynamo Team, EAGLE Team, FlashInfer Team, Google Vertex AI Team, iFlytek MaaS Team, Intel Team, LinkedIn Team, Meituan Team, Microsoft Copilot Team, Mooncake Team, NVIDIA Team, Oracle Team, Qwen Team, Voltage Park Team and open source community users. Your support and collaboration are deeply appreciated!
What's Changed
- Update nightly-test.yml by @merrymercy in #5797
- [CI] Improve github summary & enable fa3 for more models by @merrymercy in #5796
- [Docs] update grafana setup guide in production metrics by @PopSoda2002 in #5643
- [Misc] add structure logging, write to file and log tracing for SGL R… by @slin1237 in #5741
- Improve overlap scheduling by @hnyls2002 in #5788
- Add Cutlass MLA attention backend by @trevor-m in #5390
- chore: upgrade sgl-kernel 0.1.0 by @zhyncs in #5690
- Dockerfile.dev pip scikit_build_core by @BBuf in #5807
- Add a doc to fix sgl-kernel build link error in py39 with ccache by @BBuf in #5809
- Turn on overlap scheduler for multimodal models by @merrymercy in #5771
- Tiny refactor DefaultModelLoader.Source by @fzyzcjy in #5482
- [Docs] Replace lists with tables for cleanup and readability in server_arguments by @windsonsea in #5276
- Revert "Tiny refactor DefaultModelLoader.Source" by @merrymercy in #5825
- Feat: add support for thinking mode via chat_template_kwargs.enable_t… by @minleminzui in #5551
- fix: fix the error where the content is None when reasoning and tool … by @minleminzui in #5838
- feat: Add fused moe triton config for qwen3 moe on h100 by @JustinTong0323 in #5833
- fused moe triton tuning script support qwen3 by @BBuf in #5842
- feat: Add fused moe triton config for qwen3bf16 moe on h20 by @yhyang201 in #5839
- [PD] support pd fake transfer for warmup by @whybeyoung in #5726
- [qwen3] qwen3moe_tune_h20 fp8 tp4 by @whybeyoung in #5846
- [Doc] Recover history of server_arguments.md by @Fridge003 in #5851
- feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 by @GeLee-Q in #5850
- [CI] test chunked prefill more by @merrymercy in #5798
- ROCm: update AITER by @HaiShaw in #5816
- [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel by @yinfan98 in #5847
- [Fix] Missing bootstrap_port field by @xutianyi1999 in #5823
- feat: update is_fa3_default_architecture by @zhyncs in #5854
- add fused moe config for qwen3moe fp8/bf16 by @yizhang2077 in #5849
- chore: bump v0.4.6.post1 by @zhyncs in #5845
- Support
max_completion_tokensfor OpenAIChatCompletions by @CatherineSue in #5857 - simplify fused_moe config logging by @BBuf in #5801
- [CI] tune the test order to warmup the server by @merrymercy in #5860
- Cutlass MLA decode - fix dtype error by @trevor-m in #5868
- cutlass 3.9 supported to improve fp8_blockwise_gemm by @BBuf in #5820
- [Feature] support auto chat template by @woodx9 in #4949
- Feat: support cuda graph for LoRA by @Qiaolin-Yu in #4115
- Add qwen3 30b fused moe config by @JustinTong0323 in #5859
- [Fix] Fix a bug for flashmla to run R1 model by @pengcuo in #5875
- Add A800 fused moe config for qwen3 30b by @lambert0312 in #5880
- [Misc] add service discovery for sgl router by @slin1237 in #5865
- [fix]: PyO3 macOS linking and consolidate on tracing for logging by @slin1237 in #5856
- chore: update Dockerfile by @zhyncs in #5894
- [Docs] Update docs for Qwen3 and Qwen3MoE by @adarshxs in #5836
- Tables instead of bulletpoints for sampling doc by @simveit in #5841
- chore: update CODEOWNERS by @zhyncs in #5895
- [FEATURE] Enhance platform compatibility for ARM by @johnnynunez in #5746
- [CI] Add test_function_calling.py to run_suite.py by @CatherineSue in #5896
- Auto set draft model path for MTP by @ispobock in #5793
- [fix] relax mem_fraction_static for h200 by @Alcanderian in #5893
- feat: support pythonic tool call and index in tool call streaming by @CatherineSue in #5725
- [Bugfix]: fix missing queue_time_start for requests from grammar_queue by @CatherineSue in #5696
- Add AMD MI300x Nightly Testing. by @saienduri in #5861
- chore: use torch 2.6 for sgl-kernel build by @zhyncs in #5898
- Fix check_env script by @lambert0312 in #5901
- [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels #134 by @whybeyoung in #5830
- Bump Flashinfer to 0.2.5 by @Fridge003 in #5870
- [Fix] Unload lora in HF_Runner if needed by @Qiaolin-Yu in #5899
- Add A800 fused moe config for qwen3 235b by @lambert0312 in #5900
- Add sm_120 for blackwell by @zhjunqin in #5903
- [Feature] add support kimi vl model by @liwenju0 in #5383
- support vlm benchmark profile by @yizhang2077 in #5905
- [fix] kimi-vl test in test_vision_openai_server.py by @Alcanderian in #5910
- [Misc] use parallel build for cmake in sgl-kernel by @yinfan98 in #5919
- [qwen3] support qwen3 ep moe by @laixinn in #5917
- Add TP2 MOE benchmarks for AMD. by @saienduri in #5909
- [Feat] Scale up fa3 kernel to sm8x arch by @yinfan98 in #5912
- chore: bump sgl-kernel 0.1.1 by @zhyncs in #5932
- chore: upgrade sgl-kernel 0.1.1 by @zhyncs in #5933
- Remove unused method
calculate_num_image_tokensfrom qwen2_vl.py by @JustinTong0323 in #5783 - [PP] Add pipeline parallelism by @Ying1123 in #5724
- Fix lora batch processing when input lora_path contains None by @Qiaolin-Yu in #5930
- add Thor & Spark by @johnnynunez in #5915
- fix: correct stream response when enable_thinking is set to false by @minleminzui in #5881
- fix: update model runner by @zhyncs in #5934
- chore: bump v0.4.6.post2 by @zhyncs in #5939
- Support XiaomiMiMo/MiMo model inference by @ryang-max in #5921
- [P...
Release v0.4.6
Highlights
- Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, Qwen, Llama, etc). #4709 (comment)
- PD disaggregation with mooncake and NIXL transfer backends #4880 #5477 #4655
- DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. #5580 #5628
- Update torch to 2.6.0. Fix torch.compile cache. #5417 #5213
- Preliminary support for blackwell #5303
Thanks very much to LinkedIn team, Alibaba Cloud, Mooncake team, NVIDIA Team, AMD Team, Pytorch Team, Ant Group, Baseten Team, Oracle Team, Meituan Team, iFlytek MaaS team and the open source community users for their contributions!
We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!
Coming Soon
- Large scale expert parallelism + PD disaggregation #4734 #5524
- Pipeline Parallelism #5724
- MLA Cutlass Backend #5390
What's Changed
- [ci] fix llama4 ci error by @BBuf in #5126
- Refactor and Optimize FA3 Code by @hebiao064 in #5090
- Add Llama4 user guide by @ispobock in #5133
- [Misc] Use pytest.mark.skipif in sgl-kernel test by @yinfan98 in #5137
- feat: disable grammar restrictions within reasoning sections by @minleminzui in #4984
- [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method by @yundai424 in #5145
- [AMD] Fix missing per_token_group_quant_fp8 for ROCm by @hubertlu-tw in #5140
- fix multimodal hash feature by @huangtingwei9988 in #5083
- Fix run time error in ROCm platform by @kkHuang-amd in #5147
- [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct by @zcnrex in #5103
- Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 by @yubofredwang in #4760
- Use public model for FA3 speculative decode testing by @yubofredwang in #5152
- Add dummy grok test to amd CI. by @saienduri in #5115
- fix empty_cache error in pt_weights_iterator by @dangkai4u in #5151
- Fix torch compile errors by @kkHuang-amd in #5158
- Fix loading KV quantization scale; Enable modelopt kv cache by @yundai424 in #4686
- [PD] Fix unclosed prefill connection warning of mini_lb by @ShangmingCai in #5155
- Add optimized native kernels in sgl-kernel by @mingfeima in #5150
- [PD] Simplify mini LB by @ByronHsu in #4911
- Small improvement of native api docs by @simveit in #5139
- [feat&refactor] Enhance multimodal input support with refactor io_struct by @JustinTong0323 in #4938
- Support 2x8xH100 for Llama 4 by @fzyzcjy in #5159
- FP4 weight loading and inference (2/2) by @trevor-m in #3972
- Fix multimodal hashing error by @fzyzcjy in #5174
- Tiny disable model that does not work by @fzyzcjy in #5175
- [Bugfix] Fix index out of bounds in local attention with large sequences by @CatherineSue in #5173
- [Fix] DeepEP Compatibility with Low Latency by @liz-badada in #5068
- docs: remove the use of Downward API for LWS_WORKER_INDEX by @yankay in #5110
- feat: add DeepGEMM build warning by @zhyncs in #5176
- fix: use DeepEPDispatcher on CUDA by @zhyncs in #5180
- [DeepEP] fix: import buffer error by @ch-wan in #5179
- Let
bench_one_batchsupportenable_dp_attentionby @fzyzcjy in #4058 - [Misc] clean up vllm in sgl-kernel test by @yinfan98 in #5189
- Fix ci test "test_eval_fp8_accuracy" failed by @kkHuang-amd in #5185
- Optimize topk operation in llama4 by @fzyzcjy in #5128
- Support Llama4 fp8 inference by @HandH1998 in #5194
- [ci] fix ci test fused_moe op by @BBuf in #5102
- model: support mllama4 by @mickqian in #5144
- Rework grok test. by @saienduri in #5171
- sgl-kernel use cutlass latest version for fp8 blockwise gemm by @yizhang2077 in #5207
- Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 by @Muuuchen in #5196
- fix: log warning when disable cuda graph by @zhyncs in #5209
- [metrics] Add in queue metrics by @hebiao064 in #4444
- Fix DeepSeek error when using DeepEP mode by @fzyzcjy in #5190
- reduce moe_align_block_size_kernel small batch mode overhead by @BBuf in #5086
- [PD] Support KV transfer with mooncake by @stmatengss in #4880
- [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool by @stmatengss in #5204
- Update deps for mllama4 by @ispobock in #5215
- Fix deepseek-v3 with torch.compile in PyTorch 2.6. by @zou3519 in #5213
- ROCm sgl-kernel: compatible to later torch by @HaiShaw in #5167
- [Misc] Clean sgl-kernel test by @yinfan98 in #5216
- Update Makefile / build script to avoid installing incompatible torch dependency by @elfiegg in #5245
- Fix torch.compile cacheing by @zou3519 in #5259
- ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations by @HaiShaw in #5228
- Optimize attention in llama4 by @fzyzcjy in #5127
- Optimize GPU memory usage in FlashAttentionBackend's strided indexing by @CatherineSue in #5262
- Support
--enable-llama4-multimodalby @ch-wan in #5254 - [fix] fix mrope positions not picked up by @mickqian in #5265
- doc: nested loop code for offline engine by @minleminzui in #5244
- fix: examples for token_in_token_out_vlm by @JustinTong0323 in #5193
- Fix a 404 link in send_request.ipynb by @windsonsea in #5280
- fix: enable fp4 compilation on cu128 by @zhyncs in #5286
- feat: add cu128 identifier for sgl-kernel by @zhyncs in #5287
- chore: relax the torch version restriction for sgl-kernel compilation by @zhyncs in #5288
- chore: bump sgl-kernel v0.0.8.post1 by @zhyncs in #5289
- [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout by @GaoYusong in #5292
- [Docs] Supported Model Docs - Major restructuring by @adarshxs in #5290
- fix: update update_wheel_index for cu128 by @zhyncs in #5300
- [Docs] Remove the older supported docs section by @adarshxs in #5301
- remove moe_align_block_size torch.zeros in small batch/expert mode by @BBuf in #5298
- feat: add blackwell Dockerfile by @zhyncs in #5302
- feat: add blackwell workflow by @zhyncs in #5303
- fix: use fa3 unit test on hopper only by @zhyncs in #5304
- misc: update blackwell Dockerfile by @zhyncs in #5306
- fix: remove cublas_grouped_gemm by @zhyncs in #5307
- fix: update flash attn by @zhyncs in #5308
- fix: use deepgemm only on hopper by @zhyncs in #5310
- [VLM] Adopt fast image processor by default by @mickqian in #5065
- Adjust ci test threshold by @ispobock in #5271
- Blackwell Cutlass MLA kernel by @trevor-m in #5142
- misc: cleanup 3rdparty by @zhyncs in https:/...
Release v0.4.5
Highlights
The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding.
New Features
-
Llama 4 Support: We supported Llama 4 model with accuracy matching official benchmark numbers, achieving a zero-shot score of 75.2 on the MMLU Pro dataset for
Llama-4-Scout-17B-16E-Instructmodel and 80.7 forLlama-4-Maverick-17B-128E-Instructmodel. #5092 -
FlashAttention 3 Backend: Our implementation of the FlashAttention 3 backend delivers significant acceleration for long-context tasks. #4709
-
EAGLE3 Speculative Decoding: We’re proud to be the first to support EAGLE3 speculative decoding, offering substantial gains in decoding throughput. Learn more in our documentation and the EAGLE3 paper. #4247
-
DeepEP Integration: By incorporating DeepEP, we enhanced performance for MoE inference.
-
Disaggregated Prefill and Decoding: We introduced a prototype for disaggregated prefill and decoding, with plans for further optimizations.
Thanks very much to the NVIDIA team, LinkedIn team, EAGLE team, Oracle team, Meituan team, and our incredible open-source community for their invaluable contributions!
Coming Soon
-
Disaggregated Prefill and Decoding: #4655
-
Llama 4 Optimization: #5118
-
EP Enhancement: #4734
-
FA3 Enhancement: #4709
We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!
What's Changed
- Fix a regression introduced by overlapping KV cache writing by @merrymercy in #4375
- Update ci_install_dependency.sh to use accelerate 1.4.0 by @merrymercy in #4392
- Improve DP attention by @merrymercy in #4390
- Fix auto merge & add back get_flat_data_by_layer by @merrymercy in #4393
- Add some fused elementwise kernels for grok-1 by @merrymercy in #4398
- Fix Llama3.3 tool call support by @CatherineSue in #4320
- Fix the output of hidden states after HTTP requests by @Qiaolin-Yu in #4269
- Add a dummy grok test case by @merrymercy in #4399
- Hot fix for hicache with new page aligned radixtree by @xiezhq-hermann in #4397
- bump v0.4.4.post1 by @zhyncs in #4402
- Update CODEOWNERS by @merrymercy in #4403
- Hierarchical Caching supports MLA by @zeroorhero in #4009
- cleanup deps 1/n by @zhyncs in #4400
- feat(remote_model): support variable remote backend for model loader by @DellCurry in #3964
- [bug] fix duplicate variable MAX_PIXELS in qwen_vl.py by @qibaoyuan in #4419
- [Doc] fix wrong flag in deepseek documentation by @lausannel in #4427
- Add moe topk softmax templated from vllm by @qingquansong in #4302
- bump v0.0.5.post1 by @zhyncs in #4437
- Fix maximum recursion depth triggered on exception exit by @merrymercy in #4438
- use topk_softmax with sgl-kernel by @zhyncs in #4439
- docs: hot fix torch compile cache by @zhaochenyang20 in #4442
- ci: update transformers==4.48.3 by @mickqian in #4451
- Fix test_create_kvindices unit test by @sleepcoo in #4452
- [Fix] Fix errors when using the device except cuda. by @cboss6 in #4455
- docs: Add Llama 3.3 to supported models by @JiangJiaWei1103 in #4453
- Update bench_serving.py by @xu-song in #4454
- bugfix: Update sampling_params.py by @WrRan in #4413
- typos: Update sampling_params.md by @WrRan in #4391
- Auto-detect device if not specified in server arguments. by @vshekhawat-hlab in #4423
- Add support for upcoming QwenMoe by @michaelfeil in #4447
- perf: update fused moe config by @mickqian in #4459
- typos by @WrRan in #4368
- Fix minor style by @merrymercy in #4460
- cleanup deps 2/n by @zhyncs in #4464
- feat: Add FlashMLA submodule by @shuaills in #4449
- [Fix] use
torch.catinstead oftorch.concatto prevent entering theAutogradbackends. by @Alcanderian in #4466 - Fix finish step for pr tests and notebook tests by @merrymercy in #4467
- Remove filter for pr-tests by @merrymercy in #4468
- Add greedy verification kernel by @Ying1123 in #4383
- Release sgl-kernel v0.0.5.post2 by @merrymercy in #4469
- Revert "feat: Add FlashMLA submodule (#4449)" by @zhyncs in #4470
- [Eagle] Remove the greedy branch and some redundant code by @Ying1123 in #4363
- Support FlashMLA backend by @sleepcoo in #4472
- fix custom allreduce performance/accuracy problem by @yizhang2077 in #4477
- 400 on empty input_ids by @yinghai in #4481
- Update CODEOWNERS by @merrymercy in #4484
- Statistical Analysis of the Output Stability of the Deepseek Model by @tanzelin430 in #4202
- model: support gemma-3-it by @mickqian in #4424
- Initialize image processor for skip-tokenizer-init codepath by @yinghai in #4479
- Fix: modelscope env comment by @huiwq1990 in #4474
- Fix: Complete int32 to int64 conversion by @xiezhq-hermann in #4465
- [ROCm] enable moe topk softmax in amd by @yiakwy-xpu-ml-framework-team in #4448
- Feat/support code completion by @woodx9 in #3612
- Add endpoint for file support, purely to speed up processing of input_embeds. by @RinRin-32 in #2797
- Set xgrammar as the default grammar backend by @minleminzui in #4386
- Fix router test by @ByronHsu in #4483
- [Fix] use
torch.inference_mode()instead oftorch.no_grad()by @Alcanderian in #4372 - [Feature] Support Deepseek-VL2 by @ccw1996 in #2798
- config: Update fused moe config by @mickqian in #4493
- Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. by @solrex in #4418
- Support Online Quantization for W8A8 by @hebiao064 in #4485
- Tool call with text by @xihuai18 in #4067
- Nicer standalone engine inferface by @yinghai in #4480
- [Fix] Resolve GPU Memory Leak in update_weights_from_tensor by @U-rara in #4446
- [Doc] add doc for quantization w8a8_fp8 or w8a8_int8 by @HandH1998 in #4495
- Fix data parallel + tensor parallel by @merrymercy in #4499
- [ROCm] fix dtype by @yiakwy-xpu-ml-framework-team in #4510
- Remove redundant type conversion by @merrymercy in #4513
- Update readme by @merrymercy in #4517
- [sgl-router] improvement to avoid hang by @yinghai in #4482
- Revert "feat: update grouped_topk to support softmax and sigmoid" by @ispobock in #4505
- bump v0.0.5.post3 by @zhyncs in #4520
- upgrade sgl-kernel 0.0.5.post3 by @zhyncs in https://github.com/sgl-project/sg...
Release v0.4.4
Highlights
The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!
Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source community users for their contributions!
Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement, there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!
Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!
Optimizations
-
AMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog
-
Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with
--enable-flashinfer-mla -
Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script, compatible with radix cache and chunked prefill.
-
DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable with
export SGL_ENABLE_JIT_DEEPGEMM=1 -
Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:
-
Other Optimizations:
-
Blackwell architecture Block Scale FP8 GEMM support
-
Support page size greater than 1 #4356
-
Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89
-
Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 16) #4390
-
Coming soon
-
Integrate Flash Attention #4385
-
Integrate FlashMLA #4384
-
EAGLE 2 optimization #4383
-
EAGLE 3 day one support #4247
-
Integrate DeepEP #4232
-
Prefill and Decoding Disaggregation
What's Changed
- update flashinfer-python by @zhyncs in #3557
- fix doc by @zhyncs in #3558
- Add support for OpenAI API o1 model by @ChuyueSun in #3363
- fix sgl-kernel codestyle by @BBuf in #3563
- docs: update install by @zhyncs in #3581
- Copy config files for MI300X to support in virtualized environments by @yosoyjay in #3505
- ROCm docker: triton update by @HaiShaw in #3584
- [fix] added support for vlm in offline inference by @FrankLeeeee in #3548
- Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 by @ispobock in #3582
- [CI] Improve Docs CI Efficiency by @shuaills in #3587
- doc: emphasize and notify the usage of chat_template by @mickqian in #3589
- fix eagle unit test by @zhyncs in #3591
- fix high qps crash when enable mtp by @zhyncs in #3592
- fix apply_token_bitmask_inplace_cuda by @zhyncs in #3594
- [docs] added favicon to sphinx html by @FrankLeeeee in #3564
- fix lockfile and port_registry file permission error by @Jiadalee in #3598
- feat: Support Qwen 2.5 vl by @mickqian in #3258
- [ROCm] Use
tl.range()in block GEMM kernels withnum_stagesset by host. by @whchung in #3535 - Update to latest amd image. by @saienduri in #3597
- Benchmark for reasoning models by @simveit in #3532
- Draft of updated doc for sampling params. by @simveit in #3260
- [docs] Update sampling_params.md by @shuaills in #3617
- [docker] added rdma support by @FrankLeeeee in #3619
- Revert "[ROCm] Use
tl.range()in block GEMM kernels with `num_stage… by @zhyncs in #3632 - add mtp unit test by @zhyncs in #3634
- update unit test by @zhyncs in #3636
- chore: bump v0.4.3.post1 by @zhyncs in #3638
- h800 deepseek r1 config and support multi-gpu block-gemm tuning by @BBuf in #3639
- feat: support flashinfer mla with prefix cache by @zhyncs in #3643
- chore: update flashinfer v0.2.1.post2 by @zhyncs in #3644
- chore: bump v0.4.3.post2 by @zhyncs in #3645
- use transformers 4.48.3 by @zhyncs in #3650
- [ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. by @whchung in #3616
- [ROCm] Optimal MOE Tuning for AMD Radeon Graphics by @BruceXcluding in #3567
- Deploy multi-node inference (LWS method) using sglang in a K8s cluster by @whybeyoung in #3624
- Update amd docker image. by @saienduri in #3654
- [Feature] Apply Cublas Grouped Gemm kernel by @Fridge003 in #3629
- update pr-test by @zhyncs in #3663
- Fix draft decode max batch size by @ispobock in #3676
- fix: remove dependency on latest transformers impl by @mickqian in #3635
- AMD Prefill optimize by @fsx950223 in #3665
- fix: apply cache size limit of attention mask for VisionAttention by @mickqian in #3657
- set NCCL_IB_GID_INDEX=3 for multi node NVIDIA InfiniBand if needed by @zhyncs in #3698
- use warp shuffle style reduce and flashinfer vectorize by @BBuf in #3628
- [Docs] Add SkyPilot DeepSeek example by @Michaelvll in #3706
- [k8s] remove unnecessary hostIPC for security concern by @panpan0000 in #3700
- [moe] optim: reduce memory consumption in fused_moe by @ch-wan in #3692
- [Improve] Fix Multi-User Port Allocation Conflicts by @shuaills in #3601
- Variance measure for reasoning benchmark by @simveit in #3677
- Docs: Fix layout with sub-section by @zhaochenyang20 in #3710
- add control for cutlass fp8 blockwise gemm by @yizhang2077 in #3727
- revert BLOCK and num_warps on HIP by @HaiShaw in #3722
- Optimize triton attention custom mask by @ispobock in #3731
- [Bugfix] Fix scores mask for moe topk by @Chen-XiaoBing in #3705
- [Docs] Modify ep related server args and remove cublas part of deepseek by @Fridge003 in #3732
- [Fix] Fix bugs and refactor codes in lora for better scalability. by @aoshen524 in #3652
- docs: fix 404 link by @trayvonpan in #3588
- [docs] added torch.compile cache to dpsk manual by @FrankLeeeee in #3737
- AMD/ROCm: update AITER repo to ROCm/aiter by @HaiShaw in #3747
- feat: update grouped_topk to support softmax and sigmoid by @zixuanzhang226 in #3680
- feat: Add SageMaker support by @andjsmi in #3740
- Change description of nvidia jetson docs by @shahizat in https://github.com/sgl-proj...