Skip to content

Commit c65512b

Browse files
amogkamhongxiayangtjtanaamaxdebayserDarkLight1337
authored
Upgrade to 0.9.1 & support for cached_tokens info in completions request (#12)
* [Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix (vllm-project#18100) Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> * Prevent the cross-encoder logic from being applied to classification tasks (vllm-project#18838) Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * Add ability to use CUDAGraphs with use_inductor=False (vllm-project#17345) Signed-off-by: rzou <[email protected]> * [Bugfix][TPU] fix moe custom kernel import (vllm-project#18853) Signed-off-by: Chengji Yao <[email protected]> * [Doc][Neuron] Update documentation for Neuron (vllm-project#18868) Signed-off-by: Elaine Zhao <[email protected]> * Skip device and quant Pydantic validation to make plugin device work (vllm-project#18843) Signed-off-by: Yikun Jiang <[email protected]> * Fixes a dead link in nightly benchmark readme (vllm-project#18856) Signed-off-by: Brent Salisbury <[email protected]> * [Neuron] Add multi-LoRA support for Neuron. (vllm-project#18284) Signed-off-by: Satyajith Chilappagari <[email protected]> * [LoRA] Add LoRA support for InternVL (vllm-project#18842) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Remove redundant spaces from compatibility_matrix.md (vllm-project#18891) Signed-off-by: windsonsea <[email protected]> * [doc] add CLI doc (vllm-project#18871) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix misleading information in the documentation (vllm-project#18845) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Replace TODO in serving transcription (vllm-project#18895) Signed-off-by: NickLucche <[email protected]> * [Bugfix] Ensure tensors are contiguous during serialisation (vllm-project#18860) Signed-off-by: Lukas Geiger <[email protected]> * [BugFix] Update pydantic to fix error on python 3.10 (vllm-project#18852) Signed-off-by: luka <[email protected]> * Fix an error in dummy weight loading for quantization models (vllm-project#18855) Signed-off-by: Chenyaaang <[email protected]> * [Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. (vllm-project#18692) Signed-off-by: Duyi-Wang <[email protected]> * [Doc] Fix codeblocks formatting in LoRA adapters documentation (vllm-project#18907) Signed-off-by: Zerohertz <[email protected]> * [Bugfix] Fix the failing gte embedding test (vllm-project#18720) Signed-off-by: Isotr0py <[email protected]> * [Attention][V1] Toggle for v1 attention backend (vllm-project#18275) Signed-off-by: Gregory Shtrasberg <[email protected]> * [ROCm][V0][Attention] Revert to the previous FA triton kernel (vllm-project#18226) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Deprecation] Disallow pos-args other than `model` when initializing `LLM` (vllm-project#18802) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Remove duplicate init for self.vllm_config (vllm-project#18896) Signed-off-by: googs1025 <[email protected]> * [V1] Allocate kv_cache with stride order for V1 (vllm-project#18775) Signed-off-by: nicklucche <[email protected]> * [BugFix] Make DP work with connector-delayed new requests (vllm-project#18559) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Will Eaton <[email protected]> * [P/D] NixlConnector DP fixes (vllm-project#18903) Signed-off-by: Will Eaton <[email protected]> * Use standalone_compile by default in torch >= 2.8.0 (vllm-project#18846) Signed-off-by: rzou <[email protected]> * [TPU] remove transpose ops in moe kernel (vllm-project#18923) Signed-off-by: Chengji Yao <[email protected]> * [Bugfix] Fix PP default fallback behavior for V1 (vllm-project#18915) Signed-off-by: mgoin <[email protected]> * [Misc] Update type annotation for rotary embedding `base` (vllm-project#18914) Signed-off-by: DarkLight1337 <[email protected]> * [TPU][CI/CD] Clean up docker for TPU tests. (vllm-project#18926) Signed-off-by: Carol Zheng <[email protected]> * improve the robustness of parsing vlms config in AutoRound (vllm-project#18894) Signed-off-by: wenhuach21 <[email protected]> * [Bugfix] Consistent ascii handling in tool parsers (vllm-project#18883) Signed-off-by: chaunceyjiang <[email protected]> * [Model] Use AutoWeightsLoader for mamba2 (vllm-project#18918) Signed-off-by: iLeGend <[email protected]> * [docs] fix: fix markdown syntax (vllm-project#18927) * [ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. (vllm-project#18938) Signed-off-by: vllmellm <[email protected]> * [Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy (vllm-project#18861) Signed-off-by: mgoin <[email protected]> * [Deprecation] Remove mean pooling default for `Qwen2EmbeddingModel` (vllm-project#18913) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]Fix benchmarks/README.md for speculative decoding (vllm-project#18897) Signed-off-by: rabi <[email protected]> * [doc] add mkdocs doc (vllm-project#18930) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Model] Use in-place adds in SigLIP (vllm-project#18922) Signed-off-by: Lukas Geiger <[email protected]> * [Bugfix][Failing Test] Fix test_vllm_port.py (vllm-project#18618) Signed-off-by: rabi <[email protected]> * [Misc]Fix typo (vllm-project#18947) * [Bugfix][TPU] Fix tpu model runner testcase failure (vllm-project#18810) Signed-off-by: Carol Zheng <[email protected]> * [CI/Build] remove regex from build dependencies (vllm-project#18945) Signed-off-by: Daniele Trifirò <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Feature] minicpm eagle support (vllm-project#18943) Signed-off-by: huangyuxiang03 <[email protected]> Co-authored-by: huangyuxiang03 <[email protected]> * [doc] show the count for fork and watch (vllm-project#18950) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Docs] Update SECURITY.md with link to our security guide (vllm-project#18961) Signed-off-by: Russell Bryant <[email protected]> * Improve "failed to get the hash of the compiled graph" error (vllm-project#18956) Signed-off-by: rzou <[email protected]> * [Perf] API-server scaleout with many-to-many server-engine comms (vllm-project#17546) * Benchmark script for fp8 vs bf16 gemm (vllm-project#17126) Signed-off-by: mgoin <[email protected]> * [VLM] Add PP support and fix GPTQ inference for Ovis models (vllm-project#18958) Signed-off-by: isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] add group_size is -1 in awq quantization (vllm-project#18910) Signed-off-by: rongfu.leng <[email protected]> * Tool parser regex timeout handling (vllm-project#18960) Signed-off-by: Will Eaton <[email protected]> * [Docs] Correct multiprocessing design doc (vllm-project#18964) Signed-off-by: Lukas Geiger <[email protected]> * create util function for batched arange (vllm-project#18937) * [Frontend] Add rerank support to run_batch endpoint (vllm-project#16278) Signed-off-by: Pooya Davoodi <[email protected]> * [Misc] Fix estimated max model len msg (vllm-project#18966) Signed-off-by: Yong Hoon Shin <[email protected]> * [Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled (vllm-project#18879) Signed-off-by: chaunceyjiang <[email protected]> * fix security issue of logging llm output (vllm-project#18980) Signed-off-by: Lu Fang <[email protected]> Co-authored-by: Lucia (Lu) Fang <[email protected]> * [Neuron] Add Multi-Modal model support for Neuron (vllm-project#18921) Signed-off-by: Satyajith Chilappagari <[email protected]> Co-authored-by: Ashraf Mahgoub <[email protected]> Co-authored-by: Rohith Nallamaddi <[email protected]> Co-authored-by: FeliciaLuo <[email protected]> Co-authored-by: Elaine Zhao <[email protected]> * [doc] fix the list rendering issue - security.md (vllm-project#18982) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [BugFix] Pydantic part 2 (vllm-project#18911) Signed-off-by: luka <[email protected]> * [FEAT][ROCm] Add AITER grouped topk for DeepSeekV2 (vllm-project#18825) Signed-off-by: vllmellm <[email protected]> * [Bugfix] Fix for issue 17396 (vllm-project#18773) Signed-off-by: Fred Reiss <[email protected]> * [ROCm][Kernel] Add gfx950 support for skinny gemms (vllm-project#18010) Signed-off-by: charlifu <[email protected]> * [P/D] NixlConnector use cache device index for memory registration (vllm-project#18969) Signed-off-by: Piotr Tarasiewicz <[email protected]> * [BugFix] Fix multi-node offline data-parallel (vllm-project#18981) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Yizhou Liu <[email protected]> * [Misc] add return token strs for tokenize (vllm-project#18941) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc][Benchmark] Add support for CustomDataset (vllm-project#18511) * [Bugfix] Fix EAGLE3 broken logits (vllm-project#18909) Signed-off-by: Benjamin Chislett <[email protected]> * [Core] Rework dtype resolution (vllm-project#18751) Signed-off-by: DarkLight1337 <[email protected]> * [LoRA] Support dynamically initialize `packed_modules_mapping` for VLM with arbitrary components (vllm-project#18987) Signed-off-by: isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [doc] small fix - mkdocs (vllm-project#18996) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Let max_num_batched_tokens use human_readable_int for large numbers (vllm-project#18968) Signed-off-by: mgoin <[email protected]> * [BugFix] fix data parallel construct ipv6 url addres (vllm-project#18991) Signed-off-by: rongfu.leng <[email protected]> * [BugFix] Fix incorrect metrics shutdown error log message (vllm-project#18992) Signed-off-by: Nick Hill <[email protected]> * [doc] wrong output (vllm-project#19000) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context (vllm-project#18935) Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: zhuhaoran <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [Bugfix][Nixl] Fix DP Metadata Handshake (vllm-project#19008) Signed-off-by: [email protected] <[email protected]> * [Core] Support inplace model weights loading (vllm-project#18745) Signed-off-by: 22quinn <[email protected]> * [doc] add pytest tips (vllm-project#19010) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Model] enable data parallel for Llama4 vision encoder (vllm-project#18368) Signed-off-by: yzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: yzhen <[email protected]> * [Frontend] enable custom logging for the uvicorn server (OpenAI API server) (vllm-project#18403) Signed-off-by: François Paupier <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Bugfix][Model] Attempt to fix eagle in V0. (vllm-project#18978) Signed-off-by: Gregory Shtrasberg <[email protected]> * add an absolute path for run.sh (vllm-project#18258) Signed-off-by: calvin chen <[email protected]> * [Hardware][TPU] Initial support of model parallelism with single worker using SPMD (vllm-project#18011) Signed-off-by: Siyuan Liu <[email protected]> Co-authored-by: Hossein Sarshar <[email protected]> Co-authored-by: Chengji Yao <[email protected]> * [Doc] Remove duplicate TOCs during MkDocs migration (vllm-project#19021) Signed-off-by: Zerohertz <[email protected]> * [Bugfix][EP+DP] Use pplx-kernel internode instead of intranode (vllm-project#19034) Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> * Adding "LoRA Test %N" to AMD production tests (vllm-project#18929) Signed-off-by: Yida Wu <[email protected]> * [CPU][CI] Re-enable the CPU CI tests (vllm-project#19046) Signed-off-by: jiang.li <[email protected]> * [ROCm][Build] Clean up the ROCm build (vllm-project#19040) Signed-off-by: Gregory Shtrasberg <[email protected]> * [V1] Support DP with Ray (vllm-project#18779) * Add tarsier model support (vllm-project#18985) Signed-off-by: 汪志鹏 <[email protected]> * [bugfix] small fix logic issue (vllm-project#18999) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Reduce logs in CLI scripts and plugin loader (vllm-project#18970) Signed-off-by: mgoin <[email protected]> * [Bugfix] Use cmake 3.26.1 instead of 3.26 to avoid build failure (vllm-project#19019) Signed-off-by: Lu Fang <[email protected]> * [v1][KVCacheManager] Rename BlockHashType to BlockHash (vllm-project#19015) Signed-off-by: Chen Zhang <[email protected]> * Update docker docs with ARM CUDA cross-compile (vllm-project#19037) Signed-off-by: mgoin <[email protected]> * [Doc] Add InternVL LoRA support (vllm-project#19055) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Update `WeightsMapper` for qwen2-vl/qwen2.5-vl (vllm-project#19054) Signed-off-by: Isotr0py <[email protected]> * [Doc] Update V1 user guide for embedding and enc-dec models (vllm-project#19060) Signed-off-by: DarkLight1337 <[email protected]> * [doc] clarify windows support (vllm-project#19088) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove V0 LoRA test (vllm-project#19066) Signed-off-by: Jee Jee Li <[email protected]> * Fix underscores in dict keys passed via CLI (vllm-project#19030) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix Signed-off-by: Amog Kamsetty <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> * update config Signed-off-by: Amog Kamsetty <[email protected]> * add Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Elaine Zhao <[email protected]> Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: Brent Salisbury <[email protected]> Signed-off-by: Satyajith Chilappagari <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Duyi-Wang <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Will Eaton <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Carol Zheng <[email protected]> Signed-off-by: wenhuach21 <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: iLeGend <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: rabi <[email protected]> Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: huangyuxiang03 <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: isotr0py <[email protected]> Signed-off-by: rongfu.leng <[email protected]> Signed-off-by: Pooya Davoodi <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Fred Reiss <[email protected]> Signed-off-by: charlifu <[email protected]> Signed-off-by: Piotr Tarasiewicz <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: yzhen <[email protected]> Signed-off-by: François Paupier <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: raushan <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: tjtanaa <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: aws-elaineyz <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Brent Salisbury <[email protected]> Co-authored-by: Satyajith Chilappagari <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Duyi-Wang <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Will Eaton <[email protected]> Co-authored-by: Will Eaton <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Carol Zheng <[email protected]> Co-authored-by: Wenhua Cheng <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: iLeGend <[email protected]> Co-authored-by: H <[email protected]> Co-authored-by: vllmellm <[email protected]> Co-authored-by: Rabi Mishra <[email protected]> Co-authored-by: Always-Naive <[email protected]> Co-authored-by: Daniele <[email protected]> Co-authored-by: Shawn Huang <[email protected]> Co-authored-by: huangyuxiang03 <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: rongfu.leng <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Pooya Davoodi <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: Lucia (Lu) Fang <[email protected]> Co-authored-by: Ashraf Mahgoub <[email protected]> Co-authored-by: Rohith Nallamaddi <[email protected]> Co-authored-by: FeliciaLuo <[email protected]> Co-authored-by: Fred Reiss <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: ptarasiewiczNV <[email protected]> Co-authored-by: Yizhou Liu <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: zhrrr <[email protected]> Co-authored-by: zhuhaoran <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: yzhen <[email protected]> Co-authored-by: Frαnçois <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Hossein Sarshar <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]>
1 parent 6c7451c commit c65512b

File tree

1,863 files changed

+83095
-35029
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,863 files changed

+83095
-35029
lines changed

.buildkite/check-wheel-size.py

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
23

34
import os
45
import sys
@@ -8,12 +9,12 @@
89
# Note that we have 400 MiB quota, please use it wisely.
910
# See https://github.com/pypi/support/issues/3792 .
1011
# Please also sync the value with the one in Dockerfile.
11-
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))
12+
VLLM_MAX_SIZE_MB = int(os.environ.get("VLLM_MAX_SIZE_MB", 400))
1213

1314

1415
def print_top_10_largest_files(zip_file):
1516
"""Print the top 10 largest files in the given zip file."""
16-
with zipfile.ZipFile(zip_file, 'r') as z:
17+
with zipfile.ZipFile(zip_file, "r") as z:
1718
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
1819
file_sizes.sort(key=lambda x: x[1], reverse=True)
1920
for f, size in file_sizes[:10]:
@@ -28,14 +29,18 @@ def check_wheel_size(directory):
2829
wheel_path = os.path.join(root, file_name)
2930
wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)
3031
if wheel_size_mb > VLLM_MAX_SIZE_MB:
31-
print(f"Not allowed: Wheel {wheel_path} is larger "
32-
f"({wheel_size_mb:.2f} MB) than the limit "
33-
f"({VLLM_MAX_SIZE_MB} MB).")
32+
print(
33+
f"Not allowed: Wheel {wheel_path} is larger "
34+
f"({wheel_size_mb:.2f} MB) than the limit "
35+
f"({VLLM_MAX_SIZE_MB} MB)."
36+
)
3437
print_top_10_largest_files(wheel_path)
3538
return 1
3639
else:
37-
print(f"Wheel {wheel_path} is within the allowed size "
38-
f"({wheel_size_mb:.2f} MB).")
40+
print(
41+
f"Wheel {wheel_path} is within the allowed size "
42+
f"({wheel_size_mb:.2f} MB)."
43+
)
3944
return 0
4045

4146

@@ -45,4 +50,4 @@ def check_wheel_size(directory):
4550
sys.exit(1)
4651

4752
directory = sys.argv[1]
48-
sys.exit(check_wheel_size(directory))
53+
sys.exit(check_wheel_size(directory))

.buildkite/generate_index.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
23

34
import argparse
45
import os
@@ -22,5 +23,5 @@
2223
print(f"Generated index.html for {args.wheel}")
2324
# cloudfront requires escaping the '+' character
2425
f.write(
25-
template.format(wheel=filename,
26-
wheel_html_escaped=filename.replace("+", "%2B")))
26+
template.format(wheel=filename, wheel_html_escaped=filename.replace("+", "%2B"))
27+
)
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Llama-3.2-1B-Instruct-FP8 -b "auto" -l 1319 -f 5 -t 1
2+
model_name: "RedHatAI/Llama-3.2-1B-Instruct-FP8"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.335
8+
- name: "exact_match,flexible-extract"
9+
value: 0.323
10+
limit: 1319
11+
num_fewshot: 5
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2.5-1.5B-Instruct -b auto -l 1319 -f 5 -t 1
2+
model_name: "Qwen/Qwen2.5-1.5B-Instruct"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.54
8+
- name: "exact_match,flexible-extract"
9+
value: 0.59
10+
limit: 1319
11+
num_fewshot: 5
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -b auto -l 1319 -f 5 -t 1
2+
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.47
8+
- name: "exact_match,flexible-extract"
9+
value: 0.64
10+
limit: 1319
11+
num_fewshot: 5

.buildkite/lm-eval-harness/configs/models-large.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ Meta-Llama-3-70B-Instruct.yaml
33
Mixtral-8x7B-Instruct-v0.1.yaml
44
Qwen2-57B-A14-Instruct.yaml
55
DeepSeek-V2-Lite-Chat.yaml
6+
Meta-Llama-3-8B-QQQ.yaml
Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,6 @@
1-
Meta-Llama-3-8B-Instruct.yaml
2-
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
1+
Qwen2.5-1.5B-Instruct.yaml
32
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
43
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
54
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
6-
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
5+
Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml
76
Qwen1.5-MoE-W4A16-compressed-tensors.yaml
8-
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
9-
Qwen2-1.5B-Instruct-FP8W8.yaml
10-
Meta-Llama-3-8B-QQQ.yaml
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
from pathlib import Path
4+
5+
import pytest
6+
7+
8+
def pytest_addoption(parser):
9+
parser.addoption(
10+
"--config-list-file",
11+
action="store",
12+
help="Path to the file listing model config YAMLs (one per line)",
13+
)
14+
parser.addoption(
15+
"--tp-size",
16+
action="store",
17+
default="1",
18+
help="Tensor parallel size to use for evaluation",
19+
)
20+
21+
22+
@pytest.fixture(scope="session")
23+
def config_list_file(pytestconfig, config_dir):
24+
rel_path = pytestconfig.getoption("--config-list-file")
25+
return config_dir / rel_path
26+
27+
28+
@pytest.fixture(scope="session")
29+
def tp_size(pytestconfig):
30+
return pytestconfig.getoption("--tp-size")
31+
32+
33+
def pytest_generate_tests(metafunc):
34+
if "config_filename" in metafunc.fixturenames:
35+
rel_path = metafunc.config.getoption("--config-list-file")
36+
config_list_file = Path(rel_path).resolve()
37+
config_dir = config_list_file.parent
38+
with open(config_list_file, encoding="utf-8") as f:
39+
configs = [
40+
config_dir / line.strip()
41+
for line in f
42+
if line.strip() and not line.startswith("#")
43+
]
44+
metafunc.parametrize("config_filename", configs)

.buildkite/lm-eval-harness/run-tests.sh

Lines changed: 0 additions & 59 deletions
This file was deleted.
Lines changed: 24 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,69 +1,55 @@
11
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
23
"""
34
LM eval harness on model to compare vs HF baseline computed offline.
45
Configs are found in configs/$MODEL.yaml
56
6-
* export LM_EVAL_TEST_DATA_FILE=configs/Meta-Llama-3-70B-Instruct.yaml
7-
* export LM_EVAL_TP_SIZE=4
8-
* pytest -s test_lm_eval_correctness.py
7+
pytest -s -v test_lm_eval_correctness.py \
8+
--config-list-file=configs/models-small.txt \
9+
--tp-size=1
910
"""
1011

11-
import os
12-
from pathlib import Path
13-
1412
import lm_eval
15-
import numpy
16-
import pytest
13+
import numpy as np
1714
import yaml
1815

1916
RTOL = 0.08
20-
TEST_DATA_FILE = os.environ.get(
21-
"LM_EVAL_TEST_DATA_FILE",
22-
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")
23-
24-
TP_SIZE = os.environ.get("LM_EVAL_TP_SIZE", 1)
25-
2617

27-
def launch_lm_eval(eval_config):
28-
trust_remote_code = eval_config.get('trust_remote_code', False)
29-
30-
model_args = f"pretrained={eval_config['model_name']}," \
31-
f"tensor_parallel_size={TP_SIZE}," \
32-
f"add_bos_token=true," \
33-
f"trust_remote_code={trust_remote_code}"
3418

19+
def launch_lm_eval(eval_config, tp_size):
20+
trust_remote_code = eval_config.get("trust_remote_code", False)
21+
model_args = (
22+
f"pretrained={eval_config['model_name']},"
23+
f"tensor_parallel_size={tp_size},"
24+
f"enforce_eager=true,"
25+
f"add_bos_token=true,"
26+
f"trust_remote_code={trust_remote_code}"
27+
)
3528
results = lm_eval.simple_evaluate(
3629
model="vllm",
3730
model_args=model_args,
3831
tasks=[task["name"] for task in eval_config["tasks"]],
3932
num_fewshot=eval_config["num_fewshot"],
4033
limit=eval_config["limit"],
41-
batch_size="auto")
42-
34+
batch_size="auto",
35+
)
4336
return results
4437

4538

46-
def test_lm_eval_correctness():
47-
eval_config = yaml.safe_load(
48-
Path(TEST_DATA_FILE).read_text(encoding="utf-8"))
49-
50-
if eval_config[
51-
"model_name"] == "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform": #noqa: E501
52-
pytest.skip("FBGEMM is currently failing on main.")
39+
def test_lm_eval_correctness_param(config_filename, tp_size):
40+
eval_config = yaml.safe_load(config_filename.read_text(encoding="utf-8"))
5341

54-
# Launch eval requests.
55-
results = launch_lm_eval(eval_config)
42+
results = launch_lm_eval(eval_config, tp_size)
5643

57-
# Confirm scores match ground truth.
5844
success = True
5945
for task in eval_config["tasks"]:
6046
for metric in task["metrics"]:
6147
ground_truth = metric["value"]
6248
measured_value = results["results"][task["name"]][metric["name"]]
63-
print(f'{task["name"]} | {metric["name"]}: '
64-
f'ground_truth={ground_truth} | measured={measured_value}')
65-
success = success and numpy.isclose(
66-
ground_truth, measured_value, rtol=RTOL)
49+
print(
50+
f"{task['name']} | {metric['name']}: "
51+
f"ground_truth={ground_truth} | measured={measured_value}"
52+
)
53+
success = success and np.isclose(ground_truth, measured_value, rtol=RTOL)
6754

68-
# Assert at the end, print all scores even on failure for debugging.
6955
assert success

0 commit comments

Comments
 (0)