-
Notifications
You must be signed in to change notification settings - Fork 6
add Kimi K2 model documentation and benchmarks #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @ilopezluna, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive documentation and performance benchmarks for the Kimi K2 Thinking model. It provides detailed insights into the model's architecture, key features, and its performance across a wide array of reasoning, general, agentic search, and coding tasks, facilitating a better understanding of its capabilities and usage. Separate documentation is provided for both the vLLM and GGUF versions of the model. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The two new docs are almost identical; consider extracting the shared description/benchmark sections into a single canonical Kimi K2 page and keeping only the runner-specific bits (GGUF vs vLLM) separate to avoid duplication and future drift.
- It may be helpful to make the distinction between the GGUF and vLLM variants more obvious in the titles and introductory text (e.g., explicitly label one as 'GGUF (Unsloth)' and the other as 'vLLM server' near the top) so users immediately see which page applies to their setup.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The two new docs are almost identical; consider extracting the shared description/benchmark sections into a single canonical Kimi K2 page and keeping only the runner-specific bits (GGUF vs vLLM) separate to avoid duplication and future drift.
- It may be helpful to make the distinction between the GGUF and vLLM variants more obvious in the titles and introductory text (e.g., explicitly label one as 'GGUF (Unsloth)' and the other as 'vLLM server' near the top) so users immediately see which page applies to their setup.
## Individual Comments
### Comment 1
<location> `ai/kimi-k2.md:6-7` </location>
<code_context>
+
+
+## Description
+Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
+
+
</code_context>
<issue_to_address>
**issue (typo):** Consider fixing the article usage and the 'k/K' inconsistency in this description sentence.
For example: "Kimi K2 Thinking is the latest, most capable version of an open-source thinking model ... with a 256K context window," which adds the missing article before "open-source thinking model" and matches the "256K" capitalization used later in the table.
```suggestion
## Description
Kimi K2 Thinking is the latest, most capable version of an open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with a 256K context window, achieving lossless reductions in inference latency and GPU memory usage.
```
</issue_to_address>
### Comment 2
<location> `ai/kimi-k2.md:12` </location>
<code_context>
+
+## Key Features
+- **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
+- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
+- **Stable Long-Horizon Agency:** Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps.
+
</code_context>
<issue_to_address>
**nitpick (typo):** Add a definite article in 'in post-training stage' for smoother grammar.
For example: "is employed in the post-training stage" or "during the post-training stage".
```suggestion
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in the post-training stage to achieve lossless 2x speed-up in low-latency mode.
```
</issue_to_address>
### Comment 3
<location> `ai/kimi-k2-vllm.md:6` </location>
<code_context>
+
+
+## Description
+Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
+
+
</code_context>
<issue_to_address>
**issue (typo):** Same as the other file: fix the missing article and '256k' capitalization in this sentence.
Use "version of an open-source thinking model" and "a 256K context window" to align grammar and capitalization with the table.
```suggestion
Kimi K2 Thinking is the latest, most capable version of an open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with a 256K context window, achieving lossless reductions in inference latency and GPU memory usage.
```
</issue_to_address>
### Comment 4
<location> `ai/kimi-k2-vllm.md:11` </location>
<code_context>
+
+## Key Features
+- **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
+- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
+- **Stable Long-Horizon Agency:** Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps.
+
</code_context>
<issue_to_address>
**nitpick (typo):** As in the other file, 'in post-training stage' reads more naturally with 'the'.
```suggestion
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in the post-training stage to achieve lossless 2x speed-up in low-latency mode.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| ## Description | ||
| Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Consider fixing the article usage and the 'k/K' inconsistency in this description sentence.
For example: "Kimi K2 Thinking is the latest, most capable version of an open-source thinking model ... with a 256K context window," which adds the missing article before "open-source thinking model" and matches the "256K" capitalization used later in the table.
| ## Description | |
| Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. | |
| ## Description | |
| Kimi K2 Thinking is the latest, most capable version of an open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with a 256K context window, achieving lossless reductions in inference latency and GPU memory usage. |
|
|
||
| ## Key Features | ||
| - **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift. | ||
| - **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick (typo): Add a definite article in 'in post-training stage' for smoother grammar.
For example: "is employed in the post-training stage" or "during the post-training stage".
| - **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode. | |
| - **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in the post-training stage to achieve lossless 2x speed-up in low-latency mode. |
|  | ||
|
|
||
| ## Description | ||
| Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Same as the other file: fix the missing article and '256k' capitalization in this sentence.
Use "version of an open-source thinking model" and "a 256K context window" to align grammar and capitalization with the table.
| Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. | |
| Kimi K2 Thinking is the latest, most capable version of an open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with a 256K context window, achieving lossless reductions in inference latency and GPU memory usage. |
|
|
||
| ## Key Features | ||
| - **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift. | ||
| - **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick (typo): As in the other file, 'in post-training stage' reads more naturally with 'the'.
| - **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode. | |
| - **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in the post-training stage to achieve lossless 2x speed-up in low-latency mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds documentation for the Kimi K2 model. While the information is valuable, there are a few major issues. The two new files, ai/kimi-k2-vllm.md and ai/kimi-k2.md, have almost identical content, which will create a maintenance burden. I recommend consolidating them into a single file with sections for each variant (vllm and GGUF) if needed, perhaps using the Available model variants section from the template. More importantly, neither file follows the repository's template.md for model documentation. I've left specific comments on how to align with the template. Additionally, the benchmark tables contain unexplained asterisks, which makes the data difficult to interpret. Please address these points to ensure consistency and clarity.
| # Kimi K2 | ||
|
|
||
|  | ||
|
|
||
| ## Description | ||
| Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. | ||
|
|
||
|
|
||
| ## Key Features | ||
| - **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift. | ||
| - **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode. | ||
| - **Stable Long-Horizon Agency:** Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps. | ||
|
|
||
| | **Field** | **Value** | | ||
| |-----------------------------------------|--------------------------| | ||
| | Architecture | Mixture-of-Experts (MoE) | | ||
| | Total Parameters | 1T | | ||
| | Activated Parameters | 32B | | ||
| | Number of Layers (Dense layer included) | 61 | | ||
| | Number of Dense Layers | 1 | | ||
| | Attention Hidden Dimension | 7168 | | ||
| | MoE Hidden Dimension (per Expert) | 2048 | | ||
| | Number of Attention Heads | 64 | | ||
| | Number of Experts | 384 | | ||
| | Selected Experts per Token | 8 | | ||
| | Number of Shared Experts | 1 | | ||
| | Vocabulary Size | 160K | | ||
| | Context Length | 256K | | ||
| | Attention Mechanism | MLA | | ||
| | Activation Function | SwiGLU | | ||
|
|
||
|
|
||
| ## Use this AI model with Docker Model Runner | ||
|
|
||
| ```bash | ||
| docker model run kimi-k2-vllm | ||
| ``` | ||
|
|
||
| ## Benchmarks | ||
|
|
||
| ### Reasoning Tasks | ||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | Grok-4 | | ||
| |-----------------|-----------|-------------|--------------|-------------------|--------------------|---------------|--------| | ||
| | HLE | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 | | ||
| | HLE | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 | | ||
| | HLE | heavy | 51.0 | 42.0 | - | - | - | 50.7 | | ||
| | AIME25 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 | | ||
| | AIME25 | w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1* | 98.8 | | ||
| | AIME25 | heavy | 100.0 | 100.0 | - | - | - | 100.0 | | ||
| | HMMT25 | no tools | 89.4 | 93.3 | 74.6* | 38.8 | 83.6 | 90.0 | | ||
| | HMMT25 | w/ python | 95.1 | 96.7 | 88.8* | 70.4 | 49.5* | 93.9 | | ||
| | HMMT25 | heavy | 97.5 | 100.0 | - | - | - | 96.7 | | ||
| | IMO-AnswerBench | no tools | 78.6 | 76.0* | 65.9* | 45.8 | 76.0* | 73.1 | | ||
| | GPQA | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 | | ||
|
|
||
|
|
||
| ### General Tasks | ||
|
|
||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||
| |------------------|----------|-------------|--------------|-------------------|--------------------|---------------| | ||
| | MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 | | ||
| | MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 | | ||
| | Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 | | ||
| | HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 | | ||
|
|
||
|
|
||
| ### Agentic Search Tasks | ||
|
|
||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||
| |------------------|----------|-------------|--------------|-------------------|--------------------|---------------| | ||
| | BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 | | ||
| | BrowseComp-ZH | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 | | ||
| | Seal-0 | w/ tools | 56.3 | 51.4* | 53.4* | 25.2 | 38.5* | | ||
| | FinSearchComp-T3 | w/ tools | 47.4 | 48.5* | 44.0* | 10.4 | 27.0* | | ||
| | Frames | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* | | ||
|
|
||
|
|
||
| ### Coding Tasks | ||
|
|
||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||
| |------------------------|---------------------------|-------------|--------------|-------------------|--------------------|---------------| | ||
| | SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 | | ||
| | SWE-bench Multilingual | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 | | ||
| | Multi-SWE-bench | w/ tools | 41.9 | 39.3* | 44.3 | 33.5 | 30.6 | | ||
| | SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 | | ||
| | LiveCodeBenchV6 | no tools | 83.1 | 87.0* | 64.0* | 56.1* | 74.1 | | ||
| | OJ-Bench (cpp) | no tools | 48.7 | 56.2* | 30.4* | 25.5* | 38.2* | | ||
| | Terminal-Bench | w/ simulated tools (JSON) | 47.1 | 43.8 | 51.0 | 44.5 | 37.7 | | ||
|
|
||
| ## Links | ||
| - https://moonshotai.github.io/Kimi-K2/thinking.html | ||
| - https://huggingface.co/moonshotai/Kimi-K2-Thinking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of this document does not follow the established template.md for model pages. For consistency across the project, please restructure this file to match the template. Key missing or mismatched sections include:
- Characteristics: The provided table is different. Please use the format from the template and include fields like
Provider,Cutoff date,Languages,Tool calling,License, etc. - Available model variants: This section is missing.
- Considerations: This section is missing.
- Benchmark performance: The format of benchmark reporting is different from the template.
Adhering to the template is important for maintainability and user experience.
| # Kimi K2 | ||
| *GGUF version by Unsloth* | ||
|
|
||
|  | ||
|
|
||
| ## Description | ||
| Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. | ||
|
|
||
|
|
||
| ## Key Features | ||
| - **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift. | ||
| - **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode. | ||
| - **Stable Long-Horizon Agency:** Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps. | ||
|
|
||
| | **Field** | **Value** | | ||
| |-----------------------------------------|--------------------------| | ||
| | Architecture | Mixture-of-Experts (MoE) | | ||
| | Total Parameters | 1T | | ||
| | Activated Parameters | 32B | | ||
| | Number of Layers (Dense layer included) | 61 | | ||
| | Number of Dense Layers | 1 | | ||
| | Attention Hidden Dimension | 7168 | | ||
| | MoE Hidden Dimension (per Expert) | 2048 | | ||
| | Number of Attention Heads | 64 | | ||
| | Number of Experts | 384 | | ||
| | Selected Experts per Token | 8 | | ||
| | Number of Shared Experts | 1 | | ||
| | Vocabulary Size | 160K | | ||
| | Context Length | 256K | | ||
| | Attention Mechanism | MLA | | ||
| | Activation Function | SwiGLU | | ||
|
|
||
|
|
||
| ## Use this AI model with Docker Model Runner | ||
|
|
||
| ```bash | ||
| docker model run kimi-k2 | ||
| ``` | ||
|
|
||
| ## Benchmarks | ||
|
|
||
| ### Reasoning Tasks | ||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | Grok-4 | | ||
| |-----------------|-----------|-------------|--------------|-------------------|--------------------|---------------|--------| | ||
| | HLE | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 | | ||
| | HLE | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 | | ||
| | HLE | heavy | 51.0 | 42.0 | - | - | - | 50.7 | | ||
| | AIME25 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 | | ||
| | AIME25 | w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1* | 98.8 | | ||
| | AIME25 | heavy | 100.0 | 100.0 | - | - | - | 100.0 | | ||
| | HMMT25 | no tools | 89.4 | 93.3 | 74.6* | 38.8 | 83.6 | 90.0 | | ||
| | HMMT25 | w/ python | 95.1 | 96.7 | 88.8* | 70.4 | 49.5* | 93.9 | | ||
| | HMMT25 | heavy | 97.5 | 100.0 | - | - | - | 96.7 | | ||
| | IMO-AnswerBench | no tools | 78.6 | 76.0* | 65.9* | 45.8 | 76.0* | 73.1 | | ||
| | GPQA | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 | | ||
|
|
||
|
|
||
| ### General Tasks | ||
|
|
||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||
| |------------------|----------|-------------|--------------|-------------------|--------------------|---------------| | ||
| | MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 | | ||
| | MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 | | ||
| | Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 | | ||
| | HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 | | ||
|
|
||
|
|
||
| ### Agentic Search Tasks | ||
|
|
||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||
| |------------------|----------|-------------|--------------|-------------------|--------------------|---------------| | ||
| | BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 | | ||
| | BrowseComp-ZH | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 | | ||
| | Seal-0 | w/ tools | 56.3 | 51.4* | 53.4* | 25.2 | 38.5* | | ||
| | FinSearchComp-T3 | w/ tools | 47.4 | 48.5* | 44.0* | 10.4 | 27.0* | | ||
| | Frames | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* | | ||
|
|
||
|
|
||
| ### Coding Tasks | ||
|
|
||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||
| |------------------------|---------------------------|-------------|--------------|-------------------|--------------------|---------------| | ||
| | SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 | | ||
| | SWE-bench Multilingual | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 | | ||
| | Multi-SWE-bench | w/ tools | 41.9 | 39.3* | 44.3 | 33.5 | 30.6 | | ||
| | SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 | | ||
| | LiveCodeBenchV6 | no tools | 83.1 | 87.0* | 64.0* | 56.1* | 74.1 | | ||
| | OJ-Bench (cpp) | no tools | 48.7 | 56.2* | 30.4* | 25.5* | 38.2* | | ||
| | Terminal-Bench | w/ simulated tools (JSON) | 47.1 | 43.8 | 51.0 | 44.5 | 37.7 | | ||
|
|
||
| ## Links | ||
| - https://moonshotai.github.io/Kimi-K2/thinking.html | ||
| - https://huggingface.co/moonshotai/Kimi-K2-Thinking | ||
| - [Hugging Face (Unsloth GGUF)](https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF) | ||
| - [Unsloth Dynamic 2.0 GGUF](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of this document does not follow the established template.md for model pages. For consistency across the project, please restructure this file to match the template. Key missing or mismatched sections include:
- Characteristics: The provided table is different. Please use the format from the template and include fields like
Provider,Cutoff date,Languages,Tool calling,License, etc. - Available model variants: This section is missing. The subtitle on line 2 (
*GGUF version by Unsloth*) should likely be part of this section. - Considerations: This section is missing.
- Benchmark performance: The format of benchmark reporting is different from the template.
Adhering to the template is important for maintainability and user experience.
| ## Benchmarks | ||
|
|
||
| ### Reasoning Tasks | ||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | Grok-4 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Benchmarks | ||
|
|
||
| ### Reasoning Tasks | ||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | Grok-4 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description provided.