Skip to content

Conversation

@ilopezluna
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ilopezluna, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive documentation and performance benchmarks for the Kimi K2 Thinking model. It provides detailed insights into the model's architecture, key features, and its performance across a wide array of reasoning, general, agentic search, and coding tasks, facilitating a better understanding of its capabilities and usage. Separate documentation is provided for both the vLLM and GGUF versions of the model.

Highlights

  • New Model Documentation: Comprehensive documentation for the Kimi K2 Thinking model has been added, detailing its architecture, key features, and usage.
  • Detailed Benchmarks: Extensive benchmark results are included, showcasing the Kimi K2 Thinking model's performance across various reasoning, general, agentic search, and coding tasks, comparing it against other leading models.
  • Model Specifications: Detailed specifications of the Kimi K2 model, such as its Mixture-of-Experts (MoE) architecture, parameter count, context length, and activation function, are now documented.
  • Docker Integration: Instructions for running the Kimi K2 model using the Docker Model Runner have been provided for both vLLM and GGUF versions.
  • GGUF Version Information: Specific documentation for the Kimi K2 GGUF version by Unsloth has been added, including relevant Hugging Face and Unsloth documentation links.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The two new docs are almost identical; consider extracting the shared description/benchmark sections into a single canonical Kimi K2 page and keeping only the runner-specific bits (GGUF vs vLLM) separate to avoid duplication and future drift.
  • It may be helpful to make the distinction between the GGUF and vLLM variants more obvious in the titles and introductory text (e.g., explicitly label one as 'GGUF (Unsloth)' and the other as 'vLLM server' near the top) so users immediately see which page applies to their setup.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The two new docs are almost identical; consider extracting the shared description/benchmark sections into a single canonical Kimi K2 page and keeping only the runner-specific bits (GGUF vs vLLM) separate to avoid duplication and future drift.
- It may be helpful to make the distinction between the GGUF and vLLM variants more obvious in the titles and introductory text (e.g., explicitly label one as 'GGUF (Unsloth)' and the other as 'vLLM server' near the top) so users immediately see which page applies to their setup.

## Individual Comments

### Comment 1
<location> `ai/kimi-k2.md:6-7` </location>
<code_context>
+![logo](https://statics.moonshot.cn/kimi-blog/assets/logo-CvjirWOb.svg)
+
+## Description
+Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
+
+
</code_context>

<issue_to_address>
**issue (typo):** Consider fixing the article usage and the 'k/K' inconsistency in this description sentence.

For example: "Kimi K2 Thinking is the latest, most capable version of an open-source thinking model ... with a 256K context window," which adds the missing article before "open-source thinking model" and matches the "256K" capitalization used later in the table.

```suggestion
## Description
Kimi K2 Thinking is the latest, most capable version of an open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with a 256K context window, achieving lossless reductions in inference latency and GPU memory usage.
```
</issue_to_address>

### Comment 2
<location> `ai/kimi-k2.md:12` </location>
<code_context>
+
+## Key Features
+- **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
+- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
+- **Stable Long-Horizon Agency:** Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps.
+
</code_context>

<issue_to_address>
**nitpick (typo):** Add a definite article in 'in post-training stage' for smoother grammar.

For example: "is employed in the post-training stage" or "during the post-training stage".

```suggestion
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in the post-training stage to achieve lossless 2x speed-up in low-latency mode.
```
</issue_to_address>

### Comment 3
<location> `ai/kimi-k2-vllm.md:6` </location>
<code_context>
+![logo](https://statics.moonshot.cn/kimi-blog/assets/logo-CvjirWOb.svg)
+
+## Description
+Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
+
+
</code_context>

<issue_to_address>
**issue (typo):** Same as the other file: fix the missing article and '256k' capitalization in this sentence.

Use "version of an open-source thinking model" and "a 256K context window" to align grammar and capitalization with the table.

```suggestion
Kimi K2 Thinking is the latest, most capable version of an open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with a 256K context window, achieving lossless reductions in inference latency and GPU memory usage.
```
</issue_to_address>

### Comment 4
<location> `ai/kimi-k2-vllm.md:11` </location>
<code_context>
+
+## Key Features
+- **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
+- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
+- **Stable Long-Horizon Agency:** Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps.
+
</code_context>

<issue_to_address>
**nitpick (typo):** As in the other file, 'in post-training stage' reads more naturally with 'the'.

```suggestion
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in the post-training stage to achieve lossless 2x speed-up in low-latency mode.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +6 to +7
## Description
Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Consider fixing the article usage and the 'k/K' inconsistency in this description sentence.

For example: "Kimi K2 Thinking is the latest, most capable version of an open-source thinking model ... with a 256K context window," which adds the missing article before "open-source thinking model" and matches the "256K" capitalization used later in the table.

Suggested change
## Description
Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
## Description
Kimi K2 Thinking is the latest, most capable version of an open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with a 256K context window, achieving lossless reductions in inference latency and GPU memory usage.


## Key Features
- **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (typo): Add a definite article in 'in post-training stage' for smoother grammar.

For example: "is employed in the post-training stage" or "during the post-training stage".

Suggested change
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in the post-training stage to achieve lossless 2x speed-up in low-latency mode.

![logo](https://statics.moonshot.cn/kimi-blog/assets/logo-CvjirWOb.svg)

## Description
Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Same as the other file: fix the missing article and '256k' capitalization in this sentence.

Use "version of an open-source thinking model" and "a 256K context window" to align grammar and capitalization with the table.

Suggested change
Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
Kimi K2 Thinking is the latest, most capable version of an open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with a 256K context window, achieving lossless reductions in inference latency and GPU memory usage.


## Key Features
- **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (typo): As in the other file, 'in post-training stage' reads more naturally with 'the'.

Suggested change
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in the post-training stage to achieve lossless 2x speed-up in low-latency mode.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds documentation for the Kimi K2 model. While the information is valuable, there are a few major issues. The two new files, ai/kimi-k2-vllm.md and ai/kimi-k2.md, have almost identical content, which will create a maintenance burden. I recommend consolidating them into a single file with sections for each variant (vllm and GGUF) if needed, perhaps using the Available model variants section from the template. More importantly, neither file follows the repository's template.md for model documentation. I've left specific comments on how to align with the template. Additionally, the benchmark tables contain unexplained asterisks, which makes the data difficult to interpret. Please address these points to ensure consistency and clarity.

Comment on lines +1 to +92
# Kimi K2

![logo](https://statics.moonshot.cn/kimi-blog/assets/logo-CvjirWOb.svg)

## Description
Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.


## Key Features
- **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
- **Stable Long-Horizon Agency:** Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps.

| **Field** | **Value** |
|-----------------------------------------|--------------------------|
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 1T |
| Activated Parameters | 32B |
| Number of Layers (Dense layer included) | 61 |
| Number of Dense Layers | 1 |
| Attention Hidden Dimension | 7168 |
| MoE Hidden Dimension (per Expert) | 2048 |
| Number of Attention Heads | 64 |
| Number of Experts | 384 |
| Selected Experts per Token | 8 |
| Number of Shared Experts | 1 |
| Vocabulary Size | 160K |
| Context Length | 256K |
| Attention Mechanism | MLA |
| Activation Function | SwiGLU |


## Use this AI model with Docker Model Runner

```bash
docker model run kimi-k2-vllm
```

## Benchmarks

### Reasoning Tasks
| Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | Grok-4 |
|-----------------|-----------|-------------|--------------|-------------------|--------------------|---------------|--------|
| HLE | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
| HLE | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
| HLE | heavy | 51.0 | 42.0 | - | - | - | 50.7 |
| AIME25 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 |
| AIME25 | w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1* | 98.8 |
| AIME25 | heavy | 100.0 | 100.0 | - | - | - | 100.0 |
| HMMT25 | no tools | 89.4 | 93.3 | 74.6* | 38.8 | 83.6 | 90.0 |
| HMMT25 | w/ python | 95.1 | 96.7 | 88.8* | 70.4 | 49.5* | 93.9 |
| HMMT25 | heavy | 97.5 | 100.0 | - | - | - | 96.7 |
| IMO-AnswerBench | no tools | 78.6 | 76.0* | 65.9* | 45.8 | 76.0* | 73.1 |
| GPQA | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |


### General Tasks

| Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 |
|------------------|----------|-------------|--------------|-------------------|--------------------|---------------|
| MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 |
| MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 |
| Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 |
| HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 |


### Agentic Search Tasks

| Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 |
|------------------|----------|-------------|--------------|-------------------|--------------------|---------------|
| BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 |
| BrowseComp-ZH | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 |
| Seal-0 | w/ tools | 56.3 | 51.4* | 53.4* | 25.2 | 38.5* |
| FinSearchComp-T3 | w/ tools | 47.4 | 48.5* | 44.0* | 10.4 | 27.0* |
| Frames | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* |


### Coding Tasks

| Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 |
|------------------------|---------------------------|-------------|--------------|-------------------|--------------------|---------------|
| SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 |
| SWE-bench Multilingual | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 |
| Multi-SWE-bench | w/ tools | 41.9 | 39.3* | 44.3 | 33.5 | 30.6 |
| SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 |
| LiveCodeBenchV6 | no tools | 83.1 | 87.0* | 64.0* | 56.1* | 74.1 |
| OJ-Bench (cpp) | no tools | 48.7 | 56.2* | 30.4* | 25.5* | 38.2* |
| Terminal-Bench | w/ simulated tools (JSON) | 47.1 | 43.8 | 51.0 | 44.5 | 37.7 |

## Links
- https://moonshotai.github.io/Kimi-K2/thinking.html
- https://huggingface.co/moonshotai/Kimi-K2-Thinking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The structure of this document does not follow the established template.md for model pages. For consistency across the project, please restructure this file to match the template. Key missing or mismatched sections include:

  • Characteristics: The provided table is different. Please use the format from the template and include fields like Provider, Cutoff date, Languages, Tool calling, License, etc.
  • Available model variants: This section is missing.
  • Considerations: This section is missing.
  • Benchmark performance: The format of benchmark reporting is different from the template.

Adhering to the template is important for maintainability and user experience.

Comment on lines +1 to +95
# Kimi K2
*GGUF version by Unsloth*

![logo](https://statics.moonshot.cn/kimi-blog/assets/logo-CvjirWOb.svg)

## Description
Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.


## Key Features
- **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
- **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
- **Stable Long-Horizon Agency:** Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps.

| **Field** | **Value** |
|-----------------------------------------|--------------------------|
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 1T |
| Activated Parameters | 32B |
| Number of Layers (Dense layer included) | 61 |
| Number of Dense Layers | 1 |
| Attention Hidden Dimension | 7168 |
| MoE Hidden Dimension (per Expert) | 2048 |
| Number of Attention Heads | 64 |
| Number of Experts | 384 |
| Selected Experts per Token | 8 |
| Number of Shared Experts | 1 |
| Vocabulary Size | 160K |
| Context Length | 256K |
| Attention Mechanism | MLA |
| Activation Function | SwiGLU |


## Use this AI model with Docker Model Runner

```bash
docker model run kimi-k2
```

## Benchmarks

### Reasoning Tasks
| Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | Grok-4 |
|-----------------|-----------|-------------|--------------|-------------------|--------------------|---------------|--------|
| HLE | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
| HLE | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
| HLE | heavy | 51.0 | 42.0 | - | - | - | 50.7 |
| AIME25 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 |
| AIME25 | w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1* | 98.8 |
| AIME25 | heavy | 100.0 | 100.0 | - | - | - | 100.0 |
| HMMT25 | no tools | 89.4 | 93.3 | 74.6* | 38.8 | 83.6 | 90.0 |
| HMMT25 | w/ python | 95.1 | 96.7 | 88.8* | 70.4 | 49.5* | 93.9 |
| HMMT25 | heavy | 97.5 | 100.0 | - | - | - | 96.7 |
| IMO-AnswerBench | no tools | 78.6 | 76.0* | 65.9* | 45.8 | 76.0* | 73.1 |
| GPQA | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |


### General Tasks

| Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 |
|------------------|----------|-------------|--------------|-------------------|--------------------|---------------|
| MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 |
| MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 |
| Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 |
| HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 |


### Agentic Search Tasks

| Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 |
|------------------|----------|-------------|--------------|-------------------|--------------------|---------------|
| BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 |
| BrowseComp-ZH | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 |
| Seal-0 | w/ tools | 56.3 | 51.4* | 53.4* | 25.2 | 38.5* |
| FinSearchComp-T3 | w/ tools | 47.4 | 48.5* | 44.0* | 10.4 | 27.0* |
| Frames | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* |


### Coding Tasks

| Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 |
|------------------------|---------------------------|-------------|--------------|-------------------|--------------------|---------------|
| SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 |
| SWE-bench Multilingual | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 |
| Multi-SWE-bench | w/ tools | 41.9 | 39.3* | 44.3 | 33.5 | 30.6 |
| SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 |
| LiveCodeBenchV6 | no tools | 83.1 | 87.0* | 64.0* | 56.1* | 74.1 |
| OJ-Bench (cpp) | no tools | 48.7 | 56.2* | 30.4* | 25.5* | 38.2* |
| Terminal-Bench | w/ simulated tools (JSON) | 47.1 | 43.8 | 51.0 | 44.5 | 37.7 |

## Links
- https://moonshotai.github.io/Kimi-K2/thinking.html
- https://huggingface.co/moonshotai/Kimi-K2-Thinking
- [Hugging Face (Unsloth GGUF)](https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF)
- [Unsloth Dynamic 2.0 GGUF](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The structure of this document does not follow the established template.md for model pages. For consistency across the project, please restructure this file to match the template. Key missing or mismatched sections include:

  • Characteristics: The provided table is different. Please use the format from the template and include fields like Provider, Cutoff date, Languages, Tool calling, License, etc.
  • Available model variants: This section is missing. The subtitle on line 2 (*GGUF version by Unsloth*) should likely be part of this section.
  • Considerations: This section is missing.
  • Benchmark performance: The format of benchmark reporting is different from the template.

Adhering to the template is important for maintainability and user experience.

## Benchmarks

### Reasoning Tasks
| Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | Grok-4 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The benchmark tables use asterisks (*) next to some values (e.g., 19.8* on line 44), but there's no explanation for what the asterisk signifies. Please add a footnote or a note to clarify its meaning. Without this, the benchmark data is ambiguous.

## Benchmarks

### Reasoning Tasks
| Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | Grok-4 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The benchmark tables use asterisks (*) next to some values (e.g., 19.8* on line 45), but there's no explanation for what the asterisk signifies. Please add a footnote or a note to clarify its meaning. Without this, the benchmark data is ambiguous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants