-
Notifications
You must be signed in to change notification settings - Fork 6
add Kimi K2 model documentation and benchmarks #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,92 @@ | ||||||
| # Kimi K2 | ||||||
|
|
||||||
|  | ||||||
|
|
||||||
| ## Description | ||||||
| Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. | ||||||
|
|
||||||
|
|
||||||
| ## Key Features | ||||||
| - **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift. | ||||||
| - **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick (typo): As in the other file, 'in post-training stage' reads more naturally with 'the'.
Suggested change
|
||||||
| - **Stable Long-Horizon Agency:** Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps. | ||||||
|
|
||||||
| | **Field** | **Value** | | ||||||
| |-----------------------------------------|--------------------------| | ||||||
| | Architecture | Mixture-of-Experts (MoE) | | ||||||
| | Total Parameters | 1T | | ||||||
| | Activated Parameters | 32B | | ||||||
| | Number of Layers (Dense layer included) | 61 | | ||||||
| | Number of Dense Layers | 1 | | ||||||
| | Attention Hidden Dimension | 7168 | | ||||||
| | MoE Hidden Dimension (per Expert) | 2048 | | ||||||
| | Number of Attention Heads | 64 | | ||||||
| | Number of Experts | 384 | | ||||||
| | Selected Experts per Token | 8 | | ||||||
| | Number of Shared Experts | 1 | | ||||||
| | Vocabulary Size | 160K | | ||||||
| | Context Length | 256K | | ||||||
| | Attention Mechanism | MLA | | ||||||
| | Activation Function | SwiGLU | | ||||||
|
|
||||||
|
|
||||||
| ## Use this AI model with Docker Model Runner | ||||||
|
|
||||||
| ```bash | ||||||
| docker model run kimi-k2-vllm | ||||||
| ``` | ||||||
|
|
||||||
| ## Benchmarks | ||||||
|
|
||||||
| ### Reasoning Tasks | ||||||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | Grok-4 | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||
| |-----------------|-----------|-------------|--------------|-------------------|--------------------|---------------|--------| | ||||||
| | HLE | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 | | ||||||
| | HLE | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 | | ||||||
| | HLE | heavy | 51.0 | 42.0 | - | - | - | 50.7 | | ||||||
| | AIME25 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 | | ||||||
| | AIME25 | w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1* | 98.8 | | ||||||
| | AIME25 | heavy | 100.0 | 100.0 | - | - | - | 100.0 | | ||||||
| | HMMT25 | no tools | 89.4 | 93.3 | 74.6* | 38.8 | 83.6 | 90.0 | | ||||||
| | HMMT25 | w/ python | 95.1 | 96.7 | 88.8* | 70.4 | 49.5* | 93.9 | | ||||||
| | HMMT25 | heavy | 97.5 | 100.0 | - | - | - | 96.7 | | ||||||
| | IMO-AnswerBench | no tools | 78.6 | 76.0* | 65.9* | 45.8 | 76.0* | 73.1 | | ||||||
| | GPQA | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 | | ||||||
|
|
||||||
|
|
||||||
| ### General Tasks | ||||||
|
|
||||||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||||||
| |------------------|----------|-------------|--------------|-------------------|--------------------|---------------| | ||||||
| | MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 | | ||||||
| | MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 | | ||||||
| | Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 | | ||||||
| | HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 | | ||||||
|
|
||||||
|
|
||||||
| ### Agentic Search Tasks | ||||||
|
|
||||||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||||||
| |------------------|----------|-------------|--------------|-------------------|--------------------|---------------| | ||||||
| | BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 | | ||||||
| | BrowseComp-ZH | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 | | ||||||
| | Seal-0 | w/ tools | 56.3 | 51.4* | 53.4* | 25.2 | 38.5* | | ||||||
| | FinSearchComp-T3 | w/ tools | 47.4 | 48.5* | 44.0* | 10.4 | 27.0* | | ||||||
| | Frames | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* | | ||||||
|
|
||||||
|
|
||||||
| ### Coding Tasks | ||||||
|
|
||||||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||||||
| |------------------------|---------------------------|-------------|--------------|-------------------|--------------------|---------------| | ||||||
| | SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 | | ||||||
| | SWE-bench Multilingual | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 | | ||||||
| | Multi-SWE-bench | w/ tools | 41.9 | 39.3* | 44.3 | 33.5 | 30.6 | | ||||||
| | SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 | | ||||||
| | LiveCodeBenchV6 | no tools | 83.1 | 87.0* | 64.0* | 56.1* | 74.1 | | ||||||
| | OJ-Bench (cpp) | no tools | 48.7 | 56.2* | 30.4* | 25.5* | 38.2* | | ||||||
| | Terminal-Bench | w/ simulated tools (JSON) | 47.1 | 43.8 | 51.0 | 44.5 | 37.7 | | ||||||
|
|
||||||
| ## Links | ||||||
| - https://moonshotai.github.io/Kimi-K2/thinking.html | ||||||
| - https://huggingface.co/moonshotai/Kimi-K2-Thinking | ||||||
|
Comment on lines
+1
to
+92
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The structure of this document does not follow the established
Adhering to the template is important for maintainability and user experience. |
||||||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,95 @@ | ||||||||||
| # Kimi K2 | ||||||||||
| *GGUF version by Unsloth* | ||||||||||
|
|
||||||||||
|  | ||||||||||
|
|
||||||||||
| ## Description | ||||||||||
| Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. | ||||||||||
|
Comment on lines
+6
to
+7
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue (typo): Consider fixing the article usage and the 'k/K' inconsistency in this description sentence. For example: "Kimi K2 Thinking is the latest, most capable version of an open-source thinking model ... with a 256K context window," which adds the missing article before "open-source thinking model" and matches the "256K" capitalization used later in the table.
Suggested change
|
||||||||||
|
|
||||||||||
|
|
||||||||||
| ## Key Features | ||||||||||
| - **Deep Thinking & Tool Orchestration:** End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift. | ||||||||||
| - **Native INT4 Quantization:** Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode. | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick (typo): Add a definite article in 'in post-training stage' for smoother grammar. For example: "is employed in the post-training stage" or "during the post-training stage".
Suggested change
|
||||||||||
| - **Stable Long-Horizon Agency:** Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps. | ||||||||||
|
|
||||||||||
| | **Field** | **Value** | | ||||||||||
| |-----------------------------------------|--------------------------| | ||||||||||
| | Architecture | Mixture-of-Experts (MoE) | | ||||||||||
| | Total Parameters | 1T | | ||||||||||
| | Activated Parameters | 32B | | ||||||||||
| | Number of Layers (Dense layer included) | 61 | | ||||||||||
| | Number of Dense Layers | 1 | | ||||||||||
| | Attention Hidden Dimension | 7168 | | ||||||||||
| | MoE Hidden Dimension (per Expert) | 2048 | | ||||||||||
| | Number of Attention Heads | 64 | | ||||||||||
| | Number of Experts | 384 | | ||||||||||
| | Selected Experts per Token | 8 | | ||||||||||
| | Number of Shared Experts | 1 | | ||||||||||
| | Vocabulary Size | 160K | | ||||||||||
| | Context Length | 256K | | ||||||||||
| | Attention Mechanism | MLA | | ||||||||||
| | Activation Function | SwiGLU | | ||||||||||
|
|
||||||||||
|
|
||||||||||
| ## Use this AI model with Docker Model Runner | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| docker model run kimi-k2 | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ## Benchmarks | ||||||||||
|
|
||||||||||
| ### Reasoning Tasks | ||||||||||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | Grok-4 | | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||
| |-----------------|-----------|-------------|--------------|-------------------|--------------------|---------------|--------| | ||||||||||
| | HLE | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 | | ||||||||||
| | HLE | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 | | ||||||||||
| | HLE | heavy | 51.0 | 42.0 | - | - | - | 50.7 | | ||||||||||
| | AIME25 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 | | ||||||||||
| | AIME25 | w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1* | 98.8 | | ||||||||||
| | AIME25 | heavy | 100.0 | 100.0 | - | - | - | 100.0 | | ||||||||||
| | HMMT25 | no tools | 89.4 | 93.3 | 74.6* | 38.8 | 83.6 | 90.0 | | ||||||||||
| | HMMT25 | w/ python | 95.1 | 96.7 | 88.8* | 70.4 | 49.5* | 93.9 | | ||||||||||
| | HMMT25 | heavy | 97.5 | 100.0 | - | - | - | 96.7 | | ||||||||||
| | IMO-AnswerBench | no tools | 78.6 | 76.0* | 65.9* | 45.8 | 76.0* | 73.1 | | ||||||||||
| | GPQA | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 | | ||||||||||
|
|
||||||||||
|
|
||||||||||
| ### General Tasks | ||||||||||
|
|
||||||||||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||||||||||
| |------------------|----------|-------------|--------------|-------------------|--------------------|---------------| | ||||||||||
| | MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 | | ||||||||||
| | MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 | | ||||||||||
| | Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 | | ||||||||||
| | HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 | | ||||||||||
|
|
||||||||||
|
|
||||||||||
| ### Agentic Search Tasks | ||||||||||
|
|
||||||||||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||||||||||
| |------------------|----------|-------------|--------------|-------------------|--------------------|---------------| | ||||||||||
| | BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 | | ||||||||||
| | BrowseComp-ZH | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 | | ||||||||||
| | Seal-0 | w/ tools | 56.3 | 51.4* | 53.4* | 25.2 | 38.5* | | ||||||||||
| | FinSearchComp-T3 | w/ tools | 47.4 | 48.5* | 44.0* | 10.4 | 27.0* | | ||||||||||
| | Frames | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* | | ||||||||||
|
|
||||||||||
|
|
||||||||||
| ### Coding Tasks | ||||||||||
|
|
||||||||||
| | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 | K2 0905 (Thinking) | DeepSeek-V3.2 | | ||||||||||
| |------------------------|---------------------------|-------------|--------------|-------------------|--------------------|---------------| | ||||||||||
| | SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 | | ||||||||||
| | SWE-bench Multilingual | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 | | ||||||||||
| | Multi-SWE-bench | w/ tools | 41.9 | 39.3* | 44.3 | 33.5 | 30.6 | | ||||||||||
| | SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 | | ||||||||||
| | LiveCodeBenchV6 | no tools | 83.1 | 87.0* | 64.0* | 56.1* | 74.1 | | ||||||||||
| | OJ-Bench (cpp) | no tools | 48.7 | 56.2* | 30.4* | 25.5* | 38.2* | | ||||||||||
| | Terminal-Bench | w/ simulated tools (JSON) | 47.1 | 43.8 | 51.0 | 44.5 | 37.7 | | ||||||||||
|
|
||||||||||
| ## Links | ||||||||||
| - https://moonshotai.github.io/Kimi-K2/thinking.html | ||||||||||
| - https://huggingface.co/moonshotai/Kimi-K2-Thinking | ||||||||||
| - [Hugging Face (Unsloth GGUF)](https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF) | ||||||||||
| - [Unsloth Dynamic 2.0 GGUF](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs) | ||||||||||
|
Comment on lines
+1
to
+95
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The structure of this document does not follow the established
Adhering to the template is important for maintainability and user experience. |
||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Same as the other file: fix the missing article and '256k' capitalization in this sentence.
Use "version of an open-source thinking model" and "a 256K context window" to align grammar and capitalization with the table.