Update README.md
Browse files
README.md
CHANGED
|
@@ -12,7 +12,6 @@ library_name: transformers
|
|
| 12 |
|
| 13 |
<div align="center" style="line-height:1">
|
| 14 |
<a href="https://www.kimi.com" target="_blank"><img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-Kimi%20K2-ff6b6b?color=1783ff&logoColor=white"/></a>
|
| 15 |
-
<a href="https://github.com/moonshotai/Kimi-K2"><img alt="github" src="https://img.shields.io/badge/🤖%20Github-Kimi%20K2-ff6b6b?color=1783ff&logoColor=white"/></a>
|
| 16 |
<a href="https://www.moonshot.ai" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-Moonshot%20AI-white?logo=Kimi&logoColor=white"/></a>
|
| 17 |
</div>
|
| 18 |
|
|
@@ -68,7 +67,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 68 |
|
| 69 |
**Reasoning Tasks**
|
| 70 |
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
|
| 71 |
-
|
| 72 |
| **HLE (Text-only)** | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
|
| 73 |
| | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
|
| 74 |
| | heavy | 51.0 | 42.0 | - | - | - | 50.7 |
|
|
@@ -83,7 +82,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 83 |
|
| 84 |
**General Tasks**
|
| 85 |
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 86 |
-
|
| 87 |
| **MMLU-Pro** | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 |
|
| 88 |
| **MMLU-Redux** | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 |
|
| 89 |
| **Longform Writing** | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 |
|
|
@@ -91,7 +90,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 91 |
|
| 92 |
**Agentic Search Tasks**
|
| 93 |
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 94 |
-
|
| 95 |
| **BrowseComp** | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 |
|
| 96 |
| **BrowseComp-ZH** | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 |
|
| 97 |
| **Seal-0** | w/ tools | 56.3 | 51.4* | 53.4* | 25.2 | 38.5* |
|
|
@@ -100,7 +99,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 100 |
|
| 101 |
**Coding Tasks**
|
| 102 |
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 103 |
-
|
| 104 |
| **SWE-bench Verified** | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 |
|
| 105 |
| **SWE-bench Multilingual** | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 |
|
| 106 |
| **Multi-SWE-bench** | w/ tools | 41.9 | 39.3* | 44.3 | 33.5 | 30.6 |
|
|
|
|
| 12 |
|
| 13 |
<div align="center" style="line-height:1">
|
| 14 |
<a href="https://www.kimi.com" target="_blank"><img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-Kimi%20K2-ff6b6b?color=1783ff&logoColor=white"/></a>
|
|
|
|
| 15 |
<a href="https://www.moonshot.ai" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-Moonshot%20AI-white?logo=Kimi&logoColor=white"/></a>
|
| 16 |
</div>
|
| 17 |
|
|
|
|
| 67 |
|
| 68 |
**Reasoning Tasks**
|
| 69 |
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
|
| 70 |
+
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|:-------:|
|
| 71 |
| **HLE (Text-only)** | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
|
| 72 |
| | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
|
| 73 |
| | heavy | 51.0 | 42.0 | - | - | - | 50.7 |
|
|
|
|
| 82 |
|
| 83 |
**General Tasks**
|
| 84 |
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 85 |
+
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
|
| 86 |
| **MMLU-Pro** | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 |
|
| 87 |
| **MMLU-Redux** | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 |
|
| 88 |
| **Longform Writing** | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 |
|
|
|
|
| 90 |
|
| 91 |
**Agentic Search Tasks**
|
| 92 |
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 93 |
+
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
|
| 94 |
| **BrowseComp** | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 |
|
| 95 |
| **BrowseComp-ZH** | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 |
|
| 96 |
| **Seal-0** | w/ tools | 56.3 | 51.4* | 53.4* | 25.2 | 38.5* |
|
|
|
|
| 99 |
|
| 100 |
**Coding Tasks**
|
| 101 |
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 102 |
+
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
|
| 103 |
| **SWE-bench Verified** | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 |
|
| 104 |
| **SWE-bench Multilingual** | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 |
|
| 105 |
| **Multi-SWE-bench** | w/ tools | 41.9 | 39.3* | 44.3 | 33.5 | 30.6 |
|