Benchmarks
| AA-Index | general | % | 3 |
| AA-LCR | reasoning | % | 8 |
| ACEBench | reasoning | % | 2 |
| ActivityNet | vision | % | 1 |
| AdvancedIF | reasoning | % | 2 |
| AGIEval | math | % | 8 |
| AI2 Reasoning Challenge (ARC) | reasoning | % | 1 |
| AI2D | reasoning | % | 31 |
| Aider | coding | % | 4 |
| Aider-Polyglot | coding | % | 21 |
| Aider-Polyglot Edit | coding | % | 10 |
| AIME | math | % | 1 |
| AIME 2024 | math | % | 51 |
| AIME 2025 | math | % | 108 |
| AIME 2026 | math | % | 12 |
| AIR-Bench | safety | % | 1 |
| AITZ_EM | reasoning | % | 3 |
| AlignBench | math | % | 4 |
| AlpacaEval 2.0 | reasoning | % | 4 |
| AMC_2022_23 | math | % | 2 |