Model performance across exams
Model | Ex 1 | Ex 2 | Ex 3 | Ex 4 | Ex 5 | Ex 6 | Ex 7 | Ex 8 | Total |
---|---|---|---|---|---|---|---|---|---|
anthropic/claude-opus-4.1 | 16.0 | 16.0 | 13.0 | 12.0 | 11.0 | 14.0 | 14.0 | 4.0 | 100/105 |
anthropic/claude-sonnet-4 | 13.0 | 18.0 | 12.0 | 12.0 | 11.0 | 14.0 | 14.0 | 4.0 | 98/105 |
deepseek/deepseek-chat-v3.1 | 13.0 | 18.0 | 13.0 | 12.0 | 11.0 | 14.0 | 14.0 | 5.0 | 100/105 |
deepseek/deepseek-r1-0528 | 16.0 | 18.0 | 13.0 | 12.0 | 13.0 | 14.0 | 14.0 | 5.0 | 105/105 |
google/gemini-2.5-pro | 16.0 | 17.0 | 13.0 | 12.0 | 13.0 | 14.0 | 14.0 | 5.0 | 104/105 |
openai/gpt-5 | 16.0 | 18.0 | 13.0 | 12.0 | 13.0 | 14.0 | 14.0 | 5.0 | 105/105 |
openai/gpt-oss-120b | 16.0 | 17.0 | 13.0 | 9.0 | 7.0 | 14.0 | 14.0 | 2.0 | 92/105 |
qwen/qwen3-235b-a22b | 13.0 | 17.0 | 10.0 | 3.0 | 13.0 | 14.0 | 14.0 | 4.0 | 88/105 |
qwen/qwen3-235b-a22b-thinking-2507 | 13.0 | 18.0 | 13.0 | 12.0 | 7.0 | 12.0 | 14.0 | 3.0 | 92/105 |
x-ai/grok-4 | 16.0 | 18.0 | 13.0 | 12.0 | 13.0 | 14.0 | 14.0 | 5.0 | 105/105 |
z-ai/glm-4.5 | 9.0 | 15.0 | 9.0 | 3.0 | 9.0 | 12.0 | 14.0 | 3.0 | 74/105 |
Model | Ex 1 | Ex 2 | Ex 3 | Ex 4 | Ex 5 | Ex 6 | Ex 7 | Ex 8 | Total |
---|---|---|---|---|---|---|---|---|---|
anthropic/claude-opus-4.1 | 15.0 | 18.0 | 13.0 | 12.0 | 12.0 | 14.0 | 14.0 | 3.0 | 101/105 |
anthropic/claude-sonnet-4 | 11.0 | 11.0 | 9.0 | 12.0 | 10.0 | 10.0 | 14.0 | 5.0 | 82/105 |
deepseek/deepseek-chat-v3.1 | 13.0 | 18.0 | 12.0 | 12.0 | 12.0 | 14.0 | 14.0 | 5.0 | 100/105 |
deepseek/deepseek-r1-0528 | 13.0 | 11.0 | 13.0 | 12.0 | 9.0 | 12.0 | 14.0 | 4.0 | 88/105 |
google/gemini-2.5-pro | 16.0 | 18.0 | 13.0 | 13.0 | 12.0 | 14.0 | 14.0 | 4.0 | 104/105 |
openai/gpt-5 | 16.0 | 18.0 | 12.0 | 12.0 | 12.0 | 14.0 | 14.0 | 5.0 | 103/105 |
openai/gpt-oss-120b | 12.0 | 13.0 | 12.0 | 12.0 | 9.0 | 10.0 | 14.0 | 1.0 | 83/105 |
qwen/qwen3-235b-a22b | 16.0 | 12.0 | 11.0 | 12.0 | 7.0 | 14.0 | 14.0 | 3.0 | 89/105 |
qwen/qwen3-235b-a22b-thinking-2507 | 14.0 | 12.0 | 12.0 | 12.0 | 12.0 | 14.0 | 14.0 | 5.0 | 95/105 |
x-ai/grok-4 | 16.0 | 18.0 | 13.0 | 12.0 | 12.0 | 14.0 | 14.0 | 5.0 | 104/105 |
Model | Ex 1 | Ex 2 | Ex 3 | Ex 4 | Ex 5 | Ex 6 | Ex 7 | Total |
---|---|---|---|---|---|---|---|---|
deepseek/deepseek-chat-v3.1 | 5.0 | 6.0 | 0.0 | 5.0 | ? | ? | 5.0 | 21/40 |
google/gemini-2.5-pro | 0.0 | 10.0 | 5.0 | ? | ? | ? | ? | 15/30 |
openai/gpt-5 | 10.0 | 10.0 | 5.0 | ? | ? | ? | ? | 25/30 |