AI Eval Dashboard

Model performance across exams

2024_endterm.txt

Model Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 Ex 6 Ex 7 Ex 8 Total
anthropic/claude-opus-4.1 16.0 16.0 13.0 12.0 11.0 14.0 14.0 4.0 100/105
anthropic/claude-sonnet-4 13.0 18.0 12.0 12.0 11.0 14.0 14.0 4.0 98/105
deepseek/deepseek-chat-v3.1 13.0 18.0 13.0 12.0 11.0 14.0 14.0 5.0 100/105
deepseek/deepseek-r1-0528 16.0 18.0 13.0 12.0 13.0 14.0 14.0 5.0 105/105
google/gemini-2.5-pro 16.0 17.0 13.0 12.0 13.0 14.0 14.0 5.0 104/105
openai/gpt-5 16.0 18.0 13.0 12.0 13.0 14.0 14.0 5.0 105/105
openai/gpt-oss-120b 16.0 17.0 13.0 9.0 7.0 14.0 14.0 2.0 92/105
qwen/qwen3-235b-a22b 13.0 17.0 10.0 3.0 13.0 14.0 14.0 4.0 88/105
qwen/qwen3-235b-a22b-thinking-2507 13.0 18.0 13.0 12.0 7.0 12.0 14.0 3.0 92/105
x-ai/grok-4 16.0 18.0 13.0 12.0 13.0 14.0 14.0 5.0 105/105
z-ai/glm-4.5 9.0 15.0 9.0 3.0 9.0 12.0 14.0 3.0 74/105

2024_retake.txt

Model Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 Ex 6 Ex 7 Ex 8 Total
anthropic/claude-opus-4.1 15.0 18.0 13.0 12.0 12.0 14.0 14.0 3.0 101/105
anthropic/claude-sonnet-4 11.0 11.0 9.0 12.0 10.0 10.0 14.0 5.0 82/105
deepseek/deepseek-chat-v3.1 13.0 18.0 12.0 12.0 12.0 14.0 14.0 5.0 100/105
deepseek/deepseek-r1-0528 13.0 11.0 13.0 12.0 9.0 12.0 14.0 4.0 88/105
google/gemini-2.5-pro 16.0 18.0 13.0 13.0 12.0 14.0 14.0 4.0 104/105
openai/gpt-5 16.0 18.0 12.0 12.0 12.0 14.0 14.0 5.0 103/105
openai/gpt-oss-120b 12.0 13.0 12.0 12.0 9.0 10.0 14.0 1.0 83/105
qwen/qwen3-235b-a22b 16.0 12.0 11.0 12.0 7.0 14.0 14.0 3.0 89/105
qwen/qwen3-235b-a22b-thinking-2507 14.0 12.0 12.0 12.0 12.0 14.0 14.0 5.0 95/105
x-ai/grok-4 16.0 18.0 13.0 12.0 12.0 14.0 14.0 5.0 104/105

puzzle.txt

Model Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 Ex 6 Ex 7 Total
deepseek/deepseek-chat-v3.1 5.0 6.0 0.0 5.0 ? ? 5.0 21/40
google/gemini-2.5-pro 0.0 10.0 5.0 ? ? ? ? 15/30
openai/gpt-5 10.0 10.0 5.0 ? ? ? ? 25/30