AI Eval Dashboard

2024_endterm.txt

Model	Ex 1	Ex 2	Ex 3	Ex 4	Ex 5	Ex 6	Ex 7	Ex 8	Total
anthropic/claude-opus-4.1	16.0	16.0	13.0	12.0	11.0	14.0	14.0	4.0	100/105
anthropic/claude-sonnet-4	13.0	18.0	12.0	12.0	11.0	14.0	14.0	4.0	98/105
deepseek/deepseek-chat-v3.1	13.0	18.0	13.0	12.0	11.0	14.0	14.0	5.0	100/105
deepseek/deepseek-r1-0528	16.0	18.0	13.0	12.0	13.0	14.0	14.0	5.0	105/105
google/gemini-2.5-pro	16.0	17.0	13.0	12.0	13.0	14.0	14.0	5.0	104/105
openai/gpt-5	16.0	18.0	13.0	12.0	13.0	14.0	14.0	5.0	105/105
openai/gpt-oss-120b	16.0	17.0	13.0	9.0	7.0	14.0	14.0	2.0	92/105
qwen/qwen3-235b-a22b	13.0	17.0	10.0	3.0	13.0	14.0	14.0	4.0	88/105
qwen/qwen3-235b-a22b-thinking-2507	13.0	18.0	13.0	12.0	7.0	12.0	14.0	3.0	92/105
x-ai/grok-4	16.0	18.0	13.0	12.0	13.0	14.0	14.0	5.0	105/105
z-ai/glm-4.5	9.0	15.0	9.0	3.0	9.0	12.0	14.0	3.0	74/105

Model	Ex 1	Ex 2	Ex 3	Ex 4	Ex 5	Ex 6	Ex 7	Ex 8	Total
anthropic/claude-opus-4.1	15.0	18.0	13.0	12.0	12.0	14.0	14.0	3.0	101/105
anthropic/claude-sonnet-4	11.0	11.0	9.0	12.0	10.0	10.0	14.0	5.0	82/105
deepseek/deepseek-chat-v3.1	13.0	18.0	12.0	12.0	12.0	14.0	14.0	5.0	100/105
deepseek/deepseek-r1-0528	13.0	11.0	13.0	12.0	9.0	12.0	14.0	4.0	88/105
google/gemini-2.5-pro	16.0	18.0	13.0	13.0	12.0	14.0	14.0	4.0	104/105
openai/gpt-5	16.0	18.0	12.0	12.0	12.0	14.0	14.0	5.0	103/105
openai/gpt-oss-120b	12.0	13.0	12.0	12.0	9.0	10.0	14.0	1.0	83/105
qwen/qwen3-235b-a22b	16.0	12.0	11.0	12.0	7.0	14.0	14.0	3.0	89/105
qwen/qwen3-235b-a22b-thinking-2507	14.0	12.0	12.0	12.0	12.0	14.0	14.0	5.0	95/105
x-ai/grok-4	16.0	18.0	13.0	12.0	12.0	14.0	14.0	5.0	104/105

Model	Ex 1	Ex 2	Ex 3	Ex 4	Ex 5	Ex 6	Ex 7	Total
deepseek/deepseek-chat-v3.1	5.0	6.0	0.0	5.0	?	?	5.0	21/40
google/gemini-2.5-pro	0.0	10.0	5.0	?	?	?	?	15/30
openai/gpt-5	10.0	10.0	5.0	?	?	?	?	25/30