Current LLM Performance in Theoretical CS Exams
My current day job is as a scientific employee at Technical University of Munich. Our chair has been responsible for our undergraduate CS theory course, which contains the usual topics from formal languages and automata theory to the basics of decidability and complexity theory. It is interesting to see how good AI is at tackling these problems, for a few reasons:
- If AI is very good, it opens up new avenues for cheating, which factors into decisions about exam conditions.
- Many students already use LLMs as a tool for learning, and they could also be a tool for teaching. But possible use-cases depend on their ability to provide accurate responses.
- Also, this is another area in which one can compare human abilities against the computer, one that I am very familiar with.
Now, you might say that there are already enough benchmarks that measure mathematical reasoning abilities. However, there are a few differences why those results might not apply in this context. In particular,
- our exams are in German, and
- the exercises are written to be graded by humans.
This means that the LLMs must not only arrive at the correct answers, but they must also provide proofs which I find convincing.
Now, before I talk about the results, I would like to highlight that this is not a serious benchmark for comparing models against each other. It is not difficult to game (exercises and solutions are public), it does not have a high resolution, and there is a large variance. Instead, my intention is to take an area where I have a good intuition for difficulty, to evolve my understanding of the kinds of tasks that current models can solve.
The results
Full data: See here to see the actual exercises and the answers.
To summarise, LLMs are very good at our exams. As you can see, none of the models tested here score lower than the top 2% of students (!). The open-weight models (Qwen3 and Deepseek) perform at the level of the top 0.5% (if you use the newest version), while the proprietary models outperform the top 0.1% (i.e. the statistical best student).
So if you wanted a model specifically to solve these kinds of exams, we have the following Pareto frontier (note that the graph combines the data of two exams, so I divide by 2):
-
GPT-OSS-120B
is both the worst and the cheapest model on this graph. Still, this means a price of 2ct for one exam and a grade in the top 1.5%. This would already be the best grade, generally speaking. -
Qwen3-235B-A22B-thinking-2507
is the second cheapest option, and gets you a performance of roughly a top 0.7% submission at the price of $0.08 per exam. -
Deepseek-v3.1
is 70% more expensive, at a whopping $0.13 per exam. But you do get an additional 6P per exam, which would correspond to to one grade increment, and overall is a top 0.2% result. -
GPT-5
is almost 8 times more expensive, and gets you an additional 4P per exam, being only 1P shy of a perfect exam. This result is superhuman, at least in the sense that no human has achieved it (yet).
Technically speaking, Grok 4
did get one additional point compared to GPT-5
, but this is (well) within measurement error.
Is this surprising?
Yes, at least to some people. Prior to publishing the results, I ran a survey of qualified domain experts (well, I asked my colleagues when they walked past my office) what their predictions were for how the performance of a state-of-the-art model on one of our exams (which have a total of 105P). Their estimates ranged from 65 to 96P, with the median at 89P. So even the worst models match the median estimate, and most models handily beat the maximum estimate. (Again, note that you have to multiply by 2 to compare with the plot.)
Back when OpenAI released their o1-preview
with advanced reasoning, I did an informal test, feeding it the exercises of our then-current exam. It scored within the top 23% of submissions — impressive at the time, but as you can see, models have improved significantly since those ancient times (~1 year ago).
Details
I took the exercises from the 2024 endterm and retake exams. Those exams, their solutions, and the grading schemes are public. I used the grading schemes as-is.
The models are instructed to output their answers in a specifiy results, which my test harness parses. Some exercises are graded automatically, where this is feasible (e.g. checking whether a regular expression corresponds to a specific language).
Note that our exams are designed with grading efficiency in mind, so many exercises are of the “prove or disprove” form. Students have to mark which direction they are attempting, and (generally, but not always) solutions attempting a wrong direction cannot earn partial credit. So I grade these exercises semi-automatically — if the wrong direction is selected, the exercise is graded automatically with 0P, and otherwise is left for the human grader (me) to process.
Sometimes, the models do not answer according to the template. If I had to do it again, I would use tool-calling (or some other method to force structured output), but for this evaluation I simply graded those answers manually. More precisely, I usually fixed the answers by hand, so that the automatic grader could deal with them. (This is not indicated in the UI, sorry.) These errors were more or less evenly distributed across the models, each requring 1-4 manual interventions (across 68 answers), except for GPT-OSS-120B
, which required ~20.
Disclaimer
Since writing the above, I have accepted a position at Anthropic, which produces the Claude series of LLMs. While this analysis has been completed entirely within the scope of my position as a doctoral researcher at TUM, please note the potential for bias.
Written by Philipp Czerner, 2025-09-01