Current LLM Performance in Theoretical CS Exams

My current day job is as a scientific employee at Technical University of Munich. Our chair has been responsible for our undergraduate CS theory course, which contains the usual topics from formal languages and automata theory to the basics of decidability and complexity theory. It is interesting to see how good AI is at tackling these problems, for a few reasons:

Now, you might say that there are already enough benchmarks that measure mathematical reasoning abilities. However, there are a few differences why those results might not apply in this context. In particular,

This means that the LLMs must not only arrive at the correct answers, but they must also provide proofs which I find convincing.

Now, before I talk about the results, I would like to highlight that this is not a serious benchmark for comparing models against each other. It is not difficult to game (exercises and solutions are public), it does not have a high resolution, and there is a large variance. Instead, my intention is to take an area where I have a good intuition for difficulty, to evolve my understanding of the kinds of tasks that current models can solve.

The results

A graph showing the trade-off between performance in the exam and cost.

Full data: See here to see the actual exercises and the answers.

To summarise, LLMs are very good at our exams. As you can see, none of the models tested here score lower than the top 2% of students (!). The open-weight models (Qwen3 and Deepseek) perform at the level of the top 0.5% (if you use the newest version), while the proprietary models outperform the top 0.1% (i.e. the statistical best student).

So if you wanted a model specifically to solve these kinds of exams, we have the following Pareto frontier (note that the graph combines the data of two exams, so I divide by 2):

Technically speaking, Grok 4 did get one additional point compared to GPT-5, but this is (well) within measurement error.

Is this surprising?

Yes, at least to some people. Prior to publishing the results, I ran a survey of qualified domain experts (well, I asked my colleagues when they walked past my office) what their predictions were for how the performance of a state-of-the-art model on one of our exams (which have a total of 105P). Their estimates ranged from 65 to 96P, with the median at 89P. So even the worst models match the median estimate, and most models handily beat the maximum estimate. (Again, note that you have to multiply by 2 to compare with the plot.)

Back when OpenAI released their o1-preview with advanced reasoning, I did an informal test, feeding it the exercises of our then-current exam. It scored within the top 23% of submissions — impressive at the time, but as you can see, models have improved significantly since those ancient times (~1 year ago).

Details

I took the exercises from the 2024 endterm and retake exams. Those exams, their solutions, and the grading schemes are public. I used the grading schemes as-is.

The models are instructed to output their answers in a specifiy results, which my test harness parses. Some exercises are graded automatically, where this is feasible (e.g. checking whether a regular expression corresponds to a specific language).

Note that our exams are designed with grading efficiency in mind, so many exercises are of the “prove or disprove” form. Students have to mark which direction they are attempting, and (generally, but not always) solutions attempting a wrong direction cannot earn partial credit. So I grade these exercises semi-automatically — if the wrong direction is selected, the exercise is graded automatically with 0P, and otherwise is left for the human grader (me) to process.

Sometimes, the models do not answer according to the template. If I had to do it again, I would use tool-calling (or some other method to force structured output), but for this evaluation I simply graded those answers manually. More precisely, I usually fixed the answers by hand, so that the automatic grader could deal with them. (This is not indicated in the UI, sorry.) These errors were more or less evenly distributed across the models, each requring 1-4 manual interventions (across 68 answers), except for GPT-OSS-120B, which required ~20.

Disclaimer

Since writing the above, I have accepted a position at Anthropic, which produces the Claude series of LLMs. While this analysis has been completed entirely within the scope of my position as a doctoral researcher at TUM, please note the potential for bias.

Written by Philipp Czerner, 2025-09-01