An AI for MTG drafting

(Try it, or read the source code.)

A popular (and also my favourite) way to play Magic: The Gathering is Booster Draft. You sit at a table with ~7 other people. Everyone opens a booster pack, consisting of ~15 random cards. You select one of them, then pass the remaining 14 to the person on your left. This is repeated 14 more times, until all 15 cards of each of the packs have been picked.

An illustration of the process of drafting one pack

This whole procedure is then repeated with two more packs, alternating direction, and you end up with 45 cards. You use these to build a deck and then play one-on-one matches against other players.

The goal is clear: pick cards that maximise your chances of winning the subsequent games, or, in other words, that allow you to build a deck as strong as possible.

How to do this is a hard problem — so perhaps we let our AI overlords struggle in our stead? Following that thought, I trained a small model to both choose which cards to pick, as well as which cards to use for the deck.

Motivation

Drafting MTG is hard. (I describe some of the complications below.) When trying to improve at it, you face the problem that it is difficult to know whether your decisions were right.

After the draft you end up with a deck. Estimating whether that deck is good or bad is itself tough — and even if you could do that, you still do not know whether a bad deck is simply you getting unlucky during the draft, or actually a result of bad decisions.

Unlike chess, where you can go back to a position and try out different moves, you cannot simply re-play a draft because you would know what cards are coming.

A strong AI model would be an invaluable training tool. Instead of you having to do enough drafts that the variance evens out, the model gives you immediate feedback, with numbers. It also allows you to pose hypotheticals: you can change an earlier pick and see how it affects the results.

The usefulness of this depends on the playing strength of the model, of course. There are also other use-cases for which it is sufficient that the model has human-like behaviour, regardless of its strength. For example, human-like bots are useful to fill some seats in a draft, either because you do not have enough players, or to stand in temporarily while someone is late, lost connection, or similar.

What makes drafting difficult?

Before going into how the model was trained, let me describe some of the considerations that human players have. These are the kind of patterns that we want the model to identify, and will later function as ways in which to qualitatively evaluate the model's performance.

Card quality. On the most basic level, you want to choose “good” cards and avoid “bad” cards. Merely identifying which cards are (on some abstract level) better than others is already difficult, but in the end this boils down to a simple list. You can find rankings of cards online, created by human experts, and simply copy them.

If this was all, drafting would not be of much interest. Luckily, things get more complicated from here.

Colours. Certain cards work better together than others. The clearest example are the colours. There are five colours in the game (white, blue, black, red, green). While the rules of the game place no prohibition on the colours you play and you could — in theory — combine cards of all five colours as you please, the game mechanics penalise this indirectly, and adding additional colours of cards to your deck will reduce its consistency and, therefore, ability to win.

It is thus preferable to limit yourself to only play cards of a few colours. (Usually two, but with many caveats and exceptions.) While some may fix their colours already in advance, good players start out being flexible and narrow their possible colours as the draft progresses.

Once you know which colours you play, any card outside of those colours is worthless to you. This is the main dynamic that changes your evaluations throughout the draft: initially, you weigh only the quality of the cards themselves, while at the end you disregard all that do not fit your chosen colours. How and when you go from one to the other is a tough balancing act.

Synergy. Colour is the most visible form of synergy, but there are others. Many cards excel only when used in the context of a particular strategy, and are merely average, or even poor, outside of it. Sometimes these themes are obvious and named on the cards themselves, making the association clear. Sometimes not. And even if the connection is clear, the evaluation of a card can change in subtle ways.

Some cards are independently good, their value less influenced by your other picks. Some cards are good regardless, but maximised in some strategy and push you towards it. Some cards are build-arounds — they are only worth playing if you are going for their strategy, and dead weight if you are not. Some cards may even have anti-synergy and get worse when combined.

Only in the first pick of a draft can a card by considered by itself. Every card after that must be evaluated in the context of the picks you have already made.

Signals. It becomes more complicated still. You are not drafting alone, and the cards you see are not chosen just by randomness. Rather, the cards in a pack passed to you are the ones left after the players before you have made their choice.

Of course, they will pick (what they consider to be) good cards, leaving the less good ones to you. But if you could read their mind, you could gain advantage by pursuing a strategy different from theirs. Usually you do this by colour — by choosing colours distinct from theirs, you may access cards that they consider worthless (since the cards do not fit within their colours) but which you value highly.

Communicating with other players (outside of taking game actions) is not allowed. But you can use the information you get within the game itself. For example, if you are repeatedly passed packs of cards with strong red cards, you might conclude that the player passing to you does not consider red as one of their colours. This, in turn, should make you more likely to choose red as one of your colours.

This principle is easy enough, but complicated to implement. In practice, you are unlikely to receive pack after pack with strong cards of one colour, and any such pack might as well be a false signal — after all, perhaps it initially contained two strong cards of the same colour, one of which was taken. Further, the quality of cards naturally goes down as fewer cards remain in the pack, so you have to have a keen sense which cards would be considered strong relative to the current pick.

Overview

To train a model, you need data. Luckily, the 17Lands project offers a tool for people to track their drafts, and then publishes aggregated datasets. Our goal is to take those data and train a model to predict the next pick made by the human player.

Needless to say, an approach such as this will usually not give you superhuman performance, since that is not what you are optimising for. It should be good at giving you human-like behaviour though, which is already interesting.

There is some hope that you are able to get an AI that is stronger than the average player, even if only doing imitation learning. Averaging opinions can be beneficial (just like a panel of experts is likely to outperform any single expert on the panel). Moreover, we can feed the model metainformation about the skill level of the humans that it is trying to imitate, and then during evaluation tell it to predict the actions of a high-skilled player.

Modelling

First, we have to figure out how we feed the model information about the draft, and how it makes its predictions.

On a high level, I want to encode the draft as a sequence of picks, i.e. the model is supposed to predict the next card to be picked based on the cards that have been picked previously. This is similar to the task solved by current large language models (LLMs), which are fundamentally models for predicting the continuation of a sequence. Since those appear to work quite well, I have made the bold choice to just to the same thing here.

Tokens. In an LLM, a text is first split into a sequence of tokens, each corresponding to a few letters or a short word, encoded as a number. To model a draft, we assign a unique token to each card.

To also include metadata with the picks, we can prefix the sequence of picks with special tokens that encode the metadata.

A sequence of picks is encoded into a sequence of tokens, one per card. Tokens encoding metadata are prepended.

Model architecture. Predictions are made by a simple transformer architecture. Many descriptions of this architecture already exist, so I shall be brief:

The sequence of numbers is mapped to a sequence of vectors by looking up each number in some (learnable) embedding.
Each position in the sequence can query information about prior tokens via causal multi-head attention.
The previous step is repeated a number of times.
Finally, a single linear layer maps the resulting vectors (one for each position) to a prediction over the next token in the sequence.

The parameters of the embedding, the attention layers, and the final prediction head are then trained so that the prediction of the next token matches the observed next token.

An illustration of the model structure described above.

Actually, this is only the core part of the model — we will later extend it a bit. One problem in particular is the inability of the model to “see” which cards are contained in the current pack. As you might imagine, predicting the next pick is quite futile if you have to choose among the ~300 cards that are possible, instead of the ≤15 cards that are actually available in the current pack.

For the moment we fix this by masking out the probabilities: the model produces a prediction for every single possible card, and we then set the probabilities of all cards that are not in the pack to 0.

Data preprocessing

As mentioned above, we rely on data of the 17Lands project. Our first task is to convert that data (a bunch of compressed CSV files) to a suitable format for a neural network.

I started with ~~ophiomancy~~ a Python script. CSV parsing, some string processing, nothing fancy. This worked fine for some initial experiments, but processing the full dataset would have taken many hours. This is a problem — it is likely that some issues will appear and require me to re-process all of the data. Having to wait a day each time this happens would significantly slow down development.

So things had to become faster. Two options:

Go wide. I could parallelise the Python code to run on multiple cores, likely leading to a 10-20x speedup.
Go tall. Alternatively, re-writing the code in C++ (or similar) would make it faster as well. This speed-up is harder to estimate, but I would ballpark 10-100x here.

When facing an embarassingly parallel problem (one that can be split into entirely independent subproblems), I tend to parallelise first and ask questions later. Unfortunately, it is not quite as easy here: we have to store a mapping from card names (and metadata information) to token ids. The easiest way (and thus the way I did it) to compute this mapping is on-the fly. Whenever the script needs to map a card name to a number, it looks up whether such a mapping already exists, and creates it if it does not.

However, this creates a dependency between subproblems, since this mapping and updates to it need to be shared and synchronised. It is not terribly difficult to do, but enough bother that I thought it less effort to rewrite the code in C++ and avoid parallelisation.

This worked, and processing the entire dataset now takes a few minutes. The output are tensors in a simple binary encoding, and a JSON file with the relevant metadata (shapes of the tensors, as well as the token id mapping).

The total dataset comprises roughly 2 million drafts, resulting in 2.1 GiB of binary data (785 MiB compressed).

Implementing and training the model

I used PyTorch to implement the architecture described above. A lot of decisions and details are compressed into that sentence — I do not think it useful to describe those here, but the source code is publicly available for reference.

Most of the compute for training the model was provided by my own, consumer-grade GPU (a 7900XT). I also rented GPUs from a cloud provider (on Vast, to be precise). I did not keep track, but I estimate the total number of GPU hours to be around 100. Training the final model would have taken 51h on the 7900XT (but since parts of the training occured on a 5070 Ti, the actual time was shorter).

Extensions

Pack contents. As described above, the model is incapable of taking into account the cards contained in the current pack. This might sound like a large handicap, but it is not: since we only consider the predicted probabilities of cards that are actually in the pack, the model is not unfairly penalised for not being able to tell which cards are available to pick.

Essentially, at each pick the model ranks all possible cards (not just the ones in the pack), and its prediction then is the highest-ranked card of the ones in the pack.

Put like this, it might be difficult to see how this is a handicap at all! But let us recall the concept of signals introduced above. In short, strong players will take into account the contents of prior packs to draw conclusions about the strategies that their opponents are pursuing, and then adjust their own strategy accordingly.

Clearly, this is impossible if the contents of the pack are not provided to the model as input. Since reading signals is an advanced strategy, we would expect the model to be able to make good predictions even without it. Still, it makes sense to (optionally) include information about pack contents.

A straightforward way to do this would extend the sequence of picks with the contents of the packs (marking the tokens in some way to differentiate them from the picks). However, this would increase the length of the sequence significantly: while there are $3 \cdot 15 = 45$ picks in a draft, you see a total of $3 \cdot 15 \cdot (15 + 1) / 2 = 360$ cards in packs. The computational effort required scales with the length of the sequence, and would thus increase by a factor of at least ~8. [?]

Most things scale linearly, while attention scales quadratically. If the latter becomes the bottleneck, the factor will be larger.

Instead, I decided to process the packs as a separate sequence with another transformer, and then feed the outputs to the main model. This other transformer can be smaller, reducing the overhead. Additionally, it can be turned off, both to improve training efficiency, and evaluate its impact on overall performance.

Deckbuilding. After picking all cards, a player has to build a deck using the cards they drafted. Since the 17Lands data also contains information on whether a card was played in the deck, I added a second output to the model predicting it.

More precisely, after each pick the model outputs a probability for each card indicating how likely this card is to be included in the deck. Only the probabilities for the cards that have been picked up to that point are considered for training — it would be reasonable to not have this restriction, i.e. force the model to speculate how likely other cards, that may or may not be drafted in the future, would be to be included.

Final architecture. With the above two extensions, the full model looks like this:

I experimented a bit to find a good set of hyperparameters, eventually settling on a model with ~7 million parameters. The values I used to create the final model can be found in configs/default.json. Please note that I added additional layers during training and changed the learning rate.

Results

The final model has an accuracy of 71% on my testing data. (The testing data consists of 1024 drafts picked uniformly at random from the dataset.) This means that for 71% of picks the card predicted by the model to be the most likely one was indeed the next pick.

The next two subsections attempt to answer how good this accuracy is. But to me, the more interesting question is how strong the model can be (and not how good it is at accurately predicting the behaviour of novices), which is what the other subsections try to evaluate. For the analyses on those sections, I have thus configured the metadata of the drafts to indicate a very strong player.

Comparison with synthetic baselines

The first question should always be whether the model does any better than pure randomness. Here, the answer is yes.

This is easy to calculate. If you have packs of size $n$ , the probability of randomly choosing the right card out of a full pack of (distinct) cards is $1 / n$ , so for the whole draft you have $\frac{1}{n} \sum_{i = 1}^{n} \frac{1}{n}$ . With $n = 14$ this yields ~23%. The actual number would be slightly lower, since on average, sets have slightly more than 14 cards in a pack, but also slightly higher, since there is a low chance that a pack contains two of the same card. In any case, this is much lower than the 71% achieved by the model.

As a second baseline, we can consider what happens if we simply predict the “best” card from every pack, referring to the abstract strength of a card independent of your other picks. As any player will tell you, this is a terrible strategy, since you will end up with cards from all five colours. To estimate some numbers, let us assume that the player spends the entirely of pack 1 picking the best card, then choose two colours, and then always picks the best card in one of your two colours. Always predicting the best card would then give you an accuracy of

\frac{1}{3 n} (n + 2 \sum_{i = 1}^{n} max (\frac{1}{i}, \frac{2}{5}))

With $n = 14$ you get 64%, again much lower than 71%. So even under these very generous assumptions, the inability to react to colours is fatal. Empirically, I get around 48% when trying this strategy on the testing dataset, using the average pick rates to estimate what the best card is.

Comparison with prior work

I found two pieces of prior work. First, this repository follows a similar approach to mine. Unfortunately, they do not report any statistics we could use to compare with their approach. Due to the similarity of the two approaches, we could guess that the results will be similar as well, but that is only speculation.

The other prior work I found was this post, where the author fine-tuned an LLM also on 17Lands data. They report an accuracy of 65% of their finetuned model, and a 75% accuracy for the author themselves. However, these accuracy numbers cannot be directly compared, since the evaluation used different datasets.

Besides, I have found accuracy to be quite volatile. For example, drafters with higher (estimated) skill are easier to predict. (Below I analyse two drafts by human experts, and there my model has an accuracy of 81%.) Accuracy will also depend on the set, since the playerbase changes over time, the number of cards in a pack changes slightly between sets, and also just because of different cards (i.e. if you have many cards of similar strength, accuracy will be lower).

Additionally, you get variance just based on the size of the dataset, especially when evaluating human performance. Assuming that predicting the picks of a draft takes roughly as long as doing the draft, then you can do ~two drafts per hour, meaning 90 picks. If you correctly predict 75% of those 90 picks, i.e. 67, the 95% confidence interval is 61-85%. If you do four drafts, again 75% correct, the 95% confidence interval is 66-83%. [?]

These confidence intervals assume independent Bernoulli variables. This assumption is wrong, but a decent approximation.

Unfortunately, the size and makeup of the evaluation dataset was not reported. Thus the strongest conclusion I am willing to draw is that both the two models and the author had performance roughly in the same ballpark.

Comparison with human experts

First, let us look at the first draft from this video. Since there is live commentary, we can get some insight into what the two human players are thinking beyond just the picks made. I entered the cards into my tool, and compared the predictions of the model. (I also verified that this draft is not part of the training data.)

For a $33 / 42 \approx 0.79$ fraction of the picks, the model correctly predicted the first pick. For the other 9 picks, the actual pick was the model's second choice. Except for one pick, the top cards considered by the players and the top picks of the model align very closely.

In terms of deckbuilding, the model seems to have some issues. It does manage to correctly identify the colours early on in the draft. For example, at the point where the players decide that they are unlikely to play blue, the model concurs and gives the blue cards low probability of being included in the deck. Two picks prior, the model still considers it roughly a coinflip whether those cards will be played.

However, the inclusion probability for another card fluctuates somewhat from pick to pick around that time, with no apparent reason.

Compared to the actual deck, the model's final prediction differs by 3 cards — if you were to build a deck by including all cards with an above 50% inclusion probability. (The players' deck contains 24 of the 42 drafted cards.)

I also went through a second draft, from this video. This is a competitive event with cash prizes. Here, the model correctly predicts the first pick a $35 / 42 \approx 0.83$ fraction of the time. Of the other 7 picks, 5 are the second prediction, one the third, and one the forth. In the final deck, we again see a difference of 3 cards.

In two picks I noted a potential tendency of the model to do rare-drafting, where you pick a card not for its contribution to your deck, but for its (often monetary) value after the draft. [?] This would not be surprising, since this is a consideration of the players making up the dataset the model is trained on.

You keep the cards you draft after the games are finished.

In pack 2, pick 2 we can use the fact that the model outputs probabilities for all cards, not just the ones in the pack, to determine that the model considers the picked card (Curator of Destinies) to not only be the best card in the pack, but also the best possible card in the entire set. This is in context of the cards already picked — without any context the model considers it the 10th best card in the set (and it has the 4th highest winrate on 17Lands). This matches the sentiment expressed by the player at that point.

The pick-by-pick ranks and some additional notes on the two drafts can be found in draft1.txt and draft2.txt.

Qualitative analysis

Let us go through the considerations of human drafters I have identified above, and see whether the model is able to take them into account.

Card quality. Clearly, the model has a sense of which cards are better then others. This is obvious from the fact that it manages to predict picks with a high accuracy. You can also look at pack 1, pick 1 predictions and see that they roughly line up with the 17Lands win rate data, which is considered the best objective source of card quality information. (Of the 20 highest ranked cards of the model and of 17Lands, they agree on 10.)

Colours. Even though the model does not receive the colour of a card as an input, it has managed to deduce them from the data. It prefers picking cards in the same colours as previous picks. This is true even when the cards are so weak that they would never be played in the deck.

The model has also acquired a more subtle understanding of colours. For example, it is possible to splash colours, i.e. play a few very strong cards that are not in your main colours. But to do this, you need to include other cards in your deck that make it easier to play multiple colours, called mana-fixing.

Looking at pack 3, pick 9 of the first draft analysed above, the players (and the model) prefer to pick Evolving Wilds, which is mana-fixing. If we instead were to pick Wary Thespian, the model estimates a lower probability of including our off-colour cards in the deck, which makes sense.

Synergy. To see whether the model can take into account synergies that are not expressed in colours, I took another Foundations draft and modified the packs so that the first picks of both packs 1 and 2 were either Homunculus Horde or Extravagant Replication. These two cards are both blue rares with similar win rates, but Homunculus Horde is a build-around for drawing two cards in a turn. And indeed, if the picks are Homunculus Horde, the model consistently gives higher weight to cards that say “draw a card” somewhere in their text. (Of course, the model does not actually know the text of a card.)

Signals. Finally, I wanted to know whether the model is able to read signals. Since I can toggle the model's ability to see the contents of the packs off, a simple evaluation is to compute the accuracy on the test dataset with it turned off.

This does lower accuracy, as expected. The difference is 0.57%, which does not sound like a large number, but consider that this is an advanced strategy, likely implemented but only a fraction of the playerbase. If we say that only 10% of players care about signals at all, that we would have to improve accuracy by 5.7% for those players, which is 2-3 picks in a draft. This sounds roughly correct to me.

I then ran a test where I took some draft (Foundations again), where the first 6 picks are white cards. I modified the packs to include strong green common and uncommons, but not strong enough that they beat the first pick (a white card). In pick 7 then the pack does not contain a white card. I would predict that the model now favours the green card, relative to the unmodified packs.

However, it does not. The model actually has a small preference against green cards now. I am not sure why — perhaps the model observes that the player has not chosen green cards in previous picks, and deduces a potential dislike of green?

So the model is doing something with the contents of the packs, but it does not seem to consider signals in the straighforward way I would imagine.

Conclusions

The model is able to take the context of the previous picks into account to suggest a good candidate for the next pick. As seen in the comparison with human experts, the model correctly identifies the top candidates in almost all instances, and correctly predicts the actual pick ~80% of the time.

So I would say that it can serve as a good baseline for analysing your own drafts, giving you an indicator which of your picks line up with the consensus view. You can do differential analyses, where you see how the probabilities of the model change when you go back and change your picks.

There are also some other fun possibilities, such as modelling how the metagame evolved (by changing the date of the draft in its metadata), or to look at the differences in behaviour between expert and novice players.

However, poking at the model does reveal some problems, and there is still potential for improvement. You should certainly not assume that the model's prediction is always the right pick.

Attachments

name	size
draft1.txt	1.7 KiB
draft2.txt	1.0 KiB
everything	1.3 KiB

Written by Philipp Czerner, 2025-06-12