Fine Tuning Mistral 7B on Magic the Gathering Drafts
Tips, examples, and thoughts from an exploration of the world of fine tuning
In the last six months, I’ve written about fine tuning a few times. Fine tuning is such an enticing technology — promising to fill the gaps in GPT-4’s capabilities while also being faster and cheaper. For as often as fine tuning is discussed, though, I’ve found a surprisingly small amount of content out there that has helped me reason about how effective fine tuning is and how hard it is to successfully fine tune new capabilities into language models.
So, I decided to take things into my own hands, dust off my ML chops, and find out for myself.
Choosing a Problem
I was particularly interested in testing models’ ability to reason (i.e., perform a somewhat complex task that requires high context understanding) about out-of-distribution (i.e., unseen) data. I ended up using a hobby of mine: Magic the Gathering (specifically, draft).
For the unfamiliar: Magic: The Gathering is a strategic trading card game where players use decks of cards representing creatures and spells to battle against their opponents. One of the ways that players play Magic (and my personal favorite way) is draft, where players build their decks by selecting individual cards from a rotating pool of randomized cards passed among them.
Draft fits my criteria pretty nicely:
Reasoning: choosing a card from a randomized pack is quite skill testing and often requires a cohesive understanding of the context (e.g., what cards have you picked so far, what cards are available in the current pack)
Out-of-distribution: New Magic cards are released ~4-6 times a year, and the most recent cards are not found in the training corpus of LLMs.
Another important piece: data. There’s an awesome service called 17lands that has a huge trove of historical data — players use 17lands’ tracking service to track draft data from the digital Magic client. With that data, you can extract “ground truth” by looking at the draft picks made by the best players on the service (sorted by win rate). This is all a bit fuzzy (a lot of great Magic players debate about correct picks all the time), but it’s a good enough signal to test LLM’s ability to learn a new task.
If you’re curious about data details, here’s an example of what 17lands data looks like when transformed into a prompt for an LLM.
Results + Summary
Let’s get straight to the results, then dig into some specific learnings and thoughts:
Thoughts:
A fine tuned 7B parameter model handily beat GPT-4 and came close to human-level (or at least author-level) performance on this task.
It looks like fine-tuned GPT-3.5 would be even better, but fine-tuning GPT-3.5 is really expensive! (~100x more expensive than fine-tuning Mistral on bare metal + a premium price for each inference). A fine-tuning run of GPT-3.5 equivalent to my largest run of Mistral-7b would have cost ~$500 — an expensive experiment.
Fine tuning is still a bit of an art — I had hoped that this would feel more like engineering than science, but there was a lot of experimentation to be done. In particular, prompt engineering with the long feedback loop of fine-tuning is brutal. I’ll go into more details below.
When in doubt, use axolotl for fine tuning. It will save you from missing out on a lot of little optimizations.
Even the small OSS models are huge by the standard of 5 years ago. It’s one thing to read “7 Billion Parameters”; it’s another to deal with fitting 7 billion parameters and all of the associated math onto a GPU.
I did one interesting experiment, fine tuning a model on one set of cards, then evaluating it on an unseen set of cards. The model seemed to generalize on the concept of drafting, not just memorizing which cards were good.
Field report: methods and learnings along the way
Data
Building a text dataset: The 17lands draft dataset is actually a big CSV file that describes a series of draft picks made by users, roughly with the format of:
The cards that were available in the current pack
The cards the drafter had picked so far
The card the drafter picked from that pack
To make this data suitable for fine tuning a language model, you have to transform it into text — I ended up using the assistant format popularized by OpenAI:
{ | |
"messages": [ | |
{ | |
"role": "system", | |
"content": "You are DraftGPT, a Magic the Gathering Hall of Famer and helpful AI assistant that helps players choose what card to pick during a draft. You are a master of the current draft set, and know every card well.\n\nWhen asked for a draft pick, respond with the card's name first." | |
}, | |
{ | |
"role": "user", | |
"content": "In our Magic the Gathering draft, we're on pack 2 pick 13. These are the contents of our pool so far:\n-------------------------\nEvolving Wilds -- (common)\nRat Out -- {B} (common)\nNot Dead After All -- {B} (common)\nHopeless Nightmare -- {B} (common)\nBarrow Naughty -- {1}{B} (common)\nUnassuming Sage -- {1}{W} (common)\nThe Witch's Vanity -- {1}{B} (uncommon)\nSpell Stutter -- {1}{U} (common)\nMintstrosity -- {1}{B} (common)\nWater Wings -- {1}{U} (common)\nBarrow Naughty -- {1}{B} (common)\nGadwick's First Duel -- {1}{U} (uncommon)\nBitter Chill -- {1}{U} (uncommon)\nThe Princess Takes Flight -- {2}{W} (uncommon)\nStockpiling Celebrant -- {2}{W} (common)\nVoracious Vermin -- {2}{B} (common)\nDevouring Sugarmaw // Have for Dinner -- {2}{B}{B} // {1}{W} (rare)\nMisleading Motes -- {3}{U} (common)\nJohann's Stopgap -- {3}{U} (common)\nBesotted Knight // Betroth the Beast -- {3}{W} // {W} (common)\nThreadbind Clique // Rip the Seams -- {3}{U} // {2}{W} (uncommon)\nTwining Twins // Swift Spiral -- {2}{U}{U} // {1}{W} (rare)\nEriette's Whisper -- {3}{B} (common)\nFarsight Ritual -- {2}{U}{U} (rare)\nTwisted Sewer-Witch -- {3}{B}{B} (uncommon)\nInto the Fae Court -- {3}{U}{U} (common)\n-------------------------\n\nTo keep track of what colors are open, you've counted how many cards of each color identity you've seen in the last 5 packs. Here is the breakdown:\nW: 11\nB: 6\nG: 4\nRW: 1\nR: 2\n\nThese are the contents of the pack:\n-------------------------\nCut In -- {3}{R}\nSorcery (common)\nCut In deals 4 damage to target creature.\nCreate a Young Hero Role token attached to up to one target creature you control. (If you control another Role on it, put that one into the graveyard. Enchanted creature has \"Whenever this creature attacks, if its toughness is 3 or less, put a +1/+1 counter on it.\")\n-------------------------\nSkewer Slinger -- {1}{R}\nCreature — Dwarf Knight (common)\nReach\nWhenever Skewer Slinger blocks or becomes blocked by a creature, Skewer Slinger deals 1 damage to that creature.\n1/3\n-------------------------\n\nWhat card would you pick from this pack?" | |
}, | |
{ | |
"role": "assistant", | |
"content": "Cut In" | |
} | |
] | |
} |
This very quickly exposes the most challenging piece of fine tuning: formatting the data for the right outcome is challenging and fundamentally experimental.
By now, most folks are familiar with prompt engineering — the experimental process of modifying your prompt to get the best performance out of a language model. The prompt engineering process is 100x slower with fine tuning. You typically need to kick off a multiple-hour job to test a prompt. This bogs down the experimental workflow significantly and makes fine-tuning feel just as challenging as classical machine learning.
To illustrate with the Magic draft problem, I considered and tested the following:
~5 prompt formats, in particular how much detail about each card to show
Adding additional context about the last few draft picks to have “memory”
Including training lines of “card trivia,” where the model is asked to remember details about the new cards
I did ~ 40 hours of experiments and still don’t conclusively feel that I’ve answered questions about what prompt format is “best” for this task. There is a lot of room to experiment.
Running Fine Tuning
Finding GPUs: doesn’t need to be said, but it sucks! Most places don’t have a lot of availability. I ended up renting an hourly GPU from Runpod (an RTX 4090 w/ 24GB of VRAM) for ~$0.7/hr.
Fine tuning script: This isn’t my first ML rodeo, so my gut was to write my own training script with HuggingFace transformers + PEFT. Considering my limited GPU situation, QLoRA seemed like the way to go.
It turns out that writing my own script was a bad idea! There are a whole bunch of finicky little optimizations and options that range from straightforward-if-you-know-about-them to pretty obtuse without reading a research paper. Nothing insurmountable, but it would take a long time to figure out yourself.
I ended up using axolotl, which implements a ton of those optimizations out of the box and was much easier to get running (and running quickly). Their documentation is actually pretty decent, and I think is the right starting point for most people to fine-tune LLMs.
A note on the models: Holy crap, LLMs are seriously large! The last time I trained models regularly was ~ 2019, when Bert had ~110 million parameters; now, the “small” LLMs are 70 times bigger than that. Models this large are fundamentally cumbersome. Weights being ~16GB makes storage a real concern; GPU memory is challenging even with methods like QLora. No wonder the best researchers are such a hot commodity; this is seriously challenging work at the largest scale.
Evaluation
Start with evaluation first: One lesson from ML of old that I don’t think has been adopted enough among the prompt engineering wizards: you should always build a good evaluation before starting your experiments. Here, evaluation was pretty easy (hold out some full drafts from the training data and check if the model picks the same card as the human on the holdout data), but having a good evaluation set made reasoning about fine-tuning much more straightforward.
Some criteria for language models are hard to define: The “pick the right card” task is pretty easy to define for Magic drafts, but there are some fuzzier things that I would like the final model to do, too:
When it makes different picks, they should be justifiable
It would be nice if the model could give a reasonable explanation for “why” it made a pick
Each of those is much harder to define, and I ended up testing them with the “eye test” by going through a bunch of examples, but this was slow. FWIW, GPT-4 is better at making less “weird” picks and better at justifying its choices than the fine-tuned smaller models.
Key Takeaways
My two biggest takeaways from this experiment:
Fine tuning on new data can be remarkably effective, easily surpassing GPT-4 + in-context learning on both accuracy and cost.
Fine tuning is a fundamentally experimental process to get “right”, and doing it well is a specialized skillset (and in particular, a skillset that is harder to learn than prompt engineering).
Oh, and some Magic stuff
In terms of how the bots actually feel as drafters? Pretty good!
I wired up the draft pick model to the logs generated by Magic Arena, whipped up a quick electron app, and have done a few drafts with a “Magic Copilot”:
Some quirks:
The pick is generated by a fine tuned model, but the commentary is generated by GPT-4. This works well most of the time, but occasionally GPT-4 disagrees with the fine tune and immediately contradicts it 😅
I’ve hooked up eight draft AIs to a simulated draft (i.e., all of the bots are drafting against each other). They have some quirky behavior when passing to each other — they have a pretty weird tendency to draft mono-colored decks. If there’s a human making other picks, they tend to converge into much more normal-looking decks.
Overall, I would venture to guess this is probably one of the more powerful and humanlike draft AIs out there right now. Compared to the bots in Magic Arena’s quick draft feature, these are much more similar to a high-quality human drafter than a heuristic bot.
Wizards of the Coast — if you’re looking for excessively high fidelity and somewhat expensive to run draft AI, hit me up! I’m happy to send you some LLMs!
Hey, what GPU model are you using? Can I also fine-tune it on Dataoorts: https://dataoorts.com/gpu/ for T4 GPUs?
Really cool stuff. I was avid mtger and drafting is really hard.
I wondered how it would do with some agents/mix-of-models magic, a ft model or maybe long enough context with data to cover basic metas and make advices on longterm strategy -> not making shots but giving hints (for the next try and pick: either high mana create or land) comboed with more short-fused model that makes actual calls according to advisory.
Hm, I am thinking this could be fun thing to reproduce for league of legends picks (they should be vastly easier tho, feels like xgboost or some other ml would suffice).
Anyways, great post, super fun read!