Back Original

Computer Chess: The Byte Pair Encoding Variation

I'm not a Machine Learning (ML) expert, but I have a hypothesis: Language models are mostly bad at chess because we don't tokenize chess moves well.

Just look at how the tokenizers for popular model architectures break a PGN string down into tokens:

So, I decided to learn how tokenizers work and build a new one that understands chess.

Here's where I ended up:

If you want to play around with the tokenizer, all of the code is open-source. The Python library is published to PyPI as pgn-tokenizer. A beta version of the TypeScript library is published to npm as @dvdagames/pgn-tokenizer.

The Dataset Defense

I knew the first thing I needed was data. Lots and lots of data. Luckily, the world has no shortage of chess data. Unluckily, there seem to be two classes of chess datasets:

  1. Massive datasets that take up terabytes (TB) of space
  2. Smaller datasets that don't include a PGN string or use non-standard formatting

It also seemed like I could need a lot of time and money to process all the data and train a model, neither of which I was eager to give up just to tinker for fun. So, I'd need to build something that could run on my recently purchased MacBook M4 Pro using a dataset that could reasonably fit on my hard drive.

The Kaggle Gambit

I discovered a (now unpublished) dataset on Kaggle, adapted from the chess research project that included ~3 million games from ChessDB dating back to 1783, which felt like a pretty neat representative sample of the game. Unfortunately, the data stopped about 9 years ago, so there's a lot more data that could be included now that chess has experienced its renaissance post the Queen's Gambit. - I'm one of those folks who discovered chess during the pandemic after watching Beth Harmon stick it to the chess patriarchy.

This dataset didn't fall into either category that I mentioned earlier, despite the underlying source using some truly bizarre formatting. Maybe it was less bizarre almost a decade ago when the research was originally published? I'm sure the labeling served some research goals, but it means the dataset isn't as valuable for general purposes.

Here's an example of the data:

1.t 2.date 3.result 4.welo 5.belo 6.len 7.date_c 8.resu_c 9.welo_c 10.belo_c 11.edate_c 12.setup 13.fen 14.resu2_c 15.oyrange 16.bad_len 17.game...
1 2000.03.14 1-0 2851 None 67 date_false result_false welo_false belo_true edate_true setup_false fen_false result2_false oyrange_false blen_false ### W1.d4 B1.d5 W2.c4 B2.e6 W3.Nc3 B3.Nf6 W4.cxd5 B4.exd5 W5.Bg5 B5.Be7 W6.e3 B6.Ne4 W7.Bxe7 B7.Nxc3 W8.Bxd8 B8.Nxd1 W9.Bxc7 B9.Nxb2 W10.Rb1 B10.Nc4 W11.Bxc4 B11.dxc4 W12.Ne2 B12.O-O W13.Nc3 B13.b6 W14.d5 B14.Na6 W15.Bd6 B15.Rd8 W16.Ba3 B16.Bb7 W17.e4 B17.f6 W18.Ke2 B18.Nc7 W19.Rhd1 B19.Ba6 W20.Ke3 B20.Kf7 W21.g4 B21.g5 W22.h4 B22.h6 W23.Rh1 B23.Re8 W24.f3 B24.Bb7 W25.hxg5 B25.fxg5 W26.d6 B26.Nd5+ W27.Nxd5 B27.Bxd5 W28.Rxh6 B28.c3 W29.d7 B29.Re6 W30.Rh7+ B30.Kg8 W31.Rbh1 B31.Bc6 W32.Rh8+ B32.Kf7 W33.Rxa8 B33.Bxd7 W34.Rh7+
2 2000.03.14 1-0 2851 None 53 date_false result_false welo_false belo_true edate_true setup_false fen_false result2_false oyrange_false blen_false ### W1.e4 B1.d5 W2.exd5 B2.Qxd5 W3.Nc3 B3.Qa5 W4.d4 B4.Nf6 W5.Nf3 B5.c6 W6.Ne5 B6.Bf5 W7.g4 B7.Be4 W8.f3 B8.Bd5 W9.a3 B9.Nbd7 W10.Be3 B10.Nxe5 W11.dxe5 B11.Nxg4 W12.Bd4 B12.e6 W13.b4 B13.Qd8 W14.Nxd5 B14.Qxd5 W15.c4 B15.Ne3 W16.cxd5 B16.Nxd1 W17.dxc6 B17.bxc6 W18.Rxd1 B18.Be7 W19.Ba6 B19.O-O W20.Ke2 B20.Rab8 W21.Rc1 B21.Rfd8 W22.Rhd1 B22.c5 W23.Bxc5 B23.Rxd1 W24.Rxd1 B24.Bxc5 W25.bxc5 B25.g6 W26.c6 B26.Rb2+ W27.Rd2
3 1999.11.20 1-0 2851 None 57 date_false result_false welo_false belo_true edate_false setup_false fen_false result2_false oyrange_false blen_false ### W1.e4 B1.e5 W2.Nf3 B2.Nc6 W3.Bc4 B3.Bc5 W4.c3 B4.Nf6 W5.d3 B5.d6 W6.Bb3 B6.O-O W7.Nbd2 B7.Be6 W8.O-O B8.Qd7 W9.Re1 B9.Rfe8 W10.Nf1 B10.Ne7 W11.Ng3 B11.Bg4 W12.h3 B12.Be6 W13.Bg5 B13.Kh8 W14.Bxf6 B14.gxf6 W15.d4 B15.exd4 W16.cxd4 B16.Bb4 W17.Re3 B17.Rg8 W18.d5 B18.Bxh3 W19.Qd4 B19.Rg6 W20.Qxb4 B20.c5 W21.Qc3 B21.Bg4 W22.Bc2 B22.Rh6 W23.Nh2 B23.b5 W24.b4 B24.Rc8 W25.Bd3 B25.c4 W26.Bc2 B26.Bh5 W27.Nxh5 B27.Rxh5 W28.Qxf6+ B28.Kg8 W29.Bd1

This Kaggle user had already done most of the hard work of cleaning the Chess Research Project's dataset:

Sample of the Kaggle PGN Dataset

After scripting out downloading and some more cleaning to remove games without moves or games with unexpected formatting, I had nice, simple PGN strings to work with. The next day, the original dataset I was building on top of vanished from Kaggle. Luckily, I still had it cached locally and could publish my cleaned PGN dataset to Hugging Face.

Here's a sample of the dataset:

Sample of My Cleaned PGN Dataset

Just some boring PGN strings without any weird formatting or metadata.

And, here are some statistics about this dataset:

Average Elo: 2240

Middlegame

Much has been written about which Large Language Models (LLMs) are good at chess, but I wasn't interested in making the best chess opponent. While chess isn't solved yet, Stockfish has basically won as far as computers playing chess goes, and there are more passionate, and more qualified people out there who will make better contributions to the field than I could.

I just wanted to see if a language model could reliably work with PGN strings, learn some new things, and try to understand more about how these tools work under the hood.

As I tested various models, I thought something prevents the models from attaching appropriate semantic meaning to individual moves PGN strings.

Looking at how the tokenizers for popular model architectures break a PGN string down into tokens illustrates how that could be an issue:

I could be wrong, but it seems like e4 is probably a lot more valuable to making decisions about chess than e and 4 are separatley. I decided to put on my pretend researcher hat and see what would happen if I could tokenize these more semantically.

So, I took all that data that I cleaned, put on Andrej Karpathy's fantastic Let's build the GPT Tokenizer video, and started hacking away at the problem.

It didn't take long to synthesize the crash course in ML that Karpathy was giving, and after testing the tokenizer on a few samples of the larger dataset, I was ready to train it on the entire corpus of ~3 million games.

Blunders

Emboldened by the tokenizer's success, I decided it was time to push things a bit further and train a language model.

I started by pulling up another Karpathy video: Let's build GPT: from scratch, in code, spelled out.

Unfortunately, I made many blunders in this part of the process:

  1. Hubris because tokenizing is a much simpler process than training a model
  2. Not knowing which model architecture to use - I went through a few of them:
    1. I started with GPT-2 based on the Karpathy video
    2. I switcehd to BERT based on some brief research
    3. I finally decided on ModernBERT after some trial and error
  3. Not fully understanding the semantic differences between written language and structured data like PGN strings
  4. Jumping straight into training a model without going through the rest of Karpathy's Neural Networks: Zero to Hero series
  5. Trying to use LLMs and vibe coding to jumpstart my understanding of training a model and getting stuck in debugging loops with code that I didn't fully understand

There are many more, but these were the big ones that stand out in hindsight.

I hit a wall pretty quickly, got frustrated, and ended up setting this project aside to fully immerse myself in my Recurse Center batch.

Endgame

As my 12 weeks at the Recurse Center were coming to an end, I picked up the model training again, and I'm excited to say that there has been a lot of progress on training a language model that not only understands PGN strings, but also seems to be capable of making a few legal, logical moves in a familiar game of chess

I've learned more about the underlying architecture, and have been carefully integrating help from my Generative AI-powered copilot (Claude 3.7 Sonnet in Cursor) to help me refine smaller sections of the training code.

It would be simpler to have it just spit out PGN-like tokens - and that's all my first iteration did - but I'm trying to see if there's a way to encode an understanding of the game by only using the tokens. This involves weighting the tokens based on their position in the PGN string and reducing the weight of tokens like the turn number because they are more prevalent in the training data but only valid in specific token positions.

Use the Arrow button to step through each iteration of the training loop.

It's also important to consider the context of the entire game when making each move, so I'm not just training on a randomly sampled subset of PGN tokens at a time but a progressively longer string of tokens as context for each move in each game of the training set.

In Chess, every previous move influences the future state of the game.

The basic training implementation seeks to:

  1. Start with a baseline sample of short games as a warmup with a lower learning rate increasing gradually
  2. Then use a set of medium-length games with a standard learning rate
  3. Then use a set of commonly played openings with a higher learning rate to improve performance at the start of the game - where I assume users of the Chess Tutor will benefit the most
  4. Finish with a set of longer games with a standard learning rate

After this pre-training process has been refined, my goal will be to use a chess engine, like Stockfish, to evaluate the quality of the model's moves and provide some reinforcement learning to improve the model's performance.

Stalemate

I'm still navigating the training process and working out the challenges of disparate environments. Training code that runs on my Macbook doesn't necessarily work when I'm loading it up on a cloud provider with a small GPU like a T4 - and the various PyTorch or CUDA error messages are only debuggable in the cloud environment, which makes the debugging cycle and resolution time stretch out much longer than it should

This is a work in progress, but I can't wait to share the model and the results of testing my hypothesis with you soon.

I have another hypothesis that trying to get LLMs to play chess isn't a valuable use of time or compute, but that's another post for another day.