The Ultimate Copy-Cat - ML Implementations

The Ultimate Copy-Cat - ML Implementations
0

#5

Woah thank you so much for this insightful post! We are analyzing the merits of ML for the game so this post is extremely helpful.

The copy strategy I think will ultimately never work if you use multiple algos because its effectively just an “average” of what algos tend to do given certain situtations. Seeing how it plays is really interesting though, the replay you link shows that the maze-like structure is very popular, as even the average summation of all algos you trained on manage to leave a line open for your units to go through.

What is your ML background btw? Have you used it professionally or just played with it as a hobby? Have you tried stuff like OpenAI Gym?

Anyway I’m not a super expert but here are some approaches I would personally try:

  • Completely ML Approach: Pick an algo to copy as a base, and then use something like Q learning (or NEAT) or even genetic training to improve upon it. The hard part would be to optimize the random “off policy” moves it can do, essentially the way any of these ML approaches work is you try a random move it wouldn’t normally do and if it does well based on your reward function you increase the likelihood of doing it in a similar situation. Because getting a large number of matches is hard doing just anything random to “explore the action space” is probably not effective so I would try to make it so the “off policy” thing the algo tries is at least a little smart. For example, make sure it doesn’t try illegal moves, or doesn’t try obviously bad moves, but this is tricky. Of course this off policy thing could practically be an algo itself so finding a not too complicated way to try “possibly good” new moves would be nice. Deciding between NEAT, Q Learning, Genetic algos etc, is also tricky. When i was more into ML Q learning was the cutting edge, but I always thought NEAT and genetic algos were underrated as well, not sure what the consensus is now adays I haven’t been keeping up.

  • Multi-Algo Decider: The opposite end of the spectrum is instead of having the ML part of the algo decide specific firewall placements you could have it decide between a selection of hardcoded algos after a few turns. For example, start with a basic defense, then use ML to decide between which of the X preset algo designs should be chosen based on what configuration the enemy has. This would by far be the easiest to use ML, the output would just be between X possibilities, though it would also require a bunch of hardcodedish algos to use. Example preset algo strats could be: left side ping canon, right side ping canon, EMP line, center of board ping rush etc. Figuring out which one of the subset algos to use is pretty tricky for a regular algo to decide, an ML approach could definitely help in deciding. You could also have it wait to “decide” until the enemy algo commits most of its cores as well. Of course this wouldn’t work well against super dynamic algos that completely change their strategy mid game like Aelgoo but could still do really well probably reach top 10. This algo wouldn’t need to be pretrained on copying a previous algo.

    • Heuristics approach: A subset of this approach is instead of having the ML part decide which algo to use, it just decides or predicts features of the game. For example, it could decide is the enemy using a maze or not (and you could train it on algos you know are mazes and some you know that aren’t). Is the enemy attacking the left or right side. Is the enemy a ping canon? Is it better to attack right or left side? Basically you use the ML part just to predict features about the game and then a regular algo uses those features to make decisions.
  • Hybrid Approach : Inbetween the last two approaches is a hybrid approach. Instead of choosing the exact firewall location, or the opposite of just picking between preselected strategies, the ML part of the algo uses a combination of predetermined structures to build itself to avoid clobbering itself. For example, there are a few “standard” defense shapes, like a destructor with some filters around it to soak up damage, or an EMP line etc. These can be also divided into a few preset locations that are optimal to place them in. For example you may have a optimized defense that is good against EMPs, we can call that substructure A, and another that defends well from PINGs substructure B. Instead of having the ML strategy decide “Put a destructor here” it could decide “put substructure A in position 3” substructure A being forified destructor for example position 3 being the middle left in the front. I imagine it would feel something like how Aelgoo feels like to play against. The downside is it wouldn’t come up with new substructures, but you could train a separate network that just determines substructures or something, one that is even a subcomponent of the main big net even (there are a lot of crazy neural net graphs out there in the research). To be even more ML based you could have it choose more specifically what to do and just have some baseline rules to help guide it like never make a move that completely blocks attack paths or something. For attacking it gets a bit tricky because attack and defense are so linked, but there are all sorts of things you could do it make it work well like only do attack structure A if defense structure B is not used or something (though hopefully the ML also learns to do this on its own).

    • Pretraining: This algo also wouldn’t need to be pretrained on a copy but there are a lot of weird pretraining techniques that could be useful, for example the one where you remove the last few layers and replace them with a few layers that output something different like the substructure choices above and retrain thereby keeping some of the basic pattern recognition that the bot you are copying from has (but using those heuristics to output higher level substructure suggestions instead of specific locations). For example you could copy aelgoo, then remove the last few layers and have it output substructures instead, whatever aelgoo is using to detect where to defend will be kept in the first few layers logic, but the output would be retrained to use your substructures. Of course, the pretraining that I’m describing is kind of an old technique and your mileage may vary, sometimes it doesn’t really help at all but sometimes its very useful like with autoencoders.

#6

Thanks for the comments!

@arnby regarding following a consistent strategy, I think an excellent solution (on paper) is your idea of an RNN. Certainly, this makes sense given how each state of the game board is dependent on the turns before it. This would help a lot with making it stick to one strategy and also recognizing that strategy as independent from others. I thought I would try and tackle NEAT next since I’m just interested in the algorithm itself and haven’t played with it, but with some preliminary work already done towards NEAT, I can say I think it’s possible but may be stripped to a point where it is not functional and is going to take quite a bit of work. I’ll talk more about NEAT below. But given how NEAT provides some quite difficult challenges and an RNN will be (fairly) simple for me to convert my current setup I’ll definitely give that a try.


@kkroep, I definitely realize in hindsight what you are saying about your algo being a poor candidate since it focuses on the attack. However, I simply wanted to see if a recreation would work and I randomly picked a top algo to mine data, but the idea of going after @Aeldrexan’s AGLOO is an excellent one and it would be very interesting to see the results, I’ll definitely give it a test run in a little while out of curiosity. (Assuming @Aeldrexan doesn’t have any complaints - speak now or forever hold your peace :slight_smile:).


@C1Junaid, wow, some awesome ideas presented in your reply. My goal is to make whatever ML implementation as pure as possible. What I mean by this is that I agree, a Multi-Algo Decider is probably a much easier implementation, but I hope to be able to train some sort of network to think without any static elements (I am not including things like pre-training and substructures in this). My general process (I believe) will be to start by giving as little guidance as possible, and slowly work backward in terms of the amount of static instruction regarding the strategy.

I noticed two common themes regarding your list, and I feel ML can be used to benefit an algo in two main ways. It can make either make the decisions, or provide information (like what strategy to use) that help an already built algo make decisions. I think the latter here is a “safer” and probably more reliable decision, which is naturally why I want to try the former first :).

That being said, I will be trying many different approaches and will draw from both sides of this equation. I have never fully done machine learning before so I am looking to learn lots of different implementations. I have followed ML for quite a while and did things like build my own (very basic) neural network that did forward and backward propagation from scratch in C++. I have looked at OpenAI gym before, but haven’t done anything yet. Every time I started working on something other things have taken precedence and time has eluded me, so Terminal provides an excellent opportunity for me to learn about this :).


Some thoughts about NEAT and genetic algorithms:

I began some preliminary work setting up a NEAT algo and it became very clear that the size of each genome using the raw board as inputs and outputs is simply too large (this is exaggerated by the inefficiency of python - which could be offset in C but I’m not ready to spend that much time working on that yet). There are also some issues regarding training in parallel since I obviously don’t want to create a new folder for every single genome. Instead, I have a single folder that references a pickled Genome object that it will use for its logic. Then it gets which pickled object it should use from the (slightly modified) run_arena.py script. This leads to race conditions when creating the algos, but this is pretty easily solved (thought I’d mention it anyway though).

The big issue is simply the number of inputs and outputs. If you have each board position and health, cores, and bits for each player as inputs, there are 427 (one for bias) inputs. Then, you can expect a similar number of outputs, or possibly more depending on your implementation. This is nothing, small in fact, for a standard NN model, but using an inefficient language (like I said, not doing this in C) where you need to create a new population every epoch with new mutations, this is extremely inefficient and slow.

The obvious solution (which is very common) is to simply reduce the number. This is fairly easy for the inputs since you can simply package the data differently (an analogy would be making an image more pixelated) without really changing the information the network receives. Where I am struggling to find a solution is reducing the number of outputs. This is, of course, assuming an implementation that makes a base. It is much more difficult to take a pixelated image and make it high resolution. Similarly, it is challenging to make the number of outputs give more data without actually creating more nodes. The current idea I am toying with is some sort of hashing function, but if anyone has any thoughts I’d love to hear them. The other issue with the number of inputs/outputs, of course, is just training time - it will take much much longer to learn with more initial nodes.

All of the above problems (which really all come down to training time) make me hesitant to try a NEAT implementation without at least trying the RNN idea first (and the little test with ALGOO).


Little update:

Regarding the training, I mentioned in my first comment (the 11,000 matches): it experiences the same issue observed by everyone thus far regarding committing to a strategy. It is now very confident in its decisions it makes (>.98), but it still chooses everything. This more than anything makes me want to try an RNN, but if anyone has other suggestions or reasons I shouldn’t just let me know.

Thanks!


#7

I’m currently using pytorch and haven’t had a bot do anything of merit yet, but I’ve somewhat dealt with the issue of the large input (not entirely though). Neural networks work best with one-hot encoded input data. This increases the input data if you want to use machine learning, but there are a few strategies that i’ve used.

  1. Architecture is VERY important with ML algos.
    Before I was thinking about architecture weights and just purely thinking about what would be nice if trained properly, I ended with a network that had a little over a million weights. (YIKES!) , so think carefully about the structure of your NN.
  2. Use a CNN to reduce number of weights in a meaningful-ish way.
    Using a CNN of height and length 2 with a stride of 2 should reduce input space by 75 percent. The amount of weights used in a CNN is worth it to reduce the input space.
  3. Use NN’s to reduce input space. Implementing an autoencoder will help reduce input space if trained correctly. But with NN’s their are a lot of weights used to reduce the input space. So this is a secondary option that may have to be used either way.
  4. Use the least-bulky RNN possible. In this scenario, I’d recommend either vanilla RNN’s or a GRU. The bulkyness of LSTM’s just aren’t worth it. RNN’s tend to use a lot of weights. If there are other RNN types that have about or less weights needed per input than a GRU cell, then use it.

This is everything I can think of right now. Good luck to the rest of the ML algo coders!


#8

I haven’t had a lot of time lately to continue work on this, but I have done a little bit:

I have begun some preliminary work (honestly, mostly done - it went way faster than I expected) regarding classification of algos.

What I mean by this is taking a current snapshot of the board (looking at one side, or algo) and attempting to identify what sort of strategy it is using. The obvious hiccup is that for a ML approach you’d typically have labels to tell it what is right/wrong. But, if you had this you wouldn’t need the ML… So I chose to use unsupervised learning to identify strategies. Essentially, I made it a k-means clustering problem.

Although I have just started this, it is pretty clear that it works pretty well. Lets just say that I have found a lot of maze algos…

My thought is this information can be used in a couple of different ways. First, I think it would be interesting to train different neural networks to counter only one specific strategy, and then choose the NN based on the cluster classification. My belief is this would help with choosing and sticking to a strategy problem I have been facing thus far, since the NN wouldn’t have to make that decision. A simpler use case would be to identify the strategies by watching games (which I plan on doing regardless) and hard code a counter-strategy.

The main challenge with clustering is determining its accuracy based on the k-value (number of clusters). I am not sure what an optimal k-value would be - and it is an extremely important number. A basic approach would be to simply try a bunch, which is fine because training is actually quite quick, but the problem is then determining a k-values accuracy. If anyone has any thoughts on determining an optimal k-value or how to judge a clusters accuracy I would love to hear them. For the moment though, I’ll just upload an algo that builds a message (abbreviated) similar to @C1Ryan’s fee-fi-fo-fum idea to tell me what it thinks the opponents strategy is, so if you face a messaging algo, thats me :).

Thanks for all the comments thus far, they’ve all been awesome!


#9

Alright, I got my first ML algo (and ML project in general) up and running locally, which is pretty exciting. I trained it on ~2000 replays, although it’s clear that it’s confined to copying Demux-like algos due to the hard coded round 0/1 I provide it.

The algo is designed to just provide to me the defensive placements that it thinks would be best. I work the rest of the magic to place the best attack, so it’s certainly not operating “all on its own.”

Here’s the first game, which I think is pretty successful and shows it able to express ideas related to how to add on to Demux: second_game.replay (5.8 MB)

Here’s another, in which I don’t give it the first two rounds to give it structure. It’s a hot mess and seems to want to copy some dynamic algos briefly before it spirals out of control: third_game_organic_placements.replay (2.1 MB)

Overall I think the algo works pretty well. It’s certain a bit “unfair” to pure ML algos to hand mine optimized attacks and a well defined structure, but maybe later on I’ll get it to operate a bit more autonomously. I’ll see what I can do about getting this running live, but I suspect the dependency workaround will not install in time to be functional. I might need to wait it out for Python dependencies to be supported, unless I figure out how to make my own forward propagation. Update: “Unable to find ‘keras’” when I use the workaround…

From here I might just commit it to learning from Demux-algos. Ideally I’d like to set up some kind of self-reinforcement environment, but with the struggle of setting this up that might take a while…


Mixed Results: DQN Experimentation
#10

Excitingly, I got “ML_Bootstrap” up and running on the live servers using some workarounds. It’s no champ, but it’s competing alright. The algo ID is 56279 for those that want to see it derp around and place weird things.

Here’s an alternative method to getting a Keras model running on the live servers without needing Keras, Tensorflow, or Numpy as dependencies:

  • Using this repo, get your Keras model running locally on C++
  • Using your choice of Python-C++ interfacing, (I used this method), get your C++ model back into your Python algo
  • Spend many hours debugging

A couple notes on debugging this: std::cout will mess up the game engine, but you can replace those with std::cerr. Also, the JSON and H5 to plain text conversion is outdated. Line 30 of dump_to_simple_cpp.py should read arch["config"]["layers"]. Everything else should be regular debugging. There may have been some other fixes I had to make, so if anyone’s trying to attempt this and is having immense difficulty, I can fork the repo and upload a fixed version.

This upside to this method over the other discussed ones is that you can get your Keras model up and running without figuring out how to load a model and its weights while also avoiding implementing a forward propagation method. The downside, of course, is you’re debugging a slightly broken library, and that’s no fun.


Dependencies in Python Algos
What exactly am I supposed to upload?
#11

Thanks for your open post. I was not really developing for the past month so I am checking back now to find that a lot of interesting stuff is happening. There seem to be quite a few algos around that use a base structure that is inspired by Demux, but I can tell that under the hood a much more interesting algos are hiding than an attempt at a replica. This is great, because now I can also learn a lot from you guys :slight_smile:.

So now the interesting question is, is the ML algorithm able to come up with decisions that are surprisingly intelligent, or is that still hard to achieve? I would be extremely impressed if this ML type approach would be able to formulate attack strategies that beat (my) hand crafted ones. I would expect that constructing multi-turn strategies is extremely difficult to achieve.


#12

The model is designed to give me a set of “optimal” defensive places, given an inputted game state. What I do with those placements and what kind of game state I pass into it (relative to that turn’s game state) are the result of my handcrafted framework, so there’s not a lot of room for “expressing” ideas both ways (the handcrafted logic is muddled by the less-than-perfect defensive placements, and the defensive placements are muddled by an imperfect model of the game state being passed into it).

That being said, you can see the algo expressing some ideas. A couple from the replay I provided:

  • Stuff destructors in the corner to stop the corner cannon
  • Replace firewalls that have been destroyed from the main structure
  • Attempt to express an attack strategy by building out the middle rows of filters

The third bullet point is obviously the weakest, although that’s likely the fault of my approach. I don’t feed it how many cores it has to spend, nor does it give me the entire structure it wants to place (just that turn’s placements). That’s more of an issue with my model than something ML can/can’t do (I think a better model could express more complete attack strategies, or perhaps better logic that buffers placements when there aren’t enough cores for all of them).

A note on the second one: This one was the most surprising idea it expressed. The removal logic is not part of the neural net, that’s being done by hand. Deciding to replace things that have been removed/destroyed is quite interesting to see, particularly since it’s giving me that turn’s placements, not a complete structure. I suppose that replacing firewalls in the core part of the structure is just standard behavior it was able to mimic.


#13

Firstly super awesome to see ML algos, thanks for experimenting with them!

So, what exactly did you train your algo on? Any replay or ones with demux like strats only? How did you judge what the desired output should be (assuming this is supervised learning). Was it something like take a gamestate, look at what firewalls were placed, and if the turn was “good” (enemy deployed units but they were properly defended against, player health didn’t go down) then you train on that turn?


#14

Thanks, I had a lot of fun getting it working. It was definitely a huge learning curve (not that I made it very far into the curve).

So, what exactly did you train your algo on?

This was a detail I initially left out on purpose, but I’ll handle the followups with care.

Any replay or ones with demux like strats only?

I downloaded 100 replays from the top 20 algos as of 1-26. The Demux-specific behavior that can be seen is because the algo has a hardcoded turn 0/1, but it “technically” could mimic any other basic strategy it trained in if I supplied a different initial turn. If I make another version with mimicking behavior, I’ll probably just download Demux-esq replays.

How did you judge what the desired output should be (assuming this is supervised learning). Was it something like take a gamestate, look at what firewalls were placed, and if the turn was “good” (enemy deployed units but they were properly defended against, player health didn’t go down) then you train on that turn?

I’m not sure how to write my own loss function, so instead I calculated this beforehand. I iterated over each turn of each replay to evaluate whether or not a turn was “good,” then added this to the dataset if it was “good.” The neural net was then simply trained to obtain outputs that mirror those “good” placements for a given game state, which makes it more of a mimicking algo than something that has the capacity to “invent” new ways to play (something I want to explore later, but would take a lot more work).

The metric for “goodness” is closely tied to my regular algo, so I’ll leave that out.

Edit: Important note on iterating over the replays: Technically one could compare between two turns generated from the replay file to define “goodness” without the need for a simulator at all, much less a game engine.


#15

I recently did some work on copy-cat ML algos.

Model

I trained a model to place both attacks and defense according to the 35 last versions of the eagle algo.
It accounts for about 800k placements to learn.

I choose to take several versions of an algo, in hope that it will add some noise to the learning material and refrain the model from overfitting (all my ML tries on a single algo overfitted too fast). But I am aware that it will be harder for it to learn a consistent strategy. The assumption is that the versions do not differ that much.

I uploaded 4 versions of this model, at different learning stage (the number at the end is supposed to be the accuracy test on 1000):

Results

It did learn few things:

  • the general structure of an eagle algo
  • the idea of opening and closing paths of attacks to defend and create opportunities
  • several attacks patterns from eagle:
    * turn 10 it closes every path to make a corner attack
    * turn 7 it saw a pretty good opportunity for an EMP attack

what can be improved:

  • I currently give it only the board and the players stats. It has absolutely no clue about the attacks the opponent did or where it scored / got scored.
  • Use a CNN instead of a flat NN
  • Overall accuracy: it still does some silly moves , like attacking while closing every paths, or not spending cores after receiving a huge blow. (the algos over around a rating of 1400 :disappointed_relieved:)
  • Training material: I am afraid that the test and validation sets are leaching because of repeating patterns among the opponents.

Anyway, thanks to @Felix for this opportunity! :grin:


#16

Very cool stuff, arnby! Interesting to see your progress with it. Out of curiosity, did you ever perform an experiment to see how the algorithms performance scales as you increase the amount of data? (And side note, are you re-training the final model on train+valid+test sets?)

For example, take 100k turn frames, train a model on this, take 200k turns and train a model on this, etc. Then, compare the win rates amongst the different models (relative to each other or some baseline like the eagle model), so that you can try to estimate how the performance is scaling in response to the data available, and might tell you whether getting more data is likely to significant boost the algo’s capabilities.

Either way, awesome stuff and thanks for sharing!


#17

Well done @arnby!

I’m really impressed by your ml-clones from my algos :grinning:.

Some other really cool matches I noticed from your ML-algos:

  1. https://terminal.c1games.com/watch/5300681
  2. https://terminal.c1games.com/watch/5300527

And I’m mostly impressed, that your algos wait sometimes many turns before they do their next attack because I’ve seent that your first ML-algos launched attacks every turn.

In the second match, in rounds 6 to 8 your algo entered a situation, where your algo would not be able to score anymore, if the opponent didn’t attacked themselve.
This problem is fixed in the original EAGLE by also using information from previous turns and action-phases.
I think, that this information could also benefit your algos.

One of the hardest tasks for ML when copying my algo is probably to learn to open a path and then attack there. Right now your algos seem to replicate my actions without really understanding the that the attack does nothing, when there is no opening. That’s why I think that it can perform well against strategies it learned from but will be bad against new ones it had never seen before.

I look foreward to seeing more improvements of your ML-algos.


#18

Aside from learning from one versus from 35 algos, I did not. I initially trained on 40 eagle algos, and then dumped the first five because the model learned less successfully on them

No, I don’t, would you? How do you decide then when to stop the learning process? (aka overfitting).

Thanks @Felix, These two are pretty cool!

And what a surprise: I just discovered that ML_Eagle_765 has a rating of 2075 :partying_face::partying_face:
I admit that after watching the first few matches, I thought they would never climb the ladder.
I think ML_Eagle_765 got lucky and finally managed to be match up against opponent it had learned from, hence the climb.

And it is also quite unfamiliar with getting destroyed: I noticed that it didn’t how to react when a large chunk of the base was gone, even so it had the cores to rebuild it.

I will try some variations and if there is some interesting stuff, I will post updates here.


#19

RIP, it would have beaten me if not for one big mistake:
https://terminal.c1games.com/watch/5307823

The removals of that algo really confuse my algo…
Either way, nice work @arnby


#20

@Felix, are talking about the RL algos? I thought they would go unnoticed :sweat_smile:

There is still lot of rooms for improvement yes :wink:

btw is you algo dynamic? By the weirdness of the structure it seems so, or at least on a certain level?


#21

@arnby, yes I ment your RL-algos.

With this tool (thread) I can find everything. Even if it is very far away from the top 10 :grinning:
(hint: it’s currently not possible to find your RL-algos, because I reduced the number of algos that are getting loaded to increase the loading speed.)


#22

Yeah, it’s very dynamic, I was getting bored of just making static structure algos.


#23

I see your point, but I laughed out loud when I realized all my algos have been simple finite state machines with a bunch of if statements… :rofl:

I find debugging and refining a strategy to be very easy when you know exactly what piece of code builds what structure, so I usually try to get as far as I can with simple code, and add in more complex logic later on when I hit a wall


#24

That’s fair, it might be an interesting experiment to try and compare the winrate of a bot trained with 100% of match data to the winrate of a bot trained with 50% matches worth of data. If it’s significantly better, could be a good indication there is still a lot to squeeze out by just collecting more data.

Additionally, just a thought, but something I’ve used in the past on another game (generals.io) that improved my initial behavioral cloning approach (copy-cat approach) was to move beyond mimicking a single particular algorithm (which was my first thought too). Instead, what I did was I took a huge number of match data from the top X players, and modified the loss function by scaling it with the match outcome (-1 for loss, 1 for win).

So for example, let’s say I have match data for player A vs player B, which consisted of 250 turns worth of data (one turn for each player, with a frame of history and the actions they took). Suppose Player A wins the match, well in this case, I would take all of his of turn data, and I would attempt to train the algorithm with supervised learning to predict the action he would take, and multiply the loss function by 1 (because he won), which is very similar to what you are doing now, except instead of only predicting a particular player’s turn data in the whole dataset, you are always predicting the actions taken by the winner of the match. Finally, I would also take the turn data for Player B (the loser), and perform the exact same training except I would scale the loss function gradient by -1, which is essentially training the network to AVOID mimicking his moves.

General idea is that you want your network to mimic the best player’s win they are winning matches, and avoid behaving like the players that lost. However, Terminals might have an advantage because of the HP system and how it definitively leads to a win (I think), so in that case, it might be worth training a network to try and predict how much HP damage it will take AND how much it will deal by playing a certain move, then you just pick the action with the highest expected dealt/received ratio. Or, just use the paradigm above, but instead of scalling loss by match outcome (-1 or 1), just scale it by (damagedealt - damagereceived) so that it learns to mimick the moves that have a high dealt/received ratio and avoid mimicking moves with a bad ratio. When I switched to this paradigm when competing in that other game, I saw a HUGE boost in performance because of the increase in dataset size that I got by looking at many more high-level players worth of data instead of just one.

Typically, you determine the epoch count (how many epochs you should train before stopping) via validation. So, for example, you start off with train/valid/test splits. You perform your training process, and suppose after N epochs you see that your validation error starts increasing (and doesn’t stop). Now, given this N, you combine your train+valid sets into one bigger set and re-train your model again using the best hyperparameters found during validation process (including the N) and you stop after training N epochs, and test the model on the test set to get your test error.

Finally, after having validated and tested your model, you take the ENTIRE dataset (train+valid+test) and you re-perform the entire training process on the whole dataset with the same hyperparameters (and N) as before. After you are done training, this is your final model that you have produced, and is likely to perform as good (or better) than its test error indicates (since it has more data to train on than it did previous). Very common error to only train on the train set and just use that model as your final model, missing out on a large chunk of your dataset.

EDIT: One last thing I thought of (sorry for the wall of text LOL), another improvement that significantly increased my bot’s winrate in that other game was to introduce data augmentation via rotation/mirroring to exploit the symmetry of the game. I’m not sure if you can rotate for Terminals, but it looks like Terminals is symmetrical horizontally, so that you can take all of the turn data you have, and then duplicate it and mirror the game state/actions, which will still be completely valid moves. In essence, it should quickly double the size of your dataset (with a bit of correlation) which is one of the best ways to reduce overfitting.