That’s fair, it might be an interesting experiment to try and compare the winrate of a bot trained with 100% of match data to the winrate of a bot trained with 50% matches worth of data. If it’s significantly better, could be a good indication there is still a lot to squeeze out by just collecting more data.
Additionally, just a thought, but something I’ve used in the past on another game (generals.io) that improved my initial behavioral cloning approach (copy-cat approach) was to move beyond mimicking a single particular algorithm (which was my first thought too). Instead, what I did was I took a huge number of match data from the top X players, and modified the loss function by scaling it with the match outcome (-1 for loss, 1 for win).
So for example, let’s say I have match data for player A vs player B, which consisted of 250 turns worth of data (one turn for each player, with a frame of history and the actions they took). Suppose Player A wins the match, well in this case, I would take all of his of turn data, and I would attempt to train the algorithm with supervised learning to predict the action he would take, and multiply the loss function by 1 (because he won), which is very similar to what you are doing now, except instead of only predicting a particular player’s turn data in the whole dataset, you are always predicting the actions taken by the winner of the match. Finally, I would also take the turn data for Player B (the loser), and perform the exact same training except I would scale the loss function gradient by -1, which is essentially training the network to AVOID mimicking his moves.
General idea is that you want your network to mimic the best player’s win they are winning matches, and avoid behaving like the players that lost. However, Terminals might have an advantage because of the HP system and how it definitively leads to a win (I think), so in that case, it might be worth training a network to try and predict how much HP damage it will take AND how much it will deal by playing a certain move, then you just pick the action with the highest expected dealt/received ratio. Or, just use the paradigm above, but instead of scalling loss by match outcome (-1 or 1), just scale it by (damagedealt - damagereceived) so that it learns to mimick the moves that have a high dealt/received ratio and avoid mimicking moves with a bad ratio. When I switched to this paradigm when competing in that other game, I saw a HUGE boost in performance because of the increase in dataset size that I got by looking at many more high-level players worth of data instead of just one.
Typically, you determine the epoch count (how many epochs you should train before stopping) via validation. So, for example, you start off with train/valid/test splits. You perform your training process, and suppose after N epochs you see that your validation error starts increasing (and doesn’t stop). Now, given this N, you combine your train+valid sets into one bigger set and re-train your model again using the best hyperparameters found during validation process (including the N) and you stop after training N epochs, and test the model on the test set to get your test error.
Finally, after having validated and tested your model, you take the ENTIRE dataset (train+valid+test) and you re-perform the entire training process on the whole dataset with the same hyperparameters (and N) as before. After you are done training, this is your final model that you have produced, and is likely to perform as good (or better) than its test error indicates (since it has more data to train on than it did previous). Very common error to only train on the train set and just use that model as your final model, missing out on a large chunk of your dataset.
EDIT: One last thing I thought of (sorry for the wall of text LOL), another improvement that significantly increased my bot’s winrate in that other game was to introduce data augmentation via rotation/mirroring to exploit the symmetry of the game. I’m not sure if you can rotate for Terminals, but it looks like Terminals is symmetrical horizontally, so that you can take all of the turn data you have, and then duplicate it and mirror the game state/actions, which will still be completely valid moves. In essence, it should quickly double the size of your dataset (with a bit of correlation) which is one of the best ways to reduce overfitting.