In preparation for the upcoming competition and the end of the semester, I’m probably shelving this project for the foreseeable future. But, I wanted to share what I’ve been working on for the past two months.
Want to try this out too? Below is a short list of links of stuff I remember being helpful. It shouldn’t assume too much prior knowledge. I think starting with supervised learning helped a lot though.
Links
Game Engine
- Game documentation
- Turn simulation discussion: One of several posts on the forums discussing the main aspect of the game engine, simulating the game map.
RL
Open AI Gym
DQN
- keras-rl: Library to run the magic of DQN that plugs right into the OpenAI gym you made.
Getting the Model Running On Terminal
- Keras2Cpp4Terminal: Self plug for the adaption of the Keras2Cpp library I made. Note that there are a couple known issues with the library.
How It Works
I wanted to try out a reinforcement-based algo, instead of my first supervised learning project, so I researched the basics of RL and common libraries for using it. I ended up hijacking my C++ turn simulator to create a partial C++ game engine[1] and using that to create a custom OpenAI Gym (to be referred to as “the environment”). From there, I used the keras-rl library to train a neural net with a deep Q-learning policy[2] on the environment I made. Specifcally, I trained it (the neural net, to be referred to as “the agent”) against a simple version of the classic “Sawooth” algo (pink player). This served to be a proof-of-concept of deep Q-learning in Terminal, vastly reduce the state space the agent has to learn about, and in speed up the game engine with a static opponent playing over a dynamic one.
Game Engine
The biggest part of recreating the game engine is the turn simulation aspect, and conveniently a C++ turn simulator was the first thing I did in Season 2. I won’t go too much into this, since this aspect of Terminal is both well documented and well discussed.
Open AI Gym Environment
This ended up taking longer than I wanted it to, but the standardized format of the Open AI Gym saved a lot of time later on since other libraries like keras-rl could simply attach to it. The main thing here is how to give a reward to an agent for taking a particular set of turns. I primarily rewarded the agent with a numeric “value” of its turn’s outcome as well as a few other parameters like punishing it for dragging out the game and rewarding it for winning quickly.
Keras-rl
A library for all kinds of reinforcement learning. This did require some diving into to get working, unfortunately. I believe the environment’s observation is supposed to be a numpy array instead of a list, as I had several issues with that. Additionally, I had to get my hands dirty in the policy functions to Terminal-ify them. The library will sanitize the neural net outputs to all 1s and 0s (part of making its own data to fit against) before passing it as the “action” to feed into the environment, so selecting the placements by the confidence of the neural net’s output couldn’t be done. Instead, I had to give it a method of spending its cores until it didn’t have enough to place the next unit it wanted to[3]. There’s a little more that went on to this, and I realized a minor bug in my own implementation just by typing it out, but I can’t take away all the debugging fun, can I?
The Agent
The actual neural net model isn’t much more advanced than what I made with ML_Bootstrap. This time, however, I gave the model additional control of removals instead of just the defensive placements. I attempted to give it control of attacking as well, but this seemed to be too complicated to learn from in a short amount of time and in a roundabout way hurt the defense learning as well.
Results
Here’s the game. (6.8 MB)
Not terribly impressive, eh? I trained it for 5000 steps (turns, not games). I will say, though, that while watching the training I witnessed it win in 12 turns as well as win in 17 turns without losing any hp. It failed to latch on and learn from this, however, likely because the start wasn’t as efficient as what you see in the replay and I’m fairly certain it’s a tad too greedy to see beyond that. My previous attempt at training this was able to win in 13 turns, however that model was bugged and couldn’t be reproduced on the live servers.
On the bright side, there seems to be evidence that it did learn something. It generally learned to place destructors out in front of encryptors to protect them, as well as to avoid the first several rows that can be easily destroyed by Sawtooth on turns it doesn’t attack. It failed to devise a replacement policy, unfortunately, and while it could be from a lack of training, I more suspect it to be from a bad model.
Where Does It Goes From Here?
Well, I trained it against Sawtooth so I could get a proof of concept that this works. It’s kind of hard to say that it will work as a more general algo that can play against any other algo. Certainly, it would take a lot more work than I could put in right now. Below, I’ll list out a few limitations that I see:
- Full Game Engine: Any plan for a more general DQN algo that I’ve thought of, or discussed with a few other people, has involved creating a full engine that can support running a normal algo, including passing the game states to each algo throughout the action phase (the primary functionality mine lacks).
- Fast Game Engine: My engine is on C++, and can reach about ~10ms per simulation, but this is the primary limitation I face. Brute forcing the agent’s attack is roughly 95% of the computation time during training, and the resultant speed it 0.5-1.0 steps per second. The 5000 steps of training took several hours. I was hoping to reach speeds of a game a second, which places me around two orders of magnitude too slow, and that’s just one algo running (Sawtooth’s static placements have a negligible amount of computation time).
- Perfect Knowledge: Terminal’s a lot different than other board games in that both players play simultaneously, so each algo lacks perfect knowledge about what the other will do. This is the basis for making a prediction phase. I simply ignored this problem by making Sawtooth place first[4]. Certainly, a more generalized approach would need to overcome this problem, especially when a full game engine is needed. It adds a unique layer of difficulty to have a vast range of outcomes tied to each state, regardless of the action taken.
- State Space/Action Space: The game’s large, basically. The number of possible states of the game is large and complex enough I’d rather not try to compute it. This is tied to the need for a fast game engine, which leads me to believe a brute-forced simulation of the best attack makes it near impossible to scale this model into a general algo.
If all that can be overcome, which is no small feat, then I theorize it’s possible to make a DQN algo “master” the game. I think the limited success of learning to play against Sawtooth does support that, on the basis that the limited success is due to my own lack of knowledge of creating a good model and the above limitations, and not something else I missed.
Footnotes
- By “hijacking” I mean adding on more and more functionality to the simulator until it was able to carry out basic game processes beyond “playing out” a game map. This ended up not being a good idea, as it got sloppy very quickly. By “partial game engine,” I mean that it cannot run “regular” algos like you can with the engine.jar. Instead, the environment and the “game engine” have to communicate back and forth to correctly play out the game, with the environment managing resources and the game engine managing the state of the board. Oh, and the environment’s in Python while the engine is in C++. Fun.
- The learning policy is the bit of “magic” that I stayed away from, so I can’t particularly answer what it does or why I chose it, other than our favorite CodeBullet conveniently released a video while I was working on this (what it does) and it was the first thing I read about while researching the basics (why I chose it).
- Removals are “free” as they don’t “cost” any cores. I make the presumption that all desired removals have been selected by the time it attempts to place a unit it lacks the cores for.
- I replicated this on the live servers by manually feeding the algo the first couple of inputs (ignoring the read in game state) and letting my predictor take over from there. My predictor isn’t perfect, and that’s most noticeable by how, locally, the agent actually loses on turn 36 instead of winning on turn 30. ¯\_(ツ)_/¯