Mixed Results: DQN Experimentation

Mixed Results: DQN Experimentation


In preparation for the upcoming competition and the end of the semester, I’m probably shelving this project for the foreseeable future. But, I wanted to share what I’ve been working on for the past two months.

Want to try this out too? Below is a short list of links of stuff I remember being helpful. It shouldn’t assume too much prior knowledge. I think starting with supervised learning helped a lot though.


Game Engine


Open AI Gym


  • keras-rl: Library to run the magic of DQN that plugs right into the OpenAI gym you made.

Getting the Model Running On Terminal

  • Keras2Cpp4Terminal: Self plug for the adaption of the Keras2Cpp library I made. Note that there are a couple known issues with the library.

How It Works

I wanted to try out a reinforcement-based algo, instead of my first supervised learning project, so I researched the basics of RL and common libraries for using it. I ended up hijacking my C++ turn simulator to create a partial C++ game engine[1] and using that to create a custom OpenAI Gym (to be referred to as “the environment”). From there, I used the keras-rl library to train a neural net with a deep Q-learning policy[2] on the environment I made. Specifcally, I trained it (the neural net, to be referred to as “the agent”) against a simple version of the classic “Sawooth” algo (pink player). This served to be a proof-of-concept of deep Q-learning in Terminal, vastly reduce the state space the agent has to learn about, and in speed up the game engine with a static opponent playing over a dynamic one.

Game Engine

The biggest part of recreating the game engine is the turn simulation aspect, and conveniently a C++ turn simulator was the first thing I did in Season 2. I won’t go too much into this, since this aspect of Terminal is both well documented and well discussed.

Open AI Gym Environment

This ended up taking longer than I wanted it to, but the standardized format of the Open AI Gym saved a lot of time later on since other libraries like keras-rl could simply attach to it. The main thing here is how to give a reward to an agent for taking a particular set of turns. I primarily rewarded the agent with a numeric “value” of its turn’s outcome as well as a few other parameters like punishing it for dragging out the game and rewarding it for winning quickly.


A library for all kinds of reinforcement learning. This did require some diving into to get working, unfortunately. I believe the environment’s observation is supposed to be a numpy array instead of a list, as I had several issues with that. Additionally, I had to get my hands dirty in the policy functions to Terminal-ify them. The library will sanitize the neural net outputs to all 1s and 0s (part of making its own data to fit against) before passing it as the “action” to feed into the environment, so selecting the placements by the confidence of the neural net’s output couldn’t be done. Instead, I had to give it a method of spending its cores until it didn’t have enough to place the next unit it wanted to[3]. There’s a little more that went on to this, and I realized a minor bug in my own implementation just by typing it out, but I can’t take away all the debugging fun, can I?

The Agent

The actual neural net model isn’t much more advanced than what I made with ML_Bootstrap. This time, however, I gave the model additional control of removals instead of just the defensive placements. I attempted to give it control of attacking as well, but this seemed to be too complicated to learn from in a short amount of time and in a roundabout way hurt the defense learning as well.


Here’s the game. (6.8 MB)

Not terribly impressive, eh? I trained it for 5000 steps (turns, not games). I will say, though, that while watching the training I witnessed it win in 12 turns as well as win in 17 turns without losing any hp. It failed to latch on and learn from this, however, likely because the start wasn’t as efficient as what you see in the replay and I’m fairly certain it’s a tad too greedy to see beyond that. My previous attempt at training this was able to win in 13 turns, however that model was bugged and couldn’t be reproduced on the live servers.

On the bright side, there seems to be evidence that it did learn something. It generally learned to place destructors out in front of encryptors to protect them, as well as to avoid the first several rows that can be easily destroyed by Sawtooth on turns it doesn’t attack. It failed to devise a replacement policy, unfortunately, and while it could be from a lack of training, I more suspect it to be from a bad model.

Where Does It Goes From Here?

Well, I trained it against Sawtooth so I could get a proof of concept that this works. It’s kind of hard to say that it will work as a more general algo that can play against any other algo. Certainly, it would take a lot more work than I could put in right now. Below, I’ll list out a few limitations that I see:

  • Full Game Engine: Any plan for a more general DQN algo that I’ve thought of, or discussed with a few other people, has involved creating a full engine that can support running a normal algo, including passing the game states to each algo throughout the action phase (the primary functionality mine lacks).
  • Fast Game Engine: My engine is on C++, and can reach about ~10ms per simulation, but this is the primary limitation I face. Brute forcing the agent’s attack is roughly 95% of the computation time during training, and the resultant speed it 0.5-1.0 steps per second. The 5000 steps of training took several hours. I was hoping to reach speeds of a game a second, which places me around two orders of magnitude too slow, and that’s just one algo running (Sawtooth’s static placements have a negligible amount of computation time).
  • Perfect Knowledge: Terminal’s a lot different than other board games in that both players play simultaneously, so each algo lacks perfect knowledge about what the other will do. This is the basis for making a prediction phase. I simply ignored this problem by making Sawtooth place first[4]. Certainly, a more generalized approach would need to overcome this problem, especially when a full game engine is needed. It adds a unique layer of difficulty to have a vast range of outcomes tied to each state, regardless of the action taken.
  • State Space/Action Space: The game’s large, basically. The number of possible states of the game is large and complex enough I’d rather not try to compute it. This is tied to the need for a fast game engine, which leads me to believe a brute-forced simulation of the best attack makes it near impossible to scale this model into a general algo.

If all that can be overcome, which is no small feat, then I theorize it’s possible to make a DQN algo “master” the game. I think the limited success of learning to play against Sawtooth does support that, on the basis that the limited success is due to my own lack of knowledge of creating a good model and the above limitations, and not something else I missed.

  1. By “hijacking” I mean adding on more and more functionality to the simulator until it was able to carry out basic game processes beyond “playing out” a game map. This ended up not being a good idea, as it got sloppy very quickly. By “partial game engine,” I mean that it cannot run “regular” algos like you can with the engine.jar. Instead, the environment and the “game engine” have to communicate back and forth to correctly play out the game, with the environment managing resources and the game engine managing the state of the board. Oh, and the environment’s in Python while the engine is in C++. Fun.
  2. The learning policy is the bit of “magic” that I stayed away from, so I can’t particularly answer what it does or why I chose it, other than our favorite CodeBullet conveniently released a video while I was working on this (what it does) and it was the first thing I read about while researching the basics (why I chose it).
  3. Removals are “free” as they don’t “cost” any cores. I make the presumption that all desired removals have been selected by the time it attempts to place a unit it lacks the cores for.
  4. I replicated this on the live servers by manually feeding the algo the first couple of inputs (ignoring the read in game state) and letting my predictor take over from there. My predictor isn’t perfect, and that’s most noticeable by how, locally, the agent actually loses on turn 36 instead of winning on turn 30. ¯\_(ツ)_/¯

How to get all the coordinates of all the towers on the board
AI for Terminal

Awesome post! Super interesting to see any ML developments.

I’ll try to give you some tips from my own experience though you’re already really far along so maybe you know them already, though if someone else wants to experiment with ML maybe these will help:

  • What net structure are you using? I would imagine having a few convolution layers could help it learn more generally and a lot faster, though haven’t tried myself yet. In general I’ve found changing the net can be very counter-intuitive, for example reducing the number of nodes per layer, can actually help a ton (makes it learn much faster and have less noise). Also, changes like keeping or removing having the “bias” weight on nodes can make a big difference. People often combine DQN with evolutionary algos to deal with these kinds of hyper-parameters, but manual tweaking can work too. Also make sure you are using the industry-best optimizer functions and stuff (I believe adam was the best when I looked a few years ago).
  • What’s your off-policy behavior? This is basically what will the net do when it wants to try new things. You could help save a lot of training time to limiting it to moves that make sense like no illegal moves (though you may be doing this already). Also tweaking the off-policy rate (% of the time it does something off-policy) could help. There are some advanced techniques related to this out there too.
  • Strange that the algo didn’t learn from some of its 13 turn wins more. Maybe you need to tweak your reward function? Not sure if you are looking at turns ahead though (if this turn was used in a game that led to winning quickly, increase reward). There are also a few ML tricks out there that involve prioritizing certain data more and training on those more. For example, if your algo does something “interesting” train more on those games. Or if your algo seems to have a huge weakness, train on those situations more. That’s the general description, the real techniques are very efficient and nuanced sometimes, one is called “Prioritized Experience Replay” this paper is an example (first one I found just now so could be better ones out there): https://arxiv.org/pdf/1803.00933.pdf


Thanks for the great feedback! This will be quite useful. Most the guides/light research I’ve done so far was to get something working, so this will be a great starting point into figuring out how to make it good (instead of just barely functional).

  • I’m using a fully connected model with an input representation of the game map, a hidden layer of like size ~40 (completely arbitrary), and an output representation of placements/removes. Previously the hidden layer(s) were much larger but those were equally as arbitrary and I figured it probably slowed the learning. I’ve yet to experiment with convolution layers, though I can see how that could be beneficial. Definitely some more stuff to look into to help improve the model. It is compiled with Adam (and with an equally arbitrary lr (learning rate?) I pulled from some keras-rl example).
  • Not really sure, as I assumed it was in the “magic” of the keras-rl libary, but I’m starting to think I should pay more attention this. While it’s training, it uses the BoltzmannQPolicy, which I believe “explores” according the these lines, where the q_values are the forward feed outputs of the neural net:
#                                    Probably defaults, don't think I set them
exp_values = np.exp(np.clip(q_values / self.tau, self.clip[0], self.clip[1]))
probs = exp_values / np.sum(exp_values)
# This line I modified to get "multiple outputs" for multiple placements
actions = np.random.choice(range(nb_actions), choices, replace=False, p=probs)
  • (Still in bullet point 2) In testing, it uses the GreedyQPolicy and just selects its choices from the highest q_values. A little more under the hood is some sanitation of the outputs to prevent illegal moves, specifically giving it cores amount of choices, and selecting from its choices until it’s “spent” its cores in areas not already occupied. I’m starting to see that as an issue though, as it’s being fit to produce outputs in one given game state, but in the next turn, those outputs are taken and it needs to fit to something else it probably didn’t want to. I’m not sure what those probabilities it’s using to train look like, though, which might be a separate issue if it’s not exploring enough as the outputs become more fit. AFAIK that’s the only point of exploration, but I’d have to dig into keras-rl more to figure that out.
  • I haven’t explored how exactly it manages the rewards, but I should certainly check to make sure it’s “splashing” the rewards to reduce the greed, at least on episodes that end with the large “win” bounty. I’m guessing it’s not doing that enough, or at least not far back enough if it ended up selecting the greedier opening turns with a much worse ending.