Mixed Results: DQN Experimentation

Ryan_Draves · March 24, 2019, 5:40am

In preparation for the upcoming competition and the end of the semester, I’m probably shelving this project for the foreseeable future. But, I wanted to share what I’ve been working on for the past two months.

Want to try this out too? Below is a short list of links of stuff I remember being helpful. It shouldn’t assume too much prior knowledge. I think starting with supervised learning helped a lot though.

Links

Game Engine

Game documentation
Turn simulation discussion: One of several posts on the forums discussing the main aspect of the game engine, simulating the game map.

RL

Basic Overview

Open AI Gym

DQN

keras-rl: Library to run the magic of DQN that plugs right into the OpenAI gym you made.

Getting the Model Running On Terminal

Keras2Cpp4Terminal: Self plug for the adaption of the Keras2Cpp library I made. Note that there are a couple known issues with the library.

How It Works

I wanted to try out a reinforcement-based algo, instead of my first supervised learning project, so I researched the basics of RL and common libraries for using it. I ended up hijacking my C++ turn simulator to create a partial C++ game engine[1] and using that to create a custom OpenAI Gym (to be referred to as “the environment”). From there, I used the keras-rl library to train a neural net with a deep Q-learning policy[2] on the environment I made. Specifcally, I trained it (the neural net, to be referred to as “the agent”) against a simple version of the classic “Sawooth” algo (pink player). This served to be a proof-of-concept of deep Q-learning in Terminal, vastly reduce the state space the agent has to learn about, and in speed up the game engine with a static opponent playing over a dynamic one.

Game Engine

The biggest part of recreating the game engine is the turn simulation aspect, and conveniently a C++ turn simulator was the first thing I did in Season 2. I won’t go too much into this, since this aspect of Terminal is both well documented and well discussed.

Open AI Gym Environment

This ended up taking longer than I wanted it to, but the standardized format of the Open AI Gym saved a lot of time later on since other libraries like keras-rl could simply attach to it. The main thing here is how to give a reward to an agent for taking a particular set of turns. I primarily rewarded the agent with a numeric “value” of its turn’s outcome as well as a few other parameters like punishing it for dragging out the game and rewarding it for winning quickly.

Keras-rl

A library for all kinds of reinforcement learning. This did require some diving into to get working, unfortunately. I believe the environment’s observation is supposed to be a numpy array instead of a list, as I had several issues with that. Additionally, I had to get my hands dirty in the policy functions to Terminal-ify them. The library will sanitize the neural net outputs to all 1s and 0s (part of making its own data to fit against) before passing it as the “action” to feed into the environment, so selecting the placements by the confidence of the neural net’s output couldn’t be done. Instead, I had to give it a method of spending its cores until it didn’t have enough to place the next unit it wanted to[3]. There’s a little more that went on to this, and I realized a minor bug in my own implementation just by typing it out, but I can’t take away all the debugging fun, can I?

The Agent

The actual neural net model isn’t much more advanced than what I made with ML_Bootstrap. This time, however, I gave the model additional control of removals instead of just the defensive placements. I attempted to give it control of attacking as well, but this seemed to be too complicated to learn from in a short amount of time and in a roundabout way hurt the defense learning as well.

Results

Here’s the game. (6.8 MB)

Not terribly impressive, eh? I trained it for 5000 steps (turns, not games). I will say, though, that while watching the training I witnessed it win in 12 turns as well as win in 17 turns without losing any hp. It failed to latch on and learn from this, however, likely because the start wasn’t as efficient as what you see in the replay and I’m fairly certain it’s a tad too greedy to see beyond that. My previous attempt at training this was able to win in 13 turns, however that model was bugged and couldn’t be reproduced on the live servers.

On the bright side, there seems to be evidence that it did learn something. It generally learned to place destructors out in front of encryptors to protect them, as well as to avoid the first several rows that can be easily destroyed by Sawtooth on turns it doesn’t attack. It failed to devise a replacement policy, unfortunately, and while it could be from a lack of training, I more suspect it to be from a bad model.

Where Does It Goes From Here?

Well, I trained it against Sawtooth so I could get a proof of concept that this works. It’s kind of hard to say that it will work as a more general algo that can play against any other algo. Certainly, it would take a lot more work than I could put in right now. Below, I’ll list out a few limitations that I see:

Full Game Engine: Any plan for a more general DQN algo that I’ve thought of, or discussed with a few other people, has involved creating a full engine that can support running a normal algo, including passing the game states to each algo throughout the action phase (the primary functionality mine lacks).
Fast Game Engine: My engine is on C++, and can reach about ~10ms per simulation, but this is the primary limitation I face. Brute forcing the agent’s attack is roughly 95% of the computation time during training, and the resultant speed it 0.5-1.0 steps per second. The 5000 steps of training took several hours. I was hoping to reach speeds of a game a second, which places me around two orders of magnitude too slow, and that’s just one algo running (Sawtooth’s static placements have a negligible amount of computation time).
Perfect Knowledge: Terminal’s a lot different than other board games in that both players play simultaneously, so each algo lacks perfect knowledge about what the other will do. This is the basis for making a prediction phase. I simply ignored this problem by making Sawtooth place first[4]. Certainly, a more generalized approach would need to overcome this problem, especially when a full game engine is needed. It adds a unique layer of difficulty to have a vast range of outcomes tied to each state, regardless of the action taken.
State Space/Action Space: The game’s large, basically. The number of possible states of the game is large and complex enough I’d rather not try to compute it. This is tied to the need for a fast game engine, which leads me to believe a brute-forced simulation of the best attack makes it near impossible to scale this model into a general algo.

If all that can be overcome, which is no small feat, then I theorize it’s possible to make a DQN algo “master” the game. I think the limited success of learning to play against Sawtooth does support that, on the basis that the limited success is due to my own lack of knowledge of creating a good model and the above limitations, and not something else I missed.

Footnotes

By “hijacking” I mean adding on more and more functionality to the simulator until it was able to carry out basic game processes beyond “playing out” a game map. This ended up not being a good idea, as it got sloppy very quickly. By “partial game engine,” I mean that it cannot run “regular” algos like you can with the engine.jar. Instead, the environment and the “game engine” have to communicate back and forth to correctly play out the game, with the environment managing resources and the game engine managing the state of the board. Oh, and the environment’s in Python while the engine is in C++. Fun.
The learning policy is the bit of “magic” that I stayed away from, so I can’t particularly answer what it does or why I chose it, other than our favorite CodeBullet conveniently released a video while I was working on this (what it does) and it was the first thing I read about while researching the basics (why I chose it).
Removals are “free” as they don’t “cost” any cores. I make the presumption that all desired removals have been selected by the time it attempts to place a unit it lacks the cores for.
I replicated this on the live servers by manually feeding the algo the first couple of inputs (ignoring the read in game state) and letting my predictor take over from there. My predictor isn’t perfect, and that’s most noticeable by how, locally, the agent actually loses on turn 36 instead of winning on turn 30. ¯\_(ツ)_/¯

C1Junaid · March 25, 2019, 3:30pm

Awesome post! Super interesting to see any ML developments.

I’ll try to give you some tips from my own experience though you’re already really far along so maybe you know them already, though if someone else wants to experiment with ML maybe these will help:

What net structure are you using? I would imagine having a few convolution layers could help it learn more generally and a lot faster, though haven’t tried myself yet. In general I’ve found changing the net can be very counter-intuitive, for example reducing the number of nodes per layer, can actually help a ton (makes it learn much faster and have less noise). Also, changes like keeping or removing having the “bias” weight on nodes can make a big difference. People often combine DQN with evolutionary algos to deal with these kinds of hyper-parameters, but manual tweaking can work too. Also make sure you are using the industry-best optimizer functions and stuff (I believe adam was the best when I looked a few years ago).
What’s your off-policy behavior? This is basically what will the net do when it wants to try new things. You could help save a lot of training time to limiting it to moves that make sense like no illegal moves (though you may be doing this already). Also tweaking the off-policy rate (% of the time it does something off-policy) could help. There are some advanced techniques related to this out there too.
Strange that the algo didn’t learn from some of its 13 turn wins more. Maybe you need to tweak your reward function? Not sure if you are looking at turns ahead though (if this turn was used in a game that led to winning quickly, increase reward). There are also a few ML tricks out there that involve prioritizing certain data more and training on those more. For example, if your algo does something “interesting” train more on those games. Or if your algo seems to have a huge weakness, train on those situations more. That’s the general description, the real techniques are very efficient and nuanced sometimes, one is called “Prioritized Experience Replay” this paper is an example (first one I found just now so could be better ones out there): https://arxiv.org/pdf/1803.00933.pdf

Ryan_Draves · March 25, 2019, 5:07pm

Thanks for the great feedback! This will be quite useful. Most the guides/light research I’ve done so far was to get something working, so this will be a great starting point into figuring out how to make it good (instead of just barely functional).

I’m using a fully connected model with an input representation of the game map, a hidden layer of like size ~40 (completely arbitrary), and an output representation of placements/removes. Previously the hidden layer(s) were much larger but those were equally as arbitrary and I figured it probably slowed the learning. I’ve yet to experiment with convolution layers, though I can see how that could be beneficial. Definitely some more stuff to look into to help improve the model. It is compiled with Adam (and with an equally arbitrary lr (learning rate?) I pulled from some keras-rl example).
Not really sure, as I assumed it was in the “magic” of the keras-rl libary, but I’m starting to think I should pay more attention this. While it’s training, it uses the BoltzmannQPolicy, which I believe “explores” according the these lines, where the q_values are the forward feed outputs of the neural net:

#                                    Probably defaults, don't think I set them
exp_values = np.exp(np.clip(q_values / self.tau, self.clip[0], self.clip[1]))
probs = exp_values / np.sum(exp_values)
# This line I modified to get "multiple outputs" for multiple placements
actions = np.random.choice(range(nb_actions), choices, replace=False, p=probs)

(Still in bullet point 2) In testing, it uses the GreedyQPolicy and just selects its choices from the highest q_values. A little more under the hood is some sanitation of the outputs to prevent illegal moves, specifically giving it cores amount of choices, and selecting from its choices until it’s “spent” its cores in areas not already occupied. I’m starting to see that as an issue though, as it’s being fit to produce outputs in one given game state, but in the next turn, those outputs are taken and it needs to fit to something else it probably didn’t want to. I’m not sure what those probabilities it’s using to train look like, though, which might be a separate issue if it’s not exploring enough as the outputs become more fit. AFAIK that’s the only point of exploration, but I’d have to dig into keras-rl more to figure that out.
I haven’t explored how exactly it manages the rewards, but I should certainly check to make sure it’s “splashing” the rewards to reduce the greed, at least on episodes that end with the large “win” bounty. I’m guessing it’s not doing that enough, or at least not far back enough if it ended up selecting the greedier opening turns with a much worse ending.

MrTwiggy · November 26, 2019, 6:13pm

Very interesting stuff, Ryan Draves! Out of curiosity, is your C++ game engine open sourced anywhere, or is it closed private source?

I definitely think that with some modifications to your choice of RL algorithm and how you are structuring the problem, I think a deep learning RL model could perform extremely well here. Though I’m not surprised it didn’t learn much, given that it was such a restricted simple NN model and had very small amount of data (I would expect to need 1000x more data to see great results at least)

Ryan_Draves · November 26, 2019, 6:48pm

Yeah, I was quite in over my head in that project, but that’s what learning’s for . The research I started this year has made me more appreciative of graduate degrees in understanding this stuff.

Historically, “turn simulators” are not meant to be shared, as that’s one of the focus points some people aim for in their competitive algos, so I don’t think I can share that part, which is honestly the most important logic. I could however share the rest of the project, and leave the call to the “turn simulator” as an exercise to the user.

Clarification:

Game engine: complete handling of the game in terms of managing resources/restarting a new game, for this purposes we can wrap the OpenAI gym into this, since I didn’t separate those two things.
Turn simulator: the main logic that simulates a game state to get to the start of the next turn. Would presumably return the reward for the agent

I can share the OpenAI gym, if that’s what you’d like, but as mentioned above I’d have to leave the core simulator logic unimplemented.

I have a much better turn simulator now as well. This one is a compiled extension instead of the wrapper shenanigan I used back when I did this project. I could update the project to use this faster simulator (critical for your training), and pending C1 has no objections, privately send you the compiled extension that would otherwise not be in the repo. I would do this at my leisure, as it would take a while to update and I haven’t been involved in this project since I posted in March.

MrTwiggy · November 26, 2019, 7:00pm

Thanks for the quick and very detailed response! I wasn’t aware of the general tendency to not share turn simulators, but that makes sense overall. Developing an efficient simulator is likely to be the largest time-investment for developing any type of RL agent, so it makes sense that people would keep it private for the sake of competitiveness.

Out of curiosity, if I were to develop such a turn simulator, is it frowned upon by C1 or the community to open source it? Or is it just more-so that developers have historically chosen not to share?

Either way, thanks again for the quick info on such an old thread! If I do try to implement modern deep RL techniques to try and compete, I likely won’t use the OpenAI gym environment, so the framework for that logic may not be particularly useful for me personally, but I do appreciate your openness and willingness to share! If I do end up developing a turn simulator and pursuing RL in this game, I will definitely try to share as openly as you have.

Ryan_Draves · November 26, 2019, 8:02pm

Email notifications help. And yes, it would be frowned upon to share a turn simulator. I can’t dig up a thread off the top of my head but C1 has stated so in the past and that’s unlikely to change.

It is quite a time consuming process to make one, first between making it behave perfectly (months of resolving edge cases) and second to make it fast (depends on your experience with such optimizations). If the framework isn’t of much interest to you, I probably won’t bother to share if there’s no interest, but I’d still be willing to share a compiled version of my simulator for your training purposes (pending C1 would allow such a thing). That’d be easier to send to you versus updating the framework to use the much improved simulator. However, that might not actually be perfectly desirable in its own right, as it would limit the formulation of your reward system to whatever data the turn simulator returns (it returns results sensible for greedy optimization).

MrTwiggy · November 26, 2019, 8:19pm

If C1 such a thing, I would be extremely grateful to have you share your compiled simulator! Ultimately, just having a reliable efficient simulator is mostly all I would need to apply the techniques I would like to. Please let me know if you think it might be possible, thanks!

kkroep · November 27, 2019, 10:00am

I would also say that your own simulator is one of the key contributions one should program himself. My style of algos have always relied on finite state machines with simple logic. The reason was because I couldn’t invest enough time to program one and utilize it. I am now considering programming a simple and faulty one anyway just to give my program some extra information to work with.

The nice thing about this game is that different skill sets can be used and sparred against each other. For example strategic thinking, optimizing, high performance coding, artificial intelligence, prediction, etc…

Maybe I have the wrong angle here, but those are my thoughts on this.

Ryan_Draves · November 27, 2019, 2:05pm

My assumption here was two-fold:

End-to-end machine learning attempts aren’t particularly competitive
The turn-simulator would be used to create a game engine for training, not in the actual algo

Using machine learning to optimize turn-simulator results in-game seems more like a genetic algorithm problem to me, so I figured under those two assumptions I wouldn’t lose much to share that part of my work.

kkroep · November 27, 2019, 5:16pm

Ah then maybe I am reading the situation incorrectly. I was thinking that if you simulate every possible attack option for both you and your opponent that can be quite efficient. There probably exist different use cases for runtime turn simulators too.

If the program is closed source and only offline I had the wrong idea.

On a different note, I. Currently planning on having machine learning components in my algo. I understand that full machine learning is super tough, but I think ML can give strong solutions to a subset of problems. Maybe not

Btw, I am only talking about the turn simulator, the machine learning part sounds fine

MrTwiggy · November 27, 2019, 8:03pm

Ryan Draves is exactly correct. I am only interested in using a fast efficient turn simulator so that I can generate large amounts of offline self-play training data, which I think is a prerequisite to being able to train a strong deep learning model, especially with an RL framework. Though, personally, I think an end-to-end machine learning approach has the potential to provide a very competitive agent (though that might require absurd amounts of data and computing resources that just aren’t realistic in this competition).

Though I completely respect if even an offline-simulator usage to share is frowned upon, so whatever C1 thinks is reasonable is okay by me

kkroep · November 28, 2019, 12:25pm

Ah okay, upon reading the conversation again I believe I understand what you meant with “compiled version”. If it is a local executable that cannot be used online I am content with people using it. The question is whether the algorithm will not get stuck on the local optimum that is “ping cannon”. I had a lot of joy seeing aelgoo operate in the previous generation, so maybe this method can bring a cool style to the table.

Ryan_Draves · November 28, 2019, 2:46pm

No, the compiled version of the turn simulator would be a shared library that you can plug into directly as a Python Extension (it would just come with a README on how to interface with it). It works on the live servers just as well. What he wouldn’t get would be the source code of the simulator, but he would have access to use it in a competitive algo (hence a little bit of trust he would responsibly use it only offline). There’s probably a way to compile it as a Windows shared library instead, which would not work on the live servers, but then I’d have to spend more time than I’d like to figuring that out. That’s actually not a terrible option as I could open source the training environment entirely to be Windows-dependent versus just sharing the simulator with one person.

kkroep · November 28, 2019, 4:20pm

Very interesting. The only problem is that c1games is doing all these live events, where I am guessing people compete in a limited time frame. Then it would be problematic if someone finding a simulator like that online and crush the competition. For the online it is probably fine though. I will be seriously impressed if someone programs a successful AI that is not a ping cannon.

The idea of a windows thing is pretty clever actually. I don’t know how large a time investment that would be, but that sounds like a pretty sweet guarantee of fair use, that c1games might be more comfortable with. I would be surprised if a live event gives programmers enough time to use a windows shared library to make a high performance ML algo.

Ryan_Draves · November 28, 2019, 4:46pm

Yes, I guess you run the risk of having the .so version being passed around as a little cheat engine (although it’s not hard to look into someone’s upload and see the little black box powering everything).

The Windows library idea is just a hands-on, faster version of the jar file game engine in the starter kit, essentially. Even if someone was clever enough to make a good ML algo using that framework (or just using the .dll turn simulator within it), that would sound more than fair to me, as I’m not familiar with any way to make that useful on the live servers.

I think the largest time investment is going through this documentation. I might go through with that idea this weekend if I’m bored enough, and open source it pending approval. If so I bet we could even ping Arnby or one of the faster turn simulators if we wanted to move it towards a community project for a fast offline game engine.

arnby · November 28, 2019, 5:09pm

I already have a turn simulator that can be compiled under Linux (.so) or Windows (.pyd)
This simulator is pretty fast, so C1 just has to say the word and I could share it

Edit: under the assumption that there is no shady workaround that allows to use a .pyd file under Linux

Edit 2: As a security measure, the turn_simulator function could need a token as one of its entries.
This token could last something like 24hrs, and would be generated by a 2nd function of the compiled module. This 2nd function would take 30s to return the 24hrs token, which should make the turn_simulator impossible to use online

Isaac · November 28, 2019, 7:16pm

As someone who wrote my experiences publicly about my machine learning experience as well as creating python extensions, I am all for making the experience easier for others. However, I definitely think this needs to be carefully considered for the reasons you all have identified above.

I think most people agree that sharing source code is a big no no. I personally spent a lot of time on my simulator to make it as flexible and fast as possible and I’ll be perfectly honest in that I don’t love the idea of people getting the equivalent to my hard work just “for free”. I’m all for sharing knowledge about how to create a simulator, just not actually sharing the simulator.

That being said, I feel more conflicted regarding the machine learning aspect. Libraries are now supported on the C1 servers (huge supporter of that), making it easier to run ML code on the actual server. This being said, everyone is correct in that this capability is useless for many algorithms if you can’t run a lot of turns/games quickly for training. I would also just mention that a simulator is not necessary for many machine learning models, however, I agree that some of the most interesting ones do require one.

So ultimately, as many others have noted, this is about:

Whether one should be allowed to share a simulator exclusively for the purpose of ML training
Whether it is possible to safely share a simulator without it being used on C1 servers

1: Should be allowed to share a simulator exclusively for the purpose of ML training:

I’ve talked a little about this above already, but ultimately I believe that yes, simulators in principle should be allowed to be shared ONLY for training a machine learning model. Machine learning is hard enough as it is, and I’m of the opinion that we (as a community and C1) should make it as easy as possible to experiment with an ML model. However, this is easier said than done as talked about in #2.

2: Is possible to safely share a simulator without it being used on C1 servers:

I think this is a much harder question. As such, I feel like there would need to be some consensus on how this would be done. I for one created my simulator as a python extension specifically with ease of use on the endpoint in mind. In other words, it would be extremely easy, should someone get just my .so file, for them to plug and play without understanding anything under the hood of my simulator. This is extremely dangerous in my mind and certainly disqualifies the sharing of .so files in my mind.

I like the idea of being able to share .pyd files, since they wouldn’t run online. Note this is still NOT a perfect solution since as @arnby notes:

This will always technically be a consideration since disassembly is a thing, however, I think the risk is low because for anyone who has the experience to do so it would probably be faster to just create a simulator. A larger consideration in my mind with this approach is that it would alienate users who do not use Windows, giving a potentially unfair advantage to Windows users. Ultimately this is a competition, and fair use of resources is important to the integrity of the competition.

Another point that @Ryan_Draves made is that:

and I think this is a good consideration. C1 wouldn’t necessarily have to check everyone’s code, but only the top players since that’s where the money is distributed. So the solution here would basically C1 saying do what you want, but if you use a simulator (or any .so related file) and are in the top ___%, you are required to explain your code, or maybe provide your source code along with the compiled version.

Something I’m less excited about because it’s more restrictive could be that C1’s servers prevent compiled code from being uploaded. Then, if you wanted to have something like a compiled .so file, there could be some specification on how to do so and they compile your code when you upload it. Again, I don’t love this solution because it brings up a lot of problems C1 saw with creating a C++ starter-kit (which has been for the most part fully written and open source in a PR) which is still not an officially supported language because of the challenges it poses for compiling.

Lastly, I just wanted to mention that while I like the concept of an open-sourced project for creating a turn simulator, I don’t think it’s a good idea because it’s ultimately sharing code. Also, I would predict that the people making the simulator would mostly be people who already have one.

I’d love to get some response from C1 about this and their thoughts. However, I like the fact that we haven’t heard anything yet because it means they are taking the time to talk about it in full internally before commenting (or more likely, it’s Thanksgiving break ).

Happy coding everyone :).

arnby · November 28, 2019, 7:32pm

As always, great summary @Isaac!

That would be deadly for me: my top algos since season 1 are for 98% just compiled c++.

Ryan_Draves · November 28, 2019, 8:06pm

@arnby likewise. My last uploads were fully dependent on a shell of Python and everything else C++, as I’m dependent on that kind of performance.

@Isaac My reference to an open source project wasn’t towards an open-source simulator, but rather the DQN framework I made. The “community” part of it would be to plug in someone else’s (better) compiled simulator. I think the very little success I had with the project makes it more than safe to share as a starting point for any OpenAI gym projects, while also providing a reference for running games locally.

I think you’re also right about anyone who can reasonably disassemble a .pyd file can also just make their own high performance simulator, or at the very least, while not their “own” work, put in just as much work as doing it themselves.