Board encoding for ML

arnby · April 29, 2020, 8:37am

Hi there!

For a school project, I worked with a friend of mine on a ML bot for Terminal.
One of the biggest issue was about how to represent the board game so that a model could interpret it and output some moves.

Output

The first tricky task was to represent the output. Indeed the length is variable (from 0 to theoretically hundreds of units) and several units can be placed on the same tile.
We got around this by creating the notion sub-turn. A sub-turn is either the placement of one unit on one tile or the the action to end the turn. For example a turn with 4 placed units is equivalent to 5 sub-turns.

Input

The first thing we did was to apply a rotation to the board to get rid of the diamond shape and the unused tiles. The tiles would then be in a 15*14 rectangle.
Then we added a third dimension for unit types.
The first 3 rows of the third dimensions would be dedicated to firewalls types and the value of a tile would be the remaining health ratio of the firewall.
The 3 next rows would be for the informations units. We decided there to encode a ratio of how many informations were placed on the tile against an arbitrary number (40 for pings and scramblers, 10 for EMP)
The last row was used to put a removal flag so that the model could know that it had flagged a unit in a previous sub-turn.

We also made a second input with useful indicators (health, cores, bits …)

Our model was a mixed CNN - NN, CNN for the board and NN for for the indicators.

Some drawbacks:

the input has a huge majority of zeros, which is not very good for learning
the sub-turn notion induce that there is a priority order for unit placements when sometimes there is not.
Season 5 pretty much made the input worse as 3 rows have to be added to the third dimension to take into account upgraded units.

I would love to hear your thought on this! Would you have an idea to construct a better representation of the game?

Demorf · April 29, 2020, 9:14am

If this was possible, I would not be here trying to solve it. (and C1 team would make sure the game is more complex)

Instead of solving the hole thing, why don’t you try with something simpler (where a bot can shine with calculation power and no intuition)
May be split it in more specific tasks, and the representation will be simpler:

Take a fixed board of static units and only calculate only possible attacks
take a initial board, calculate possible attacks for the enemy, and suggest optimal defense (with FireWalls only)
make it in a smaller frame, with adjusted config: 7:7 with 1-2 range for attacks.

acshikh · April 29, 2020, 7:54pm

I’ll second Demorf’s suggestion to at least try solving a simpler problem, especially at first!

For example, what if the only thing you are trying to learn is where to place filters given static positions of destructors, and static attacks to defend against? And again, maybe starting with a smaller simulated board size.

Etc… etc… but basically until you can make it successful on a simpler model it isn’t worth trying to scale up!

Demorf · April 30, 2020, 4:26am

Also … not sure why you try to have a unified model of everything …
sounds cool yes, but complexity grows exponentially.
Board representation may be depend on the task.
For example for my pathfinding sim, I have the board represented as Set of Tuples with 2 elements,
Just the location, (all the board locations - static unit locations)
And the set is not even ordered with no “grid” awareness.

arnby · April 30, 2020, 1:40pm

Thanks for your replies however you are answering aside. I may not have been very clear on my goal here.

Obviously I know that tackling a smaller problem is easier and most of the time provides faster results.
But what interests me here is purely “How far can a full ML algo climbs the ladder?”
I already did some work on that, and got some interesting results that I talked about in this topic. The algo even went near a rating of 2100 at the time.

I just wanted to discuss about the data structure that could be used because the way it is today doesn’t please me. It can also give some pointers if others want to toy with ML.

ML is not a magical solution, just because you have some working ML algos doesn’t mean that every other ways of addressing Terminal will be less efficient.

max1e6 · April 30, 2020, 3:59pm

I have played around with a CNN based bot. My encoding is pretty similar to yours tbh.

The encoding is something like a 32x28x28 tensor. I have one plane each for each type of player one’s firewalls, then player two’s firewalls, then player one’s stats (hp, cores, bits, these planes are spatially constant), then player two’s stats. The remainder of the plane’s encode the sub-turn’s that player one has made so far.

The output of the network is then just one sub-turn, i.e. placing a unit, removing, upgrading or submitting the turn.

You may say that this input encoding consumes somewhat more memory than it really needs, but the point of this is to format the input in the way that is most friendly to a CNN. Extra inputs/zeros should not reduce the performance of a CNN too much.

My CNN bot was a pure self-play based RL algorithm, I think its highest Elo was somewhere above 1000, but not much above. So it so far has only managed to do a little better than a random algorithm.

arnby · April 30, 2020, 5:56pm

great @max1e6! Thanks for sharing

I tried to keep the third dimensions a minimal as possible but you did quite the opposite, that’s interesting!

but I think this is weird as you have already a lot of plane to encode this.

Ok I see what you mean, I may try to augmente expand a little the third dimension

this is really interesting! How did you setup the RL environnement? Did you made it yourself? How long did you train it?

acshikh · April 30, 2020, 7:31pm

I think the main thing is I was disagreeing about the specific goal! You could say I am in the camp of people generally against end-to-end ML as being too slow and error prone, but as an extremely useful tool when applied a little less ambitiously

arnby · April 30, 2020, 8:13pm

I totally agree with you. In a real case, you need to use ML with care, most of the time you don’t even need it.
But Terminal is a game, that’s why I want to use this opportunity to use only ML as I have absolutely no performance goal or whatever.

max1e6 · April 30, 2020, 9:24pm

I’m still not really sure whether this way is better though, from the point of view of making it easier for the algo to learn stuff. Obviously the larger encoding will be more computationally expensive but not so much that it has a significant impact on the speed of my training.

Yes, it probably is overkill. At the time I chose to encode it this way so that the algorithm would be able to distinguish between a firewall played this turn vs a firewall played in a previous turn that the opponent would already know about. But my algorithm so far is such a bad player that this distinction is 100% irrelevant.

I set up the RL environment using the simulator that I wrote for my algorithms. The RL training so far has been quite a journey of slow debugging and improvements. The thing that makes this really hard is that there is no good way to measure whether the bot is actually improving or not, and thus there is no good way to really drilll down and find things to improve. The process honestly mostly involves generating a guess of what could be going wrong and then experimenting with different solutions.

So far I have tried both a policy learning (PPO) and a q-learning (a custom algorithm, but basically a short experience replay buffer with batched updates) approach. Of the two, q-learning seems to be more stable.

The architecture of the q-learning net also seems to make a big difference. Improvements like bottle-necking, SE blocks, and dueling net all seemed to help the network learn to beat simple opponents.

I originally was doing all my experiments with a self play setup, but this was a really inefficient way for me to improve the training. In order to test out network architecture I used supervised learning to train the network to build a fixed formation of units. Then I used RL to teach it to build the same formation in order to experiment with different RL algos. Now I have moved on to training vs a simple bot using RL, and as it beats the bot I manually make the bot a little harder.

I am now at the point where I am training vs a relatively simple bot that the RL algo can’t seem to make progress against. The problem is that the RL agent fails to discover relatively basic strategies like protecting destructors with filters. I see two potential causes for this behavior, a) either the net never tries this behavior in the first place, or b) the net tries it but then does not manage to learn to do it regularly. So now I need to figure out which of these is the problem and then add some solution (maybe improved exploration or search).

I don’t have time to work on it right now but I hope to have more time in the future to experiment with it. I’m not sure if it will ever work though, since I do not have the compute resources necessary to run the really large experiments that seem to be necessary for all deep RL work.

arnby · May 1, 2020, 6:50am

To do that, I made a pool of algos and just made them affront each other as in Terminal (with rating and so on) . It hugely depend on how strong and diverse the pool of algos is but it give a good way to gauge the RL algo.
From time to time, instead of self play I would also made it go against one of those algos.

Deep Q Learning I guess?
I also tried it several month ago and I also found that’s what was best. One question I was unsure about was how to propagate score through sub-turn. Currently I just give a score to the sub-turn which end the turn and put a high discount factor so that it can propagate through the sub-turns. Did you make it this way too?

Did not try that, I’m gonna look into it.

max1e6 · May 1, 2020, 2:43pm

That sounds really good, if I have time to make a whole pool of algos I will do something like that.

I just say that every sub-turn is a state, and after an episode completes I generate a score for every sub-turn by running them through a checkpointed version of my q-net. Then I generate the target for each sub-turn with a weighted average of all the sub-turns that come afterward, like in TD(lambda).

What approaches have you tried for RL?

arnby · May 1, 2020, 5:50pm

Only DQ Learning but my models were too simples. Now I have to rework almost everything

max1e6 · May 1, 2020, 7:05pm

Have you experimented with combining search with the model output? I want to look into that soon

arnby · May 1, 2020, 7:56pm

what do you mean by combining search?

max1e6 · May 1, 2020, 8:16pm

I want to do a tree search over possible moves which is guided by the q-values for each move. So that the tree search essentially tries many of the best moves suggested by the model.

arnby · May 2, 2020, 7:02am

Yes that’s definitely something I thought about! It could have great potential
The only thing that refrained me from implementing it yet is that it greatly increase training time as you have to make one or several forward pass on the network and then perform the search