Thanks for the great feedback! This will be quite useful. Most the guides/light research I’ve done so far was to get something working, so this will be a great starting point into figuring out how to make it good (instead of just barely functional).
- I’m using a fully connected model with an input representation of the game map, a hidden layer of like size ~40 (completely arbitrary), and an output representation of placements/removes. Previously the hidden layer(s) were much larger but those were equally as arbitrary and I figured it probably slowed the learning. I’ve yet to experiment with convolution layers, though I can see how that could be beneficial. Definitely some more stuff to look into to help improve the model. It is compiled with Adam (and with an equally arbitrary lr (learning rate?) I pulled from some keras-rl example).
- Not really sure, as I assumed it was in the “magic” of the keras-rl libary, but I’m starting to think I should pay more attention this. While it’s training, it uses the BoltzmannQPolicy, which I believe “explores” according the these lines, where the q_values are the forward feed outputs of the neural net:
# Probably defaults, don't think I set them
exp_values = np.exp(np.clip(q_values / self.tau, self.clip[0], self.clip[1]))
probs = exp_values / np.sum(exp_values)
# This line I modified to get "multiple outputs" for multiple placements
actions = np.random.choice(range(nb_actions), choices, replace=False, p=probs)
- (Still in bullet point 2) In testing, it uses the GreedyQPolicy and just selects its choices from the highest q_values. A little more under the hood is some sanitation of the outputs to prevent illegal moves, specifically giving it
cores
amount of choices, and selecting from its choices until it’s “spent” its cores in areas not already occupied. I’m starting to see that as an issue though, as it’s being fit to produce outputs in one given game state, but in the next turn, those outputs are taken and it needs to fit to something else it probably didn’t want to. I’m not sure what those probabilities it’s using to train look like, though, which might be a separate issue if it’s not exploring enough as the outputs become more fit. AFAIK that’s the only point of exploration, but I’d have to dig into keras-rl more to figure that out. - I haven’t explored how exactly it manages the rewards, but I should certainly check to make sure it’s “splashing” the rewards to reduce the greed, at least on episodes that end with the large “win” bounty. I’m guessing it’s not doing that enough, or at least not far back enough if it ended up selecting the greedier opening turns with a much worse ending.