MCTS is the cornerstone of AlphaGo and many AI applications. We aim to build some intuitions and along the way get our hands dirty.
Published in · 6 min read · Jan 10, 2021
--
Monte Carlo Tree Search (MCTS) is an important algorithm behind many major successes of recent AI applications such as AlphaGo’s striking showdown in 2016.
In this blog, we will first start with uninformed search in which we simply traverse through the whole search space to find the optima. It includes depth-first search and breadth-first search.
Then we follow by describing how MCTS algorithm works. In the end, we apply it to a toy example problem of finding the most rewarding leaf node of a binary tree.
Uninformed search, as its name suggests, is a kind of generic search algorithms which are given no extra information other than an abstract problem definition. Although they can be applied universally with the right abstraction of problems, they normally suffer from efficiency problem when problem becomes large.
A tree with branch factor of b and depth of d would have b^d (read as b to the power of d) number of leaf nodes.
In games, we use a so-called game tree where each node represents a state of a game and its children nodes represent possible next states for all possible actions a player can take.
To give some ideas, Go has averaging branch factor of 250 according to Wikipedia. Some back of the envelope calculation will quickly show how prohibitive it is to use uninformed search in Go:
- At step 1: 250
- At step 2: 250² = 62500
- At step 3: 250³ = 15,625,000
- At step 4: 250⁴ = 3,906,250,000
- At step 5: 250⁵ = 976,562,500,000
- …
- At step 10: 250¹⁰ = 953,674,316,406,250,000,000,000
After 10 steps, we already see a cosmetic number of possible states of game. This also gives some sense why the win of AlphaGo is such a milestone for humankind.
Now we understand the fact that we need something smarter than uninformed search to navigate through a gigantic state space like the game of Go. MCTS is one such smart algorithm that gives us a way out.
Essentially, MCTS uses Monte Carlo simulation to accumulate value estimates to guide towards highly rewarding trajectories in the search tree. In other words, MCTS pays more attention to nodes that are more promising, so it avoids having to brute force all possibilities which is impractical to do.
At its core, MCTS consists of repeated iterations (ideally infinite, in practice constrained by computing time and resources) of 4 steps: selection, expansion, simulation and backup.
Selection
In this step, we use tree policy to construct path from root to most promising leaf node.
A leaf node is a node which has unexplored child node(s).
A tree policy is an informed policy used for action (node) selection in the snowcap (explored part of the game tree as opposed to the vast unexplored bottom part). One important consideration of such policy is the balance of exploration vs exploitation, a recurring topic in AI, especially in Reinforcement Learning.
In the algorithm behind AlphaGo, a UCB based policy is used. More specifically, each node has an associated UCB value and during selection we always chose the child node with the highest UCB value.
As we can see, the more a node i is visited, the lower the second term in UCB becomes, hence decreasing its probability of being selected again. Therefore, UCB algorithm has exploration and exploitation inherently built-in. Smart and neat!
For a more rigorous yet still easy-to-follow account of exploration vs exploitation (including UCB), please read Lilian Weng’s blog on this topic.
Finally, in the selection step, we can traverse the snowcap or explored part of the tree until a leaf node has been reached.
Expansion
Remember that we reach a leaf node at the end of selection?
In expansion step, we simply randomly pick an unexplored node of a leaf node.
Simulation (or Roll-out)
During this step, we roll out one or multiple simulations with reward accumulated for each simulation. Roll-out policy is normally simply or even pure random such that it is fast to execute. For example, a win might induce reward of +1, a draw for 0, and a loss for -1.
One might ask, why bother?
In the game of Go, this step is like the algorithm, like a Go master might do in his/her head, playing many times of the game until it’s end. The truly genius not only sees the future, but many versions of it.
Backup
Now we have played out many simulations, what can we do with it to inform the best next move?
We use the accumulated rewards of simulations to back up and update the values of nodes in the snowcap.
Note: we don’t update the values of nodes during the roll-out steps! This is done so because we need to focus on the vicinity of the root node (snowcap), based on which we need to make decisions of next move(s). Whereas the values outside of snowcap is not relevant for such decision, nor computationally efficient to store and calculate.
Put it together
The pseudo-code might look like below:
def run(node, num_rollout):
"one iteration of select->expand->simulation-> backup"
path = select(node)
leaf = path[-1]
expand(leaf)
reward = 0
for i in range(num_rollout):
reward += simulate(leaf)
backup(path, reward)
Note that this is one iteration of MCTS given a specific root node, e.g., a node in the game tree. In practice, we can do as many iterations as possible of these 4 steps, during which in the simulation step multiple rollouts can be done. After computation time or resources are exhausted, we stop the iteration loop, and decide what is the next move to take, then we end up with a new root node, and run the iterations again, so on and so forth…
(Code available at: git repo)
Let’s suppose, we have a binary tree of depth 7, all of its leaf nodes carry a certain reward between 0 and 100, drawn from uniform distribution. Our task is to find the highest reward among the 2⁷ leaf nodes (a small note: leaf nodes used here are of its rudimentary meaning, that is the nodes in the tree without children, unlike in MCTS we use leaf nodes to denote nodes with unexplored child).
The reader can checkout my git repo, and run find_max_leaf_binary_tree.py
. The code handily includes a few parameters:
numb_iter
to change number of MCTS iterationsnum_rollout
to change number of roll-outs during simulationexploration_weight
to control the balance between exploration and exploitation
Enough to play with!
The default run shows results below:
Expected (max) leaf node: node@7_53, value: 99.16984625936269
Expected (min) leaf node: node@7_121, value: 1.2521064278136151
Found optimal (max) leaf node: node@7_104, value: 95.8546572417074
Not too bad, the found optimal node is pretty close to its expected value (note you might not get the exact result due to its probabilistic nature).
Many of AI problems have its roots in search problems. After all, life is about searching. Hope you all find your optimal path :).