内容简介:DeepMind recently released their MuZero algorithm, headlined by superhuman ability inReinforcement Learning agents that can play Atari games are interesting because, in addition to a visually complex state space, Atari games don’t come with aThis idea of a
The Evolution of AlphaGo to MuZero
DeepMind recently released their MuZero algorithm, headlined by superhuman ability in 57 different Atari games .
Reinforcement Learning agents that can play Atari games are interesting because, in addition to a visually complex state space, Atari games don’t come with a perfect simulator .
This idea of a “ perfect simulator ” is one of the key limitations that keep AlphaGo and subsequent improvements such as AlphaGo Zero and AlphaZero, limited to Chess, Shogi and Go and useless for certain real-world applications such as Robotic Control.
Reinforcement Learning problems are framed within Markov Decision Processes (MDPs) depicted below:
The family of algorithms from AlphaGo, AlphaGo Zero, AlphaZero, and MuZero extend this framework by using planning , depicted below:
DeepMind’s AlphaGo, AlphaGo Zero, and AlphaZero exploit having a perfect model of (action, state) → next state to do lookahead planning in the form of Monte Carlo Tree Search (MCTS) . MCTS is a perfect complement to using Deep Neural Networks for policy mappings and value estimation because it averages out the errors from these function approximations. MCTS provides a huge boost for AlphaZero in Chess, Shogi, and Go where you can do perfect planning because you have a perfect model of the environment.
MuZero comes with a way of salvaging MCTS planning by learning a dynamics model depicted below:
MuZero’s approach to Model-Based Reinforcement Learning, having a parametric model map from (s,a) → (s’, r), is that it does not exactly reconstruct the pixel-space at s’ . Contrast that with the image below from “World Models” by Ha and Schmidhuber:
This planning algorithm from MuZero is very successful in the Atari domain and could have enormous application potential for Reinforcement Learning problems. This article will explain the evolution from AlphaGo, AlphaGoZero, AlphaZero, and MuZero to get a better understanding for how MuZero works. I have also made a video explaining this if you are interested:
AlphaGo
AlphaGo is the first paper in the series, showing that Deep Neural Networks could play the game of Go by predicting a policy (mapping from state to action) and value estimate (probability of winning from a given state). These policy and value networks are used to enhance tree-based lookahead search by selecting which actions to take from given states and which states are worth exploring further.
AlphaGo uses 4 Deep Convolutional Neural Networks, 3 policy networks and a value network. 2 of the policy networks are trained with supervised learning on expert moves .
Supervised learning describes loss functions consisting of some kind of L(y’, y). In this case, the y’ is the action the policy network predicted from a given state, and the y is the action the expert human player had taken in that state.
The rollout policy is a smaller neural network that takes in a smaller input state representation as well. As a consequence of this, the rollout policy has a significantly lower modeling accuracy of expert moves than the higher capacity network. However the rollout policy network’s inference time (time to make a prediction of action given state) is 2 microseconds compared to 3 milliseconds with the larger network, making it useful for Monte Carlo Tree Search simulations.
The SL policy network is used to initialize the 3rd policy network which is trained with self-play and policy gradients. Policy gradients describe the idea of optimizing the policy directly with respect to the resulting rewards, compared to other RL algorithms that learn a value function and then make the policy greedy with respect to the value function. The policy gradient trained policy network plays against previous iterations of its own parameters, optimizing its parameters to select the moves that result in wins. The self-play dataset is then used to train a value network to predict the winner of a game from a given state.
The final workhorse of AlphaGo is the combination of policy and value networks in MCTS, depicted below:
The idea of MCTS is to perform lookahead search to get a better estimate of which immediate action to take. This is done by starting from a root node (the current state of the board), expanding that node by selecting an action and repeating this with subsequent states that result from the state, action transitions. MCTS chooses which edge of the tree to follow based on this Q + u(P) term which is a weighted combination of the value network’s estimate of the state, the original probability density the policy network had given to this state, and a negative weighting of how many times the node has been visited, since this is repeated over and over again. Unique to AlphaGo is the use of a rollout policy simulation to average the contribution of the value network. The rollout policy simulates until the episode and wether that resulted in a win or a loss is blended with the value function estimate of that state with an extra parameter, lambda.
AlphaGo Zero
AlphaGo Zero significantly improves the AlphaGo algorithm by making it more general and starting from “Zero” human knowledge . AlphaGo Zero avoids the supervised learning of expert moves initialization and combines the value and policy network into a single neural network. This neural network is scaled up as well to utilize a ResNet compared to a simpler convolutional network in AlphaGo. The contribution of the ResNet performing both value and policy mappings is evident in the diagram below comparing the dual task ResNet to separate task CNNs:
One of the most interesting characteristics of AlphaGo Zero is the way it trains its policy network using the action distribution found by MCTS, depicted below:
The MCTS trains the policy network by using it as supervision to update the policy network. This is a clever idea since MCTS produces a better action distribution through lookahead search than the policy network’s instant mapping from state to action.
AlphaZero
AlphaZero is the first step towards generalizing the AlphaGo family outside of Go, looking at changes needed to play Chess and Shogi as well. This requires formulating input state and output action representations for the residual neural network.
In AlphaGo, the state representation uses a few handcrafted feature planes, depicted below:
AlphaGo Zero uses a more general representation, simply passing in the previous 8 locations of stones for both players and a binary feature plane telling the agent which player it is controlling, depicted below:
AlphaZero uses a similar idea to encode the input state representation for Chess and Shogi, depicted below:
AlphaZero also makes some more subtle changes to the algorithm such as the way the self-play champion is crowned and the eliminations of data augmentation from Go board games such as reflections and rotations.
MuZero
This leads us to the current state-of-the-art in this series, MuZero. MuZero presents a very powerful generalization to the algorithm that allows it to learn without a perfect simulator. Chess, Shogi, and Go are all examples of games that come with a perfect simulator, if you move your pawn forward 2 positions, you know exactly what the resulting state of the board will be. You can’t say the same thing about applying 30 N of force on a given joint in complex dexterous manipulation tasks like OpenAI’s rubik’s cube hand.
The diagram below illustrates the key ideas of MuZero:
Diagram A shows the pipeline of using a representation function h to map raw observations into a hidden state s0 that is used for tree-based planning. In MuZero, the combined value / policy network reasons in this hidden state space , so rather than mapping raw observations to actions or value estimates, it takes these hidden states as inputs. The dynamics function g learns to map from hidden state and action to a future hidden states.
Diagram B shows how the policy network is similarly trained by mimicking the action distribution produced by MCTS as first introduced in AlphaGo Zero.
Diagram C shows how this system is trained. Each of the three neural networks are trained in a joint optimization of the difference between the value network and the actual return, the difference between the intermediate reward experienced and predicted by the dynamics model and the difference between the MCTS action distribution and policy mapping.
How does the representation function h get trained in this optimization loop?
The representation function h comes into play in this joint optimization equation through back-propagation through time . Let’s say you are taking the difference between the MCTS action distribution pi(s1) and the policy distribution p(s1). The output of p(s1) is a result of p(g(s0, a1)), which is a result of p(g(h(raw_input), a1)). This is how backprop through time sends update signals all the way back into the hidden representation function as well.
AlphaGo → AlphaGo Zero → AlphaZero → MuZero
I hope this article helped clarify how MuZero works within the context of the previous algorithms, AlphaGo, AlphaGo Zero, and AlphaZero! Thanks for reading!
以上所述就是小编给大家介绍的《The Evolution of AlphaGo to MuZero》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。