The success of AlphaGo Zero is definitely a milestone of AI technology. Here we are not trying to discuss the social, cultural or even the ethic impact of it. Instead we are interested in the following technical question: Why can AlphaGo Zero achieve its convergency with a limited self-play generated sampling data and a limited computational cost?
The factors that may influence the performance of AlphaGo Zero include (a)the inherent property of the game Go and (b) the structure of AlphaGo Zero (the ResNet based value and policy network, MCTS and the reinforcement learning structure).
In this paper we try to give a qualitative answer to this question by indicating that AlphaGo Zero can be understood as a special GAN with an expected good convergence property. In another word, the research of GAN gives conditions on the convergency of GANs and AlphaGo Zero seems to fulfill them well.
Ii GAN and its convergence
Generative Adversarial Networks (GANs) are designed to approximate a distribution given samples drawn from with two competing models fighting against each other: a generative model that capture the data distribution and a discriminative model that distinguishes the training samples and generated fake data samples.
It’s well known that the training of GANs is more difficult than normal deep convolutional neural networks (CNNs) due to the following aspects:
In GANs there are two networks, the generator and the discriminator , need to be trained so that we have to deal with a higher complexity.
There may exist a mismatch between the discriminator and the generator , which leads to a mode collapse problem.
The cost function may fail to capture the difference between the distributions of the training samples and the generated data as discussed in .
To deal with the above mentioned problems, various solutions were proposed to improve the convergence of GANs. The main ideas include
To eliminate the complexity: This is achieved by adding constraints to the network structureExamples of this strategy include infoGAN and LAPGAN, which reduce the complexity of the generator by introducing constraints on clusters or subspaces of generated data.
To improve the cost function: The most successful example of this class is WGAN which proposed the Wasserstein distance so that the convergence of GANs is almost solved. With Wasserstein distance, the difference of the distributions of the training data and generated data can be reliably captured and the mismatch between the training of and is not a serious problem any more.
Ii-a The geometry of GAN
In order to analysis AlphaGo Zero, here we introduce a geometrical picture of GANs, which provides an intuitive understanding of GANs.
In the language of the geometry of deep learning, CNNs and ResNets are all curves in the space of transformations. Accordingly GANs are represented as a two-segment curve since there are two networks, and in GANs. From the geometric point of view, the reasons that GANs are difficult to be trained can be understood geometrically as follows:
The higher system complexity of GANs lies in a larger length of the two-segment curve to be found by the training process.
The essential goal of GANs is to train the generative model by the training data. This means that it’s preferred that the information from the training data can directly be fed to . Instead in GANs the information flow from the training data samples to has to firstly pass the discriminator . Obviously the training of is highly dependent on so that usually a balance between the training of and need to be carefully adjusted.
The longer information flow path also leads to a serious information loss. An example is that before the Wasserstein distance was introduced, the difference between the distributions of the training data and the generated data can be easily lost so that a failure of convergence may happen.
From the intuitive geometrical picture, training GANs is to find a two-segment curve connecting the input space of and the decision output space of while keeping the curve passing a neighbourhood of the training data space in an elegant way, i.e. no mode collapse. But the information flow passway shows that we can not directly see if the curve passes the neighbourhood of the training samples. Instead we can only make an evaluation at the end point of the curve, the output of .
Besides these, since GANs are usually based on CNN or ResNets, GANs will also befinite from the strategies to improve the convergence performance on CNN and ResNets. For example from the geometrical point of view, the spectral normalization on GANs  can be understood as to set constraints on the Remannian metric of the transformation manifold to control the curvature so that the geodesic shooting like GSD will be more stable. For more details on the geometric picture of deep learning, please refer to . In this paper we will only focus on the structure of AlphaGo Zero.
Accordingly, to improve the performance of GANs, we can
Reduce the complexity of GANs by setting constraints on the structure of the networks, or equivalently to reduce the possible shapes of the curves.
Directly feed the information from the training data to so that the information loss problem can be improved.
Find a way to balance the training of and to avoid the information loss and mode collapse problems.
In the next section, we will show AlphaGo Zero can be understood as a specially designed GAN, whose structure naturally fulfill the above mentioned conditions. We claim that this is the reason that AlphaGo Zero shows a excellent convergency property from the structural point of view.
Iii AlphaGo Zero as GAN
According to , AlphaGo Zero combines the original value and policy network into a single ResNet to compute the value and policy for any stage of the game as . The neural network is trained by a self-play reinforcement learning algorithm using MCTS to play each move. In each position , an MCTS search guided by working as a policy improvement operator is executed to generate a stronger move policy . The self-play using the improved policy to select each move and the game winner is regarded as a sample of the value, or a policy evaluation operator. The reinformcement learning algorithm uses these operators repeatedly to improve the policy, which is achieved by updating the network parameters to make the value and policy match the improved policy and self-play winner . The updated network parameters are used for the next iteration to make the search stronger till converge.
Now we can describe AlphaGo Zero in the language of GANs as follows.
The discriminator is a cascaded series of the same network connecting from the first move to the end of the game.
The generator is a cascaded series of the MCTS improved policy guided by , which generates the self-play data.
From a graphical model point of view, the MCTS enhanced policy can be roughly understood as the result of a nonparametric belief propagation(NBP) on a tree.
So we can establish a GAN structure on AlphaGo Zero system. We call it AlphaGo GAN since the other versions of AlphaGo can also be regarded as GANs with a minor modification. It should be noticed that in this GAN, both the training data and generative data are generated by the generator during the self-play procedure.
Iii-a Demystify AlphaGo Zero as GAN
We can now check if the AlphaGo GANs fulfill the conditions for a good convergence performance of GANs.
Complexity: Both the discriminator and the generator hold a similar repeated structure. So although the lengthes of and are huge but the complexity is restricted roughly by the size of the policy and value network .
Information flow and information loss: In general GANs, the information of training data can only be fed to through the output of . Or is trained by the output of . But in AlphaGo GANs, is not updated by data based training. Instead it’s directly updated by running a MCTS or a NBP based on the information from . We note that the update of not only depends on the final output of , instead it includes the intermediate information of , i.e. the output of at every move of the game. Obviously the information of the self-play generated training data and the information of can be fed to efficiently.
Mismatch between and : It was indicated that the mismatch between and will lead to either a slow converge or a mode collapse. In AlphaGo GANs, is a NBP enhanced version of so that the matching between and can be guaranteed.
Training data and generated adversarial data: In the two-segment curve picture of GANs, we ask that the generated data should pass the neighbourhood of training data. In a general GAN, this can only be justified by checking the outputs of on the training data and generated data. This is to say, only by checking the distributions of the ouputs of the discriminator of the training data and the generated data, we can judge if the training data and generated data have the same distribution. In AlphaGo Zero, all the data are generated from self-play using the same policy. So naturally the generated data fall in the neighbourhood of the training data. In another word, the winner and loser’s moves are based on the same knowledge and they are just samples from the same distribution.
So we can easily see that AlphaGo GANs fulfill the conditions of a good GAN. It’s not suprising that AlphaGo Zero show a good convergence property.
Based on the GAN structure of AlphaGo Zero, we can then explain the following observations on AlphaGo Zero.
Why AlphaGo Zero converges: The good AlphaGo GAN structure is only one reason for the convergence of AlphaGo Zero. We have to assume that the problem itself, the game of Go, should hold an elegant structure such that the convergence can be achieved. This may be a hint that the successness of AlphaGo Zero may not be regarded as a universal phenomenon since the convergence is highly dependent on the problem itself.
Why human knowledge deteriorates its performance: It’s observed that a pre-training using human knowledge can result in a worse performance. In GAN’s language, the pre-training will lead to human knowledge biased strong policy, or an over-strong discriminator. It’s well known that a over-strong discriminator will lead to a deteriorated convergence. Or in the language of NBP, the discriminator is so strong that the NBP based generator can not further enhance or shift it.
In this work we try to understand AlphaGo Zero as a GAN structure called AlphaGo GAN. Combining the geometrical picture of deep learning, we show that AlphaGo Zero can be analyzed as a GAN with a special structure, which fulfills the good convergence conditions of GANs. We then conclude that convergence of AlphaGo Zero is a joint result of both the special structure of the game GO and the structure of AlphaGo GAN. The success of AlphaGo Zero is not mysterious and it’s not safe to claim that this can be generalized to other applications.
-  D. Silver, J. Schrittwieser, and K. Simonyan et al. Mastering the game of go without human knowledge. Nature, 2017.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv:1701.07875, 2017.
-  X. Dong, J.S. Wu, and L. Zhou. Why deep learning works? — the geometry of deep learning. arXiv:1710.10784, 2017.
-  Anonymous authors. Spectral normalization for genertive adversarial networks. ICLR 2018 under review, 2017.