Explaining AlphaGo: Interpreting Contextual Effects in Neural Networks

01/08/2019 ∙ by Zenan Ling, et al. ∙ 10

In this paper, we propose to disentangle and interpret contextual effects that are encoded in a pre-trained deep neural network. We use our method to explain the gaming strategy of the alphaGo Zero model. Unlike previous studies that visualized image appearances corresponding to the network output or a neural activation only from a global perspective, our research aims to clarify how a certain input unit (dimension) collaborates with other units (dimensions) to constitute inference patterns of the neural network and thus contribute to the network output. The analysis of local contextual effects w.r.t. certain input units is of special values in real applications. Explaining the logic of the alphaGo Zero model is a typical application. In experiments, our method successfully disentangled the rationale of each move during the Go game.



There are no comments yet.


page 2

page 3

page 6

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interpreting the decision-making logic hidden inside neural networks is an emerging research direction in recent years. The visualization of neural networks and the extraction of pixel-level input-output correlations are two typical methodologies. However, previous studies usually interpret the knowledge inside a pre-trained neural network from a global perspective. For example, [18, 15, 11] mined input units (dimensions or pixels) that the network output is sensitive to; [3] visualized receptive fields of filters in intermediate layers; [34, 16, 25, 6, 7, 21] illustrated image appearances that maximized the score of the network output, a filter’s response, or a certain activation unit in a feature map.

However, instead of visualizing the entire appearance that is responsible for a network output or an activation unit, we are more interested in the following questions.

  • How does a local input unit contribute to the network output? Here, we can vectorize the input of the network into a high-dimensional vector, and we treat each dimension as a specific “unit” without ambiguity. As we know, a single input unit is usually not informative enough to make independent contributions to the network output. Thus, we need to clarify which other input units the target input unit collaborates with to constitute inference patterns of the neural network, so as to pass information to high layers.

  • Can we quantitatively measure the significance of above contextual collaborations between the target input unit and its neighboring units?


Therefore, given a pre-trained convolutional neural network (CNN), we propose to disentangle contextual effects

w.r.t. certain input units.

Figure 1:

Explaining the alphaGo model. Given the state of the Go board and the next move, we use the alphaGo model to explain the rationale of the move. We first estimate a rough region of contextual collaborations

w.r.t. the current move by distilling knowledge from the value net to student nets that receive different regions of the Go board as inputs. Then, given a student net, we analyze fine-grained contextual collaborations within its region of the Go board. In this figure, we use a board state from a real Go game between humans for clarity.

As shown in Fig. 1, we design two methods to interpret contextual collaborations at different scales, which are agnostic to the structure of CNNs. The first method estimates a rough region of contextual collaborations, i.e. clarifying whether the target input unit mainly collaborates with a few neighboring units or most units of the input. This method distills knowledge from the pre-trained network into a mixture of local models (see Fig. 2), where each model encodes contextual collaborations within a specific input region to make predictions. We hope that the knowledge-distillation strategy can help people determine quantitative contributions from different regions. Then, given a model for local collaborations, the second method further analyzes the significance of detailed collaborations between each pair of input units, when we use the local model to make predictions on an image.

Application, explaining the alphaGo Zero model: The quantitative analysis of contextual collaborations w.r.t. a local input unit is of special values in some tasks. For example, explaining the alphaGo model [23, 8] is a typical application.

The alphaGo model contains a value network to evaluate the current state of the game—a high output score indicates a high probability of winning. As we know, the contribution of a single move (

i.e. placing a new stone on the Go board) to the output score during the game depends on contextual shapes on the Go board. Thus, disentangling explicit contextual collaborations that contribute to the output of the value network is important to understand the logic of each new move hidden in the alphaGo model.

More crucially, in this study, we explain the alphaGo Zero model [8], which extends the scope of interests of this study from diagnosing feature representations of a neural network to a more appealing issue letting self-improving AI teach people new knowledge. The alphaGo Zero model is pre-trained via self-play without receiving any prior knowledge from human experience as supervision. In this way, all extracted contextual collaborations represent the automatically learned intelligence, rather than human knowledge.

As demonstrated in well-known Go competitions between the alphaGo and human players [2, 1], the automatically learned model sometimes made decisions that could not be explained by existing gaming principles. The visualization of contextual collaborations may provide new knowledge beyond people’s current understanding of the Go game.

Contributions of this paper can be summarized as follows.
(i) In this paper, we focus on a new problem, i.e. visualizing local contextual effects in the decision-making of a pre-trained neural network w.r.t. a certain input unit.
(ii) We propose two new methods to extract contextual effects via diagnosing feature representations and knowledge distillation.
(iii) We have combined two proposed methods to explain the alphaGo Zero model, and experimental results have demonstrated the effectiveness of our methods.

2 Related work

Understanding feature representations inside neural networks is an emerging research direction in recent years. Related studies include 1) the visualization and diagnosis of network features, 2) disentangling or distilling network feature representations into interpretable models, and 3) learning neural networks with disentangled and interpretable features in intermediate layers.

Network visualization: Instead of analyzing network features from a global view [31, 20, 17], [3] defined six types of semantics for middle-layer feature maps of a CNN, i.e. objects, parts, scenes, textures, materials, and colors. Usually, each filter encodes a mixture of different semantics, thus difficult to explain.

Visualization of filters in intermediate layers is the most direct method to analyze the knowledge hidden inside a neural network. [34, 16, 25, 6, 33, 5, 35] showed the appearance that maximized the score of a given unit. [6] used up-convolutional nets to invert CNN feature maps to their corresponding images.

Pattern retrieval: Some studies retrieved certain units from intermediate layers of CNNs that were related to certain semantics, although the relationship between a certain semantics and each neural unit was usually convincing enough. People usually parallel the retrieved units similar to conventional mid-level features [26] of images. [38, 39] selected units from feature maps to describe “scenes”. [24] discovered objects from feature maps.

Model diagnosis and distillation: Model-diagnosis methods, such as the LIME [18], the SHAP [15], influence functions [12], gradient-based visualization methods [7, 21], and [13] extracted image regions that were responsible for network outputs. [30, 37] distilled knowledge from a pre-trained neural network into explainable models to interpret the logic of the target network. Such distillation-based network explanation is related to the first method proposed in this paper. However, unlike previous studies distilling knowledge into explicit visual concepts, our using distillation to disentangle local contextual effects has not been explored in previous studies.

Figure 2: Division of lattices for two types of student nets. We distill knowledge from the value net into a mixture of four/nine student nets to approximate decision-making logic of the value net.

Learning interpretable representations: A new trend is to learn networks with meaningful feature representations in intermediate layers [10, 27, 14] in a weakly-supervised or unsupervised manner. For example, capsule nets [19] and interpretable RCNN [32] learned interpretable middle-layer features. InfoGAN [4] and -VAE [9] learned meaningful input codes of generative networks. [36] developed a loss to push each middle-layer filter towards the representation of a specific object part during the learning process without given part annotations.

All above related studies mainly focused on semantic meanings of a filter, an activation unit, a network output. In contrast, our work first analyzes quantitative contextual effects w.r.t. a specific input unit during the inference process. Clarifying explicit mechanisms of how an input unit contributes to the network output has special values in applications.

3 Algorithm

In the following two subsections, we will introduce two methods that extract contextual collaborations w.r.t. a certain input unit from a CNN at different scales. Then, we will introduce the application that uses the proposed methods to explain the alphaGo Zero model.

3.1 Determining the region of contextual collaborations w.r.t. an input unit

Since the input feature usually has a huge number of dimensions (units), it is difficult to accurately discover a few input units that collaborate with a target input unit. Therefore, it is important to first approximate the rough region of contextual collaborations before the unit-level analysis of contextual collaborations, i.e. clarifying in which regions contextual collaborations are contained.

Given a pre-trained neural network, an input sample, and a target unit of the sample, we propose a method that uses knowledge distillation to determine the region of contextual collaborations w.r.t. the target input unit. Let denote the input feature (e.g.

an image or the state in a Go board). Note that input features of most CNNs can be represented as a tensor

, where and indicate the height of the width of the input, respectively; is the channel number. We clip different lattices (regions) from the input tensor, and input units within the -th lattice are given as , . Different lattices overlap with each other.

The core idea is that we use a mixture of models to approximate the function of the given pre-trained neural network (namely the teacher net), where each model is a student net and uses input information within a specific lattice to make predictions.


where and denote the output of the pre-trained teacher net and the output of the -th student net , respectively. is a scalar weight, which depends on the input . Because different lattices within the input are not equally informative w.r.t. the target task, input units within different lattices make different contributions to final network output.

More crucially, given different inputs, the importance for the same lattice may also change. For example, as shown in [21], the head appearance is the dominating feature in the classification of animal categories. Thus, if a lattice corresponds to the head, then this lattice will contribute more than other lattices, thereby having a large weight . Therefore, our method estimates a specific weight for each input , i.e. is formulated as a function of (which will be introduced later).

Significance of contextual collaborations: Based on the above equation, the significance of contextual collaborations within each lattice w.r.t. an input unit can be measured as .


where we revise the value of the target unit in the input and check the change of network outputs, and . If contextual collaborations w.r.t. the target unit mainly localize within the -th lattice , then can be expected to contribute the most to the change of .

We conduct two knowledge-distillation processes to learn student nets and a model of determining , respectively.

Student nets: The first process distills knowledge from the teacher net to each student net with parameters based on the distillation loss , where the subscript indicates the output for the input . Considering that only contains partial information of , we do not expect to reconstruct without any errors.

Distilling knowledge to weights: Then, the second distillation process estimates a set of weights for each specific input . We use the following loss to learn another neural network with parameters to infer the weight.


3.2 Fine-grained contextual collaborations w.r.t. an input unit

In the above subsection, we introduce a method to distill knowledge of contextual collaborations into student nets of different regions. Given a student net, in this subsection, we develop an approach to disentangling from the student net explicit contextual collaborations w.r.t. a specific input unit , i.e. identifying which input unit collaborates with to compute the network output.

We can consider a student net as a cascade of functions of layers, i.e. (or for skip connections), where denotes the output feature of the -th layer. In particular, and

indicate the input and output of the network, respectively. We only focus on a single scalar output of the network (we may handle different output dimensions separately if the network has a high-dimensional output). If the sigmoid/softmax layer is the last layer, we use the score before the softmax/sigmoid operation as

to simplify the analysis.

3.2.1 Preliminaries, the estimation of quantitative contribution

As preliminaries of our algorithm, we extend the technique of [22] to estimate the quantitative contribution of each neural activation in a feature map to the final prediction. We use to denote the contribution distribution of neural activations on the -th layer . The score of the -th element denotes the ratio of the unit ’s score contribution w.r.t. the entire network output score. Because is the scalar network output, it has a unit contribution . Then, we introduce how to back-propagate contributions to feature maps in low layers.

The method of contribution propagation is similar to network visualization based on gradient back-propagation [16, 33]. However, contribution propagation reflects more objective distribution of numerical contributions over , instead of biasedly boosting compacts of the most important activations.

Without loss of generality, in this paragraph, we use to simplify the notation of the function of a certain layer. If the layer is a conv-layer or a fully-connected layer, then we can represent the convolution operation for computing each elementary activation score of in a vectorized form222Please see the Appendix for details. . We consider as the numerical contribution of to . Thus, we can decompose the entire contribution of , , into elementary contributions of , i.e. , which satisfies (see the appendix for details). Then, the entire contribution of is computed as the sum of elementary contributions from all in the above layer, i.e. .

A cascade of a conv-layer and a batch-normalization layer can be rewritten in the form of a single conv-layer, where normalization parameters are absorbed into the conv-layer

22footnotemark: 2. For skip connections, a neural unit may receive contributions from different layers,

. If the layer is a ReLU layer or a Pooling layer, the contribution propagation has the same formulation as gradient back-propagations of those layers

22footnotemark: 2.

3.2.2 The extraction of contextual collaborations

As discussed in [3], each neural activation of a middle-layer feature can be considered as the detection of a mid-level inference pattern. All input units must collaborate with neighboring units to activate some middle-layer feature units, in order to pass their information to the network output.

Therefore, in this research, we develop a method to
1. determine which mid-level patterns (or which neural activations ) the target unit constitutes;
2. clarify which input units help the target to constitute the mid-level patterns;
3. measure the strength of the collaboration between and .

Let and denote the feature map of a certain conv-layer when the network receives input features with the target unit being activated and the feature map generated without being activated, respectively. In this way, we can use to represent the absolute effect of on the feature map . The overall contribution of the -th neural unit depends on the activation score , , where measures the activation strength used for inference. The proportion of the contribution is affected by the target unit can be roughly formulated as .


where and thus if , because negative activation scores of a conv-layer cannot pass information through the following ReLU layer ( is not the feature map of the last conv-layer before the network output).

In this way, highlights a few mid-level patterns (neural activations) related to the target unit . measures the contribution proportion that is affected by the target unit . We can use to replace and use techniques in Section 3.2.1 to propagate back to input units . Thus, represents a map of fine-grained contextual collaborations w.r.t. . Each element in the map is given as ’s collaboration with .

We can understand the proposed method as follows. The relative activation change can be used as a weight to evaluate the correlation between and the -th activation unit (inference pattern). In this way, we can extract input units that make great influences on ’s inference patterns, rather than affect all inference patterns. Note that both and may either increase or decrease the value of . It means that the contextual unit may either boost ’s effects on the inference pattern, or weaken ’s effects.

3.3 Application: explaining the alphaGo Zero model

We use the ELF OpenGo [29, 28] as the implementation of the alphaGo Zero model. We combine the above two methods to jointly explain each move’s logic hidden in the value net of the alphaGo Zero model during the game. As we know, the alphaGo Zero model contains a value net, policy nets, and the module of the Monte-Carlo Tree Search (MCTS). Generally speaking, the superior performance of the alphaGo model greatly relies on the enumeration power of the policy net and the MCTS, but the value net provides the most direct information about how the model evaluates the current state of the game. Therefore, we explain the value net, rather than the policy net or the MCTS. In the ELF OpenGo implementation, the value net is a residual network with 20 residual blocks, each containing two conv-layers. We take the scalar output333The value net uses the current state, as well as seven most recent states, to output eight values for the eight states. To simplify the algorithm, we take the value corresponding to the current state as the target value. before the final (sigmoid) layer as the target value to evaluate the current state on the Go board.

Given the current move of the game, our goal is to estimate unit-level contextual collaborations w.r.t. the current move. I.e. we aim to analyze which neighboring stones and/or what global shapes help the current move make influences to the game. We distill knowledge from the value net to student networks to approximate contextual collaborations within different regions. Then, we estimate unit-level contextual collaborations based on the student net.

Determining local contextual collaborations: We design two types of student networks, which receive lattices at the scales of and , respectively. In this way, we can conduct two distillation processes to learn neural networks that encode contextual collaborations at different scales.

As shown in Fig. 2, we have four student nets oriented to lattices. Except for the output, the four student nets have the same network structure as the value net. The four student nets share parameters in all layers. The input of a student net only has two channels corresponding to maps of white stones and black stones, respectively, on the Go board. We crop four overlapping lattices at the four corners of the Go board for both training and testing. Note that we rotate the board state within each lattice to make the top-left position corresponds to the corner of the board, before we input to the student net. The neural network has the same settings as the value net. receives a concatenation of as the input. outputs four scalar weights for the four local student networks . We learn via knowledge distillation.

Student nets for lattices have similar settings as those for lattices. We divide the entire Go board into overlapping lattices. Nine student nets encode local knowledge from nine local lattices. We learn another neural network , which uses a concatenation of to weight for the nine local lattices.

Finally, we select the most relevant lattice and the most relevant lattice, via , for explanation.

Estimating unit-level contextual collaborations: In order to obtain fine-grained collaborations, we apply the method in Section 3.2.2 to explain two student nets corresponding to the two selected relevant lattices. We also use our method to explain the value net. We compute a map of contextual collaborations for each neural network and normalize values in the map. We sum up maps of the three networks together to obtain the final map of contextual collaborations .

Figure 3: Significance of contextual collaborations w.r.t. the new black stone (the black star). Go players provided possible explanations for contextual collaborations. The red/blue color indicates a significant/insignificant contextual collaboration. Please see the appendix for more results.

More specifically, given a neural network, we use the feature of each conv-layer to compute the initial in Equation (4) and propagated to obtain a map of collaborations . We sum up maps based on the 1st, 3rd, 5th, and 7th conv-layers to obtain the collaboration map of the network.

4 Experiments

In experiments, we distilled knowledge of the value network to student nets, and disentangled fine-grained contextual collaborations w.r.t. each new move. We compared the extracted contextual collaborations and human explanations for the new move to evaluate the proposed method.

4.1 Evaluation metric

In this section, we propose two metrics to evaluate the accuracy of the extracted contextual collaborations w.r.t. the new move. Note that considering the high complexity of the Go game, there is no exact ground-truth explanation for contextual collaborations. Different Go players usually have different analysis of the same board state. More crucially, as shown in competitions between the alphaGo and human players [2, 1], the knowledge encoded in the alphaGo was sometimes beyond humans’ current understanding of the Go game and could not be explained by existing gaming principles.

In this study, we compared the similarity between the extracted contextual collaborations and humans’ analysis of the new move. The extracted contextual collaborations were just rough explanations from the perspective of the alphaGo. We expected these collaborations to be close to, but not exactly the same as human understanding. More specifically, we invited Go players who had obtained four-dan grading rank to label contextual collaborations. To simplify the metric, Go players were asked to label a relative strength value of the collaboration between each stone and the target move (stone), no matter whether the relationship between the two stones was collaborative or adversarial. Considering the double-blind policy, the paper will introduce the Go players if the paper is accepted.

Let be a set of existing stones except for the target stone on the Go board. denotes the labeled collaboration strength between each stone and the target stone . is referred to as the collaboration strength estimated by our method, where denotes the final estimated collaboration value on the stone . We normalized the collaboration strength, , and computed the Jaccard similarity between the distribution of and the distribution of as the similarity metric.

In addition, considering the great complexity of the Go game, different Go players may annotate different contextual collaborations. Therefore, we also required Go players to provide a subjective rating for the extracted contextual collaborations of each board state, i.e. selecting one of the five ratings: 1-Unacceptable, 2-Problematic, 3-Acceptable, 4-Good, and 5-Perfect.

4.2 Experimental results and analysis

Fig. 3 shows the significance of the extracted contextual collaborations, as well as possible explanations for contextual collaborations, where the significance of the stone ’s contextual collaboration was reported as the absolute collaboration strength instead of the original score in experiments. Without loss of generality, let us focus on the winning probability of the black. Considering the complexity of the Go game, there may be two cases of a positive (or negative) value of the collaboration score . The simplest case is that when a white stone had a negative value of , it means that the white stone decreased the winning probability of the black. However, sometimes a white stone had a positive . It may be because that this white stone did not sufficiently exhibit its power due to its contexts. Since the white and the white usually had a very similar number of stones in the Go board, putting a relatively ineffective white stone in a local region also wasted the opportunity of winning advantages in other regions in the zero-sum game. Similarly, the black stone may also have either a positive or a negative value of .

The Jaccard similarity between the extracted collaborations and the manually-annotated collaborations was 0.3633. Nevertheless, considering the great diversity of explaining the same game state, the average rating score that was made by Go players for the extracted collaborations was 3.7 (between 3-Acceptable and 4-Good). Please see the appendix for more results.

5 Conclusion and discussions

In this paper, we have proposed two typical methods for quantitative analysis of contextual collaborations w.r.t. a certain input unit in the decision-making of a neural network. Extracting fine-grained contextual collaborations to clarify the reason why and how an input unit passes its information to the network output is of significant values in specific applications, but it has not been well explored before, to the best of our knowledge. In particular, we have applied our methods to the alphaGo Zero model, in order to explain the potential logic hidden inside the model that is automatically learned via self-play without human annotations. Experiments have demonstrated the effectiveness of the proposed methods.

Note that there is no exact ground-truth for contextual collaborations of the Go game, and how to evaluate the quality of the extracted contextual collaborations is still an open problem. As a pioneering study, we do not require the explanation to be exactly fit human logics, because human logic is usually not the only correct explanations. Instead, we just aim to visualize contextual collaborations without manually pushing visualization results towards human-interpretable concepts. This is different from some previous studies of network visualization [16, 33] that added losses as the natural image prior, in order to obtain beautiful but biased visualization results. In the future, we will continue to cooperate with professional Go players to further refine the algorithm to visualize more accurate knowledge inside the alphaGo Zero model.


Supplementary materials for the contribution propagation

Let denote the convolutional operation of a conv-layer. We can rewrite the this equation in a vectorized form as , , . For each output element , . If the conv-layer is a fully-connected layer, then each element corresponds to an element in . Otherwise, is a sparse matrix, i.e. if and are too far way to be covered by the convolutional filter.

Thus, we can write to simplify the notation. Intuitively, we can propagate the contribution of to its compositional elements based on their numerical scores. Note that we only consider the case of , because if , cannot pass information through the ReLU layer, and we obtain and thus . In particular, when , all compositional scores just contribute an activation score , thereby receiving a total contribution of . When , we believe the contribution of all comes from elements of , and each element’s contribution is given a . Thus, we get

When a batch-normalization layer follows a conv-layer, then the function of the two cascaded layers can be written as

Thus, we can absorb parameters for the batch normalization into the conv-layer, i.e. and .

For ReLU layers and Pooling layers, the formulation of the contribution propagation is identical to the formulation for the gradient back-propagation, because the gradient back-propagation and the contribution propagation both pass information to neural activations that are used during the forward propagation.

More results

Considering the great complexity of the Go game, there do not exist ground-truth annotations for the significance of contextual collaborations. Different Go players may have different understanding of the same Go board state, thereby annotating different heat maps for the significance of contextual collaborations. More crucially, our results reflect the logic of the automatically-learned alphaGo Zero model, rather than the logic of humans.

Therefore, in addition to manual annotations of collaboration significance, we also require Go players to provide a subjective evaluation for the extracted contextual collaborations.

We compared the extracted contextual collaborations at different scales (the second, third, fourth, and fifth columns) with annotations made by Go players.

We compared the extracted contextual collaborations at different scales (the second, third, fourth, and fifth columns) with annotations made by Go players.

Contextual collaborations of local regions

We show the significance of contextual collaborations within a local lattice. The score for the -th lattice is reported as .