Recent language models have shown an intriguing range of capabilities. Networks trained on a simple “next-word” prediction task are apparently capable of many other things, such as solving logic puzzles or writing basic code. 222See Srivastava et al. (2022) for an encyclopedic list of such examples. Yet how this type of performance emerges from sequence predictions remains a subject of current debate.
Some have suggested that training on a sequence modeling task is inherently limiting. The arguments range from philosophical (Bender and Koller, 2020) to mathematical (Merrill et al., 2021). A common theme is that seemingly good performance might result from memorizing “surface statistics,” i.e., a long list of correlations that do not reflect a causal model of the process generating the sequence. But relying on spurious correlations may lead to problems on out-of-distribution data (Bender et al., 2021; Floridi and Chiriatti, 2020).
On the other hand, some tantalizing clues suggest language models may do more than collect spurious correlations, building interpretable world models—that is, understandable models of the process producing the sequences they are trained on. Recent evidence suggests language models can develop internal representations for very simple concepts, such as color, direction Abdou et al. (2021); Patel and Pavlick (2022), or track boolean states during synthetic tasks (Li et al., 2021) (see our Related Work in section 6 for more detail).
The question remains, then, of how we might investigate the emergence of world models in more complex domains. One possibility comes from Toshniwal et al. (2021), who explore language models trained on chess move sequences. These models learn to predict legal moves with high accuracy. Furthermore, by analyzing predicted moves, the paper shows that the model appears to track the board state. The authors stop short, however, of investigating the form of any internal representations. Such an investigation will be the focus of this paper.
1.1 The game of Othello as testbed for interpretability
Toshniwal et al. (2021)’s observations suggest a new approach to studying the representations learned by sequence models. If we think of a board as the “world,” then games provide us with an appealing experimental testbed to explore world representations of moderate complexity. As our setting, we choose the popular game of Othello ( Figure 1), which is simpler than chess. This setting allows us to investigate world representations in a highly controlled context, where both the task and sequence being modeled are synthetic and well-understood.
As a first step, we train a language model (a GPT variant we call Othello-GPT) to extend partial game transcripts (a list of moves made by players) with legal moves. The model has no a priori knowledge of the game or its rules. All it sees during training is a series of tokens derived from the game transcripts. Each token represents a tile where players place their discs. Note that we do not explicitly train the model to make strategically good moves or to win the game. Nonetheless, our model is able to generate legal Othello moves with high accuracy.
Our next step is to look for world representations that might be used by the network. In Othello, the “world” consists of the current board position. A natural question is whether, within the model, we can identify a representation of the board state involved in producing its next move predictions. To study this question, we train a set of probes, i.e., classifiers which allow us to infer the board state from the internal network activations . This type of probing has become a standard tool for analyzing neural networks(Alain and Bengio, 2016; Tenney et al., 2019; Belinkov, 2016).
Using this probing methodology, we find evidence for an emergent world representation. In particular, we show that a non-linear probe is able to predict the board state with high accuracy ( section 3). (Linear probes, however, produce poor results.) This probe defines an internal representation of the board state. We then provide evidence that this representation plays a causal role in the network’s predictions. Our main tool is an intervention technique that modifies internal activations so that they correspond to counterfactual board states.
We also discuss how knowledge of the internal world model can be used as an interpretability tool. Using our activation-intervention technique, we create latent saliency maps, which provide insight into how the network makes a given prediction. These maps are built by performing attribution at a high-level setting (the board) rather than a low-level one (individual input tokens or moves).
To sum up, we present four contributions: (1) we provide evidence for an emergent world model in a GPT variant trained to produce legal moves in Othello; (2) we compare the performance of linear and non-linear probing approaches, and find that non-linear probes are superior in this context; (3), we present an intervention technique that suggests that, in certain situations, the emergent world model can be used to control the network’s behavior; and (4) we show how probes can be used to produce latent saliency maps to shed light on the model’s predictions.
2 “Language modeling” of Othello game transcripts
Our approach for investigating internal representations of language models is to narrow our focus from natural language to a more controlled synthetic setting. We are partly inspired by the fact that language models show evidence of learning to make valid chess moves simply by observing game transcripts in training data (Srivastava et al., 2022). We choose the game Othello, which is simpler than chess, but maintains a sufficiently large game tree to avoid memorization. Our strategy is to see what, if anything, a GPT variant learns simply by observing game transcripts, without any a priori knowledge of rules or board structure.
The game is played on an 8x8 board where two players alternate placing white or black discs on the board tiles. The object of the game is to have the majority of one’s color discs on the board at the end of the game. Othello makes a natural testbed for studying emergent world representations since the game tree is far too large to memorize, but the rules and state are significantly simpler than chess.
The following subsections describe how we train a system with no prior knowledge of Othello to predict legal moves with high accuracy. The system itself is not our end goal; instead, it serves as our object of study.
2.1 Datasets: “Championship” and ”Synthetic”
We use two sets of training data for the system, which we call “championship” and ”synthetic”. Each captures different objectives, namely data quality vs. quantity. While limited in size, championship data reflects strategic moves by expert human players. The synthetic data set is far larger, consisting of legal but otherwise random moves.
Our championship dataset is produced by collecting Othello championship games from two online sources333www.liveothello.com and www.ffothello.org., containing and games, respectively. They are combined and split randomly by into training and validation sets. The games in this dataset were produced by matches where human players presumably made moves with a strategic intent to win. Following this, we generate a synthetic dataset with million games for training and games for validation. We compute this dataset by uniformly sampling leaves from the Othello game tree. Its data distribution is different from the championship games, reflecting no strategy.
2.2 Model and Training
Our goal is to study how much Othello-GPT can learn from pure sequence information, so we provide as few inductive biases as possible. (Note the contrast with a system like AlphaZero (Silver et al., 2018), where the goal was to win highly competitive chess games.) We therefore use only sequential tile indices as input to our model. For example, A4 and H6 are indexed as the rd and the st word in our vocabulary, respectively. Each game (e.g., E3, D3…in Figure 1) is treated as a sentence tokenized with a of -word vocabulary, where each word corresponds to one of the tiles on which players put discs, excluding the tiles in the center.
We trained an 8-layer GPT model (Radford et al., 2018, 2019; Brown et al., 2020) with an 8-head attention mechanism and a 512-dimensional hidden space. The training was performed in an autoregressive fashion. For each partial game
, the computation process starts from indexing a trainable word embedding consisting of the 60 vectors, each for one word, to get. They are then sequentially processed by multi-head attention layers. We denote the intermediate feature for the -th token after the -th layer as . By employing a causal mask, only the features at the immediately preceding layer and earlier time steps are visible to . Finally,
goes through a linear classifier to predict logits for. We minimize the cross-entropy loss between ground-truth move and predicted logits by gradient descent.
The model starts from randomly initialized weights, including in the word embedding layer. Though there are geometrical relationships between the 60 words (e.g., C4 is below B4), this inductive bias is not explicitly given to the model but rather left to be learned.
2.3 Othello-GPT Usually Predicts Legal Moves
We now evaluate how well the model’s predictions adhere to the rules of Othello. For each game in the validation set, which was not seen during training, and for each step in the game, we ask Othello-GPT to predict the next legal move conditioned by the partial game before that move. We then calculate the error rate by checking if the top- prediction is legal. The error rate is for Othello-GPT trained on the synthetic dataset and for Othello-GPT trained on the championship dataset. For comparison, the untrained Othello-GPT has an error rate of . The main takeaway is that Othello-GPT does far better than chance in predicting legal moves when trained on both datasets. We discuss reasons for the difference between the error rates for the synthetic and championship models later in the paper.
A potential explanation for these results may be that Othello-GPT is simply memorizing all possible transcripts. To test for this possibility, we created a skewed dataset
of 20 million games to replace the training set of synthetic dataset. At the beginning of every game, there are four possible opening moves: C4, D3, E6 and F5. This means the lowest layer of the game tree (first move) has four nodes (the four possible opening moves). For our skewed dataset, we truncate one of these nodes (C4), which is equivalent to removing a quarter of the whole game tree. Othello-GPT trained on the skewed dataset still yields an error rate of. Since Othello-GPT has seen none of these test sequences before, pure sequence memorization cannot explain its performance 444 One potential criticism is that certain board states are likely repeated across test sequences, but memorization related to board states rather than move sequences supports our hypothesis of emergent world representations. .
If the performance of Othello-GPT is not due to memorization, what is it doing? We now turn to this question by probing for internal representations of the game state.
3 Exploring Internal Representations with Probes
We seek to understand if Othello-GPT computes internal representations of the game state. One standard tool for this task is a “probe” (Alain and Bengio, 2016; Belinkov, 2016; Tenney et al., 2019). A probe is a classifier or regressor whose input consists of internal activations of a network, and which is trained to predict a feature of interest, e.g., part of speech or parse tree depth (Hewitt and Manning, 2019). If we are able to train an accurate probe, it suggests that a representation of the feature is encoded in the network’s activations.
In our case, we want to know whether Othello-GPT’s internal activations contain a representation of the current board state. To study this question, we train probes that predict the board state from the network’s internal activations after a given sequence of moves. Note that the board state—whether each tile is empty or holds a black or white disc—is generally a nonlinear function of the input tokens. On the other hand, since it is straightforward to write a program to compute this function, it makes a natural probe target.555Classifying a tile as unoccupied or occupied can be written as a linear function of the input sequence, thus we consider only the -way black/white/empty classifiers.
We take the autoregressive features that summarize the partial sequence as the input to the probe and study results from different layers . The output
is a 3-way categorical probability distribution. We randomly split pairs of internal representation and ground-truth tile state byinto training and validation set. Error rates on validation set are reported. A best random guess yields an error rate of , if the probe always guess the tile is empty.
3.1 Linear Probes Have High Error Rates
Our first result is that linear classifier probes have poor relative accuracy. Its function can be written as where . is the number of dimensions of input . As Table 1 shows, error rates never dip below . As a baseline, we have included probes trained on a randomly initialized network666Probes on the randomized network do better than chance; a constant guess of “empty” has a 47% error rate. But that performance is not a surprise, since it makes sense that some information about moves is preserved even by a random network. The key comparison is between the randomized network and the trained network. We can see that there is only a marginal improvement in accuracy when we move to probing a fully-trained network. This result suggests that if there is an internal representation of the board state, it does not have a simple linear form.
3.2 Nonlinear Probes Have Lower Error Rates
Given the poor performance of linear probes, it is natural to ask whether a nonlinear probes would have higher accuracy. Moving up one notch of complexity, we apply a 2-layer MLP as a probe. This technique has been used successfully in other language model probing work, e.g., Conneau et al. (2018); Cao et al. (2021); Hernandez and Andreas (2021). Its function can be written as where . is the number of hidden dimensions for the nonlinear probes.
The probe accuracy for trained networks, shown in Table 2, is significantly better than the linear probe in absolute terms. By contrast, the baseline (probing a randomized network with nonlinear probes) shows almost no improvement over the linear case. These results indicate that the probe may be recovering a nontrivial representation of board state in the network’s activations. In section 4, we describe intervention experiments validating this hypothesis.
3.3 Visualizing the Geometry of Probes
Both linear and nonlinear probes can be viewed as geometric objects. In the case of linear probes, we can associate each classifier with the normal vector to the separating hyperplane. In the case of nonlinear probes, we can treat the second layer of the MLP as a linear classifier and take the same view. This perspective associates a vector to each grid tile, corresponding to the classifier for that grid tile.
A natural question is whether these vectors display any geometric patterns. Figure 2 visualizes their configuration using PCA plots. To make the connection with board geometry clear, we have overlaid a grid in which the vector for a given grid tile is connected to the vectors that represent its neighbors. At left, as a baseline, are weights of probes trained on randomized GPTs; the result is a somewhat jumbled version of the board’s grid. For classifier vectors, however, we see a somewhat clearer geometric correlation with the grid shape itself. This shape may reflect a set of correlations between neighboring grid cells, and could be an interesting object for future study. One point of interest is that probe geometry for the randomized network does not seem completely random. This may fit with the fact that linear probe baseline performance is better than chance, indicating some information about board state can be gleaned from random projections of game move tokens.
4 Validating Probes with Interventional Experiments
Our nonlinear probe accuracies suggest that Othello-GPT computes information reflecting the board state. It’s not obvious, however, whether that information is causal for the model’s predictions. To investigate this issue, we evaluate whether the representations uncovered through section 3 play a causal role in Othello-GPT’s predictions. In the following section, we adhere to Belinkov (2016)’s recommendation, performing a set of interventional experiments to determine the causal relationship between model predictions and the emergent world representations.
To figure out whether the board state information affects the network’s predictions, we influence internal activations during Othello-GPT’s calculation and measure the resulting effects. At a high level, the interventions are as follows: given a set of activations from the Othello-GPT, a probe predicts a baseline board state . We save the move predictions associated with , then modify these activations such that our probe reports an updated board state . Through our protocol, only a single tile distinguishes from ’s board state (an example of which can be seen in Figure 3. This small modification results in a different set of possible legal moves for . If the new predictions match our expectations for —and not —we conclude the representation had a causal effect on the model.
4.1 Intervention Technique
To implement an intervention that changes the predicted state from a board position to a modified version we must decide (a) which layers to modify activations in, and (b) how to modify those activations. The first question is subtle. Given the causal attention mechanism of GPT-2, modifying activations for only one layer is unlikely to be effective as later layer computations incorporate information from prior board representations unaffected by our intervention. Instead, we select an initial layer then modify it and subsequent layers’ activations (see Figure 3 (C)). Our modification uses a simple gradient descent method on the probe’s class score for the particular tile whose state is being modified.
Figure 3 illustrates an intervention on a single feature into such that the corresponding board state is updated to match the desired . We observe the effectiveness of these interventions by probing the intervened or at later layers (see Appendix C), as well as the change in next-step prediction in (see subsection 4.2). Consistent with the training process of probes
, we use cross entropy loss between the probe-predicted probability distribution and the desired board state, but rather than optimize probe weights, we optimize for intervention 777Note that is the learning rate. See more on the intervention hyper-parameters in Appendix D.:
At timestep , the internal activations of an -layer Othello-GPT can be viewed as an grid of activation vectors. Our intervention process will work by running Othello-GPT sequentially, but using gradient descent to modify key activation vectors at the last timestep so that their board state class scores change. Note that if we change activations only at a middle layer, activations at higher layers are directly affected by pre-intervention information. Therefore, we sequentially intervene at the last timestep, on all activations starting from a preset layer until the final layer, illustrated in Figure 3.
4.2 Evidence for a Causal Role for the Representation
To systematically evaluate if this world representation is causal for model predictions, we create two evaluation benchmarks. Each consists of intervention cases: one factual, one counterfactual. A test case in these benchmarks consists of a triplet of a partial game, a targeted board tile, and a target state. For each case, we will give the partial game to Othello-GPT and perform the intervention described in the previous section. That is, we extract its activations in the middle of the computation process, modify them to change the representation of the targeted board tile into the target state, give back the modified world representation and let it make prediction with this new world state.
In the counterfactual case, we specifically ask whether the model’s world representation can represent arrangements of tiles on a board that are unreachable during legal Othello play, i.e., states that do not correspond to any legal sequence of moves. If the model can make correct predictions about such states, it helps rule out the possibility that our probes might have learned to merely project a sequence-oriented internal state to a board-based world model that the probes have hallucinated. If Othello-GPT can make correct predictions about counterfactual states, it provides evidence for an internal representation capable of representing a board rather than just a sequence.
To measure how well the prediction is aligned with ground-truth legal moves, we calculate a prediction set by comparing the prediction probability for each tile with , where is the number of legal post-intervention moves. Then, we calculate an error per case (a sum of false positives and false negatives, shown in Figure 4)888our qualitative results, along with their interpretation, can be found in Appendix B. For both benchmarks, nonlinear probes with give the best result: average errors of and respectively. Interventions based on linear probes all give worse than baseline results. Compared to baseline errors ( and ), the proposed intervention technique is effective even under counterfactual board states, suggesting the emergent world representations are causal to model predictions.
5 Latent Saliency Maps: Attribution via Intervention
The intervention technique of the previous section provides insight into the predictions of Othello-GPT. We might also use it to create visualizations which contextualize Othello-GPT’s predictions in terms of the board state. The basic idea is simple. For any tile on the board, we ask how much the network’s prediction would change if we applied the intervention of the previous section to change the state of that tile. This will yield a value per tile, positive or negative, corresponding to its saliency in the prediction (see algorithm 1). We then create a visualization of the board where tiles are colored according to their saliency. Because this map is based on the network’s latent space rather than its input, we call it a latent saliency map.
Figure 5 shows latent saliency maps for the synthetic and championship versions of Othello-GPT. The two diagrams show a clear pattern. The synthetic Othello-GPT shows high saliency for precisely those tiles that are required to make a move legal. In almost all cases, other tiles have lower saliency values. Even without knowing how synthetic-GPT was trained, an experienced Othello player might be able to guess its goal. The latent saliency maps for the championship version, however, are more complex. Although tiles that relate directly to legality typically have high values, many other tiles show high saliency as well. This pattern makes sense, too. Expert moves rely on complex global features of the board. The difference between the latent saliency maps for the two versions of Othello-GPT suggests that the visualization technique is providing useful information about the two systems.
|the current board state|
|p||a legal next move which we try to attribute|
|assigned sensitivity values for p|
6 Related work
Our work fits into a general line of investigation into world representations created by sequence models. For example, (Li et al., 2021) fine-tune two language models on synthetic natural language tasks (Long et al., 2016) and find evidence that semantic information about the underlying world state is at least weakly encoded in the activations of the network. More direct evidence of a faithful representation of 3D color space comes from Abdou et al. (2021), who examine activations in the BERT model and find a geometric connection to a standard 3D color space. Another study by (Patel and Pavlick, 2022) shows that language models can learn to map conceptual domains, e.g., direction and color, onto a grounded world representation via prompting techniques (Brown et al., 2020). These investigations operate in natural language domains, but investigate relatively simple world models.
Another related stream of work concerns neural networks that learn board games. There is a long history of work in AI to learn game moves, but in general, these systems have been given some a priori knowledge of the structure of the game. Even one of the most general-purpose game-playing engines, AlphaZero (Silver et al., 2018), has built-in knowledge of basic board structure and game rules (although, intriguingly, it seems to develop interpretable models of various strategic concepts (McGrath et al., 2021; Forde et al., 2022)).
, which trains a language model on chess transcripts. They show strong evidence that transformer networks are building a representation of internal board state, but they stop short at investigating what form that representation takes. Our work can be seen as building on this line of research, with a focus on the geometry of internal representations.
The intervention technique we use in section 4 follows an approach of steering model output while keeping the model frozen. It is related to the ideas behind plug-and-play controllable text generation for autoregressive (Dathathri et al., 2019; Qin et al., 2020; Krause et al., 2020) and diffusion (Li et al., 2022) language models by optimizing the likelihood of the desired attribute and the fluency of generated texts at the same time. These methods naturally involve a trade-off and require several forward and backward passes to generate. Our proposed intervention method stands out by only working on internal representations and requires only one forward pass.
Finally, latent saliency maps can be viewed as a generalization of the TCAV (testing with concept activation vectors) approach Kim et al. (2018); Ghorbani et al. (2019); Koh et al. (2020). In the TCAV setting, attribution is performed via directional derivatives. This is essentially a linearization of the gradient-descent optimization used in our attribution maps.
Our experiments provide evidence that Othello-GPT maintains a representation of game board states—that is, the Othello “world”—to produce sequences it was trained on. This representation appears to be nonlinear in an essential way, as supported by the results of our linear probe and nonlinear probe experiments. Further, we find that these representations can be causally linked to how the model makes its predictions. Understanding of the internal representations of a sequence model is interesting in its own right, but may also be helpful in deeper interpretations of the network.
We have also described how interventional experiments may be used to create a “latent saliency map”, which gives a picture, in terms of the Othello board, of how the network has made a prediction. Applied to two versions of Othello-GPT that were trained on different data sets, the latent saliency maps highlight the dramatic differences between underlying representations of the Othello-GPT trained on synthetic dataset and its counterpart trained on championship dataset.
There are several potential lines of future work. One natural extension would be to perform the same type of investigations with other, more complex games. It would also be interesting to compare the strategies learned by a sequence model trained on game transcripts with those of a model trained with a priori knowledge of Othello. One way to study this question is to compare latent saliency maps of Othello-GPT with standard saliency maps of an Othello-playing program which has the actual board state as input.
More broadly, it would be interesting to know how our results generalize to models trained on natural language. One stepping stone might be to look at language models whose training data has included game transcript. Will we see similar representation of board state? For more complex natural language tasks, can we find meaningful world representations? The tools described in this paper—nonlinear probes, layerwise interventions, and latent saliency maps—may yet prove useful in natural language settings.
We thank members of the Visual Computing Group and the Insight + Interaction Lab at Harvard for their early discussions and feedback. We especially thank Aoyu Wu for helping with making part of the figures and other valuable suggestions. We gratefully acknowledge the support of Harvard SEAS Fellowship (to KL), Siebel Fellowship (to AH), Open Philanthropy (to DB), the FTX Future Fund Regrant Program (to DB). This work was partially supported by NSF grant IIS-1901030.
- Can language models encode perceptual structure without grounding? a case study in color. arXiv preprint arXiv:2109.06129. Cited by: §1, §6.
- Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644. Cited by: §1.1, §3.
- Concept gradient: concept-based interpretation without linear assumption. arXiv preprint arXiv:2208.14966. Cited by: Appendix E.
- Probing classifiers: promises, shortcomings, and advances. Computational Linguistics, pp. 1–12. Cited by: §1.1, §3, §4.
- On the dangers of stochastic parrots: can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623. Cited by: §1.
- Climbing towards nlu: on meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5185–5198. Cited by: §1.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §2.2, §6.
- Low-complexity probing via finding subnetworks. arXiv preprint arXiv:2104.03514. Cited by: §3.2.
- What you can cram into a single vector: probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070. Cited by: §3.2.
- Plug and play language models: a simple approach to controlled text generation. arXiv preprint arXiv:1912.02164. Cited by: §6.
- GPT-3: its nature, scope, limits, and consequences. Minds and Machines 30 (4), pp. 681–694. Cited by: §1.
Where, when & which concepts does alphazero learn? lessons from the game of hex.
AAAI Workshop on Reinforcement Learning in Games, Vol. 2. Cited by: §6.
- Towards automatic concept-based explanations. Advances in Neural Information Processing Systems 32. Cited by: §6.
- The low-dimensional linear geometry of contextualized word representations. arXiv preprint arXiv:2105.07109. Cited by: §3.2.
- A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138. Cited by: Appendix A, §3.
Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav).
International conference on machine learning, pp. 2668–2677. Cited by: Appendix E, §6.
- Concept bottleneck models. In International Conference on Machine Learning, pp. 5338–5348. Cited by: §6.
- Gedi: generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367. Cited by: §6.
- Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737. Cited by: §1, §6.
- Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217. Cited by: §6.
- Simpler context-dependent logical forms via model projections. arXiv preprint arXiv:1606.05378. Cited by: §6.
- Acquisition of chess knowledge in alphazero. arXiv preprint arXiv:2111.09259. Cited by: Appendix A, §6.
- Provable limitations of acquiring meaning from ungrounded form: what will future language models understand?. Transactions of the Association for Computational Linguistics 9, pp. 1047–1060. Cited by: §1.
- Mapping language models to grounded conceptual spaces. In International Conference on Learning Representations, External Links: Cited by: §1, §6.
- Back to the future: unsupervised backprop-based decoding for counterfactual and abductive commonsense reasoning. arXiv preprint arXiv:2010.05906. Cited by: §6.
- Improving language understanding by generative pre-training. Cited by: §2.2.
- Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §2.2.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362 (6419), pp. 1140–1144. Cited by: §2.2, §6.
- Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: Appendix E.
- Beyond the imitation game: quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615. Cited by: §2, §6, footnote 2.
- BERT rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950. Cited by: §1.1, §3.
- Learning chess blindfolded: evaluating language models on state tracking. arXiv preprint arXiv:2102.13249. Cited by: §1.1, §1, §6.
Appendix A Ablation on Nonlinear Probe Accuracies
Though high accuracies have been observed on nonlinear probes in section 3, we want to develop a deeper understanding on them. For example, we wish to understand when, during a game, a model has developed world representations of board tiles, where in Othello-GPT that information is stored, how difficult it is to decode that information, and when the model may forget that information.
As shown in Figure 6 (B), we plot probe accuracies of two-layer probes varying to two different experiment settings: (1) Probe Hidden Units: the number of hidden units in nonlinear probes; (2) Layer: at which layer the representations is taken out. With the increase of hidden units, i.e., probe capacity, probe accuracy is higher as it can capture more knowledge from the hidden space. For layer , the accuracy peaks at midway, which is aligned with studies on natural language (Hewitt and Manning, 2019), where linguistic properties are found to be best probed in mid-layers. The format of our what-how-where plots is similar to the what-when-where plots in McGrath et al. (2021)
, except we ask how many hidden units within our nonlinear probe are necessary to achieve reasonable accuracy given each layer of the GPT instead of looking into number of training epochs.
We are also curious about when (in terms of game progression) these concepts are captured by Othello-GPT. Are these concepts updated immediately after each move? Will they persist or be forgotten with newer moves being made? To study this, we divide the data points for probe validation by how many steps the tile has been in its current state and plot what concept can be probed by how powerful probes when in the game progression in Figure 6 (A).
For nonlinear probes with a moderate number of hidden units, a parabolic accuracy curve is shown: concepts are best captured when they have existed for some time but not too long. This tells us: (1) forgetting happens when Othello-GPT changes its world representations; (2) there is a period of uncertainty before changes in board states are updated.
Appendix B Prediction heatmaps of counterfactual board states
In Figure 7, we can see one case of how intervention changes model prediction by intervening on the world representation of Othello-GPT. Note that the set of ground-truth legal moves are also changed by the intervention. In this case, both pre-intervention and post-intervention predictions have errors. Figure 4 shows systematic results over cases.
Appendix C Inter-probe Interaction
In Figure 8 we show the same case as in Appendix B of the world states probed from layers of Othello-GPT starting the th, which is the layer we start to intervene, . We can observe that after intervention is successfully done on the th layer at C4, the flipped disc is corrected in the immediately following layer, as seen in the pre-intervention probing results at the -th layer. However, when the same intervention is done on later layers again, the model is more convinced that C4 should be black and stops to correct it.
Appendix D Ablations on Intervention Hyper-parameters
Experiments find this optimization process is robust to different configurations of optimizer, learning rate , and number of steps.
The complete world state, including states of all the board tiles, is encoded in a single internal representation , while during intervention experiment, we only wish to change one of them. A natural question is: will the intervention operation flip tiles we are not not intending? It is possible but we can mitigate that by considering the cross entropy losses of other tiles as a regularization term, weighted by a hyper-parameter . That is to say, the loss in main paper can be expanded as:
We sweep at the best on the factual benchmark and report average number of errors in Table 3. We can observe it does not clearly help.
Here we further discuss another hyper-parameter ablated in Figure 4, the starting layer for intervention, . If we intervene with more than layers, shallow layers which have not developed reliable world representations (according to Figure 6) are touched, making intervention hazardous. On the other hand, if we intervene only at the deepest layers, though the world representation can be intervened successfully (see Appendix C), the model does not have enough computation to adapt to the newer world representation and make predictions corresponding to it.
Appendix E Attribution via Conceptual Sensitivity
Kim et al. (2018) propose to do Testing with Concept Activation Vector (TCAV) on high-dimensional internal representations of deep models to quantify to which degree underlying concepts in the input are important for prediction. It will produce heatmaps of the same format as our proposed attribution via intervention method and can serve as a baseline for comparison. In its essence, Kim et al. (2018) uses the dot-products between a prediction vector and a set of CAVs to test how the next-step predictions from the Othello-GPT are sensitive to these concepts.
To start, we denote the prediction function as since acts as a summary of . Then we linearize it locally as to serve as our prediction vector after normalization. The idea is that in the hidden space dwells, the closer a CAV is to this prediction vector, the more contribution the represented concept is making towards the prediction.
For linear probes, we think of normalized probe weights as a set of CAVs, each for one tile. For nonlinear probes, we propose a natural nonlinear generalization to get , which are used to replace except being local to . At each local , we calculate the direction in hidden space that can strengthen its current probe prediction using gradient back propagation, i.e. with SmoothGrad (Smilkov et al., 2017) applied. The result is not sensitive to the whitening of . This linearization of probe is concurrently proposed by Bai et al. (2022).
We then assign a sensitivity score by to each tile p on the board and plot the results as heatmaps. Similar patterns can be found in both Figure 9 and Figure 10 but the suggested pattern is not highly aligned to real-world Othello-rule.
e.1 Discussion of Two Attribution Methods
In Figure 9 and Figure 10, based on the observation that both linear and nonlinear approximation give similar results, we guess the problem does not lie in but the linearization of prediction function . There is likely that, in Othello-GPT, a more complex logic operation in the prediction module than what can be easily linearized like in image domain.
In the latent saliency maps created by attribution via intervention in Figure 5, when multiple discs are flipped, only the first one are found contributing, which is slighted misaligned to human intuition. However, this is expected from the algorithm we are using because other flipped discs, even in the opposite states, still make the current prediction legal.
The attribution via intervention method succeeds at visualizing the AND-logic in Othello rule to flip one straight line: there should be opponent discs in between and same-color disc at the other end. However, when the prediction flips more than one straight lines, intervening on one of them does not nullify the prediction because Othello rule is an OR-logic on top of the AND-logic: a move is legal when at least one straight lines can be flipped. How to extract such knowledge with unlimited amount of intervention experiments? We leave it for future research.