Designing and Interpreting Probes with Control Tasks

09/08/2019 ∙ by John Hewitt, et al. ∙ Stanford University 0

Probes, supervised models trained to predict properties (like parts-of-speech) from representations (like ELMo), have achieved high accuracy on a range of linguistic tasks. But does this mean that the representations encode linguistic structure or just that the probe has learned the linguistic task? In this paper, we propose control tasks, which associate word types with random outputs, to complement linguistic tasks. By construction, these tasks can only be learned by the probe itself. So a good probe, (one that reflects the representation), should be selective, achieving high linguistic task accuracy and low control task accuracy. The selectivity of a probe puts linguistic task accuracy in context with the probe's capacity to memorize from word types. We construct control tasks for English part-of-speech tagging and dependency edge prediction, and show that popular probes on ELMo representations are not selective. We also find that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective. Finally, we find that while probes on the first layer of ELMo yield slightly better part-of-speech tagging accuracy than the second, probes on the second layer are substantially more selective, which raises the question of which layer better represents parts-of-speech.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As large-scale unsupervised representations such as BERT and ELMo improve downstream performance on a wide range of natural language tasks devlin2019bidirectional; peters2018deep; radford2019language, what these models learn about language remains an open scientific question. An emerging body of work investigates this question through probes, supervised models trained to predict a property (like parts-of-speech) from a constrained view of the representation. Probes trained on various representations have obtained high accuracy on tasks requiring part-of-speech and morphological information belinkov2017what, syntactic and semantic information peters2018dissecting; tenney2018what, among other properties conneau2018what, providing evidence that deep representations trained on large datasets are predictive of a broad range of linguistic properties.

But when a probe achieves high accuracy on a linguistic task using a representation, can we conclude that the representation encodes linguistic structure, or has the probe just learned the task? Probing papers tend to acknowledge this uncertainty, putting accuracies in context using random representation baselines zhang2018language and careful task design hupkes2018visualisation. Even so, as long as a representation is a lossless encoding, a sufficiently expressive probe with enough training data can learn any task on top of it.

In this paper, we propose control tasks, which associate word types with random outputs, to give intuition for the expressivity of probe families and provide insight into how representation and probe interact to achieve high task accuracy.

Control tasks are based on the intuition that the more a probe is able to make task output decisions independently of the linguistic properties of a representation, the less its accuracy on a linguistic task necessarily reflects the properties of the representation. Thus, a good probe (one that provides insights into the linguistic properties of a representation) should be what we call selective, achieving high linguistic task accuracy and low control task accuracy (see Figure 2).

We show that selectivity can be a guide in designing probes and interpreting probing results, complementary to random representation baselines; as of now, there is little consensus on how to design probes. Early probing papers used linear functions shi2016does; ettinger2016probing; alain2016understanding, which are still used bisazza2018lazy; liu2019linguistic

, but multi-layer perceptron (MLP) probes are at least as popular

belinkov2017what; conneau2018what; adi2017finegrained; tenney2018what; ettinger2018assessing. Arguments have been made for “simple” probes, e.g., that we want to find easily accessible information in a representation liu2019linguistic; alain2016understanding. As a counterpoint though, “complex” MLP probes have also been suggested since useful properties might be encoded non-linearly conneau2018what, and they tend to report similar trends to simpler probes anyway belinkov2017what; qian2016investigating.

Figure 2: Selectivity is defined as the difference between linguistic task accuracy and control task accuracy, and can vary widely, as shown, across probes which achieve similar linguistic task accuracies. These results taken from § LABEL:sectionselectivityresults.

We define control tasks corresponding to English part-of-speech tagging and dependency edge prediction, and use ELMo representations to conduct a broad study of probe families, hyperparameters, and regularization methods, evaluating both linguistic task accuracy and selectivity. We propose that selectivity be used for building intuition about the expressivity of probes and the properties of models, putting probing accuracies into richer context. We find that:

  1. With popular hyperparameter settings, MLP probes achieve very low selectivity, suggesting caution in interpreting how their results reflect properties of representations. For example, on part-of-speech tagging, accuracy is achieved, compared to control task accuracy, resulting in selectivity.

  2. Linear and bilinear probes achieve relatively high selectivity across a range of hyperparameters. For example, a linear probe on part-of-speech tagging achieves a similar accuracy, and control task accuracy, for selectivity. This suggests that the small accuracy gain of the MLP may be explained by increased probe expressivity.

  3. The most popular method for controlling probe complexity, dropout, does not consistently lead to selective MLP probes. However, control of MLP complexity through unintuitively small (10-dimensional) hidden states, as well as small training sample sizes and weight decay, lead to higher selectivity and similar linguistic task accuracy.