As large-scale unsupervised representations such as BERT and ELMo improve downstream performance on a wide range of natural language tasks devlin2019bidirectional; peters2018deep; radford2019language, what these models learn about language remains an open scientific question. An emerging body of work investigates this question through probes, supervised models trained to predict a property (like parts-of-speech) from a constrained view of the representation. Probes trained on various representations have obtained high accuracy on tasks requiring part-of-speech and morphological information belinkov2017what, syntactic and semantic information peters2018dissecting; tenney2018what, among other properties conneau2018what, providing evidence that deep representations trained on large datasets are predictive of a broad range of linguistic properties.
But when a probe achieves high accuracy on a linguistic task using a representation, can we conclude that the representation encodes linguistic structure, or has the probe just learned the task? Probing papers tend to acknowledge this uncertainty, putting accuracies in context using random representation baselines zhang2018language and careful task design hupkes2018visualisation. Even so, as long as a representation is a lossless encoding, a sufficiently expressive probe with enough training data can learn any task on top of it.
In this paper, we propose control tasks, which associate word types with random outputs, to give intuition for the expressivity of probe families and provide insight into how representation and probe interact to achieve high task accuracy.
Control tasks are based on the intuition that the more a probe is able to make task output decisions independently of the linguistic properties of a representation, the less its accuracy on a linguistic task necessarily reflects the properties of the representation. Thus, a good probe (one that provides insights into the linguistic properties of a representation) should be what we call selective, achieving high linguistic task accuracy and low control task accuracy (see Figure 2).
We show that selectivity can be a guide in designing probes and interpreting probing results, complementary to random representation baselines; as of now, there is little consensus on how to design probes. Early probing papers used linear functions shi2016does; ettinger2016probing; alain2016understanding, which are still used bisazza2018lazy; liu2019linguistic
, but multi-layer perceptron (MLP) probes are at least as popularbelinkov2017what; conneau2018what; adi2017finegrained; tenney2018what; ettinger2018assessing. Arguments have been made for “simple” probes, e.g., that we want to find easily accessible information in a representation liu2019linguistic; alain2016understanding. As a counterpoint though, “complex” MLP probes have also been suggested since useful properties might be encoded non-linearly conneau2018what, and they tend to report similar trends to simpler probes anyway belinkov2017what; qian2016investigating.
We define control tasks corresponding to English part-of-speech tagging and dependency edge prediction, and use ELMo representations to conduct a broad study of probe families, hyperparameters, and regularization methods, evaluating both linguistic task accuracy and selectivity. We propose that selectivity be used for building intuition about the expressivity of probes and the properties of models, putting probing accuracies into richer context. We find that:
With popular hyperparameter settings, MLP probes achieve very low selectivity, suggesting caution in interpreting how their results reflect properties of representations. For example, on part-of-speech tagging, accuracy is achieved, compared to control task accuracy, resulting in selectivity.
Linear and bilinear probes achieve relatively high selectivity across a range of hyperparameters. For example, a linear probe on part-of-speech tagging achieves a similar accuracy, and control task accuracy, for selectivity. This suggests that the small accuracy gain of the MLP may be explained by increased probe expressivity.
The most popular method for controlling probe complexity, dropout, does not consistently lead to selective MLP probes. However, control of MLP complexity through unintuitively small (10-dimensional) hidden states, as well as small training sample sizes and weight decay, lead to higher selectivity and similar linguistic task accuracy.