The Low-Dimensional Linear Geometry of Contextualized Word Representations

by   Evan Hernandez, et al.

Black-box probing models can reliably extract linguistic features like tense, number, and syntactic role from pretrained word representations. However, the manner in which these features are encoded in representations remains poorly understood. We present a systematic study of the linear geometry of contextualized word representations in ELMO and BERT. We show that a variety of linguistic features (including structured dependency relationships) are encoded in low-dimensional subspaces. We then refine this geometric picture, showing that there are hierarchical relations between the subspaces encoding general linguistic categories and more specific ones, and that low-dimensional feature encodings are distributed rather than aligned to individual neurons. Finally, we demonstrate that these linear subspaces are causally related to model behavior, and can be used to perform fine-grained manipulation of BERT's output distribution.


Visualizing and Measuring the Geometry of BERT

Transformer architectures show significant promise for natural language ...

Convergent Learning: Do different neural networks learn the same representations?

Recent success in training deep neural networks have prompted active inv...

Exploring the Role of BERT Token Representations to Explain Sentence Probing Results

Several studies have been carried out on revealing linguistic features c...

Intrinsic Probing through Dimension Selection

Most modern NLP systems make use of pre-trained contextual representatio...

Pursuit of a Discriminative Representation for Multiple Subspaces via Sequential Games

We consider the problem of learning discriminative representations for d...

Counterfactual Interventions Reveal the Causal Effect of Relative Clause Representations on Agreement Prediction

When language models process syntactically complex sentences, do they us...

Asking without Telling: Exploring Latent Ontologies in Contextual Representations

The success of pretrained contextual encoders, such as ELMo and BERT, ha...