A Systematic Analysis of Morphological Content in BERT Models for Multiple Languages

04/06/2020
by   Daniel Edmiston, et al.
The University of Chicago
0

This work describes experiments which probe the hidden representations of several BERT-style models for morphological content. The goal is to examine the extent to which discrete linguistic structure, in the form of morphological features and feature values, presents itself in the vector representations and attention distributions of pre-trained language models for five European languages. The experiments contained herein show that (i) Transformer architectures largely partition their embedding space into convex sub-regions highly correlated with morphological feature value, (ii) the contextualized nature of transformer embeddings allows models to distinguish ambiguous morphological forms in many, but not all cases, and (iii) very specific attention head/layer combinations appear to hone in on subject-verb agreement.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

09/13/2021

A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

In cross-lingual language models, representations for many different lan...
11/11/2020

Morphological Disambiguation from Stemming Data

Morphological analysis and disambiguation is an important task and a cru...
04/17/2021

A multilabel approach to morphosyntactic probing

We introduce a multilabel probing task to assess the morphosyntactic rep...
08/26/2019

Does BERT agree? Evaluating knowledge of structure dependence through agreement relations

Learning representations that accurately model semantics is an important...
04/04/2019

A Simple Joint Model for Improved Contextual Neural Lemmatization

English verbs have multiple forms. For instance, talk may also appear as...
11/02/2020

How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT's Attention

Recent research on the multi-head attention mechanism, especially that i...
06/27/2019

Morphological Irregularity Correlates with Frequency

We present a study of morphological irregularity. Following recent work,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work describes and reports on experiments designed to probe the contextualized word embeddings and attention distributions of several BERT-style Transformer models for discrete morphological structure. The experiments focus on testing the representations at the level of morphological feature and feature values, testing the extent to which these discrete structures are evident in the hidden-layer vectors and attention distributions produced by the models. By experimenting with five different languages, each somewhat different typologically, we hope to provide a general picture of how Transformer architectures model this aspect of morphological information.

To investigate the hidden representations of the models, we perform two types of experiment. For the first type, we perform a series of classification tasks on hidden-layer vector representations, attempting to predict the morphological feature values of contextualized representations. The second type of experiment consists of a task which probes self-attention distributions for what linguists call an agree relationship, comparing the proportions of a sentence’s attention distributions allocated to words which agree for some morphological feature value.

By focusing on morphological information, this work complements much recent work devoted to probing the hidden representations of BERT-style models for syntactic and semantic information. We contend that analysis at the level of morphological feature is particularly useful for evaluating linguistic information within embeddings for three reasons. First, morphological features represent a tangible aspect of meaning for which it is relatively simple to obtain large amounts of quality gold-standard annotation in many languages. Second, by evaluating at the level of morphological feature, experiments are less susceptible to models using heuristics to learn surface patterns

[McCoyEtAl2019], testing instead whether they’ve generalized to the underlying cause of those patterns. Finally, certain morphological features contain aspects of meaning which are vitally important for real-world tasks. For instance, the gender feature is inextricably linked to coreference resolution in many languages, and the morphological feature of mood contains discourse information which is necessary for natural language understanding, for example distinguishing between commands (imperative mood) and statements (indicative mood).

The contributions of this paper are therefore the following: we show that (i) BERT-style architectures are capable of encoding morphological information in their hidden vector representations at the featural level, and do so by dividing the embedding space into convex (i.e. linearly separable) sub-regions, (ii) that the contextualized nature of embeddings aids models’ ability to disambiguate morphologically ambiguous forms, but doesn’t solve the problem, and (iii) by introducing a score based on Pearson’s test for investigating attention distributions, we show that localized regions in the layer/attention-head space reflect subject-verb agreement.

2 Related Work

With the recent success of Transformer-style architectures [VaswaniEtAl2017] like BERT [DevlinEtAl2018]

on many natural language processing tasks, a considerable amount of research has gone into investigating the inner workings of these models, a research program sometimes dubbed “BERTology”

[RogersEtAl2020]. Among this literature, work has focused on syntactic aspects [HewittManning2019, CoenenEtAl2019, KimEtAl2020], including subject-verb agreement [Goldberg2019], and also various semantic aspects such as semantic role and model predictions’ correlation with human judgment [Ettinger2020].

One particular method of probing the information in these large architectures is to perform different tasks at different layers of the model, seeking to identify where different types of linguistic information may reside [TenneyEtAl2019a]. It has generally been shown that more local, shallow information is reflected in lower layers, and richer, more abstract information is reflected in higher layers [PetersEtAl2018b]. Not only have layers been shown to be specialized for content, ClarkEtAl2019 further showed the diffusion of linguistic knowledge through such models by demonstrating that BERT’s different attention heads learn to focus on different aspects of linguistic meaning.

In addition to the growing literature on BERT-style models, work on evaluating continuous embedding models for morphological information goes back some years. Particularly, BelinkovEtAl2017 trained classifiers on word representations extracted from models trained for machine translation to assess what these models learn about morphology. For work investigating models for agree-phenomena, LinzenEtAl2016 showed that LSTM architectures

[HochreiterSchmidhuber1997] successfully model subject-verb agreement in many instances (see also GiulianelliEtAl2018), and RavfogelEtAl2018 put forth the objective of modelling agreement in Basque as a potential baseline for future work.. Finally, for work most similar to ours in investigating morphological information at the featural level, BasiratTang2018 train classifiers to distringuish nominal features in Swedish, and Kohn2015 does the same for more varied features and multiple languages.

This work finds its place in systematically addressing the question of morphological featural information in the hidden vector representations and attention heads of BERT models for multiple languages.

3 Methodology

3.1 Considered Languages and their Morphology

This study investigates five languages of the Indo-European language family: English, French, German, Russian, and Spanish, each of which inflects for some set of morphological features. Specifically, we investigate the morphological features of Case, Gender, Mood, Number, Person, Tense, and Verb Form (which is related to what is traditionally called Finiteness). For each language, we investigate a BERT-base model [DevlinEtAl2018] pre-trained on a large corpus for that language.111See DevlinEtAl2018 for details on the English model, MartinEtAl2019 for the French model, and KuratovArkhipov2019 for the Russian model. The languages, models, features, and each feature’s values are ogranized in Table 1.222Abbreviations used in Table 1: Nom=Nominative, Acc=Accusative, Dat=Dative, Gen=Genitive, Loc=Locative, Ins=Instrumental, Masc=Masculine, Fem=Feminine, Neut=Neuter, Ind=Indicative, Imp=Imperative, Sub=Subjunctive, Cnd=Conditional, Sing=Singular, Plur=Plural, Pres=Present, Impr=Imperfect, Fut=Future, Fin=Finite, Inf=Infinitive, Ger=Gerund, Part=Participle, Conv=Converb.

Feature Language English French German Russian Spanish
Case



Gender


Mood



Number


Person


Tense



Verb Form




Model
Base-Cased CamemBERT DBMDZ RuBERT BETO



Table 1: Languages and associated features/feature values, along with models used. All models are BERT-base models, with 12 hidden layers, 12 attention heads, and 768 dimenional vectors. All models available at https://huggingface.co/models.

3.2 Experiments

3.2.1 Experiment 1: Classifying by Value

As BERT-base models are a 12-layer transformer architecture, for each word in a sentence the model produces 13 vectors, including the input vector, in 768 dimensions.333Strictly speaking, the model produces vectors for each token in a sentence. We derive word embeddings by taking the average of each word’s constituent token embeddings. To test the amount of morphological information in a model’s hidden vectors, for each layer we perform -way classification tasks for each feature, where is the number of feature values that the relevant feature can take (e.g. for Mood in German, values being Indicative, Imperative, and Subjunctive).

The classification tasks are

-means clustering, a linear classifier, and a non-linear classifier, amounting to a 3-layer neural network with ReLU-activations. For the

-means task, the best score is taken from amongst ten runs with different centroid seeds. (Weighted) F1 scores are calculated for each experiment.

3.2.2 Experiment 2: Subject-Verb Agreement in Attention

The second experiment tests whether what linguists call agreement presents itself in the attention coefficients produced by BERT-style models when embedding sentences. Agreement is a syntactic phenomenon in which two syntactic constituents in a certain relationship show agreement for some morphological feature. An example from English showing agreement for Number is below.

The men were tired after a hard day’s work.

In more morphologically rich languages like French, the agree relation can easily encompass every word in a sentence.

Les grands garçons sont tous allés
the.plur tall.plur boy.plur are.plur all.plur left.plur

The tall boys all left

The question is how to investigate the output of an attention head for some layer, call it an attention-matrix, for awareness of this agree relation. Considering that an attention-matrix for a sentence can be interpreted as a sequence of probability distributions over words, we can intuit that an attention-matrix reflects the agree relation if the attention distributions for the words inside the agree relation place a disproportionate amount of probability mass on the other words in the agree relation, and those words outside the agree relation do not.

Then given such an example sentence showing the relevant agree relation, we can quantify the extent to which some attention-matrix from a BERT model reflects this relation using a method based on the -test in the following way. Given sentence , partition into an - and an -, thusly partitioning the matrix of attention distributions into Agree-distributions and Out-distributions.

For each attention distribution in the Agree-distributions, calculate the Pearson’s score, where the two possible outcomes are and ;444That is, for each distribution calculate the Pearson -score for the probability mass allotted to words in the Agree-set vs. the probability mass allotted to words in the Out-set. calculate the average -score for the distributions in the Agree-set. Repeat the process for the distributions in the Out-set. Per our intuition that an attention-matrix reflects the agree relation insofar as it allots a disproportionate amount of probability mass to words which participate in agree for distributions of words in agree, and does not do so otherwise, then a high Agree-score, with a relatively low Out-score, means an attention-matrix focuses probability mass disproportionately between words which participate in the agree relation. The point of considering the Agree-score against the Out-score is to account for the contingency in which words participating in agree are particularly salient for reasons other than agree.555

For all attention-matrices, the diagonals were set to 0 and the probability distributions renormalized to sum to 1. This was done to account for the fact that heads have a tendancy to focus a large amount of mass on the diagonal, which would skew towards higher Agree-scores for words in the Agree-set, and towards lower scores for words in the Out-set.

3.3 Data Sets

All data used for the experiments were collected from a collection of Universal Dependency (UD) Treebanks [NivreEtAl2016]

and UD-compatible lexicons

[Sagot2018]. See Appendix A for specifics on the treebanks used. For the classification tasks described above, for each language-feature combination examples of the following form were extracted, (word, sentence, value), the task being to embed the word in the context of its sentence in order to predict its value. For each such classification dataset, 750 examples for each value were sampled,666The only exceptions to this was were the Mood feature in French and Spanish, for which there was insufficient data to extract 750 examples of the imperative. As such, the French Mood dataset consisted only of 249 examples for each value, and the Spanish mood dataset only 381 examples for each value. with .85/.15 train-test splits for the supervised tasks.

An important aspect of morphology in Indo-European languages like those chosen for this study is both the number of values a feature can take—call it feature-length—and also the amount of syncretism they display [BaermanEtAl2005], that is, how likely the forms of a language are to be ambiguous for value. For instance, the definite determiner ‘der’ in German can be nominative, dative, or genitive in Case value depending on context. The amount of such ambiguous forms is an important consideration when classifying.

ConfoundLanguage English French German Russian Spanish
Pct. of ambiguous forms 18.1% 10.3% 26.0% 14.1% 6.1%
Avg. feature-length 2.6 3.0 2.86 3.43 3.17
Table 2: Percentage of ambiguous examples across all features for language, as well as average feature length for relevant features.

Table 2 describes the statistics of the datasets for each language with regard to the percentage of ambiguous examples and average feature length. Intuitively, one would expect performance to be negatively correlated with feature-length and confounds such as ambiguity. This expectation is borne out in Section 4.1.3.

For the agree-related tasks discussed in Section 3.2.2, examples of subject-verb agreement were collected for English, French, and German. For English, only noun-verb pairs agreeing for the Number feature and marked with the nsubj dependency between noun and verb were chosen, and only when one such dependency was present in the sentence. For French and German, which each show richer agreement phenomena than English, examples of subject-verb agreement were chosen such that the subject was of the form Det-Adj-Noun (or Det-Noun-Adj), and all words agreed for the number feature, again with the nsubj dependency and only one-example-per-sentence criteria holding. For the English and German treebanks, 2,000 examples were extracted, and 1,521 examples were extracted from the French treebanks.

4 Results

4.1 Results on Classification Tasks

4.1.1 Results by Feature

Table 3 displays (weighted) F1 scores for each language for each relevant feature and for each classification task. The scores in the table are averaged over all layers. Random baselines for this task are found in Table 8 in Appendix B.777Code to reproduce results can be found at https://github.com/danedmiston/morphology_classifiers.

Language English French German Russian Spanish Average
FeatureTask KM Lin NN KM Lin NN KM Lin NN KM Lin NN KM Lin NN KM Lin NN

Case
0.25 0.87 0.88 0.15 0.84 0.86 0.2 0.86 0.87

Gender
0.5 0.96 0.96 0.32 0.9 0.91 0.28 0.86 0.87 0.48 0.97 0.97 0.39 0.92 0.93

Mood
0.6 0.98 0.98 0.3 0.98 0.96 0.41 0.91 0.85 0.28 0.99 0.99 0.26 0.9 0.88 0.37 0.95 0.93

Number
0.46 0.97 0.97 0.49 0.97 0.97 0.48 0.93 0.92 0.48 0.92 0.93 0.46 0.99 0.99 0.47 0.96 0.95

Person
0.32 0.96 0.95 0.33 0.97 0.97 0.39 0.96 0.95 0.25 0.99 0.96 0.31 0.93 0.93 0.32 0.96 0.95

Tense
0.64 1.0 0.99 0.25 0.99 0.99 0.55 0.97 0.97 0.34 0.97 0.96 0.21 0.98 0.96 0.4 0.98 0.97

VerbForm
0.2 0.88 0.87 0.39 1.0 0.96 0.33 0.97 0.94 0.2 0.99 0.99 0.33 0.98 0.98 0.29 0.96 0.95


Average
0.44 0.96 0.95 0.38 0.98 0.97 0.39 0.93 0.92 0.28 0.94 0.94 0.34 0.96 0.95
Table 3:

Weighted F1 Scores for each language and feature; scores averaged across all layers. KM=K-Means clustering, Lin=Linear classifer, NN=3-layer Neural Network with ReLU activations. Bold reflects highest score in each column. Red indicates score

random baseline (see Table 8).

The results indicate that each language’s model reflects a great deal of morphological information at the featural level. However, it appears that supervision aids tremendously in extracting this information; the K-Means clustering scores in most cases fall very near the random baseline scores for the same task, and in some cases below. Meanwhile, in the case of the linear and non-linear classifiers, average performance of the pre-trained models significantly surpasses the random baselines.

Furthermore, the fact that linear classifiers routinely returned F1 scores above 0.9, and in two cases perfect scores, strongly suggests that pre-trained models are partitioning their embedding spaces into convex regions correlated with morphological feature value; the additional power of the non-linear neural network model did not significantly improve performance in most cases. The results further suggest that certain morphological features may be better captured than others. In the linear case, when compared to random baselines the two best performing features on average were VerbForm and Case; Mood and Person were the two worst. In terms of overall performance, the Tense feature was classified the most faithfully, and the Case feature the least. Here, the percentage of ambiguous forms likely played a role; see Section 4.1.3. Further discussion follows in Section 5.

4.1.2 Results by Layer

Above, we averaged over the layers to get a sense of how well these models were reflecting different morphological features in their vector representations. This section presents results where the scores are averaged over the features and presented for each layer.

Figure 1: Layer-wise weighted F1 scores for linear (left) and neural network (right) classifiers averaged over all features.

A visual inspection of the results from the linear classifer in Figure 1 suggests that morphological information is best captured in the middle-late layers in German and Russian, with solid performance throughout for English, French, and Spanish. Speculation as to the reason for this can be found in Section 5. The results of the non-linear classifier are more varied, but roughly coincide with the results of the linear classifier. See Appendix B for a full reporting on layer-wise scores (Table 9), as well as random baseline scores for the same task (Table 10).

4.1.3 Effects of Complexity

As mentioned in Section 3.3, the languages involved not only display ambiguous morphological forms, presumably making classification more difficult, certain features in certain languages also present the challenge of having a large number of possible values, as many as six in the case of the Russian Case feature. This section presents results which suggest that these factors do indeed make classification more difficult, showing that in most cases an increase in the percentage of ambiguous forms is negatively correlated with classification performance, as is an increase in the number of values a feature may take.

Language English French German Russian Spanish
CorrelationTest Perc. #-Vals. Perc. #-Vals. Perc. #-Vals. Perc. #-Vals. Perc. #-Vals.
Spearman -0.80 -0.89 -0.68 0.6 -0.75 -0.54 -0.89 -0.11 -0.09 -0.25
Pearson -0.86 -0.93 -0.79 0.58 -0.84 -0.62 -0.87 -0.45 -0.06 -0.33
Table 4: Spearman and Pearson correlation scores between performance on feature (as measured by the linear classifier in Table 3) and percentage of ambiguous entries/number of possible values for feature. Light blue indicates statistically significant with -value below 0.1, Dark blue indicates statistically significant with -value below 0.05.

The results in Table 4 show that ambiguous forms predictably make classification more difficult in all cases, though with the situation being somewhat less pronounced in Spanish, in which there is very little morphological ambiguity. Likewise, as the number of possible values a feature may take increases, performance also suffers in all cases except French, which was the strongest performing language and performed well on all tasks.

While the percentage of ambiguity for a feature is negatively correlated with weighted F1 performance, the depth of BERT-style models does go some way to alleviating this problem, as classification shows a general upward trend through the layers, peaking by the middle layers.

Figure 2: Performance per-layer on Case feature for linear classifier for German (left) and Russian (right). Both layers show strongest performance in middle layers.

Figure 2 shows the per-layer performance of German and Russian on -way ambiguous subsets for the Case-feature, with forms in German being up to 4-way ambiguous and in Russian up to 5-way ambiguous. In all cases (except where ambiguity=1 in German, meaning the word in unambiguous) there is a pronounced upward trend in performance through the layers, suggesting that BERT is able to make use of context to disambiguate forms for morphological feature. However, it is worth keeping in mind that amount of ambiguity is still negatively correlated with performance, suggesting that BERT is far from human-like performance, in spite of its contextualized nature.

4.2 Results on Agree Task

As described in Section 3.2.2, an attention-matrix’s reflection of the agree relation can be relayed by two quantities: the Agree-score and the Out-score. Recalling the definitions of the Agree-score and Out-score as the average -scores for distributions in the agree set and the non-agree set respectively, a high ratio of agree-score to out-score signifies that an attention-matrix focuses a disproportionately high amount of probability mass between words which agree, while words which don’t participate in agree do not.

Head=1 Head=2 Head=3 Head=4 Head=5 Head=6 Head=7 Head=8 Head=9 Head=10 Head=11 Head=12
Layer=1 0.02 0.03 0.01 0.21 3.32 0.02 0.02 0.01 0.02 2.52 0.03 0.01
Layer=2 0.04 0.47 1.11 0.1 0.04 0.03 0.03 0.03 0.05 1.93 0.5 0.02
Layer=3 0.09 4.03 0.02 0.03 0.03 0.04 0.39 0.84 0.03 0.03 0.37 0.03
Layer=4 0.05 0.25 0.03 0.03 0.03 0.04 0.04 0.05 0.04 0.32 0.06 0.03
Layer=5 0.03 0.04 0.02 3.59 0.04 0.03 0.04 0.03 0.04 0.03 0.26 0.19
Layer=6 0.62 0.03 0.03 0.04 0.36 0.05 0.06 0.03 0.03 0.03 0.03 0.57
Layer=7 0.04 0.12 0.04 0.02 0.04 0.03 0.04 0.03 0.04 0.03 0.39 0.03
Layer=8 0.3 0.03 0.05 0.1 0.04 0.03 0.16 0.04 0.03 0.06 0.03 0.08
Layer=9 0.03 0.03 0.92 0.05 0.04 0.08 1.21 0.05 0.04 0.02 0.08 0.04
Layer=10 0.09 0.11 0.99 0.04 0.02 0.05 0.04 0.04 0.03 0.07 0.03 0.08
Layer=11 0.04 0.07 0.04 0.03 0.04 0.61 0.03 0.04 0.04 0.02 0.19 0.03
Layer=12 0.03 0.03 0.79 0.02 0.04 0.12 0.04 0.03 0.03 0.03 0.03 0.04
Table 5: Average agree-score (average -score over distributions in agree-set) over French Agree dataset for each head/layer combination. Light blue shading denotes value exceeds -score required for -value

for one degree of freedom (2.706).

Taking French as an initial example, Table 5 shows that a small number of head-layer combinations are focusing a large (and statistically significant) amount of attention on the agreeing set, while most others treat the agreeing set as would be expected by chance. The results in Tables 6 and 7 likewise show that the agree information is concentrated in few head-layer combinations in English and German, though the information is somewhat more diffuse and the scores higher than in French.

In all languages there are combinations which show an average Agree-score (i.e. average score on agree-set distributions) which is statistically significant with a -value, and others which show an average Agree-score with significance -value. Meanwhile, most head-layer combinations remain close to 0. In no language is there a significant average Out-score (full results are listed in Appendix B). The relatively high scores for Agree-set vs. Out-set, and the fact that the high Agree-scores are localized to a small number of head-layer combinations, suggests that certain head-layer combinations are in fact honing in on the agree relation in discriminating fashion; i.e. BERT-style pre-trained language models appear sensitive to subject-verb agreement.

Figure 3: Heatmaps for attention head-layer combinations for average agree-score for English (left), French (center), and German (right); bright spots indicate high agree-score.

Figure 3 visualizes the information in Tables 5-7, showing the heatmaps for head-layer combinations for average Agree-score for English, French, and German. In all cases, the agree information is spread over different heads, but is concentrated in a narrow field in the layers, with the highest scores being located in the early-mid layers.

Head=1 Head=2 Head=3 Head=4 Head=5 Head=6 Head=7 Head=8 Head=9 Head=10 Head=11 Head=12
Layer=1 0.03 1.03 0.04 2.11 0.07 0.18 0.29 0.02 0.07 0.08 0.03 0.11
Layer=2 0.03 0.46 1.2 1.93 4.98 0.05 0.09 0.21 0.04 0.05 0.05 6.24
Layer=3 0.1 1.22 0.02 6.88 0.59 0.06 0.04 0.12 0.05 0.05 0.08 0.13
Layer=4 0.06 0.05 0.08 0.04 0.44 0.1 0.07 0.06 0.03 0.37 0.15 6.79
Layer=5 0.07 0.05 0.44 0.19 0.06 0.94 0.14 0.04 0.08 0.61 0.27 0.04
Layer=6 0.03 6.35 0.17 0.09 0.19 0.4 0.31 0.16 1.14 0.7 0.24 0.04
Layer=7 0.05 3.17 0.21 0.27 0.39 1.98 0.26 0.28 0.42 4.97 0.19 0.05
Layer=8 2.61 0.17 0.31 0.45 0.5 0.06 0.21 0.04 0.17 0.16 0.08 0.08
Layer=9 0.15 0.1 0.05 0.19 0.3 0.05 0.18 0.1 0.72 0.58 0.1 0.07
Layer=10 0.09 0.08 0.72 0.23 0.87 0.06 0.07 0.09 0.14 0.18 0.08 0.1
Layer=11 0.07 0.14 0.06 0.06 0.08 0.06 0.12 0.15 0.16 0.15 0.05 0.08
Layer=12 0.05 0.06 0.05 0.09 0.04 0.05 0.04 0.06 0.03 0.04 0.06 0.07
Table 6: Average agree-score (average -score over distributions in agree-set) over English Agree dataset for each head/layer combination. Light blue shading denotes value exceeds -value required for -value for one degree of freedom (2.706). Dark blue shading denotes value exceeds -score for -value (3.841).
Head=1 Head=2 Head=3 Head=4 Head=5 Head=6 Head=7 Head=8 Head=9 Head=10 Head=11 Head=12
Layer=1 0.02 0.17 0.03 0.03 0.01 0.43 0.04 0.04 0.03 3.11 0.02 1.21
Layer=2 0.05 0.05 0.03 0.06 0.03 0.07 3.39 0.03 0.06 0.02 0.08 0.67
Layer=3 3.25 0.03 0.02 4.42 0.58 0.56 8.55 0.05 0.03 0.02 0.09 0.18
Layer=4 8.55 8.55 8.61 0.21 0.25 1.87 1.15 0.02 0.05 8.56 1.02 0.24
Layer=5 0.4 0.04 3.84 0.12 0.37 1.49 5.01 0.05 5.92 0.05 0.77 0.03
Layer=6 0.06 2.05 0.7 1.36 0.02 1.91 0.76 0.07 2.76 0.3 0.03 3.71
Layer=7 0.99 0.09 0.49 0.03 0.02 0.03 0.03 0.81 0.14 3.49 0.09 0.45
Layer=8 0.35 0.04 0.05 0.03 0.04 0.04 0.04 0.05 1.89 0.04 0.5 0.08
Layer=9 0.08 0.16 0.04 0.06 0.1 0.2 0.04 0.34 0.43 0.03 0.11 0.28
Layer=10 0.03 0.3 0.04 0.04 0.04 0.17 0.06 0.06 0.08 0.17 0.05 0.05
Layer=11 0.03 0.03 0.03 0.05 0.81 0.03 0.04 0.04 0.03 0.06 0.05 0.04
Layer=12 0.05 0.03 0.05 0.06 0.04 0.06 0.04 0.04 0.05 0.04 0.19 0.04
Table 7: Average agree-score (average -score over distributions in agree-set) over German Agree dataset for each head/layer combination. Light blue shading denotes value exceeds -score required for -value for one degree of freedom (2.706). Dark blue shading denotes value exceeds -score for -value (3.841).

5 Discussion

Given the overall strong results on all languages for the classification task, it would appear that we can answer in the affirmative that BERT-style models are sensitive to morphological information at the featural level. Furthermore, this information appears to be encoded by models partitioning their space into convex sub-regions by feature-value, as feature values are largely recoverable by a linear classifier. Furthermore, supervision appears to aid significantly in extracting this morphological information, as initial attempts at unsupervised classification via a K-Means clustering task resulted in scores near the random baselines.

However, in spite of the success of the (supervised) classification tasks, there is room for improvement. Specifically, while syncretic forms are clearly not a problem for human language users, who effortlessly use context to parse the correct featural values from ambiguous forms, the same cannot be said for the models discussed here. The results in Table 4 show ambiguity is negatively correlated with performance on classification, and to a significant degree in many cases. Thus, while the introduction of contextualized information into word embeddings no doubt helps to distinguish ambiguous forms (as shown in Figure 2), BERT-style models have not solved the problem, at least not given the type of classifier examined in this work.

With regard to the location of morphological information in these models, it is typically assumed that in multi-layered models such as ELMo [PetersEtAl2018a] and BERT, relatively “shallow” and local information is housed in the early layers, while information becomes more abstract and non-local as information progresses through the layers. This roughly follows a traditional NLP pipeline [TenneyEtAl2019a], and indeed standard linguistic thought on the process of the sound-to-meaning transduction. The results of the classification experiments shown here are therefore somewhat surprising at first glance, with layer being relatively uncorrelated with performance for English, French, and Spanish. One would expect morphological information of the kind necessary for feature-value classification to reside mostly in the middle-to-late layers.

However, the results for German and Russian show that each language in fact shows peak performance in the middle layers for overall performance (Figure 1), and the middle-to-late layers show the best ability to disambiguate syncretic forms (Figure 2). This suggests that morphological information of the type considered here does reside mostly in these layers, in line with the findings of PetersEtAl2018b and TenneyEtAl2019a. We speculate that the exceptional performance of English, French, and Spanish throughout all layers is due to their simple morphological paradigms (relative to German and Russian), making morphological values more predictable from low-level information like orthography, and more fixed word-order syntax potentially making morphological information more recoverable from later layers.

The results from the Agree experiments likewise show that pre-trained BERT-style architectures are sensitive to morphological information. The agree information examined in these experiments is encoded in the attention coefficients of certain head-layer combinations which focus in on the agree relation, presumably passing morphological information between words in a sentence which stand in the relevant relation. This result further adds to the evidence that different head-layer combinations specialize for different types of linguistic information (see ClarkEtAl2019). Finally, the heatmaps in Figure 3 tentatively suggest that this information is best reflected in the early-mid layers.

6 Conclusion

This work has sought to address the question of the amount of morphological feature information in pre-trained BERT-style models for multiple Indo-European languages, using (i) classification tasks, and (ii) a task designed to identify the agree relation in attention distributions. The results show that the models examined are sensitive to morphological information of the type considered, with experiments showing strong performance for each language on the (supervised) feature-value classification tasks, and also that certain attention heads learn to focus attention in a manner consistent with morphological agreement.

Furthermore, the findings here coincide with other work which suggests that morphological information may be best represented in the middle layers of deep contextualized language models like BERT. Given the results of this study, future work should include (i) identifying morphological information in BERT-style models in an unsupervised fashion, (ii) improving models’ ability to disambiguate syncretic forms for languages with complex inflectional morphology, and (iii) further exploration of how morphological information is shared between words via self-attention.

Acknowledgements

This paper has benefited from fruitful discussion with and helpful comments from John Goldsmith and Taeuk Kim.

References

Appendix A Treebank Details

For English, the following treebanks were used: the EWT treebank [SilveiraEtAl14], the GUM treebank [Zeldes2017], the LinES treebank [Ahrenberg2007], the English portion of ParTUT [BoscoEtAl2012], English-PUD [ZemanEtAL2018], and the English-Pronouns treebank [Munro2020]. For French, the following treebanks were used: French Question Bank [JudgeEtAl2006], the GSD French treebank [NivreEtAl2016], French portion of ParTUT, French-PUD, Sequoia [CanditoSeddah2012], and the French Spoken Treebank, adapted from the Rhapsoide prosodic-syntactic treebank [LacheretEtAl2014]. For German the following treebanks were used: The HDT-UD treebank [VolkerEtAl2019], and the GSD German treebank. For Russian, the following treebanks were used: The GSD Russian treebank, Russian-PUD, The SynTagRus treebank [NivreEtAl2008], and the Taiga treebank [LyashevskayaEtAl2016]. For Spanish, the following treebanks were used: The AnCora treebank [TauleEtAl2008], the GSD Spanish treebank, and Spanish-PUD.

Appendix B Full Results

Table 8 contains weighted F1 scores for each language feature combination where embeddings are from randomly initialized and untrained models, and serves as the baseline reference against which Table 3 should be considered.

Language English French German Russian Spanish Average
FeatureTask KM Lin NN KM Lin NN KM Lin NN KM Lin NN KM Lin NN KM Lin NN

Case
0.21 0.6 0.64 0.12 0.28 0.29 0.16 0.44 0.47

Gender
0.44 0.62 0.61 0.25 0.48 0.52 0.26 0.4 0.43 0.43 0.74 0.73 0.34 0.56 0.58

Mood
0.46 0.84 0.84 0.17 0.69 0.65 0.35 0.74 0.8 0.2 0.87 0.92 0.18 0.65 0.59 0.27 0.76 0.76

Number
0.39 0.63 0.61 0.39 0.7 0.62 0.4 0.58 0.55 0.38 0.52 0.53 0.34 0.66 0.66 0.38 0.62 0.6

Person
0.4 0.94 0.94 0.33 0.79 0.8 0.27 0.74 0.74 0.27 0.79 0.79 0.24 0.73 0.63 0.3 0.8 0.78

Tense
0.49 0.76 0.76 0.21 0.66 0.65 0.44 0.76 0.7 0.29 0.53 0.52 0.2 0.68 0.64 0.33 0.68 0.65

VerbForm
0.2 0.51 0.56 0.22 0.57 0.56 0.27 0.58 0.53 0.2 0.46 0.45 0.2 0.47 0.47 0.22 0.52 0.51


Average
0.39 0.74 0.74 0.29 0.67 0.65 0.31 0.64 0.64 0.25 0.55 0.56 0.27 0.65 0.62 0.3 0.65 0.64
Table 8: Random baselines for weighted F1 Scores for each language and feature; scores averaged across all layers. Compare against Table 3.

Table 9 houses the full results for the layer-wise classification scores. Table 10 contains the random baselines against which to compare Table 9, that is it reflects results from the same task, except the input embeddings were from randomly initialized, untrained models.

Language English French German Russian Spanish Average
FeatureTask KM Lin NN KM Lin NN KM Lin NN KM Lin NN KM Lin NN KM Lin NN

Layer=Input
0.32 0.93 0.94 0.36 0.95 0.95 0.43 0.85 0.85 0.22 0.89 0.9 0.29 0.94 0.94 0.33 0.91 0.92

Layer=1
0.38 0.94 0.95 0.45 0.96 0.96 0.31 0.88 0.89 0.32 0.9 0.91 0.28 0.96 0.95 0.35 0.93 0.93

Layer=2
0.24 0.96 0.96 0.38 0.97 0.91 0.42 0.9 0.88 0.25 0.91 0.92 0.34 0.97 0.97 0.33 0.94 0.93

Layer=3
0.48 0.96 0.96 0.45 0.98 0.93 0.45 0.92 0.91 0.32 0.93 0.93 0.43 0.97 0.96 0.43 0.95 0.94

Layer=4
0.38 0.96 0.96 0.43 0.98 0.98 0.34 0.94 0.93 0.32 0.94 0.94 0.3 0.97 0.96 0.35 0.96 0.96

Layer=5
0.6 0.96 0.96 0.44 0.98 0.97 0.36 0.95 0.96 0.3 0.95 0.95 0.33 0.97 0.96 0.41 0.96 0.96

Layer=6
0.31 0.97 0.96 0.34 0.98 0.97 0.3 0.96 0.96 0.2 0.96 0.96 0.34 0.96 0.92 0.3 0.96 0.95

Layer=7
0.49 0.96 0.96 0.42 0.98 0.98 0.51 0.96 0.96 0.29 0.95 0.95 0.37 0.96 0.96 0.42 0.96 0.96

Layer=8
0.52 0.97 0.96 0.3 0.98 0.91 0.34 0.95 0.96 0.36 0.96 0.96 0.3 0.96 0.96 0.36 0.96 0.95

Layer=9
0.55 0.97 0.97 0.31 0.99 0.98 0.39 0.96 0.96 0.25 0.96 0.96 0.28 0.96 0.94 0.36 0.97 0.96

Layer=10
0.48 0.96 0.96 0.27 0.98 0.98 0.48 0.95 0.95 0.22 0.96 0.96 0.56 0.96 0.95 0.4 0.96 0.96

Layer=11
0.54 0.96 0.95 0.34 0.99 0.99 0.29 0.94 0.93 0.3 0.95 0.95 0.34 0.95 0.95 0.36 0.96 0.95

Layer=12
0.45 0.96 0.92 0.38 0.99 0.92 0.42 0.95 0.94 0.35 0.94 0.94 0.3 0.95 0.94 0.38 0.96 0.93


Average
0.44 0.96 0.96 0.38 0.98 0.96 0.39 0.93 0.93 0.28 0.94 0.94 0.34 0.96 0.95 0.37 0.95 0.95
Table 9: F1 Scores by layer averaged across all relevant features. Bold indicates best score in column. Red indicates random baseline.
Language English French German Russian Spanish Average
FeatureTask KM Lin NN KM Lin NN KM Lin NN KM Lin NN KM Lin NN KM Lin NN

Layer=Input
0.42 0.77 0.78 0.28 0.72 0.69 0.3 0.68 0.69 0.23 0.6 0.61 0.25 0.7 0.71 0.3 0.69 0.7

Layer=1
0.45 0.76 0.78 0.31 0.71 0.69 0.34 0.68 0.67 0.25 0.59 0.59 0.29 0.7 0.67 0.33 0.69 0.68

Layer=2
0.43 0.75 0.77 0.33 0.71 0.67 0.32 0.67 0.66 0.29 0.58 0.6 0.22 0.68 0.66 0.32 0.68 0.67

Layer=3
0.43 0.75 0.77 0.3 0.7 0.71 0.29 0.65 0.65 0.25 0.57 0.55 0.23 0.68 0.69 0.3 0.67 0.67

Layer=4
0.42 0.76 0.78 0.26 0.68 0.68 0.31 0.65 0.67 0.24 0.56 0.57 0.23 0.68 0.55 0.29 0.67 0.65

Layer=5
0.34 0.74 0.75 0.36 0.68 0.67 0.31 0.65 0.67 0.22 0.55 0.55 0.26 0.66 0.67 0.3 0.66 0.66

Layer=6
0.35 0.74 0.74 0.24 0.68 0.67 0.3 0.63 0.63 0.22 0.55 0.58 0.29 0.65 0.58 0.28 0.65 0.64

Layer=7
0.36 0.73 0.72 0.32 0.67 0.65 0.34 0.63 0.67 0.3 0.54 0.57 0.26 0.65 0.58 0.31 0.64 0.64

Layer=8
0.38 0.73 0.74 0.3 0.65 0.64 0.33 0.64 0.61 0.22 0.54 0.55 0.28 0.62 0.62 0.3 0.64 0.63

Layer=9
0.35 0.72 0.74 0.29 0.64 0.65 0.32 0.62 0.62 0.24 0.52 0.56 0.28 0.62 0.53 0.3 0.62 0.62

Layer=10
0.38 0.72 0.72 0.3 0.63 0.62 0.28 0.62 0.65 0.25 0.51 0.53 0.28 0.62 0.5 0.3 0.62 0.6

Layer=11
0.35 0.7 0.73 0.26 0.63 0.58 0.31 0.61 0.56 0.23 0.5 0.54 0.3 0.61 0.53 0.29 0.61 0.59

Layer=12
0.38 0.71 0.68 0.26 0.62 0.63 0.35 0.6 0.62 0.25 0.51 0.49 0.28 0.61 0.57 0.3 0.61 0.6


Average
0.39 0.74 0.75 0.29 0.67 0.66 0.31 0.64 0.64 0.25 0.55 0.56 0.27 0.65 0.61 0.3 0.65 0.64
Table 10: Random baselines for F1 Scores by layer averaged across all relevant features. Compare against Table 9.

Tables 11-13 show English’s, French’s, and German’s Out-score for each layer-head combination.

Head=1 Head=2 Head=3 Head=4 Head=5 Head=6 Head=7 Head=8 Head=9 Head=10 Head=11 Head=12
Layer=1 0.05 0.18 0.03 0.29 0.09 0.07 0.1 0.03 0.07 0.05 0.07 0.08
Layer=2 0.06 0.28 0.21 0.24 0.56 0.08 0.07 0.26 0.07 0.11 0.11 0.57
Layer=3 0.11 0.26 0.04 0.6 0.16 0.15 0.06 0.09 0.09 0.08 0.09 0.08
Layer=4 0.11 0.1 0.07 0.04 0.25 0.07 0.11 0.13 0.04 0.14 0.14 0.7
Layer=5 0.15 0.08 0.32 0.14 0.08 0.2 0.11 0.09 0.12 0.15 0.15 0.07
Layer=6 0.04 0.65 0.24 0.13 0.18 0.13 0.16 0.28 0.22 0.18 0.18 0.12
Layer=7 0.09 0.49 0.18 0.16 0.26 0.3 0.13 0.2 0.2 0.52 0.15 0.07
Layer=8 0.31 0.18 0.12 0.14 0.17 0.07 0.22 0.08 0.14 0.14 0.14 0.19
Layer=9 0.2 0.12 0.09 0.13 0.17 0.06 0.16 0.1 0.15 0.16 0.1 0.12
Layer=10 0.1 0.15 0.18 0.25 0.17 0.1 0.1 0.11 0.11 0.12 0.08 0.12
Layer=11 0.07 0.11 0.12 0.14 0.11 0.12 0.12 0.14 0.1 0.09 0.09 0.09
Layer=12 0.1 0.1 0.08 0.08 0.08 0.11 0.09 0.11 0.06 0.08 0.11 0.09
Table 11: Average Out-score (average -score over distributions not in agree-set) over English Agree dataset for each head/layer combination.
Head=1 Head=2 Head=3 Head=4 Head=5 Head=6 Head=7 Head=8 Head=9 Head=10 Head=11 Head=12
Layer=1 0.02 0.05 0.02 0.08 0.45 0.02 0.03 0.02 0.03 0.38 0.03 0.02
Layer=2 0.09 0.09 0.13 0.09 0.08 0.05 0.06 0.06 0.06 0.13 0.11 0.05
Layer=3 0.13 0.89 0.05 0.06 0.05 0.09 0.1 0.18 0.06 0.06 0.1 0.07
Layer=4 0.09 0.13 0.05 0.07 0.06 0.07 0.08 0.1 0.09 0.1 0.08 0.08
Layer=5 0.07 0.08 0.05 0.57 0.08 0.07 0.08 0.06 0.09 0.05 0.09 0.08
Layer=6 0.1 0.08 0.05 0.08 0.09 0.06 0.07 0.08 0.07 0.06 0.06 0.17
Layer=7 0.08 0.09 0.06 0.04 0.09 0.06 0.1 0.08 0.08 0.05 0.08 0.08
Layer=8 0.09 0.07 0.1 0.07 0.07 0.08 0.12 0.11 0.06 0.09 0.07 0.08
Layer=9 0.08 0.06 0.22 0.06 0.08 0.07 0.15 0.09 0.08 0.06 0.09 0.08
Layer=10 0.08 0.09 0.15 0.09 0.06 0.08 0.09 0.09 0.07 0.08 0.06 0.07
Layer=11 0.09 0.08 0.09 0.07 0.09 0.12 0.09 0.09 0.1 0.07 0.09 0.09
Layer=12 0.06 0.08 0.07 0.06 0.06 0.07 0.07 0.07 0.08 0.07 0.08 0.08
Table 12: Average Out-score (average -score over distributions not in agree-set) over French Agree dataset for each head/layer combination. Light blue shading denotes value exceeds -value required for -value for one degree of freedom (2.706).
Head=1 Head=2 Head=3 Head=4 Head=5 Head=6 Head=7 Head=8 Head=9 Head=10 Head=11 Head=12
Layer=1 0.047 0.107 0.055 0.043 0.041 0.202 0.032 0.044 0.037 0.248 0.045 0.205
Layer=2 0.07 0.133 0.378 0.102 0.163 0.282 0.414 0.296 0.115 0.145 0.101 0.174
Layer=3 0.388 0.847 0.586 0.562 0.419 0.386 0.659 0.667 0.134 0.547 0.314 0.141
Layer=4 0.548 0.55 0.575 0.591 1.309 0.841 1.251 0.606 1.276 0.652 0.454 0.555
Layer=5 0.233 0.086 0.48 1.053 0.484 0.502 0.522 0.316 0.465 0.485 0.461 0.548
Layer=6 0.348 0.407 0.414 0.301 0.083 0.436 0.278 0.307 0.455 0.105 0.212 0.375
Layer=7 0.234 0.178 0.396 0.218 0.08 0.103 0.097 0.297 0.12 0.439 0.412 0.463
Layer=8 0.293 0.158 0.227 0.196 0.347 0.211 0.329 0.202 0.361 0.081 0.249 0.185
Layer=9 0.258 0.223 0.109 0.422 0.182 0.278 0.22 0.252 0.303 0.137 0.143 0.163
Layer=10 0.291 0.324 0.254 0.218 0.233 0.331 0.293 0.273 0.306 0.3 0.324 0.325
Layer=11 0.21 0.222 0.25 0.304 0.304 0.206 0.224 0.267 0.262 0.346 0.251 0.268
Layer=12 0.388 0.293 0.479 0.497 0.209 0.322 0.29 0.246 0.281 0.405 0.221 0.273
Table 13: Average Out-score (average -score over distributions not in agree-set) over German Agree dataset for each head/layer combination.