Predicting Declension Class from Form and Meaning

05/01/2020 ∙ by Adina Williams, et al. ∙ University of York University of Cambridge Facebook NYU college ETH Zurich Johns Hopkins University 0

The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties. Class membership is far from deterministic, but the phonological form of a noun and/or its meaning can often provide imperfect clues. Here, we investigate the strength of those clues. More specifically, we operationalize this by measuring how much information, in bits, we can glean about declension class from knowing the form and/or meaning of nouns. We know that form and meaning are often also indicative of grammatical gender—which, as we quantitatively verify, can itself share information with declension class—so we also control for gender. We find for two Indo-European languages (Czech and German) that form and meaning respectively share significant amounts of information with class (and contribute additional information above and beyond gender). The three-way interaction between class, form, and meaning (given gender) is also significant. Our study is important for two reasons: First, we introduce a new method that provides additional quantitative support for a classic linguistic finding that form and meaning are relevant for the classification of nouns into declensions. Secondly, we show not only that individual declensions classes vary in the strength of their clues within a language, but also that these variations themselves vary across languages. The code is publicly available at https://github.com/rycolab/declension-mi.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 6

page 7

page 9

page 10

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Declension classes, their conditional entropies (), and their mutual information quantities () with form (), meaning (), and declension class (), given gender () in German and Czech. and

correspond to the overall uncertainty over forms and meaning given gender—estimating these values falls outside the scope of this paper.

To an English speaker learning German, it may come as a surprise that one cannot necessarily predict the plural form of a noun from its singular. This is because pluralizing nouns in English is relatively simple: Usually we merely add an -s to the end (e.g., cat cats). Of course, not all English nouns follow such a simple rule (e.g., child children, sheep sheep, etc.), but those that do not are fairly few. German, on the other hand, has comparatively many nouns following comparatively many common, morphological rules. For example, some plurals are formed by adding a suffix to the singular: Insekt ‘insect’ Insekt-en, Hund ‘dog’ Hund-e, Radio ‘radio’ Radio-s. For others, the plural is formed by changing a stem vowel:111This vowel change, umlaut, corresponds to fronting. Mutter ‘mother’ Mütter, or Nagel ‘nail’ Nägel. Some others form plurals with both suffixation and vowel change: Haus ‘house’ Häus-er and Koch ‘chef’ Köch-e. Still others, like Esel ‘donkey’, have the same form in plural and singular. How baffling for the adult learner! And, the problem only worsens when we consider other inflectional morphology, such as case.

Disparate plural-formation and case rules of the kind described above split nouns into declension classes. To know a noun’s declension class is to know which morphological form it takes in which context (e.g., benveniste1935; wurzel1989; nubling2008; ackerman2009; ackerman2013; beniamine2016; bonami2016). But, this begs the question: What clues can we use to predict the class for a noun? In some languages, predicting declension class is argued to be easier if we know the noun’s phonological form aronoff1992; dressler1996 or lexical semantics carstairs1994; corbett2000. However, semantic or phonological clues are, at best, only very imperfect hints as to class wurzel1989; harris1991; harris1992; aronoff1992; halle1994; corbett2000; aronoff2007. Given this, we quantify how much a noun’s form and/or meaning shares with its class, and determine whether that amount of information is uniform across classes.

To do this, we measure the mutual information

both between declension class and meaning (i.e., distributional semantic vector) and between declension class and form (i.e., orthographic form), as in

Figure 1. We select two Indo-European languages (Czech and German) that have declension classes. We find that form and meaning both share significant amounts of information, in bits, with declension class in both languages. We further find that form clues are stronger than meaning clues; for form, we uncover a relatively large effect of 0.5–0.8 bits, while, for lexical semantics, a moderate one of 0.3–0.5 bits. We also measure the three-way interaction between form, meaning, and class, finding that phonology and semantics contribute overlapping information about class. Finally, we analyze individual inflection classes and uncover that the amount of information they share with form and meaning is not uniform across classes or languages.

We expect our results to have consequences, not only for NLP tasks that rely on morphological information—such as bilingual lexicon induction, morphological reinflection, and machine translation—but also for debates within linguistics on the nature of inflectional morphology.

2 Declension Classes in Language

The morphological behavior of declension classes is quite complex. Although various factors are doubtless relevant, we focus on phonological and lexical semantic ones here. We have ample reason to suspect that phonological factors might affect class predictability. In the most basic sense, the form of inflectional suffixes are often altered based on the identity of the final segment of the stem. For example, the English plural suffix is spelled as -s after most consonants, like in ‘cats’, but it gets spelled as -es if it appears after an s, sh, z, ch etc., like in ‘mosses’, ‘rushes’, ‘quizzes’, ‘beaches’ etc. Often differences in the spelling of plural affixes or declension class affixes are due to phonological rules that get noisily realized in orthography, but there might also be additional regularities that do not correspond to phonological rules but still have an impact. For example, statistical regularities over phonological segments in continuous speech guide first language acquisition (maye2002), even over non-adjacent segments (newport2004). Probabilistic relationships have also been uncovered between the sounds in a word and the word’s syntactic category (farmer2006; monaghan2007; sharpe2017) and between the orthographic form of a word and its argument structure valence (williams2018). Thus, we expect the form of a noun to provide clues to declension class.

Semantic factors too are often relevant for determining certain types of morphologically relevant classes, such as grammatical gender, which is known to be related to declension class. It has been claimed that there are only two types of gender systems: semantic systems (where only semantic information is required) and formal systems (where semantic information as well as morphological and phonological factors are relevant) (corbett2000, 294). Moreover, a large typological survey, qian2016 finds that meaning-sensitive grammatical properties, such as gender and animacy, can be decoded well from distributional word representations for some languages, but less well for others. These examples suggest that it is worth investigating whether noun semantics provides clues about declension class.

Lastly, form and meaning might interact, as in the case of phonaesthemes where the sounds of words provide non-arbitrary clues about their meanings (sapir1929; wertheimer1958; holland1964; maurer2006; monaghan2014; donofrio2014; dingemanse2015; dingemanse2018; pimentel2019). Therefore, we check whether form and meaning jointly share information with declension class.

2.1 Orthography as a proxy for phonology?

We motivate an investigation into the relationship between the form of a word and its declension class by appealing at least partly to phonological motivations. However, we make the simplifying assumption that phonological information is adequately captured by orthographic word forms—i.e., strings of written symbols or graphemes. In general, one should question this assumption (vachek1945; luelsdorff1987; sproat2000; sproat2012; neef2012). For the particular languages we investigate here, it is less problematic, as Czech and German are known to be languages with fairly “transparent” mappings between spelling and pronunciation (matvejvcek1998; miles2000; caravolas2001), achieving higher performance on grapheme-to-phoneme conversion than do English and other languages that have more “opaque” orthographic systems (schlippe2012). These studies suggest that we are justified in taking orthography as a proxy for phonological form. Nonetheless, to mitigate against any phonological information being inaccurately represented in the orthographic form (e.g., vowel lengthening in German), several of our authors, who are fluent reader-annotators of our languages, checked our classes for any unexpected phonological variations. (Examples are in § 3.)

2.2 Distributional Lexical Semantics

We adopt a distributional approach to lexical semantics (harris1954) that relies on pretrained word embeddings for this paper. We do this for multiple reasons: First, distributional semantic approaches to create word vectors, such as word2vec (mikolov2013), have been shown to do well at extracting lexical features such as animacy and taxonomic information (rubinstein2015) and can also recognize semantic anomaly (vecchi2011linear)

. Second, the distributional approach to lexical meaning can be easily operationalized into a straightforward procedure for extracting “meaning” from text corpora at scale. Finally, having a continuous representation of meaning, like word vectors, enables training of machine learning classifiers.

2.3 Controlling for grammatical gender?

Grammatical gender has been found to interact with lexical semantics (schwichtenberg2004; williams2019; lera), and often can be determined from form (brooks1993; dobrin1998; frigo1998; starreveld2004). This means that it cannot be ignored in the present study. While the precise nature of the relationship between declension class and gender is far from clear, it is well established that the two should be distinguished (aronoff1992; Wiese.2000; Kurschner.Nubling.2011, inter alia). We first measure the amount of information shared between gender and class, according to the methods described in §4, to verify that the predicted relationship exists. We then verify that gender and class overlap in information in German and Czech to a high degree, but that we cannot reduce one to the other (see Table 3 and § 6). We proceed to control for gender, and subsequently measure how much additional information form or meaning provides about class.

3 Data

Original Final Training Validation Test Average Length # Classes
Czech 3011 2672 2138 267 267 6.26 13
German 4216 3684 2948 368 368 5.87 16
Table 1: Number of words in dataset. Counts per language-category pair are listed both before and after preprocessing, train-validation-test split, average stem length, and # of classes. Since we use 10-fold cross-validation, all instances are included in the test set at some point, and are used to estimate the cross-entropies in § 5.

For our study, we need orthographic forms of nouns, their associated word vectors, and their declension classes. Orthographic forms are the easiest component, as they can be found in any large text corpus or dictionary. We isolated noun lexemes (i.e., or syntactic category–specific representations of words) by language. We select Czech nouns from Unimorph (kirov-etal-2018-unimorph) and German nouns from [CELEX2]celex2. For lexical semantics, we trained 300D word2vec vectors on language-specific Wikipedia.222We use the gensim toolkit(gensim).

We select the nominative singular form as the donor for both orthographic and lexical semantic representations, because it is the canonical lemma, in these languages and also usually the stem for the rest of the morphological paradigm. We restrict our investigation to monomorphemic lexemes because: (i) one stem can take several affixes which would multiply its contribution to the results, and (ii) certain affixes come with their own class.333Since these require special treatment, they are set aside.

Compared to form and meaning, declension class is a bit harder to come by, because it requires linguistic annotation. We associated lexemes with their classes on a by-language basis by relying on annotations from fluent speaker linguists, either for class determination (for Czech) or for verifying existing dictionary information (for German). For Czech, declension classes were derived by edit distance heuristic over affix forms, which grouped lemmata into subclasses if they received the same inflectional affixes (i.e., they constituted a morphological paradigm). If orthographic differences between two sets of suffixes in the lemma form could be accounted for by positing a phonological rule, then the two sets were collapsed into a single set; for example, in the “feminine

-a” declension class, we collapsed forms for which the dative singular suffix surfaces as -e following a coronal continuant consonant (figurka:figurce ‘figurine.dat.sg’), -i following a palatal nasal (piran̆a:pirani ‘piranha.dat.sg’), and as following all other consonants (kráva:krávĕ ‘cow.dat.sg’). As for meaning, descriptively, gender is roughly a superset of declension classes in Czech; among the masculine classes, animacy is a critical semantic feature, whereas form seems to matter more for feminine and neuter classes. Our final tally of Czech noun contains a total of 2672 nouns in 13 declension classes.

For German, nouns came morphologically parsed and lemmatized, as well as coded for class (celex2, CELEX2, v.2.5). We use CELEX2 to isolate monomorphemic noun lexemes and bin them into classes. CELEX2 declesion classes are more fine-grained than traditional descriptions of declension class; mappings between CELEX2 classes and traditional linguistic descriptions of declension class (alexiadou2008) are provided in Table 4 in the Appendix. CELEX2 declension class encoding is compound and includes: (i) the number prefix (the first slot ‘S’ is for singular, and the second ‘P’ for plural), (ii) the morphological form identifier—zero refers to non-existent forms (e.g., plural is zero for singularia tantum nouns), and other numbers refer to a form identifier of morphological paradigm (e.g., genitive applies an additional suffix for singular masculine nouns, but never for feminines)—and (iii) an optional ‘u’ identifier, which refers to vowel umlaut, if present. More details of the German preprocessing steps are in the Appendix. In the final tally, we consider a total of 16 declension classes, which can be broken into 3 types of singular and 7 types of plural, summing to a total of 3684 nouns.

After associating nouns with forms, meanings, and classes, we perform exclusions: Because frequency affects class entropy (parker2015), we removed all classes with fewer than 20 lexemes.444We ran another version of our models that included all the original classes and observed no notable differences. We subsequently removed all lexemes which did not appear in our word2vec models trained on Wikipedia dumps. The remaining lexemes were split into 10 folds for cross-validation: One for testing, another for validation, and the remaining 8 for training. Table 1 shows train-validation-test splits, average length of nouns, and number of declension classes, by language. Table 5 in the Appendix provides final noun lexeme counts by declension class.

4 Methods

Notation.

We define each lexeme in a language as a triple. Specifically, the triple consists of an orthographic word form , a distributional semantic vector that encodes the lexeme’s semantics, and a declension class

. These triples follow a (unknown) probability distribution

—which can be marginalized to obtain marginal distributions, e.g. . We take the space of word forms to be the Kleene closure over a language’s alphabet ; thus, we have . Our distributional semantic space is a high-dimensional real vector space where . The space of declension classes is language-specific and contains as many elements as the language has classes, i.e., where . For each noun, a gender from a language-specific space of genders is associated with the lexeme. In both Czech and German,

contains three genders: feminine, masculine, and neuter. We also consider four random variables: an

-valued random variable , a -valued random variable , a -valued random variable and a -valued random variable .

Bipartite Mutual Information.

Bipartite MI (or, simply MI) is a symmetric quantity that measures how much information (in bits) two random variables share. In the case of (declension class) and (orthographic form), we have

(1)

As can be seen, MI is the difference between an unconditional and a conditional entropy. The unconditional entropy is defined as

(2)

and the conditional entropy is defined as

(3)

A good estimate of will naturally encode how much the orthographic word form tells us about its corresponding lexeme’s declension class. Likewise, to measure the interaction between declension class and lexical semantics, we also consider the bipartite mutual information .

Tripartite Mutual Information.

To consider the interaction between three random variables at once, we need to generalize MI to three classes. One can calculate tripartite MI is as follows:

(4)

As can be seen, tripartite MI is the difference between a bipartite MI and a conditional bipartite MI. The conditional bipartite MI is defined as

(5)

In plainspeak, Equation 4 is the difference between how much and interact and how much they interact after “controlling” for . 555We emphasize here the subtle, but important, distinction between and . (The difference in notation lies in the comma replacing the semicolon.) While the first (tripartite MI) measures the ammount of (redundant) information shared by the three variables, the second (bipartite) measures the (total) information that class shares with either the form or the lexical semantics.

Controlling for Gender.

Working with mutual information also gives us a natural way to control for quantities that we know influence meaning and form. We do this by considering conditional MI. We consider both bipartite and tripartite conditional mutual information. These are defined as follows:

(6a)
(6b)

Estimating these quantities tells us how much and (and, in the case of tripartite MI, also) interact after we take (the grammatical gender) out of the picture. Figure 1 provides a graphical summary for this section until this point.

Normalization.

To further contextualize our results, we consider two normalization schemes for MI. Normalizing renders MI estimates across languages more directly comparable (gates2019). We consider the normalized mutual information, i.e., which fraction of the unconditional entropy is the mutual information:

(7)

In practice, in most cases and normalized mutual information is more appropriately termed the uncertainty coefficient (theil1970):

(8)

This can be computed from any mutual information equation, and will yield a percentage of the entropy that the mutual information accounts for—a more interpretable notion of the predictability between class and form or meaning.

5 Computation and Approximation

In order to estimate the mutual information quantities of interest per § 4, we need to estimate a variety of entropies. We derive our mutual information estimates from a corpus .

5.1 Plug-in Estmation of Entropy

The most straight-forward quantity to estimate is . Given a corpus, we may use plug-in estimation: We compute the empirical distribution over declension classes from . Then, we plug that empirical distribution over declension classes into the formula for entropy in Equation 2. This estimator is biased paninski, but is a suitable choice given because we have only a few declension classes and a large amount of data. Future work will explore whether better estimators (miller1955; hutter2001; archer2013; archer2014) affect the conclusions of studies such as this one.

5.2 Model-based Estimation of Entropy

In contrast, estimating is non-trivial. We cannot simply apply plug-in estimation because we cannot compute the infinite sum over that is required. Instead, we follow previous work brown-etal-1992-estimate; pimentel2019 in using the cross-entropy upper bound to approximate with a model. More formally, for any probability distribution , we estimate

(9)

To circumvent the need for infinite sums, we use a held-out sample disjoint from to approximate the true cross-entropy with the following quantity

(10)

where we assume the held-out data is distributed according to the true distribution . We note that as . While the exposition above focuses on learning a distribution for classes and forms to approximate , the same methodology can be used to estimate all necessary conditional entropies.

Form and gender: .

We train two LSTM classifiers (hochreiter1997)—one for each language. The last hidden state of the LSTM models is fed into a linear layer and then a softmax non-linearity to obtain probability distributions over classes. To condition our model on gender classes, we embedd each gender and feed it into each LSTM’s initial hidden state.

Meaning and gender: .

We trained a simple multilayer perceptron (MLP) classifier to predict the declension class, given the

word2vec representation. When conditioning on gender, we again embedded each class, concatenating these embeddings with the word2vec ones before feeding the result into the MLP.

Form, meaning, and gender: .

We again trained two LSTM classifiers, but this time, also conditioned on meaning (i.e., word2vec). We avoided overfitting by reducing the word2vec dimensionality from its original 300 dimensions to

with language-specific PCAs. We then linearly transformed them to match the hidden size of the LSTMs, and fed them in. To also condition on gender, we followed the same procedures, but used half of each LSTM’s initial hidden state for each vector (i.e.,

word2vec and gender one-hot embeddings).

width= Form & Declension Class (LSTM) Meaning & Declension Class (MLP) Czech 1.35 0.56 0.79 58.8% 1.35 0.82 0.53 39.4% German 2.17 1.60 0.57 26.4% 2.17 1.88 0.29 13.6% Both (Form and Meaning) & Declension Class Tripartite MI (LSTM) Czech 1.35 0.37 0.98 72.6% 0.79 0.44 0.35 25.9% German 2.17 1.50 0.67 30.8% 0.57 0.37 0.20 09.2%

Table 2: MI between form and class (top-left), meaning and class (top-right), both form and meaning and class (bottom-left), and tripartite MI (bottom-right). All values are calculated given gender, and bold if significant.

Optimization.

All classifiers were trained using Adam (kingma2015adam)

and code was implemented using PyTorch. Hyperparameters—number of training epochs, hidden sizes, PCA compression dimension (

), and number of layers—were optimized using Bayesian optimization with a Gaussian process prior (snoek2012practical). For each experiment, fifty models were trained to maximize expected improvement on the validation set.

5.3 An Empirical Lower Bound on MI

With our empirical approximations of the desired entropy measures, we can calculate the desired approximated MI values, e.g.,

(11)

where is the plug-in estimation of the entropy. Such an approximation, though, is not ideal, since we do not know if the true MI is approximated by above or below. Nonetheless, we use plug-in estimation, which underestimates entropy, and is estimated with a cross-entropy upperbound, we have

(12)

We note that these lower bounds are exact when taking an expectation under the true distribution . We cannot make a similar statement about tripartite MI, though, since it is computed as the difference of two mutual information quantities, both of which are lower-bounded in their approximations.

6 Results

Our main experimental results are presented in Table 2. We find that both form and lexical semantics significantly interact with declension class in both Czech and German. We observe that our estimates of is larger (0.5–0.8 bits) than our estimates of (0.3–0.5 bits). We also observe that the MI estimates in Czech are higher than in German. However, we caution that the estimates for the two languages are not fully comparable because they hail from models trained on different amounts of data. The tripartite MI estimates between class, form, and meaning, were relatively small (0.2–0.35 bits) for both languages. We interpret this finding as showing that much of the information contributed by form is not redundant with information contributed by meaning—although a substantial ammount is. All results in this section were significant for both languages, according to a welch1947generalization’s -test, which yielded after bhCorrection correction.666A welch1947generalization’s -test differs from student1908’s

-test in that the latter assumes equal variances, and the former does not, making it preferable (see

delacre2017).

width= Czech 2.75 1.35 1.40 50.8% German 2.88 2.17 0.71 24.6%

Table 3: MI between class and gender : is class entropy, is class entropy given gender, is the uncertainty coefficient.
(a) Czech
(b) German
Figure 2: Pointwise MI for declension classes. PMI for each random variable are plotted for classes increasing in size (towards the right): (bottom), (bottom middle), (top middle), and tripartite (top).

As a final sanity check, we measure mutual information between class and gender (see Table 3). In both cases, the mutual information between class and gender is significant. MIs ranged from approximately of a bit in German to up to 1.4 bits in Czech, nearly 25% and nearly 51% of the remaining entropy of class, respectively. Like the quantities discussed in § 4, this MI can also be estimated using simple plug-in estimation. Remember, if class were entirely reducible to gender, conditional entropy of class given gender would be zero. This is not the case: Although the conditional entropy of class given gender is lower for Czech (1.35 bits) than for German (2.17 bits), in neither case is declension class informationally equivalent to the language’s grammatical gender system.

7 Discussion and Analysis

Next, we ask whether individual declension classes differ in how idiosyncratic they are, e.g., does any one German declension class share less information with form than the others? To address this, we qualitatively inspect per-class pointwise mutual information () in (a)1(b). See Table 5 in the Appendix for the five highest and lowest surprisal examples per model. Several qualitative trends were observed: (i) classes show a decent amount of variability, (ii) unconditional entropy for each class is inversely proportional to the class’ size, (iii) is higher on average for Czech than German, and (iv) classes that have high usually have high (with notable exceptions we discuss below).

Czech.

In general, masculine classes have smaller than feminine or neuter ones of comparable size—the exception being ‘special, masculine, plural -ata’. This class ends exclusively in -e or , which might contribute to that class’ higher . That is high for feminine and neuter classes suggests that the overall results might be largely driven by these classes, which predominantly end in vowels. We also note that the high for feminine ‘plural -e’, might be driven by the many Latin or Greek loan words present in this class.

With respect to meaning, recall that masculine declension classes reflect animacy status: ‘animate1’ contains nouns referring mostly to humans, as well as a few animals (kocour ‘tomcat’, c̆olek ‘newt’), ‘animate2’ mostly animals with a few humans (syn ‘son’, křest’an ‘Christian’), ‘inanimate1’ contains many plants, staple foods (chléb ‘bread’, ocet ‘vinegar’) and meaningful places (domov ‘home’, kostel ‘church’), and ‘inanimate2’ contains many basic inanimate nouns (kámen ‘stone’). Of these masculine classes, ‘inanimate1’ has a lower than its class size alone might lead us to predict. Feminine and neuter classes show no clear pattern, although neuter classes ‘-eni’ and ‘-o’ have comparatively high .

For , we observe that ‘masculine, inanimate1’ is the smallest quantity, followed by most other masculine classes (e.g., masculine animate classes with -ové or -i plurals) for which was also low. Among non-masculine classes, we observe that feminine ‘pl -i’ and the neuter classes -o and -ení show higher tripartite . The latter two classes have relatively high across the board.

German.

for classes containing words with umlautable vowels (i.e., S3/P1u, S1/P1u) or loan words (i.e., S3/loan) tends to be high; in the prior case, our models seem able to separate umlautable from non-umlautable vowels, and in the latter case, loan word orthography from native orthography. quantities are roughly equivalent across classes of different size, with the exception of three classes: S1/P4, S3/P1, and S1/P3. S1/P4 consists of highly semantically variable nouns, ranging from relational noun lexemes (e.g., Glied ‘member’, Weib ‘wife’, Bild ‘picture’) to masses (e.g., Reis ‘rice’), which perhaps explains its relatively high . For S1/P3 and S3/P1, is low, and we observe that both declension classes idiosyncratically group clusters of semantically similar nouns: S1/P3 contains “exotic” birds (Papagei ‘parrot’, Pfau ‘peacock’), but also nouns ending in -or, (Traktor ‘tractor’, Pastor ‘pastor’), whereas S3/P1 contains very few nouns, such as names of months (März, ‘March’, Mai ‘May’) and names of mythological beasts (e.g., Sphinx, Alp).

Tripartite PMI is fairly idiosyncratic in German: The lowest quantity comes from the smallest class, S1/P2u. S1/P3, a class with low from above, also has low tripartite PMI. We speculate that this class could be a sort of ‘catch-all’ class with no clear regularities. The highest tripartite PMI comes from S1/P4, which also had high . The result suggests that submorphemic meaning bearing units, or phonaesthemes might be present; taking inspiration from pimentel2019, which aims to automatically discover such units, we observe that many words in S1/P4 contain letters {d, e, g, i, l}, often in identically ordered orthographic sequences, such as Bild, Biest, Feld, Geld, Glied, Kind, Leib, Lied, Schild, Viech, Weib, etc. While these letters are common in German orthography, their noticeable presence suggests further elucidation of declension classes in the context of phonaesthemes could be warranted.

8 Conclusion

We adduce new evidence that declension class membership is not wholly idiosyncratic nor fully deterministic based on form or meaning in Czech and German. We measure several mutual information quantities that range from 0.2 bits to nearly a bit. Despite their relatively small magnitudes, our measured mutual information between class and form accounted for between 25% and 60% of the class’ entropy, even after relevant controls, and MI between class and meaning accounted for between 13% and nearly 40%. We analyze results per-class, and find that classes vary in how much information they share with meaning and form. We also observe that classes that have high often have high , with a few noted exceptions that have specific orthographic (e.g., German umlauted plurals), or semantic (e.g., Czech masculine animacy) properties. In sum, this paper has proposed a new information-theoretic method for quantifying the strength of morphological relationships, and applied it to declension class. We verify and build on existing linguistic findings, by showing that the mutual information quantities between declension class, orthographic form, and lexical semantics are statistically significant.

Acknowledgments

Thanks as well to Guy Tabachnik for informative discussions on Czech phonology, to Jacob Eisenstein for useful questions about irregularity, and to Andrea Sims and Jeff Parker for advice on citation forms. Thanks to Ana Paula Seraphim for helping beautify Figure 1.

References

Appendix A Further Notes on Preprocessing

The breakdown of our declension classes is given in Table 4. We will first discuss more details about our preprocessing for German, and then for Czech.

German.

After extracting declension classes from CELEX2, we made some additional preprocessing decisions for German, usually based on orthographic or other considerations. For example, we combined the classes S1 with S4 classes, P1 with P7, and P6 with P3 because the difference between each member of any of these pairs lies solely in spelling (a final s is doubled in the spelling when GEN.SG -(e)s, or the PL -(e)n is attached).

Whether a given singular, say S1, becomes inflected as P1 or P2—or, for that matter, the corresponding umlauted versions of these plural classes—is phonologically conditioned (alexiadou2008). If the stem ends in a trochee whose second syllable consists of schwa plus /n/, /l/, or /r/, the schwa is not realized, i.e., it gets P2, otherwise it gets P1. For this phonological reason, we also chose to collapse P1 and P2.

We also collapsed all loan classes (i.e., those with P8–P10) under one plural class ‘Loan’. This choice resulted in us merging loans with Greek plurals (like P9, Myth-os / Myth-en) with those with Latin plurals (like P8, Maxim-um / Maxim-a and P10, Trauma / Trauma-ta). This choice might have unintended consequences on the results, as the orthography of Latin and Greek differ substantially from each other, as well as from the native German orthography, and might be affecting our measure of higher form-based MI for S1/Loan and S3/Loan classes in Table 3 of the main text. One could reasonably make a different choice, and instead remove these examples from consideration, as we did for classes with fewer than 20 lemmata.

Czech.

The preprocessing for Czech was a bit less involved, since the classes were derived from an edit-distance heuristic. A fluent speaker-linguist identified major noun classes by grouping together nouns with shared suffixes in the surface (orthographic) form. If the differences between two sets of suffixes in the surface form could then be accounted for by positing a basic phonological rule—for example, vowel shortening in monosyllabic words—then the two sets were collapsed.

Among masculine nouns, four large classes were identified that seemed to range from “very animate” to “very inanimate.” The morphological divisions between these classes were very systematic, but there was substantial overlap: dat.sg and loc.sg differentiated ‘animate1’ from ‘animate2’, ‘inanimate1’ and ‘inanimate2’; acc.sg, nom.pl and voc.pl differentiated ‘animate2’ from ‘inanimate1’ and ‘inanimate2’, and gen.sg differentiated ‘inanimate1’ from ‘inanimate2’ (see Figure 3. Further subdivisions were made within the two animate classes for the apparent idiosyncratic nominative plural suffix, and within the ‘inanimate2’ class, where nouns took either -u or -e as the genitive singular suffix. This division may have once reflected a final palatal on nouns taking -e in the genitive singular case, but this distinction has since been lost. All nouns in the ‘inanimate2’ “soft” class end in coronal consonants, whereas nouns in the ‘inanimate1’ “hard” class have a variety of final consonants.

Among feminine nouns, the ‘feminine -a’ class contained all feminine words that ended in -a in the nominative singular form. (Note that there exist masculine nouns ending in -a, but these did not pattern with the ‘feminine -a’ class). The ‘feminine pl -e’ class contained feminine nouns ending in -e, , or a consonant, and as the name suggests, had the suffix -e in the nominative plural form. The ‘feminine pl -i’ class contained feminine nouns ending in a consonant and had the suffix -i in the nominative plural form. No feminine nouns ended in a dorsal consonant.

Among neuter nouns, all words ended in a vowel.

Figure 3: Czech paradigm for masculine nouns.

Appendix B Some prototypical examples

Czech German
stem class stem class
azalka feminine, -a Kalesche fem, 6, S3P3 0.013
matamatika feminine, -a Tabelle fem, 6, S3P3 0.013
čtvrtka feminine, -a Stelze fem, 6, S3P3 0.014
paprika feminine, -a Lende fem, 6, S3P3 0.014
matoda feminine, -a Gamasche fem, 6, S3P3 0.015
ptakopysk masculine, animate1, pl -i 1.34 Karton msc, 1, S1P5 2.03
špendlík masculine, inanimate2 1.34 Humus msc, ?, S3P0 2.06
hospodář neuter, -ení, derived from verb (instr-pl) 1.36 Mufti msc, 1, S1P5 2.19
dudlík masculine, inanimate2 1.39 Magma neu, ?, S1P10 2.23
záznamník masculine, inanimate2 1.48 Los neu, 1, S1P1 2.43
Table 5: Five highest and lowest surprisal examples given form and meaning (w2v) by language.

To explore which examples, across classes might be most prototypical, we samples the top five highest and lowest suprisal examples. The results are in Table 5. We observe that the lowest surprisal from form for each language generally come from a single class for each language: feminine, -a for Czech and S3/P3 for German. These two classes were among the largest, having lower class entropy, and both contained feminine nouns. Forms with higher surprisal generally came from several smaller classes, and were predominately masculine. This sample size is small however, so it remains to be investigated whether this tendency in our data belies a genuine statistically significant relationship between gender, class size, and surprisal.