1 Introduction
Neural sequence-to-sequence models are omnipresent in the field of natural language processing due to their impressive performance. They hold state of the art on a myriad of tasks, e.g., neural machine translation
(NMT; ott-etal-2018-scaling) and abstractive summarization (AS; lewis2019bart). Yet, an undesirable property of these models has been repeatedly observed in word-level tasks: When using beam search as the decoding strategy, increasing the beam width beyond a size of often leads to a drop in the quality of solutions murray-chiang-2018-correcting; yang-etal-2018-breaking; pmlr-v97-cohen19a. Further, in the context of NMT, it has been shown that the empty string is frequently the most-probable solution under the model stahlberg-byrne-2019-nmt. Some suggest this is a manifestation of the general inadequacy of neural models for language generation tasks koehn-knowles-2017-six; kumar2019calibration; holtzman2019curious; stahlberg_phd; in this work, we find evidence demonstrating otherwise.width= NMT 63.1% 46.1% 44.3% 6.4% MI 0.8% 0.0% 0.0% 0.0%
Sequence-to-sequence transducers for character-level tasks often follow the architectures of their word-level counterparts faruqui-etal-2016-morphological; lee-etal-2017-fully, and have likewise achieved state-of-the-art performance on e.g., morphological inflection generation wu2020applying and grapheme-to-phoneme conversion G2P. Given prior findings, we might expect to see the same degenerate behavior in these models—however, we do not. We run a series of experiments on morphological inflection (MI) generators to explore whether neural transducers for this task are similarly poorly calibrated, i.e. are far from the true distribution . We evaluate the performance of two character-level sequence-to-sequence transducers using different decoding strategies; our results, previewed in Tab. 1
, show that evaluation metrics do not degrade with larger beam sizes as in NMT or AS. Additionally, only in extreme circumstances, e.g., low-resource settings with less than 100 training samples, is the empty string ever the global optimum under the model.
Our findings directly refute the claim that neural architectures are inherently inadequate for modeling language generation tasks. Instead, our results admit two potential causes of the degenerate behavior observed in tasks such as NMT and AS: (1) lack of a deterministic mapping between input and output and (2) a (perhaps irreparable) discrepancy between sample complexity and training resources. Our results alone are not sufficient to accept or reject either hypothesis, and thus we leave these as future research directions.
2 Neural Transducers
Sequence-to-sequence transduction is the transformation of an input sequence into an output sequence. Tasks involving this type of transformation are often framed probabilistically, i.e., we model the probability of mapping one sequence to another. On many tasks of this nature, neural sequence-to-sequence models sutskever_rnn; Bahdanau hold state of the art.
Formally, a neural sequence-to-sequence model defines a probability distribution
parameterized by a neural network with a set of learned weights
for an input sequence and output sequence . Morphological inflection and NMT are two such tasks, wherein our outputs are both strings. Neural sequence-to-sequence models are typically locally normalized, i.e. factorizes as follows:(1) |
Given a vocabulary , each conditional is a distribution over and . We consider to be well-calibrated
if its probability estimates are representative of the true likelihood that a solution
is correct.Morphological Inflection.
In the task of morphological inflection, is an encoding of the lemma concatenated with a flattened morphosyntactic description (MSD) and is the target inflection. As a concrete example, consider inflecting the German word Bruder into the genitive plural, as shown in Tab. 2. Then, is the string and is the string . As this demonstrates, morphological inflection generation is, by its nature, modeled at the character level faruqui-etal-2016-morphological; wu-cotterell-2019-exact, i.e., our target vocabulary is a set of characters in the language. Note that , but due to the additional encoding of the MSD. This stands in contrast to NMT, which is typically performed on a (sub)word level, making the vocabulary size orders of magnitude larger.
Another important differentiating factor of morphological inflection generation in comparison to many other generation tasks in NLP is the one-to-one mapping between source and target.111While there are cases where there exist multiple inflected forms of a lemma, e.g., in English the past tense of dream can be realized as either dreamed or dreamt, these cases (termed “overabundance”) are rare OverabundanceinMorphology. In contrast, there are almost always many correct ways to translate a sentence into another language or to summarize a large piece of text; this characteristic manifests itself in training data where a single phrase has instances of different mappings, making tasks such as translation and summarization inherently ambiguous.
max width=
Singular | Plural | |
Nominativ | Bruder | Brüder |
Genitiv | Bruders | Brüder |
Dativ | Bruder | Brüdern |
Akkusativ | Bruder | Brüder |
3 Decoding
In the case of probabilistic models, the decoding problem is the search for the most-probable sequence among valid sequences under the model :
(2) |
This problem is also known as maximum-a-posteriori (MAP) inference. Decoding is often performed with a heuristic search method such as greedy or beam search
reddy-1977, since performing exact search can be computationally expensive, if not impossible.222The search space is exponential in the sequence length and due to the non-Markov nature of (typical) neural transducers, dynamic-programming techniques are not helpful. While for a deterministic task, greedy search is optimal under a Bayes optimal model,333Under such a model, the correct token at time step will be assigned all probability mass.most text generation tasks benefit from using beam search. However, text quality almost invariably decreases for beam sizes larger than
. This phenomenon is sometimes referred to as the beam search curse, and has been investigated in detail by a number of scholarly works koehn-knowles-2017-six; murray-chiang-2018-correcting; yang-etal-2018-breaking; stahlberg-byrne-2019-nmt; pmlr-v97-cohen19a; eikema2020map.max width= Transformer HMM Dijkstra Dijkstra Overall 90.34% 90.37% 90.37% 90.37% 86.03% 85.62% 85.60% 85.60% Low-resource 84.10% 84.12% 84.12% 84.12% 70.99% 69.37% 69.31% 69.31% High-resource 94.05% 94.08% 94.08% 94.08% 93.60% 93.72% 93.72% 93.72%
Exact decoding can be seen as the case of beam search where the beam size is effectively stretched to infinity.444This interpretation is useful when comparing with beam search with increasing beam widths. By considering the complete search space, it finds the globally best solution under the model . While, as previously mentioned, exact search can be computationally expensive, we can employ efficient search strategies due to some properties of . Specifically, from Eq. 1, we can see that the scoring function for sequences is monotonically decreasing in . We can therefore find the provably optimal solution with Dijkstra’s algorithm dijkstra1959note, which terminates and returns the global optimum the first time it encounters an eos. Additionally, to prevent a large memory footprint, we can lower-bound the search using any complete hypothesis, e.g., the empty string or a solution found by beam search stahlberg-byrne-2019-nmt; meister+al.tacl20. That is, we can prematurely stop exploring solutions whose scores become less than these hypotheses at any point in time. Although exact search is an exponential-time method in this setting, we see that, in practice, it terminates quickly due to the peakiness of (see App. A). While the effects of exact decoding and beam search decoding with large beam widths have been explored for a number of word-level tasks stahlberg-byrne-2019-nmt; pmlr-v97-cohen19a; eikema2020map, to the best of our knowledge, they have not yet been explored for any character-level sequence-to-sequence tasks.
4 Experiments
We run a series of experiments using different decoding strategies to generate predictions from morphological inflection generators. We report results for two near-state-of-the-art models: a multilingual Transformer wu2020applying
and a (neuralized) hidden Markov model
(HMM; wu-cotterell-2019-exact). For reproducibility, we mimic their proposed architectures and exactly follow their data pre-processing steps, training strategies and hyperparameter settings.
555 https://github.com/shijie-wu/neural-transducer/tree/sharedtasksData.
We use the data provided by the SIGMORPHON 2020 shared task vylomova2020sigmorphon, which features lemmas, inflections, and corresponding MSDs in the UniMorph schema kirov-etal-2018-unimorph in 90 languages in total. The set of languages is typologically diverse (spanning 18 language families) and contains both high- and low-resource examples, providing a spectrum over which we can evaluate model performance. The full dataset statistics can be found on the task homepage.666https://sigmorphon.github.io/sharedtasks/2020/task0/ When reporting results, we consider languages with and training samples as low- and high-resource, respectively.
Decoding Strategies.
We decode morphological inflection generators using exact search and beam search for a range of beam widths. We use the SGNMT library for decoding stahlberg2017sgnmt albeit adding Dijkstra’s algorithm.
max width= Beam Beam Optimum Empty String Transformer -0.619 -0.617 -0.617 -6.56 HMM -1.08 -0.89 -0.80 -20.15
4.1 Results
Tab. 3 shows that the accuracy of predictions from neural MI generators generally does not decrease when larger beam sizes are used for decoding; this observation holds for both model architectures. While it may be expected that models for low-resource languages generally perform worse than those for high-resource ones, this disparity is only prominent for HMMs, where the difference between high- and low-resource accuracy is vs. for the Transformers. Notably, for the HMM, the global optimum under the model is the empty string far more often for low-resource languages than it is for high-resource ones (see Tab. 5). We can explicitly see the inverse relationship between the log-probability of the empty string and resource size in Fig. 1. In general, across models for all 90 languages, the global optimum is rarely the empty string (Tab. 5). Indeed, under the Transformer-based transducer, the empty string was never the global optimum. This is in contrast to the findings of stahlberg-byrne-2019-nmt, who found for word-level NMT that the empty string was the optimal translation in more than 50% of cases, even under state-of-the-art models. Rather, the average log-probabilities of the empty string (which is quite low) and the chosen inflection lie far apart (Tab. 4).

5 Discussion
Our findings admit two potential hypotheses for poor calibration of neural models in certain language generation tasks, a phenomenon we do not observe in morphological inflection. First, the tasks in which we observe this property are ones that lack a deterministic mapping, i.e. tasks for which there may be more than one correct solution for any given input. As a consequence, probability mass may be spread over an arbitrarily large number of hypotheses ott2018analyzing; eikema2020map. In contrast, the task of morphological inflection has a near-deterministic mapping. We observe this empirically in Tab. 4, which shows that the probability of the global optimum on average covers most of the available probability mass—a phenomenon also observed by peters-martins-2019-ist. Further, as shown in Tab. 6, the dearth of search errors even when using greedy search suggests there are rarely competing solutions under the model. We posit it is the lack of ambiguity in morphological inflection that allows for the well-calibrated models we observe.
Second, our experiments contrasting high- and low-resource settings indicate insufficient training data may be the main cause of the poor calibration in sequence-to-sequence models for language generation tasks. We observe that models for MI trained on fewer data typically place more probability mass on the empty string. As an extreme example, we consider the case of the Zarma language, whose training set consists of only 56 samples. Under the HMM, the average log-probability of the generated inflection and empty string are very close ( and , respectively). Furthermore, on the test set, the global optimum of the HMM model for Zarma is the empty string 81.25% of the time.
HMM | Transformer | |
Overall | 2.03% | 0% |
Low-resource | 8.65% | 0% |
High-resource | 0.0002% | 0% |
max width= HMM 6.20% 2.33% 0.001% 0.0% Transformer 0.68% 0.0% 0.0% 0.0%
From this example, we can conjecture that lack of sufficient training data may manifest itself as the (relatively) high probability of the empty string or the (relatively) low probability of the optimum. We can extrapolate to models for NMT and other word-level tasks, for which we frequently see the above phenomenon. Specifically, our experiments suggest that when neural language generators frequently place high probability on the empty string, there may be a discrepancy between the available training resources and the number of samples needed to successfully learn the target function. While this at first seems an easy problem to fix, we expect the number of resources needed in tasks such as NMT and AS is much larger than that for MI if not due to the size of the output space alone; perhaps so large that they are essentially unattainable. Under this explanation, for certain tasks, there may not be a straightforward fix to the degenerate behavior observed in some neural language generators.
6 Conclusion
In this work, we investigate whether the poor calibration often seen in sequence-to-sequence models for word-level tasks also occurs in models for morphological inflection. We find that character-level models for morphological inflection are generally well-calibrated, i.e. the probability of the globally best solution is almost invariably much higher than that of the empty string. This suggests the degenerate behavior observed in neural models for certain word-level tasks is not due to the inherent incompatibility of neural models for language generation. Rather, we find evidence that poor calibration may be linked to specific characteristics of a subset of these task, and suggest directions for future exploration of this phenomenon.
References
Appendix A Timing
width= Transformer HMM Dijkstra Dijkstra Overall 0.082 0.091 0.016 0.027 Low-resource 0.072 0.082 0.013 0.032 High-resource 0.075 0.083 0.017 0.026