CharManteau: Character Embedding Models For Portmanteau Creation

07/04/2017 ∙ by Varun Gangal, et al. ∙ Carnegie Mellon University 0

Portmanteaus are a word formation phenomenon where two words are combined to form a new word. We propose character-level neural sequence-to-sequence (S2S) methods for the task of portmanteau generation that are end-to-end-trainable, language independent, and do not explicitly use additional phonetic information. We propose a noisy-channel-style model, which allows for the incorporation of unsupervised word lists, improving performance over a standard source-to-target model. This model is made possible by an exhaustive candidate generation strategy specifically enabled by the features of the portmanteau task. Experiments find our approach superior to a state-of-the-art FST-based baseline with respect to ground truth accuracy and human evaluation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Portmanteaus (or lexical blends Algeo (1977)) are novel words formed from parts of multiple root words in order to refer to a new concept which can’t otherwise be expressed concisely. Portmanteaus have become frequent in modern-day social media, news reports and advertising, one popular example being Brexit (Britain + Exit). Petri (2012). These are found not only in English but many other languages such as Bahasa Indonesia Dardjowidjojo (1979), Modern Hebrew Bat-El (1996); Berman (1989) and Spanish Piñeros (2004). Their short length makes them ideal for headlines and brandnames (Gabler, 2015).

Figure 1: A sketch of our Backward, noisy-channel model. The attentional S2S model with bidirectional encoder gives and next-character model gives , where (spime) is the portmanteau and are the concatenated root words (space and time).

Unlike better-defined morphological phenomenon such as inflection and derivation, portmanteau generation is difficult to capture using a set of rules. For instance, Shaw et al. (2014) state that the composition of the portmanteau from its root words depends on several factors, two important ones being maintaining prosody and retaining character segments from the root words, especially the head. An existing work by Deri and Knight (2015) aims to solve the problem of predicting portmanteau using a multi-tape FST model, which is data-driven, unlike prior approaches. Their methods rely on a grapheme to phoneme converter, which takes into account the phonetic features of the language, but may not be available or accurate for non-dictionary words, or low resource languages.

Prior works, such as Faruqui et al. (2016), have demonstrated the efficacy of neural approaches for morphological tasks such as inflection. We hypothesize that such neural methods can (1) provide a simpler and more integrated end-to-end framework than multiple FSTs used in the previous work, and (2) automatically capture features such as phonetic similarity through the use of character embeddings, removing the need for explicit grapheme-to-phoneme prediction. To test these hypotheses, in this paper, we propose a neural S2S model to predict portmanteaus given the two root words, specifically making 3 major contributions:

  • We propose an S2S model that attends to the two input words to generate portmanteaus, and an additional improvement that leverages noisy-channel-style modelling to incorporate a language model over the vocabulary of words (§2).

  • Instead of using the model to directly predict output character-by-character, we use the features of portmanteaus to exhaustively generate candidates, making scoring using the noisy channel model possible (§3).

  • We curate and share a new and larger dataset of 1624 portmanteaus (§4).

In experiments (§5), our model performs better than the baseline Deri and Knight (2015) on both objective and subjective measures, demonstrating that such methods can be used effectively in a morphological task.

2 Proposed Models

This section describes our neural models.

2.1 Forward Architecture

Under our first proposed architecture, the input sequence , while the output sequence is the portmanteau . The model learns the distribution .

The network architecture we use is an attentional S2S model (Bahdanau et al., 2014). We use a bidirectional encoder, which is known to work well for S2S problems with similar token order, which is true in our case. Let and represent the forward and reverse encoder; and represent the character embedding functions used by encoder and decoder The following equations describe the model:

The context vector

is computed using dot-product attention over encoder states. We choose dot-product attention because it doesn’t add extra parameters, which is important in a low-data scenario such as portmanteau generation.

In addition to capturing the fact that portmanteaus of two English words typically sound English-like, and to compensate for the fact that available portmanteau data will be small, we pre-train the character embeddings on English language words. We use character embeddings learnt using an LSTM language model over words in an English dictionary,111 Specifically in our experiments, 134K words from the CMU dictionary (Weide, 1998). where each word is a sequence of characters, and the model will predict next character in sequence conditioned on previous characters in the sequence.

2.2 Backward Architecture

The second proposed model uses Bayes’s rule to reverse the probabilities

to get . Thus, we have a reverse model of the probability that the given root words were generated from the portmanteau and a character language model model

. This is a probability distribution over all character sequences

, where is the alphabet of the language. This way of factorizing the probability is also known as a noisy channel model, which has recently also been shown to be effective for neural MT (Hoang et al. (2017), Yu et al. (2016)). Such a model offers two advantages

  1. The reverse direction model (or alignment model) gives higher probability to those portmanteaus from which one can discern the root words easily, which is one feature of good portmanteaus.

  2. The character language model can be trained on a large vocabulary of words in the language. The likelihood of a word is factorized as , where , and we train a LSTM to maximize this likelihood.

3 Making Predictions

Given these models, we must make predictions, which we do by two methods

Greedy Decoding:

In most neural sequence-to-sequence models, we perform auto-regressive greedy decoding, selecting the next character greedily based on the probability distribution for the next character at current time step. We refer to this decoding strategy as Greedy.

Exhaustive Generation:

Many portmanteaus were observed to be concatenation of a prefix of the first word and a suffix of the second. We therefore generate all candidate outputs which follow this rule. Thereafter we score these candidates with the decoder and output the one with the maximum score. We refer to this decoding strategy as Score.

Given that our training data is small in size, we expect ensembling (Breiman, 1996)

to help reduce model variance and improve performance. In this paper, we ensemble our models wherever mentioned by training multiple models on 80% subsamples of the training data, and averaging log probability scores across the ensemble at test-time.

4 Dataset

The existing dataset by Deri and Knight (2015) contains 401 portmanteau examples from Wikipedia. We refer to this dataset as . Besides being small for detailed evaluation, is biased by being from just one source. We manually collect , a dataset of 1624 distinct English portmanteaus from following sources:

  • Urban Dictionary222Not all neologisms are portmanteaus, so we manually choose those which are for our dataset.

  • Wikipedia

  • Wiktionary

  • BCU’s Neologism Lists from ’94 to ’12.

Naturally, . We define as the dataset of 1223 examples not from Wikipedia. We observed that 84.7% of the words in can be generated by concatenating prefix of first word with a suffix of the second.

5 Experiments

In this section, we show results comparing various configurations of our model to the baseline FST model of Deri and Knight (2015) (BASELINE). Models are evaluated using exact-matches (Matches) and average Levenshtein edit-distance (Distance) w.r.t ground truth.

Model Attn Ens Init Prediction Matches Distance
Baseline - - - - 45.39% 1.59
Forward Greedy 22.00% 1.98
Greedy 28.00% 1.90
Beam 13.25% 2.47
Beam 15.25% 2.37
Score 30.25% 1.64
Score 32.88% 1.53
Score 42.25% 1.33
Score 41.25% 1.34
Score 6.75% 3.78
Score 6.50% 3.76
Backward Score 37.00% 1.53
Score 42.25% 1.35
Score 48.75% 1.12
Score 46.50% 1.24
Score 5.00% 3.95
Score 4.75% 3.98
Table 1: 10-Fold Cross-Validation results, . Attn, Ens, Init denote attention, ensembling, and initializing character embeddings respectively.

5.1 Objective Evaluation Results

In Experiment 1, we follow the same setup as Deri and Knight (2015). is split into 10 folds. Each fold model uses 8 folds for training, 1 for validation, and 1 for test. The average (10 fold cross-validation style approach) performance metrics on the test fold are then evaluated. Table 1 shows the results of Experiment 1 for various model configurations. We get the BASELINE numbers from Deri and Knight (2015). Our best model obtains Matches and Distance, compared to Matches and Distance using BASELINE.

For Experiment 2, we seek to compare our best approaches from Experiment 1 to the BASELINE on a large, held-out dataset. Each model is trained on and tested on . BASELINE was similarly trained only on , making it a fair comparison. Table 2 shows the results333For BASELINE (Deri and Knight, 2015), we use their trained model from http://leps.isi.edu/fst/step-all.php . Our best model gets Distance of as compared to from BASELINE.

We observe that the Backward architecture performs better than Forward architecture, confirming our hypothesis in §2.2. In addition, ablation results confirm the importance of attention, and initializing the word embeddings. We believe this is because portmanteaus have high fidelity towards their root word characters and its critical that the model can observe all root sequence characters, which attention manages to do as shown in Fig. 2.

Model Attn Ens Init Search Matches Distance
Baseline - - - - 31.56% 2.32
Forward SCORE 25.26% 2.13
SCORE 24.93% 2.32
SCORE 31.23% 1.98
SCORE 28.94% 2.04
Backward SCORE 25.75% 2.14
SCORE 25.26% 2.17
SCORE 31.72% 1.96
SCORE 32.78% 1.96
Table 2: Results on (1223 Examples). In general, Backward architecture performs better than Forward architecture.
Figure 2: Attention matrices while generating slurve from slider;curve, and bennifer from ben;jennifer respectively, using Forward model. ; and . are separator and stop characters. Darker cells are higher-valued

5.1.1 Performance on Uncovered Examples

The set of candidates generated before scoring in the approximate SCORE decoding approach sometimes do not cover the ground truth. This holds true for 229 out of 1223 examples in . We compare the Forward approach along with a Greedy decoding strategy to the Baseline approach for these examples.

Both Forward+Greedy and the Baseline get 0 Matches on these examples. The Distance for these examples is 4.52 for Baseline and 4.09 for Forward+Greedy. Hence, we see that one of our approaches (Forward+Greedy) outperforms Baseline even for these examples.

5.2 Significance Tests

Since our dataset is still small relatively small ( examples), it is essential to verify whether Backward is indeed statistically significantly better than Baseline in terms of Matches.

In order to do this, we use a paired bootstrap444We average across randomly chosen subsets of , each of size () comparison (Koehn, 2004) between Backward and Baseline in terms of Matches. Backward is found to be better (gets more Matches) than Baseline in 99.9% () of the subsets.

Similarly, Backward has a lower Distance than Baseline by a margin of in 99.5% () of the subsets.

5.3 Subjective Evaluation and Analysis

On inspecting outputs, we observed that often output from our system seemed good in spite of high edit distance from ground truth. Such aspect of an output seeming good is not captured satisfactorily by measures like edit distance. To compare the errors made by our model to the baseline, we designed and conducted a human evaluation task on AMT.555We avoid ground truth comparison because annotators can be biased to ground truth due to its existing popularity. In the survey, we show human annotators outputs from our system and that of the baseline. We ask them to judge which alternative is better overall based on following criteria: 1. It is a good shorthand for two original words 2. It sounds better. We requested annotation on a scale of 1-4. To avoid ordering bias, we shuffled the order of two portmanteau between our system and that of baseline. We restrict annotators to be from Anglophone countries, have HIT Approval Rate % and pay $ per HIT (5 Questions per HIT).

As seen in Table 4, output from our system was labelled better by humans as compared to the baseline 58.12% of the time. Table 3 shows outputs from different models for a few examples.

Input forward backward Ground Truth
shopping;marathon shopparathon shoathon shopathon
fashion;fascism fashism fashism fashism
wiki;etiquette wikiquette wiquette wikiquette
clown;president clowident clownsident clownsident
Table 3: Example outputs from different models (Refer to appendix for more examples)
Judgement Percentage of total
Much Better (1) 29.06
Better (2) 29.06
Worse (3) 25.11
Much Worse (4) 16.74
Table 4: AMT annotator judgements on whether our system’s proposed portmanteau is better or worse compared to the baseline

6 Related Work

Özbal and Strapparava (2012) generate new words to describe a product given its category and properties. However, their method is limited to hand-crafted rules as compared to our data driven approach. Also, their focus is on brand names. Hiranandani et al. (2017) have proposed an approach to recommend brand names based on brand/product description. However, they consider only a limited number of features like memorability and readability. Smith et al. (2014) devise an approach to generate portmanteaus, which requires user-defined weights for attributes like sounding good. Generating a portmanteau from two root words can be viewed as a S2S problem. Recently, neural approaches have been used for S2S problems (Sutskever et al., 2014) such as MT. Ling et al. (2015) and Chung et al. (2016) have shown that character-level neural sequence models work as well as word-level ones for language modelling and MT. Zoph and Knight (2016) propose S2S models for multi-source MT, which have multi-sequence inputs, similar to our case.

7 Conclusion

We have proposed an end-to-end neural system to model portmanteau generation. Our experiments show the efficacy of proposed system in predicting portmanteaus given the root words. We conclude that pre-training character embeddings on the English vocabulary helps the model. Through human evaluation we show that our model’s predictions are superior to the baseline. We have also released our dataset and code666https://github.com/vgtomahawk/Charmanteau-CamReady to encourage further research on the phenomenon of portmanteaus. We also release an online demo 777http://tinyurl.com/y9x6mvy where our trained model can be queried for portmanteau suggestions. An obvious extension to our work is to try similar models on multiple languages.

Acknowledgements

We thank Dongyeop Kang, David Mortensen, Qinlan Shen and anonymous reviewers for their valuable comments. This research was supported in part by DARPA grant FA8750-12-2-0342 funded under the DEFT program.

References