We describe a simple neural language model that relies only on character-level inputs. Predictions are still made at the word-level. Our model employs a convolutional neural network (CNN) and a highway network over characters, whose output is given to a long short-term memory (LSTM) recurrent neural network language model (RNN-LM). On the English Penn Treebank the model is on par with the existing state-of-the-art despite having 60 parameters. On languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian), the model outperforms word-level/morpheme-level LSTM baselines, again with fewer parameters. The results suggest that on many languages, character inputs are sufficient for language modeling. Analysis of word representations obtained from the character composition part of the model reveals that the model is able to encode, from characters only, both semantic and orthographic information.READ FULL TEXT VIEW PDF
Text classification using char-CNN + word-CNN
Language modeling is a fundamental task in artificial intelligence and natural language processing (NLP), with applications in speech recognition, text generation, and machine translation. A language model is formalized as a probability distribution over a sequence of strings (words), and traditional methods usually involve making an
-th order Markov assumption and estimating-gram probabilities via counting and subsequent smoothing [Chen and Goodman1998]. The count-based models are simple to train, but probabilities of rare -grams can be poorly estimated due to data sparsity (despite smoothing techniques).
Neural Language Models (NLM) address the
-gram data sparsity issue through parameterization of words as vectors (word embeddings) and using them as inputs to a neural network[Bengio, Ducharme, and Vincent2003, Mikolov et al.2010]. The parameters are learned as part of the training process. Word embeddings obtained through NLMs exhibit the property whereby semantically close words are likewise close in the induced vector space (as is the case with non-neural techniques such as Latent Semantic Analysis [Deerwester, Dumais, and Harshman1990]).
While NLMs have been shown to outperform count-based -gram language models [Mikolov et al.2011], they are blind to subword information (e.g. morphemes). For example, they do not know, a priori, that eventful, eventfully, uneventful, and uneventfully should have structurally related embeddings in the vector space. Embeddings of rare words can thus be poorly estimated, leading to high perplexities for rare words (and words surrounding them). This is especially problematic in morphologically rich languages with long-tailed frequency distributions or domains with dynamic vocabularies (e.g. social media).
In this work, we propose a language model that leverages subword information through a character-level convolutional neural network (CNN), whose output is used as an input to a recurrent neural network language model (RNN-LM). Unlike previous works that utilize subword information via morphemes [Botha and Blunsom2014, Luong, Socher, and Manning2013], our model does not require morphological tagging as a pre-processing step. And, unlike the recent line of work which combines input word embeddings with features from a character-level model [dos Santos and Zadrozny2014, dos Santos and Guimaraes2015], our model does not utilize word embeddings at all in the input layer. Given that most of the parameters in NLMs are from the word embeddings, the proposed model has significantly fewer parameters than previous NLMs, making it attractive for applications where model size may be an issue (e.g. cell phones).
To summarize, our contributions are as follows:
on English, we achieve results on par with the existing state-of-the-art on the Penn Treebank (PTB), despite having approximately fewer parameters, and
on morphologically rich languages (Arabic, Czech, French, German, Spanish, and Russian), our model outperforms various baselines (Kneser-Ney, word-level/morpheme-level LSTM), again with fewer parameters.
We have released all the code for the models described in this paper.111https://github.com/yoonkim/lstm-char-cnn
The architecture of our model, shown in Figure 1, is straightforward. Whereas a conventional NLM takes word embeddings as inputs, our model instead takes the output from a single-layer character-level convolutional neural network with max-over-time pooling.
For notation, we denote vectors with bold lower-case (e.g. ), matrices with bold upper-case (e.g. ), scalars with italic lower-case (e.g. ), and sets with cursive upper-case (e.g. ) letters. For notational convenience we assume that words and characters have already been converted into indices.
A recurrent neural network (RNN) is a type of neural network architecture particularly suited for modeling sequential phenomena. At each time step , an RNN takes the input vector and the hidden state vector and produces the next hidden state by applying the following recursive operation:
Here are parameters of an affine transformation and is an element-wise nonlinearity. In theory the RNN can summarize all historical information up to time with the hidden state . In practice however, learning long-range dependencies with a vanilla RNN is difficult due to vanishing/exploding gradients [Bengio, Simard, and Frasconi1994], which occurs as a result of the Jacobian’s multiplicativity with respect to time.
Long short-term memory (LSTM) [Hochreiter and Schmidhuber1997] addresses the problem of learning long range dependencies by augmenting the RNN with a memory cell vector at each time step. Concretely, one step of an LSTM takes as input and produces , via the following intermediate calculations:
Here and are the element-wise sigmoid and hyperbolic tangent functions, is the element-wise multiplication operator, and , , are referred to as input, forget, and output gates. At , and are initialized to zero vectors. Parameters of the LSTM are for .
Memory cells in the LSTM are additive with respect to time, alleviating the gradient vanishing problem. Gradient exploding is still an issue, though in practice simple optimization strategies (such as gradient clipping) work well. LSTMs have been shown to outperform vanilla RNNs on many tasks, including on language modeling[Sundermeyer, Schluter, and Ney2012]. It is easy to extend the RNN/LSTM to two (or more) layers by having another network whose input at is (from the first network). Indeed, having multiple layers is often crucial for obtaining competitive performance on various tasks [Pascanu et al.2013].
Let be the fixed size vocabulary of words. A language model specifies a distribution over (whose support is ) given the historical sequence . A recurrent neural network language model (RNN-LM) does this by applying an affine transformation to the hidden layer followed by a softmax:
where is the -th column of (also referred to as the output embedding),222In our work, predictions are at the word-level, and hence we still utilize word embeddings in the output layer. and is a bias term. Similarly, for a conventional RNN-LM which usually takes words as inputs, if , then the input to the RNN-LM at is the input embedding , the -th column of the embedding matrix . Our model simply replaces the input embeddings with the output from a character-level convolutional neural network, to be described below.
In our model, the input at time is an output from a character-level convolutional neural network (CharCNN), which we describe in this section. CNNs [LeCun et al.1989]
have achieved state-of-the-art results on computer vision[Krizhevsky, Sutskever, and Hinton2012] and have also been shown to be effective for various NLP tasks [Collobert et al.2011]. Architectures employed for NLP applications differ in that they typically involve temporal rather than spatial convolutions.
Let be the vocabulary of characters, be the dimensionality of character embeddings,333Given that
is usually small, some authors work with one-hot representations of characters. However we found that using lower dimensional representations of
characters (i.e. ) performed slightly better. and be the matrix character embeddings.
Suppose that word is made up of a sequence of characters , where is the length of word .
Then the character-level representation of is given by the matrix , where the -th column corresponds
to the character embedding for (i.e. the -th column of ).444Two technical details warrant mention here: (1) we append
start-of-word and end-of-word characters to each word to better represent prefixes and suffixes and hence actually has columns;
(2) for batch processing, we zero-pad
columns; (2) for batch processing, we zero-padso that the number of columns is constant (equal to the max word length) for all words in .
We apply a narrow convolution between and a filter (or kernel) of width , after which we add a bias and apply a nonlinearity to obtain a feature map . Specifically, the -th element of is given by:
where is the -to--th column of and is the Frobenius inner product. Finally, we take the max-over-time
as the feature corresponding to the filter (when applied to word ). The idea is to capture the most important feature—the one with the highest value—for a given filter. A filter is essentially picking out a character -gram, where the size of the -gram corresponds to the filter width.
We have described the process by which one feature is obtained from one filter matrix. Our CharCNN uses multiple filters of varying widths to obtain the feature vector for . So if we have a total of filters , then is the input representation of . For many NLP applications is typically chosen to be in .
We could simply replace (the word embedding) with at each in the RNN-LM, and as we show later, this simple model performs well on its own (Table 7
). One could also have a multilayer perceptron (MLP) overto model interactions between the character -grams picked up by the filters, but we found that this resulted in worse performance.
Instead we obtained improvements by running through a highway network, recently proposed by Srivastava et al. Srivastava2015. Whereas one layer of an MLP applies an affine transformation followed by a nonlinearity to obtain a new set of features,
one layer of a highway network does the following:
where is a nonlinearity, is called the transform gate, and is called the carry gate. Similar to the memory cells in LSTM networks, highway layers allow for training of deep networks by adaptively carrying some dimensions of the input directly to the output.555Srivastava et al. Srivastava2015 recommend initializing to a negative value, in order to militate the initial behavior towards carry. We initialized to a small interval around . By construction the dimensions of and have to match, and hence and are square matrices.
As is standard in language modeling, we use perplexity () to evaluate the performance of our models. Perplexity of a model over a sequence is given by
where is calculated over the test set. We test the model on corpora of varying languages and sizes (statistics available in Table 1).
We conduct hyperparameter search, model introspection, and ablation studies on the English Penn Treebank (PTB)[Marcus, Santorini, and Marcinkiewicz1993], utilizing the standard training (0-20), validation (21-22), and test (23-24) splits along with pre-processing by Mikolov2010 Mikolov2010. With approximately m tokens and k, this version has been extensively used by the language modeling community and is publicly available.666http://www.fit.vutbr.cz/~imikolov/rnnlm/
With the optimal hyperparameters tuned on PTB, we apply the model to various morphologically rich languages: Czech, German, French, Spanish, Russian, and Arabic. Non-Arabic data comes from the 2013 ACL Workshop on Machine Translation,777http://www.statmt.org/wmt13/translation-task.html and we use the same train/validation/test splits as in Botha2014 Botha2014. While the raw data are publicly available, we obtained the preprocessed versions from the authors,888http://bothameister.github.io/ whose morphological NLM serves as a baseline for our work. We train on both the small datasets (Data-s) with m tokens per language, and the large datasets (Data-l) including the large English data which has a much bigger than the PTB. Arabic data comes from the News-Commentary corpus,999http://opus.lingfil.uu.se/News-Commentary.php and we perform our own preprocessing and train/validation/test splits.
In these datasets only singleton words were replaced with <unk> and hence we effectively use the full vocabulary. It is worth noting that the character model can utilize surface forms of OOV tokens (which were replaced with <unk>), but we do not do this and stick to the preprocessed versions (despite disadvantaging the character models) for exact comparison against prior work.
time steps using stochastic gradient descent where the learning rate is initially set toand halved if the perplexity does not decrease by more than
on the validation set after an epoch. OnData-s we use a batch size of and on Data-l we use a batch size of (for greater efficiency). Gradients are averaged over each batch. We train for epochs on non-Arabic and
epochs on Arabic data (which was sufficient for convergence), picking the best performing model on the validation set. Parameters of the model are randomly initialized over a uniform distribution with support.
For regularization we use dropout [Hinton et al.2012] with probability
on the LSTM input-to-hidden layers (except on the initial Highway to LSTM layer) and the hidden-to-output softmax layer. We further constrain the norm of the gradients to be below, so that if the norm of the gradient exceeds then we renormalize it to have before updating. The gradient norm constraint was crucial in training the model. These choices were largely guided by previous work of Zaremba et al. Zaremba2014 on word-level language modeling with LSTMs.
Finally, in order to speed up training on Data-l we employ a hierarchical softmax [Morin and Bengio2005]—a common strategy for training language models with very large —instead of the usual softmax. We pick the number of clusters and randomly split into mutually exclusive and collectively exhaustive subsets of (approximately) equal size.101010While Brown clustering/frequency-based clustering is commonly used in the literature (e.g. Botha2014 Botha2014 use Brown clusering), we used random clusters as our implementation enjoys the best speed-up when the number of words in each cluster is approximately equal. We found random clustering to work surprisingly well. Then becomes,
where is the cluster index such that . The first term is simply the probability of picking cluster , and the second term is the probability of picking word given that cluster is picked. We found that hierarchical softmax was not necessary for models trained on Data-s.
|KN- (Mikolov et al. 2012)||m|
|RNN (Mikolov et al. 2012)||m|
|RNN-LDA (Mikolov et al. 2012)||m|
|genCNN [Wang et al.2015]||m|
|FOFE-FNNLM [Zhang et al.2015]||m|
|Deep RNN [Pascanu et al.2013]||m|
|Sum-Prod Net [Cheng et al.2014]||m|
|LSTM-1 (Zaremba et al. 2014)||m|
|LSTM-2 (Zaremba et al. 2014)||m|
We train two versions of our model to assess the trade-off between performance and size. Architecture of the small (LSTM-Char-Small) and large (LSTM-Char-Large) models is summarized in Table 2. As another baseline, we also train two comparable LSTM models that use word embeddings only (LSTM-Word-Small, LSTM-Word-Large). LSTM-Word-Small uses hidden units and LSTM-Word-Large uses hidden units. Word embedding sizes are also and respectively. These were chosen to keep the number of parameters similar to the corresponding character-level model.
As can be seen from Table 3, our large model is on par with the existing state-of-the-art (Zaremba et al. 2014), despite having approximately fewer parameters. Our small model significantly outperforms other NLMs of similar size, even though it is penalized by the fact that the dataset already has OOV words replaced with <unk> (other models are purely word-level models). While lower perplexities have been reported with model ensembles [Mikolov and Zweig2012], we do not include them here as they are not comparable to the current work.
The model’s performance on the English PTB is informative to the extent that it facilitates comparison against the large body of existing work. However, English is relatively simple from a morphological standpoint, and thus our next set of results (and arguably the main contribution of this paper) is focused on languages with richer morphology (Table 4, Table 5).
We compare our results against the morphological log-bilinear (MLBL) model from Botha2014 Botha2014, whose model also takes into account subword information through morpheme embeddings that are summed at the input and output layers. As comparison against the MLBL models is confounded by our use of LSTMs—widely known to outperform their feed-forward/log-bilinear cousins—we also train an LSTM version of the morphological NLM, where the input representation of a word given to the LSTM is a summation of the word’s morpheme embeddings. Concretely, suppose that is the set of morphemes in a language, is the matrix of morpheme embeddings, and is the -th column of (i.e. a morpheme embedding). Given the input word , we feed the following representation to the LSTM:
where is the word embedding (as in a word-level model) and is the set of morphemes for word . The morphemes are obtained by running an unsupervised morphological tagger as a preprocessing step.111111We use Morfessor Cat-MAP [Creutz and Lagus2007], as in Botha2014 Botha2014. We emphasize that the word embedding itself (i.e. ) is added on top of the morpheme embeddings, as was done in Botha and Blunsom Botha2014. The morpheme embeddings are of size / for the small/large models respectively. We further train word-level LSTM models as another baseline.
On Data-s it is clear from Table 4 that the character-level models outperform their word-level counterparts despite, again, being smaller.121212The difference in parameters is greater for non-PTB corpora as the size of the word model scales faster with . For example, on Arabic the small/large word models have m/m parameters while the corresponding character models have m/m parameters respectively. The character models also outperform their morphological counterparts (both MLBL and LSTM architectures), although improvements over the morphological LSTMs are more measured. Note that the morpheme models have strictly more parameters than the word models because word embeddings are used as part of the input.
Due to memory constraints131313All models were trained on GPUs with 2GB memory. we only train the small models on Data-l (Table 5). Interestingly we do not observe significant differences going from word to morpheme LSTMs on Spanish, French, and English. The character models again outperform the word/morpheme models. We also observe significant perplexity reductions even on English when is large. We conclude this section by noting that we used the same architecture for all languages and did not perform any language-specific tuning of hyperparameters.
Nearest neighbor words (based on cosine similarity) of word representations from the large word-level and character-level (before and after highway layers) models trained on the PTB. Last three words are OOV words, and therefore they do not have representations in the word-level model.
We explore the word representations learned by the models on the PTB. Table 6 has the nearest neighbors of word representations learned from both the word-level and character-level models. For the character models we compare the representations obtained before and after highway layers.
Before the highway layers the representations seem to solely rely on surface forms—for example the nearest neighbors of you are your, young, four, youth, which are close to you in terms of edit distance. The highway layers however, seem to enable encoding of semantic features that are not discernable from orthography alone. After highway layers the nearest neighbor of you is we, which is orthographically distinct from you. Another example is while and though—these words are far apart edit distance-wise yet the composition model is able to place them near each other. The model also makes some clear mistakes (e.g. his and hhs), highlighting the limits of our approach, although this could be due to the small dataset.
The learned representations of OOV words (computer-aided, misinformed) are positioned near words with the same part-of-speech. The model is also able to correct for incorrect/non-standard spelling (looooook), indicating potential applications for text normalization in noisy domains.
As discussed previously, each filter of the CharCNN is essentially learning to detect particular character -grams. Our initial expectation was that each filter would learn to activate on different morphemes and then build up semantic representations of words from the identified morphemes. However, upon reviewing the character -grams picked up by the filters (i.e. those that maximized the value of the filter), we found that they did not (in general) correspond to valid morphemes.
To get a better intuition for what the character composition model is learning, we plot the learned representations of all character -grams (that occurred as part of at least two words in
) via principal components analysis (Figure2). We feed each character -gram into the CharCNN and use the CharCNN’s output as the fixed dimensional representation for the corresponding character -gram. As is apparent from Figure 2, the model learns to differentiate between prefixes (red), suffixes (blue), and others (grey). We also find that the representations are particularly sensitive to character -grams containing hyphens (orange), presumably because this is a strong signal of a word’s part-of-speech.
We quantitatively investigate the effect of highway network layers via ablation studies (Table 7). We train a model without any highway layers, and find that performance decreases significantly. As the difference in performance could be due to the decrease in model size, we also train a model that feeds (i.e. word representation from the CharCNN) through a one-layer multilayer perceptron (MLP) to use as input into the LSTM. We find that the MLP does poorly, although this could be due to optimization issues.
We hypothesize that highway networks are especially well-suited to work with CNNs, adaptively combining local features detected by the individual filters. CNNs have already proven to be been successful for many NLP tasks [Collobert et al.2011, Shen et al.2014, Kalchbrenner, Grefenstette, and Blunsom2014, Kim2014, Zhang, Zhao, and LeCun2015, Lei, Barzilay, and Jaakola2015], and we posit that further gains could be achieved by employing highway layers on top of existing CNN architectures.
We also anecdotally note that (1) having one to two highway layers was important, but more highway layers generally resulted in similar performance (though this may depend on the size of the datasets), (2) having more convolutional layers before max-pooling did not help, and (3) highway layers did not improve models that only used word embeddings as inputs.
|No Highway Layers|
|One Highway Layer|
|Two Highway Layers|
|One MLP Layer|
We next study the effect of training corpus/vocabulary sizes on the relative performance between the different models. We take the German (De) dataset from Data-l and vary the training corpus/vocabulary sizes, calculating the perplexity reductions as a result of going from a small word-level model to a small character-level model. To vary the vocabulary size we take the most frequent words and replace the rest with <unk>. As with previous experiments the character model does not utilize surface forms of <unk> and simply treats it as another token. Although Table 8 suggests that the perplexity reductions become less pronounced as the corpus size increases, we nonetheless find that the character-level model outperforms the word-level model in all scenarios.
We report on some further experiments and observations:
Combining word embeddings with the CharCNN’s output to form a combined representation of a word (to be used as input to the LSTM) resulted in slightly worse performance ( on PTB with a large model). This was surprising, as improvements have been reported on part-of-speech tagging [dos Santos and Zadrozny2014]dos Santos and Guimaraes2015] by concatenating word embeddings with the output from a character-level CNN. While this could be due to insufficient experimentation on our part,141414
We experimented with (1) concatenation, (2) tensor products, (3) averaging, and (4) adaptive weighting schemes whereby the model learns a convex combination of word embeddings and the CharCNN outputs.it suggests that for some tasks, word embeddings are superfluous—character inputs are good enough.
While our model requires additional convolution operations over characters and is thus slower than a comparable word-level model which can perform a simple lookup at the input layer, we found that the difference was manageable with optimized GPU implementations—for example on PTB the large character-level model trained at tokens/sec compared to the word-level model which trained at tokens/sec. For scoring, our model can have the same running time as a pure word-level model, as the CharCNN’s outputs can be pre-computed for all words in . This would, however, be at the expense of increased model size, and thus a trade-off can be made between run-time speed and memory (e.g. one could restrict the pre-computation to the most frequent words).
Neural Language Models (NLM) encompass a rich family of neural network architectures for language modeling. Some example architectures include feed-forward [Bengio, Ducharme, and Vincent2003], recurrent [Mikolov et al.2010], sum-product [Cheng et al.2014], log-bilinear [Mnih and Hinton2007], and convolutional [Wang et al.2015] networks.
In order to address the rare word problem, Alexandrescu2006 Alexandrescu2006—building on analogous work on count-based -gram language models by Bilmes and Kirchhoff Bilmes2003—represent a word as a set of shared factor embeddings. Their Factored Neural Language Model (FNLM) can incorporate morphemes, word shape information (e.g. capitalization) or any other annotation (e.g. part-of-speech tags) to represent words.
A specific class of FNLMs leverages morphemic information by viewing a word as a function of its (learned) morpheme embeddings [Luong, Socher, and Manning2013, Botha and Blunsom2014, Qui et al.2014]. For example Luong2013 Luong2013 apply a recursive neural network over morpheme embeddings to obtain the embedding for a single word. While such models have proved useful, they require morphological tagging as a preprocessing step.
Another direction of work has involved purely character-level NLMs, wherein both input and output are characters [Sutskever, Martens, and Hinton2011, Graves2013]. Character-level models obviate the need for morphological tagging or manual feature engineering, and have the attractive property of being able to generate novel words. However they are generally outperformed by word-level models [Mikolov et al.2012].
Outside of language modeling, improvements have been reported on part-of-speech tagging [dos Santos and Zadrozny2014] and named entity recognition [dos Santos and Guimaraes2015] by representing a word as a concatenation of its word embedding and an output from a character-level CNN, and using the combined representation as features in a Conditional Random Field (CRF). Zhang2015 Zhang2015 do away with word embeddings completely and show that for text classification, a deep CNN over characters performs well. Ballesteros2015 Ballesteros2015 use an RNN over characters only to train a transition-based parser, obtaining improvements on many morphologically rich languages.
Finally, Ling2015 Ling2015 apply a bi-directional LSTM over characters to use as inputs for language modeling and part-of-speech tagging. They show improvements on various languages (English, Portuguese, Catalan, German, Turkish). It remains open as to which character composition model (i.e. CNN or LSTM) performs better.
We have introduced a neural language model that utilizes only character-level inputs. Predictions are still made at the word-level. Despite having fewer parameters, our model outperforms baseline models that utilize word/morpheme embeddings in the input layer. Our work questions the necessity of word embeddings (as inputs) for neural language modeling.
Analysis of word representations obtained from the character composition part of the model further indicates that the model is able to encode, from characters only, rich semantic and orthographic features. Using the CharCNN and highway layers for representation learning (e.g. as input into word2vec [Mikolov et al.2013]) remains an avenue for future work.
Insofar as sequential processing of words as inputs is ubiquitous in natural language processing, it would be interesting to see if the architecture introduced in this paper is viable for other tasks—for example, as an encoder/decoder in neural machine translation[Cho et al.2014, Sutskever, Vinyals, and Le2014].
We are especially grateful to Jan Botha for providing the preprocessed datasets and the model results.
Journal of Machine Learning Research3:1137–1155.