Recent advances in Natural Language Processing (NLP) are characterized by the development of techniques that compute powerful word embeddings and by the extensive use of neural language models. Word Embeddings (WEs) aim at representing individual words in a low–dimensional continuous space, in order to exploit its topological properties to model semantic or grammatical relationships between different words. In particular, they are based on the assumption that functionally or semantically related words appear in similar contexts.
Despite the idea of continuous word representations was proposed a several years ago , their importance became strongly popular mostly after the work of Mikolov et al. , when the CBOW and Skip–Gram models were introduced as implementations of the word2vec
idea. Key features of these models are the unsupervised scheme of the learning process and the simplicity of the computation that allows a highly efficient training from very large unlabeled corpora. Moreover, the learning objective function is task–independent, such that it allows the development of embeddings suitable for several NLP tasks. WEs are generally constituted by a single vector to represent each specific word in a vocabularyof words. The requirement of a predefined vocabulary is an important limitation for every NLP model. Rare and Out–Of–Vocabulary (OOV) words will not have a meaningful vector representation. Moreover, WEs do not take into account morphological properties of words. For instance, the same suffix ing
may suggest that two words have some functional similarity. Hence, the information conveyed by the sequence of characters representing a word may be useful to tackle both the problem of unseen words and the modelling of morphology for in–vocabulary tokens. For instance, the character structure of tokens can also help to detect Named Entities, usually treated as OOV elements, recognizing proper nouns, by means of capital letters, or acronyms. Furthermore, a character–based model can deal with noise caused by typos, slang, etc, that are common issues in open–domain systems such as conversational agents or sentiment analysis tools.
There are several NLP tasks in which it is useful to generate vectorial representations of contexts too. In fact, polysemy and homonymy cause inherent semantic ambiguities in language interpretation, that can only be resolved by looking at the surrounding context, that is the goal of the Word Sense Disambiguation (WSD) task. Neural approaches have been developed to learn context embeddings, such as context2vec .
In this work we propose a character–based unsupervised model to learn both context and word embeddings from generic text. The model consists in a hierarchy of two distinct Bidirectional Long Short Term Memories (Bi–LSTMs), to encode words as sequences of characters and word–level contextual representations, respectively. Our unsupervised learning approach, despite being more compact than other related algorithms, yields generic embeddings with features that can be efficiently exploited in different NLP tasks requiring either word or context embeddings, such as chunking and WSD, as we show in our comparisons.
2 Related Work
Our unsupervised computational scheme follows the one of the CBOW instance of the word2vec algorithm . The method we propose in this paper is inspired by the ideas behind context2vec , that we extend with a bidirectional recurrent neural model that processes words as sequence of characters. We also focus on a single encoder that we use both to represent words alone and words belonging to a context.
There are several approaches that jointly learn task-oriented (supervised) word and character–based representations, that are subsequently either concatenated or combined by a non–linear function. In  a gate adaptively decides how to mix the two representations, whereas the models proposed in  and 
exploit the concatenation of word embeddings and character representations to address Part–Of–Speech (POS) Tagging and Named Entity Recognition (NER), respectively. Differently, our work focusses on a single character-level encoder that is trained in an unsupervised manner.
There exists a number of different approaches that extract vectorial representations directly from the character sequences of words, mostly focused on Language Modeling (LM) or Character Language Modeling (CLM). These representations are generally computed by either Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) - mostly LSTMs. Ling et al.  applied Bidirectional LSTMs  to learn task–dependent character level features for Language Modeling and POS tagging, showing particular improvements in morphologically rich languages such as Turkish. A multi–layered Hierarchical Recurrent Neural Network was applied in  to solve CLM. Differently from our approach, the output of this model is a distribution over characters, while we exploit word level predictions. The character–aware model of , is based on a highway–network on top of 1-d convolutional filters processing input characters. The resulting output is then handled by a LSTM for a LM task. The highway–network output does provide the distributed representation of a word. In  different architectures, mostly based on CNNs, are studied in LM tasks. The proposed approach differs from most of the previous ones (1) for the learning mechanism, that is completely unsupervised on large text corpora, thus allowing the development of task–independent representations, and (2) for the architecture that is aimed at obtaining character–aware representations of both contexts and words, that are suitable for a large variety of NLP applications.
3 The Character–Aware Neural Model
The proposed model is organized as a hierarchical architecture based on Bi–LSTMs processing sentences. Each sentence is first split into a sequence of words using space characters (i.e. whitespaces, tabs, newlines, etc. ) as separators. Words are further split into sequences of characters, such that there is no need to specify a vocabulary in advance. Then, the character sequence of an input word is processed to obtain its vectorial representation (word embedding), while the character sequences of the surrounding words are used to encode the context to which belongs (context embedding). Given the current sentence, the context of comprises the words that precede and follow . Inspired by the CBOW scheme , our model is trained to predict the current word given its context. In the following we describe each layer of the proposed architecture.
3.1 Word and context embeddings
We consider an input sentence composed of words, , where each word is a sequence of characters , being the length of the sequence . Each character is encoded as an index in a dictionary of characters and it is mapped to a real vector as
where is the matrix of the learnable character representations, each of them of size , while is a function returning a one-hot representation of its integer input. Note that is quite small, in the order of hundreds, compared to common word vocabularies, whose size is in the order of hundreds of thousands.
For each input word , the first layer of the model extracts a word embedding
, using a bidirectional recurrent neural network with LSTM cells (Bi-LSTM). Let and be the forward and backward components of a Bi-LSTM taking a sequence of character embeddings as input and returning their internal states and after the entire sequence has been processed. The embeddings of the word is then the concatenation of and :
where we indicated with the concatenation operation and we emphasized the backward nature of by showing the character sequence in reverse order.
The second layer follows a similar scheme to compute the contextual embedding of the word in the sentence . Let and be the forward and backward components of a Bi-LSTM taking as inputs the embeddings of left context of (i.e ) and of the right context of (i.e ), respectively. Given the Bi-LSTM internal states and obtained after processing the input left and right context sequences, the contextual embedding of the word is then obtained by projecting the concatenation of and
into a lower-dimensional space by means of a Multi-Layer Perceptron (MLP), with the goal of merging and compressing the left and right context representations,
3.2 Learning algorithm
Both word and context representations are learned following the unsupervised approach used in CBOW [12, 13]. Given a corpus of textual data, the objective of our model is to predict each word given the representation of its surrounding context (Eq. (3)). In particular, the context embedding of Eq. (3
) is projected into the space of the corpus vocabulary using a linear projection. Instead of performing a softmax activation and minimizing the cross-entropy (as commonly done in LM tasks), the whole network is trained by minimizing the Noise Contrastive Estimation (NCE) loss function. NCE belongs to a family of classification algorithms, which approximate a softmax regression by means of sampling methods. NCE is particularly helpful in all those cases in which the number of output units is prohibitively high, as it is for our (and related) model.
One could argue that a vocabulary of words is still needed, since it is required to make the aforementioned word prediction. However, this is not a limitation, since it is only necessary at training time, while it is not needed when deploying the model. In principle, a different approach would be feasible, where the context representation of Eq. (3) is decoded into a sequence of characters that represent the word to predict. We tried both approaches and we found the word level prediction to give the best results. Thanks to the dynamic behaviour of the context-level RNNs, our model can deal with contexts of any length. In this work, the state of the RNN is reset at the beginning of a new sentence, to reduce the variability of the contexts.
4 Experimental Results
We conducted different experiments to evaluate the word and context representations developed by the proposed model. In particular, we first trained our model on a large corpora. Then, we detached the learned word and context encoders and considered the tasks of Chunking and Word Sense Disambiguation (WSD), exploiting our word and context embeddings as features for each task-specific classifier, as shown in Figure2. Depending on the problem at hand, it may be useful to use either both the word and context embeddings or only one of them. Any other additional features can also be concatenated to these representations to obtain a richer input vector. We also evaluated the robustness of our model to character–level noise. Hence, we considered the WSD task when the input words are perturbed by typos modelled as random replacements of single characters. Finally we report some qualitative examples, showing the nearest neighbours for both word and context representations of a set of sample words.
Model setup. Our model has been trained on the ukWaC corpus111http://wacky.sslmit.unibo.it/doku.php?id=corpora (2 billion words). The size
of the character embeddings is set to 50, whereas word and context embeddings are of sizes 1000 and 600, respectively. The MLP, that maps the RNN states into the context embeddings, has one hidden layer of 1200 units with ReLU activation functions. These settings are inspired by those used in thecontext2vec architecture  (the structure of the last projection layer described in Subsection 3.2 is the same). The complete encoding model has around 7 million trainable parameters, which is about 16 times smaller than the context2vec model in ; this is due to the fact that words are encoded using a RNN that does not depend on the vocabulary size.
Chunking. Chunking is a classical NLP problem whose goal is to tag text segments with labels defining their syntactic roles, e.g. noun phrase (NP) or verbal phrase (VP). Each word is uniquely associated with a single tag expressing the segment class and its position within the phrase. An instance of Chunking classification is shown in Figure 2, where the word dog is marked with the label I-NP, standing for Inside-chunk Noun Phrase. A standard benchmark for Chunking is the CoNLL 2000 dataset that contains 211,727 tokens in the training set and 47,377 tokens in the test set. The chunk tag is predicted by training a classifier that receives as input only the concatenation of the word and context embeddings computed by the model. This vector is projected onto a 600 dimensional space, and further processed by a Bi-LSTM that outputs vectors of size 500 that are finally mapped to the space of 23 classes, representing the chunk tags. Weights are updated using Adam Optimizer with default hyper-parameters and weight decay regularization with a factor of .
We compared several variants of the proposed model and the resulting F1 scores are shown in Table 1. We report results when using only Word Embeddings (WE), only Context Embeddings (CE), and both of them (WE+CE). In this case we also considered WE and CE that are not generated by our model, but that are variables of the whole architecture trained with the task-level supervision. Both the feature types (WE and CE) are needed to achieve better performances, as expected. This experiment highlights the importance of using embeddings that are pre-trained with our model, that allows us to obtain the best F1 score of . This value can be compared with the results reported by Collobert et al.  (94.32) and by Huang et al.  (94.46), taking into account that in our case we did not make use of any hand-crafted feature nor of any kind of post-processing to adjust incoherent predictions. Moreover, when adding POS tagging features, our model reaches the same performances (93.94) of the state-of-the-art architecture  without Conditional Random Fields. Hence, we can conclude that the proposed architecture provides word and context embeddings that convey enough information to reach competitive performances. Furthermore, it should be considered that the number of parameters in the model is dramatically reduced with respect to such competitors, since there is no word vocabulary.
Word Sense Disambiguation. Experiments on WSD were carried out within the evaluation framework proposed in , that collects multiple benchmarks (Senseval*, SemEval*, and a merged collection - ALL). The goal of WSD is to identify the correct sense of words. We followed the commonly used IMS approach , that is based on an SVM classifier on top of conventional WSD features. We compare our method against the original IMS model and other instances of it in which the WSD features are augmented with different context embeddings.
We report the results in Table 2 and 3. Our embeddings outperform both the IMS with only conventional features and word2vec embeddings, opportunely averaged , moreover it is competitive with context2vec representations. It is also worth to mention that, to the best of our knowledge, the use of context2vec features as input of the IMS is a novel attempt in the literature.
Robustness to typos. Many NLP applications should deal with noisy textual data. Indeed, misspelled words are likely to be set as OOV in models based on word dictionaries. We compare the proposed model against context2vec
on a WSD task (ALL benchmark), when introducing an increasing probability to randomly perturb a character of a word.
Conventional WSD features are completely removed for both the models, that only use context-level representations. Figure 3 shows how the F1 score decreases with the increase of the noise probability. Both the models suffer for word perturbations, but the character-aware embeddings yield a slower degradation in performances, that allows it to outperforms context2vec for high levels of noise.
. One of the most intriguing properties of embeddings is their capability to capture semantic and syntactic similarities into the topology of the embedding space. Such characteristic is illustrated by means of examples for both the representations (word and context) obtained by the proposed model. Distance between the distributed representations are computed by the cosine similarity. In Table4 we show the 5 nearest neighbours for some given words. The examples show that the character based model is capable of capturing both morphological and semantic similarities.
For the evaluation of context representations, we considered 8 sentences related to 2 different topics (4 sentences each): capitals of states and pizza. A context embedding is obtained by considering the tokens around the word capital or pizza. Then, a random sentence is chosen as query, and the remaining sentences are sorted according to the distance between the query context embedding and their vectors. An example is shown in Table 5, where it is clear that all the contexts related to pizza instances are closer to the query than sentences concerning capitals.
We presented an unsupervised neural model that can develop task-independent word and context representations using character-level inputs. We trained our model on a 2 billion word corpus, and the resulting word and context encoders were used to produce robust input features to approach some popular NLP tasks (Chunking, WSD). The proposed model has shown the capability of building powerful representations that are competitive to state-of-the-art embeddings generated by models with a significantly larger number of parameters. Our future work will include applications of this model to conversational systems.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(Aug), 2493–2537 (2011)
-  Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18(5-6), 602–610 (2005)
-  Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: AISTATS. pp. 297–304 (2010)
-  Hinton, G.E., Mcclelland, J.L., Rumelhart, D.E.: Distributed representations, parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations (1986)
-  Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
-  Huang, Z., Xu, W., Yu, K.: Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
-  Hwang, K., Sung, W.: Character-level language modeling with hierarchical recurrent neural networks. In: Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. pp. 5720–5724. IEEE (2017)
-  Iacobacci, I., Pilehvar, M.T., Navigli, R.: Embeddings for word sense disambiguation: An evaluation study. In: ACL (Volume 1: Long Papers). pp. 897–907 (2016)
-  Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016)
-  Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI. pp. 2741–2749 (2016)
-  Ling, W., Dyer, C., Black, A.W., Trancoso, I., Fermandez, R., Amir, S., Marujo, L., Luis, T.: Finding function in form: Compositional character models for open vocabulary word representation. In: EMNLP. pp. 1520–1530 (2015)
-  Melamud, O., Goldberger, J., Dagan, I.: context2vec: Learning generic context embedding with bidirectional lstm. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. pp. 51–61 (2016)
-  Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
-  Miyamoto, Y., Cho, K.: Gated word-character recurrent language model. In: Proceedings of the 2016 Conference on EMNLP. pp. 1992–1997 (2016)
-  Raganato, A., Camacho-Collados, J., Navigli, R.: Word sense disambiguation: A unified evaluation framework and empirical comparison. In: EACL (2017)
-  Santos, C.D., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: ICML. pp. 1818–1826 (2014)
-  Santos, C.N.d., Guimaraes, V.: Boosting named entity recognition with neural character embeddings. arXiv preprint arXiv:1505.05008 (2015)
-  Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11), 2673–2681 (1997)
-  Zhong, Z., Ng, H.T.: It makes sense: A wide-coverage word sense disambiguation system for free text. In: ACL. pp. 78–83 (2010)