Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.READ FULL TEXT VIEW PDF
We present our approach for computer-aided social media text authorship
Social media offer an abundant source of valuable raw data, however info...
Social media messages' brevity and unconventional spelling pose a challe...
The phenomenon of mixing the vocabulary and syntax of multiple languages...
Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword
A character-level convolutional neural network (CNN) motivated by
We consider the task of linking social media accounts that belong to the...
We understand from Zipf’s Law that in any natural language corpus a majority of the vocabulary word types will either be absent or occur in low frequency. Estimating the statistical properties of these rare word types is naturally a difficult task. This is analogous to the curse of dimensionality when we deal with sequences of tokens - most sequences will occur only once in the training data. Neural network architectures overcome this problem by defining non-linear compositional models over vector space representations of tokens and hence assign non-zero probability even to sequences not seen during training[bengio2003neural, kiros2015skip]. In this work, we explore a similar approach to learning distributed representations of social media posts by composing them from their constituent characters, with the goal of generalizing to out-of-vocabulary words as well as sequences at test time.
Traditional Neural Network Language Models (NNLMs) treat words as the basic units of language and assign independent vectors to each word type. To constrain memory requirements, the vocabulary size is fixed before-hand; therefore, rare and out-of-vocabulary words are all grouped together under a common type ‘UNKNOWN’. This choice is motivated by the assumption of arbitrariness in language, which means that surface forms of words have little to do with their semantic roles. Recently, [ling2015finding]
challenge this assumption and present a bidirectional Long Short Term Memory (LSTM)[hochreiter1997long] for composing word vectors from their constituent characters which can memorize the arbitrary aspects of word orthography as well as generalize to rare and out-of-vocabulary words.
Encouraged by their findings, we extend their approach to a much larger unicode character set, and model long sequences of text as functions of their constituent characters (including white-space). We focus on social media posts from the website Twitter, which are an excellent testing ground for character based models due to the noisy nature of text. Heavy use of slang and abundant misspellings means that there are many orthographically and semantically similar tokens, and special characters such as emojis are also immensely popular and carry useful semantic information. In our moderately sized training dataset of 2 million tweets, there were about 0.92 million unique word types. It would be expensive to capture all these phenomena in a word based model in terms of both the memory requirement (for the increased vocabulary) and the amount of training data required for effective learning. Additional benefits of the character based approach include language independence of the methods, and no requirement of NLP preprocessing such as word-segmentation.
A crucial step in learning good text representations is to choose an appropriate objective function to optimize. Unsupervised approaches attempt to reconstruct the original text from its latent representation [mikolov2013efficient, bengio2003neural]. Social media posts however, come with their own form of supervision annotated by millions of users, in the form of hashtags which link posts about the same topic together. A natural assumption is that the posts with the same hashtags should have embeddings which are close to each other. Hence, we formulate our training objective to maximize cross-entropy loss at the task of predicting hashtags for a post from its latent representation.
We propose a Bi-directional Gated Recurrent Unit (Bi-GRU)[chung2014empirical]
neural network for learning tweet representations. Treating white-space as a special character itself, the model does a forward and backward pass over the entire sequence, and the final GRU states are linearly combined to get the tweet embedding. Posterior probabilities over hashtags are computed by projecting this embedding to a softmax output layer. Compared to a word-level baseline this model shows improved performance at predicting hashtags for a held-out set of posts. Inspired by recent work in learning vector space text representations, we name our modeltweet2vec.
Using neural networks to learn distributed representations of words dates back to [bengio2003neural]. More recently, [mikolov2013efficient] released word2vec
- a collection of word vectors trained using a recurrent neural network. These word vectors are in widespread use in the NLP community, and the original work has since been extended to sentences[kiros2015skip], documents and paragraphs [le2014distributed], topics [niu2015topic2vec] and queries [grbovic2015context]. All these methods require storing an extremely large table of vectors for all word types and cannot be easily generalized to unseen words at test time [ling2015finding]. They also require preprocessing to find word boundaries which is non-trivial for a social network domain like Twitter.
In [ling2015finding], the authors present a compositional character model based on bidirectional LSTMs as a potential solution to these problems. A major benefit of this approach is that large word lookup tables can be compacted into character lookup tables and the compositional model scales to large data sets better than other state-of-the-art approaches. While [ling2015finding] generate word embeddings from character representations, we propose to generate vector representations of entire tweets from characters in our tweet2vec model.
Our work adds to the growing body of work showing the applicability of character models for a variety of NLP tasks such as Named Entity Recognition[santos2015boosting], POS tagging [santos2014learning], text classification [zhang2015character] and language modeling [karpathy2015visualizing, kim2015character].
Previously, [luong2013better] dealt with the problem of estimating rare word representations by building them from their constituent morphemes. While they show improved performance over word-based models, their approach requires a morpheme parser for preprocessing which may not perform well on noisy text like Twitter. Also the space of all morphemes, though smaller than the space of all words, is still large enough that modelling all morphemes is impractical.
Hashtag prediction for social media has been addressed earlier, for example in [weston2014tagspace, godin2013using]. [weston2014tagspace] also use a neural architecture, but compose text embeddings from a lookup table of words. They also show that the learned embeddings can generalize to an unrelated task of document recommendation, justifying the use of hashtags as supervision for learning text representations.
Bi-GRU Encoder: Figure 1 shows our model for encoding tweets. It uses a similar structure to the C2W model in [ling2015finding], with LSTM units replaced with GRU units.
The input to the network is defined by an alphabet of characters (this may include the entire unicode character set). The input tweet is broken into a stream of characters each of which is represented by a -by- encoding. These one-hot vectors are then projected to a character space by multiplying with the matrix , where is the dimension of the character vector space. Let be the sequence of character vectors for the input tweet after the lookup. The encoder consists of a forward-GRU and a backward-GRU. Both have the same architecture, except the backward-GRU processes the sequence in reverse order. Each of the GRU units process these vectors sequentially, and starting with the initial state compute the sequence as follows:
Here , are called the reset and update gates respectively, and is the candidate output state which is converted to the actual output state . are matrices and are matrices, where is the hidden state dimension of the GRU. The final states from the forward-GRU, and from the backward GRU are combined using a fully-connected layer to the give the final tweet embedding :
Here are and is bias term, where is the dimension of the final tweet embedding. In our experiments we set . All parameters are learned using gradient descent.
Softmax: Finally, the tweet embedding is passed through a linear layer whose output is the same size as the number of hashtags
in the data set. We use a softmax layer to compute the posterior hashtag probabilities:
Objective Function: We optimize the categorical cross-entropy loss between predicted and true hashtags:
Here is the batch size, is the number of classes, is the predicted probability that the -th tweet has hashtag , and denotes the ground truth of whether the -th hashtag is in the -th tweet. We use L2-regularization weighted by .
Since our objective is to compare character-based and word-based approaches, we have also implemented a simple word-level encoder for tweets. The input tweet is first split into tokens along white-spaces. A more sophisticated tokenizer may be used, but for a fair comparison we wanted to keep language specific preprocessing to a minimum. The encoder is essentially the same as tweet2vec, with the input as words instead of characters. A lookup table stores word vectors for the (20K here) most common words, and the rest are grouped together under the ‘UNK’ token.
Our dataset consists of a large collection of global posts from Twitter222https://twitter.com/ between the dates of June 1, 2013 to June 5, 2013. Only English language posts (as detected by the lang field in Twitter API) and posts with at least one hashtag are retained. We removed infrequent hashtags ( posts) since they do not have enough data for good generalization. We also removed very frequent tags ( posts) which were almost always from automatically generated posts (ex: #androidgame) which are trivial to predict. The final dataset contains 2 million tweets for training, 10K for validation and 50K for testing, with a total of 2039 distinct hashtags. We use simple regex to preprocess the post text and remove hashtags (since these are to be predicted) and HTML tags, and replace user-names and URLs with special tokens. We also removed retweets and convert the text to lower-case.
|Tweets||Word model baseline||tweet2vec|
|ninety-one degrees.||#initialsofsomeone.. #nw #gameofthrones||#summer #loveit #sun|
|self-cooked scramble egg. yum!! !url||#music #cheap #cute||#yummy #food #foodporn|
|can’t sleeeeeeep||#gameofthrones #heartbreaker||#tired #insomnia|
|oklahoma!!!!!!!!!!! champions!!!!!||#initialsofsomeone.. #nw #lrt||#wcws #sooners #ou|
|7 % of battery . iphones die too quick .||#help #power #money #s||#fml #apple #bbl #thestruggle|
|i have the cutest nephew in the world !url||#nephew #cute #family||#socute #cute #puppy|
Word vectors and character vectors are both set to size for their respective models. There were 2829 unique characters in the training set and we model each of these independently in a character look-up table. Embedding sizes were chosen such that each model had roughly the same number of parameters (Table 2). Training is performed using mini-batch gradient descent with Nesterov’s momentum. We use a batch size , initial learning rate and momentum parameter . L2-regularization with was applied to all models. Initial weights were drawn from 0-mean gaussians with
and initial biases were set to 0. The hyperparameters were tuned one at a time keeping others fixed, and values with the lowest validation cost were chosen. The resultant combination was used to train the models until performance on validation set stopped increasing. During training, the learning rate is halved everytime the validation set precision increases by less than 0.01 % from one epoch to the next. The models converge in about 20 epochs. Code for training both the models is publicly available on github.
|Training Time / Epoch||1528s||9649s|
We test the character and word-level variants by predicting hashtags for a held-out test set of posts. Since there may be more than one correct hashtag per post, we generate a ranked list of tags for each post from the output posteriors, and report average precision@1, recall@10 and mean rank of the correct hashtags. These are listed in Table 3.
|Model||Precision @1||Recall @10||Mean Rank|
|Full test set (50K)|
|Rare words test set (2K)|
|Frequent words test set (2K)|
To see the performance of each model on posts containing rare words (RW) and frequent words (FW) we selected two test sets each containing 2,000 posts. We populated these sets with posts which had the maximum and minimum number of out-of-vocabulary words respectively, where vocabulary is defined by the 20K most frequent words. Overall, tweet2vec outperforms the word model, doing significantly better on RW test set and comparably on FW set. This improved performance comes at the cost of increased training time (see Table 2), since moving from words to characters results in longer input sequences to the GRU.
We also study the effect of model size on the performance of these models. For the word model we set vocabulary size to 8K, 15K and 20K respectively. For tweet2vec we set the GRU hidden state size to 300, 400 and 500 respectively. Figure 2 shows precision 1 of the two models as the number of parameters is increased, for each test set described above. There is not much variation in the performance, and moreover tweet2vec always outperforms the word based model for the same number of parameters.
Table 4 compares the models as complexity of the task is increased. We created 3 datasets (small, medium and large) with an increasing number of hashtags to be predicted. This was done by varying the lower threshold of the minimum number of tags per post for it to be included in the dataset. Once again we observe that tweet2vec outperforms its word-based counterpart for each of the three settings.
Finally, table 1 shows some predictions from the word level model and tweet2vec. We selected these to highlight some strengths of the character based approach - it is robust to word segmentation errors and spelling mistakes, effectively interprets emojis and other special characters to make predictions, and also performs comparably to the word-based approach for in-vocabulary tokens.
We have presented tweet2vec - a character level encoder for social media posts trained using supervision from associated hashtags. Our result shows that tweet2vec outperforms the word based approach, doing significantly better when the input post contains many rare words. We have focused only on English language posts, but the character model requires no language specific preprocessing and can be extended to other languages. For future work, one natural extension would be to use a character-level decoder for predicting the hashtags. This will allow generation of hashtags not seen in the training dataset. Also, it will be interesting to see how our tweet2vec embeddings can be used in domains where there is a need for semantic understanding of social media, such as tracking infectious diseases [signorini2011use]. Hence, we provide an off-the-shelf encoder trained on medium dataset described above to compute vector-space representations of tweets along with our code on github.
We would like to thank Alex Smola, Yun Fu, Hsiao-Yu Fish Tung, Ruslan Salakhutdinov, and Barnabas Poczos for useful discussions. We would also like to thank Juergen Pfeffer for providing access to the Twitter data, and the reviewers for their comments.