1 Introduction
In recent years, the microblogging site Twitter has become a major social media platform with hundreds of millions of users. The short (140 character limit), noisy and idiosyncratic nature of tweets make standard information retrieval and data mining methods illsuited to Twitter. Consequently, there has been an ever growing body of IR and data mining literature focusing on Twitter. However, most of these works employ extensive feature engineering to create taskspecific, handcrafted features. This is time consuming and inefficient as new features need to be engineered for every task.
In this paper, we present Tweet2Vec, a method for generating generalpurpose vector representation of tweets that can be used for any classification task. Tweet2Vec
removes the need for expansive feature engineering and can be used to train any standard offtheshelf classifier (e.g., logistic regression, svm, etc).
Tweet2Vecuses a CNNLSTM encoderdecoder model that operates at the character level to learn and generate vector representation of tweets. Our method is especially useful for natural language processing tasks on Twitter where it is particularly difficult to engineer features, such as speechact classification and stance detection (as shown in our previous works on these topics
[13, 12]).There has been several works on generating embeddings for words, most famously Word2Vec by Mikolov et al. [9]
). There has also been a number of different works that use encoderdecoder models based on long shortterm memory (LSTM)
[11], and gated recurrent neural networks (GRU)
[1]. These methods have been used mostly in the context of machine translation. The encoder maps the sentence from the source language to a vector representation, while the decoder conditions on this encoded vector for translating it to the target language. Perhaps the work most related to ours is the work of Le and Mikolov le2014distributed, where they extended the Word2Vec model to generate representations for sentences (called ParagraphVec). However, these models all function at the word level, making them illsuited to the extremely noisy and idiosyncratic nature of tweets. Our characterlevel model, on the other hand, can better deal with the noise and idiosyncrasies in tweets. We plan to make our model and the data used to train it publicly available to be used by other researchers that work with tweets.2 CNNLSTM EncoderDecoder
In this section, we describe the CNNLSTM encoderdecoder model that operates at the character level and generates vector representation of tweets. The encoder consists of convolutional layers to extract features from the characters and an LSTM layer to encode the sequence of features to a vector representation, while the decoder consists of two LSTM layers which predict the character at each time step from the output of encoder.
2.1 CharacterLevel CNN Tweet Model
Characterlevel CNN (CharCNN) is a slight variant of the deep characterlevel convolutional neural network introduced by Zhang et al [15]
. In this model, we perform temporal convolutional and temporal maxpooling operations, which computes onedimensional convolution and pooling functions, respectively, between input and output. Given a discrete input function
, a discrete kernel functionand stride
, the convolution between and and pooling operation of is calculated as:(1) 
(2) 
where is an offset constant.
We adapted this model, which employs temporal convolution and pooling operations, for tweets. The character set includes the English alphabets, numbers, special characters and unknown character. There are 70 characters in total, given below:
abcdefghijklmnopqrstuvwxyz0123456789
,;.!?:’"/\_#$%&^ *~‘+=<>()[]{}
Each character in the tweets can be encoded using onehot vector . Hence, the tweets are represented as a binary matrix
with padding wherever necessary, where 150 is the maximum number of characters in a tweet (140 tweet characters and padding) and 70 is the size of the character set.
Each tweet, in the form of a matrix, is now fed into a deep model consisting of four 1d convolutional layers. A convolution operation employs a filter
, to extract ngram character feature from a sliding window of
characters at the first layer and learns abstract textual features in the subsequent layers. The convolution in the first layer operates on sliding windows of character (size ), and the convolutions in deeper layers are defined in a similar way. Generally, for tweet , a feature at layer is generated by:This filter is applied across all possible windows of characters in the tweet to produce a feature map. The output of the convolutional layer is followed by a 1d maxovertime pooling operation [2] over the feature map and selects the maximum value as the prominent feature from the current filter. In this way, we apply filters at each layer. Pooling size may vary at each layer (given by at layer ). The pooling operation shrinks the size of the feature representation and filters out trivial features like unnecessary combination of characters. The window length , number of filters , pooling size at each layer are given in Table 1.
()  ()  ()  () 
1  7  512  3 
2  7  512  3 
3  3  512  N/A 
4  3  512  N/A 
We define to denote the characterlevel CNN operation on input tweet matrix . The output from the last convolutional layer of CharCNN(T) (size: ) is subsequently given as input to the LSTM layer. Since LSTM works on sequences (explained in Section 2.2 and 2.3), pooling operation is restricted to the first two layers of the model (as shown in Table 1).
2.2 LongShort Term Memory (LSTM)
In this section we briefly describe the LSTM model [4]. Given an input sequence (), LSTM computes the hidden vector sequence () and and output vector sequence (). At each time step, the output of the module is controlled by a set of gates as a function of the previous hidden state and the input at the current time step , the forget gate , the input gate , and the output gate . These gates collectively decide the transitions of the current memory cell and the current hidden state . The LSTM transition functions are defined as follows:
(4) 
Here, is the function that has an output in [0, 1], denotes the hyperbolic tangent function that has an output in , and denotes the componentwise multiplication. The extent to which the information in the old memory cell is discarded is controlled by , while controls the extent to which new information is stored in the current memory cell, and is the output based on the memory cell . LSTM is explicitly designed for learning longterm dependencies, and therefore we choose LSTM after the convolution layer to learn dependencies in the sequence of extracted features. In sequencetosequence generation tasks, an LSTM defines a distribution over outputs and sequentially predicts tokens using a softmax function.
(5) 
where
is the activation function. For simplicity, we define
to denote the LSTM operation on input at timestep and the previous hidden state .2.3 The Combined Model
The CNNLSTM encoderdecoder model draws on the intuition that the sequence of features (e.g. character and word ngrams) extracted from CNN can be encoded into a vector representation using LSTM that can embed the meaning of the whole tweet. Figure 1 illustrates the complete encoderdecoder model. The input and output to the model are the tweet represented as a matrix where each row is the onehot vector representation of the characters. The procedure for encoding and decoding is explained in the following section.
2.3.1 Encoder
Given a tweet in the matrix form T (size: ), the CNN (Section 2.1) extracts the features from the character representation. The onedimensional convolution involves a filter vector sliding over a sequence and detecting features at different positions. The new successive higherorder window representations then are fed into LSTM (Section 2.2). Since LSTM extracts representation from sequence input, we will not apply pooling after convolution at the higher layers of Characterlevel CNN model. The encoding procedure can be summarized as:
(6) 
(7) 
where is an extracted feature matrix where each row can be considered as a timestep for the LSTM and
is the hidden representation at timestep
. LSTM operates on each row of the along with the hidden vectors from previous timestep to produce embedding for the subsequent timesteps. The vector output at the final timestep, , is used to represent the entire tweet. In our case, the size of the is 256.2.3.2 Decoder
The decoder operates on the encoded representation with two layers of LSTMs. In the initial timestep, the endtoend output from the encoding procedure is used as the original input into first LSTM layer. The last LSTM decoder generates each character, , sequentially and combines it with previously generated hidden vectors of size 128, , for the next timestep prediction. The prediction of character at each time step is given by:
(8) 
where refers to the character at timestep , represents the onehot vector of the character at timestep . The result from the softmax is a decoded tweet matrix , which is eventually compared with the actual tweet or a synonymreplaced version of the tweet (explained in Section 3) for learning the parameters of the model.
3 Data Augmentation & Training
We trained the CNNLSTM encoderdecoder model on 3 million randomly selected Englishlanguage tweets populated using data augmentation techniques, which are useful for controlling generalization error for deep learning models. Data augmentation, in our context, refers to replicating tweet and replacing some of the words in the replicated tweets with their synonyms. These synonyms are obtained from WordNet
[3] which contains words grouped together on the basis of their meanings. This involves selection of replaceable words (example of nonreplaceable words are stopwords, user names, hash tags, etc) from the tweet and the number of wordsto be replaced. The probability of the number,
, is given by a geometric distribution with parameter
in which . Words generally have several synonyms, thus the synonym index , of a given word is also determined by another geometric distribution in which . In our encoderdecoder model, we decode the encoded representation to the actual tweet or a synonymreplaced version of the tweet from the augmented data. We used , for our training. We also make sure that the POS tags of the replaced words are not completely different from the actual words. For regularization, we apply a dropout mechanism after the penultimate layer. This prevents coadaptation of hidden units by randomly setting a proportion of the hidden units to zero (for our case, we set ).To learn the model parameters, we minimize the crossentropy loss as the training objective using the Adam Optimization algorithm [5]. It is given by
(9) 
where p is the true distribution (onehot vector representing characters in the tweet) and q is the output of the softmax. This, in turn, corresponds to computing the negative logprobability of the true class.
4 Experiments
We evaluated our model using two classification tasks: Tweet semantic relatedness and Tweet sentiment classification.
4.1 Semantic Relatedness
The first evaluation is based on the SemEval 2015Task 1: Paraphrase and Semantic Similarity in Twitter [14]. Given a pair of tweets, the goal is to predict their semantic equivalence (i.e., if they express the same or very similar meaning), through a binary yes/no judgement. The dataset provided for this task contains 18K tweet pairs for training and 1K pairs for testing, with of these pairs being paraphrases, and nonparaphrases.
We first extract the vector representation of all the tweets in the dataset using our Tweet2Vec model. We use two features to represent a tweet pair. Given two tweet vectors and , we compute their elementwise product and their absolute difference and concatenate them together (Similar to [6]). We then train a logistic regression model on these features using the dataset. Crossvalidation is used for tuning the threshold for classification. In contrast to our model, most of the methods used for this task were largely based on extensive use of feature engineering, or a combination of feature engineering with semantic spaces. Table 2 shows the performance of our model compared to the top four models in the SemEval 2015 competition, and also a model that was trained using ParagraphVec. Our model (Tweet2Vec) outperforms all these models, without resorting to extensive taskspecific feature engineering.
ParagraphVec  0.570  0.680  0.620 
nnfeats  0.767  0.583  0.662 
ikr  0.569  0.806  0.667 
linearsvm  0.683  0.663  0.672 
svckernel  0.680  0.669  0.674 
Tweet2Vec  0.679  0.686  0.677 
4.2 Sentiment Classification
The second evaluation is based on the SemEval 2015Task 10B: Twitter Message Polarity Classification [10]. Given a tweet, the task is to classify it as either positive, negative or neutral in sentiment. The size of the training and test sets were 9,520 tweets and 2,380 tweets respectively ( positive, negative, and neutral).
As with the last task, we first extract the vector representation of all the tweets in the dataset using Tweet2Vec and use that to train a logistic regression classifier using the vector representations. Even though there are three classes, the SemEval task is a binary task. The performance is measured as the average F1score of the positive and the negative class. Table 3 shows the performance of our model compared to the top four models in the SemEval 2015 competition (note that only the F1score is reported by SemEval for this task) and ParagraphVec. Our model outperforms all these models, again without resorting to any feature engineering.
ParagraphVec  0.600  0.680  0.637 
INESCID  N/A  N/A  0.642 
lsislif  N/A  N/A  0.643 
unitn  N/A  N/A  0.646 
Webis  N/A  N/A  0.648 
Tweet2Vec  0.675  0.719  0.656 
5 Conclusion and Future Work
In this paper, we presented Tweet2Vec, a novel method for generating generalpurpose vector representation of tweets, using a characterlevel CNNLSTM encoderdecoder architecture. To the best of our knowledge, ours is the first attempt at learning and applying characterlevel tweet embeddings. Our characterlevel model can deal with the noisy and peculiar nature of tweets better than methods that generate embeddings at the word level. Our model is also robust to synonyms with the help of our data augmentation technique using WordNet.
The vector representations generated by our model are generic, and thus can be applied to tasks of different nature. We evaluated our model using two different SemEval 2015 tasks: Twitter semantic relatedness, and sentiment classification. Simple, offtheshelf logistic regression classifiers trained using the vector representations generated by our model outperformed the topperforming methods for both tasks, without the need for any extensive feature engineering. This was despite the fact that due to resource limitations, our Tweet2Vec model was trained on a relatively small set (3 million tweets). Also, our method outperformed ParagraphVec, which is an extension of Word2Vec to handle sentences. This is a small but noteworthy illustration of why our tweet embeddings are bestsuited to deal with the noise and idiosyncrasies of tweets.
For future work, we plan to extend the method to include: 1) Augmentation of data through reordering the words in the tweets to make the model robust to wordorder, 2) Exploiting attention mechanism [8] in our model to improve alignment of words in tweets during decoding, which could improve the overall performance.
References
 [1] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.

[2]
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa.
Natural language processing (almost) from scratch.
The Journal of Machine Learning Research
, 12:2493–2537, 2011.  [3] C. Fellbaum. WordNet. Wiley Online Library, 1998.
 [4] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [5] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [6] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skipthought vectors. In Advances in Neural Information Processing Systems, pages 3276–3284, 2015.
 [7] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053, 2014.
 [8] J. Li, M.T. Luong, and D. Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057, 2015.
 [9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[10]
S. Rosenthal, P. Nakov, S. Kiritchenko, S. Mohammad, A. Ritter, and
V. Stoyanov.
Semeval2015 task 10: Sentiment analysis in twitter.
In SemEval, 2015.  [11] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 [12] P. Vijayaraghavan, I. Sysoev, S. Vosoughi, and D. Roy. Deepstance at semeval2016 task 6: Detecting stance in tweets using character and wordlevel cnns. 2016.
 [13] S. Vosoughi and D. Roy. Tweet acts: A speech act classifier for twitter. In proceedings of the 10th ICWSM, 2016.
 [14] W. Xu, C. CallisonBurch, and W. B. Dolan. Semeval2015 task 1: Paraphrase and semantic similarity in twitter (pit). In SemEval, 2015.
 [15] X. Zhang and Y. LeCun. Text understanding from scratch. arXiv preprint arXiv:1502.01710, 2015.
Comments
There are no comments yet.