Long Shot Term Memory (LSTM) Networks and Convolutional Neural Networks(CNNs) are used in various fields. LSTM and CNN take sequential inputs of equal length. Hence, all the inputs should be padded to make the lengths of the inputs equal. This paper considers a common task for both CNN and LSTM and analyses the effect of padding on them, the task being Sentiment Analysis. We study 2 types of padding, namely pre and post padding. We use twitter data to classify the tweets, into 2 sentiments, positive and negative. We preprocess the data and take word vectors, or distributed representation of words for this purpose.
2 Literature Survey
LSTMs are being used in many areas these days such as, Machine Translation  as shown in “Sequence to Sequence Learning with Neural Networks” by Ilya Sutskever, Oriol Vinyals and Quoc V. Le, Image Captioning as shown in “Show and tell: A neural image caption generator” by Vinyals, Alexander and Samy, Hand writing generation in “Generating Sequences With Recurent Neural Networks” by Alex Graves and Question answering system
by Di Wang and Eric Nyberg in “A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering”. And CNNs are mostly used for pattern recognition either in images or in text. It was highly used for image classification one such example is ImageNet
by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. It was also highly used for pattern recognition tasks such as facial recognition
, as shown in “Face recognition: a convolutional neural-network approach” by S. Lawrence, C.L. Giles, Ah Chung Tsoi and A. D. Black. Until recently when people started using it for Natural Language Processing tasks like, Sentence Classification in “Convolutional Neural Networks for Sentence Classification” by Yoon Kim and Sentiment Analysis, as shown in “Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts” by Cicero Nogueira dos Santos and Maira Gatti.
In all these examples the sequences were padded to maximum length in order to train and test them on LSTMs and CNNs. Here, we see various ways of preprocessing data for variable length inputs. Then a brief discussion of word vectors. Then we talk about LSTMs and CNNs.
2.1 Preprocessing Variable Length Input Sequences
Generally, we have sequences of integers, floating points or vectors of the afore mentioned as inputs. These sequences are of variable lengths and not constant lengths, Figure 1.
We have to make them of equal length before we can use them for training. There are two ways of doing this:
Pre-Sequence Truncation: All the sequences are truncated in the beginning according to the length of the smallest sequence or a chosen length. (Figure 2). If the chosen length is longer than the smallest sequence, the smallest sequence is padded accordingly.
Post-Sequence Truncation: All the sequences are truncated in the ending according to the length of the smallest sequence or a chosen length. (Figure 3). If the chosen length is longer than the smallest sequence, the smallest sequence is padded accordingly.
Pre-Padding: All the sequences are padded with zeroes in the beginning according to the longest sequence’s length or a chosen length longer than the longest length. (Figure 4)
Post-Padding: All the sequences are padded with zeroes in the ending according to the longest sequence’s length or a chosen length longer than the longest length. (Figure 5)
2.2 Word Vectors
Instead of using Bag of Words Representation (BOW) or Continuous Bag of Words (CBOW), we are using continuous bag of word vectors representation. Where we take sequence of word in the order of their appearance, as input for the model.
Here, we are using Word2Vec skipgram
, for building the word vectors. Word Vectors in skipgram model are constructed by considering a fake task, which is predict the window of words around the given words. Usually the neural network contains all the vocabulary in the input layer and the output layer and one hidden layer, on which the number of nodes will define the dimensions of the word vectors. We first consider a word and a window around it, a pair of the given word and every word in its window becomes a training vector. After a few epochs we remove the last layer and take the layer before it or the penultimate layer into consideration whose output will be our word vector for a given activated word in the input layer.
2.3 LSTMs and CNNs
are a modification of recurrent neural networks (RNNs). RNNs consider their previous output as an input along with their next input, this allows them keep track of their previous output making them good to work with sequences. The slight modification of LSTMs is they have memory or “cell state” which keeps updating with the sequence, giving them more information making them very good while working with sequences, where time plays an important role, like time series forecasting.
on the other hand doesn’t have memory but instead tries to find pattern in the given data. Neurons in CNNs unlike normal neural networks have learnable weights and biases. These neurons receive multiple inputs and take a weighted sum over them before passing them through an activation function, which will throw an output. In addition to this CNNs input is multi channeled. CNNs have filters which slide over the input data take dot product and ads a bias before throwing a number.
These filters in further layers extract very high-level features and create multiple feature maps. In CNNs the neurons are not fully connected like the normal neural network instead they are only connected to a subset of input data. This reduces the number of parameter in the whole network. Next comes a pooling layer which reduces the spatial representation to reduce the number of parameters and computation in the network. All the neural networks end with a fully connected regular neural network which essentially does the classification for the CNN.
The dataset used in this paper was published on the Thinknook website . Only 10% of all the tweets were taken in this paper. Around 157,860 tweets were taken and divided into train test data, the number of train tweets is 126,288 tweets consisting of 63,001 positive and 63,287 negative tweets. The test data consisted of 31572 tweets. All the numbers in these tweets were replaced by ‘0’, twitter handles by ‘1’ and URLs by ‘2’. All the words in the text were lemmatized , stop words were not removed as negative stopwords can affect sentiment of the tweet . These were then trained on the Word2Vec skipgram model, with a hidden of 100 units for 5 epochs, creating word vectors of size 100 after the training. The window size was 5. The longest tweet was of size 93. So all the sequences were padded to that length.
4 Final Comparison
The LSTM used in this paper has 4 hidden layers. Each layer has 100 neurons. A dropout of 0.2 was used on each layer with an additional recurrent dropout of 0.2. The LSTM was set to take sequences of length 93 as that was the maximum length of the tweet. Tanh was used as the activation function for all the layers except the last layer (output layer) where sigmoid was used as activation for classification.
The CNN used in this paper are similar to the LSTMs used, they have 4 hidden layers of 100 units each with a dropout of 0.2 on each layer. Linear activation function was used on each layer except the output layer where a fully connected neuron was used for prediction with sigmoid activation function.
The following tables compare the models, padding and their accuracies.
|LSTM-4 Pre-Padding||LSTM-4 Post-Padding|
Though post padding model peaked it’s efficiency at 6 epochs and started to overfit after that, it’s accuracy is way less than pre-padding. Artificial Neural Networks are inspired from biological neural networks. This can be compared to the way people talk. All the zeroes are silence. Say, a person X is talking to person Y, person X waits for some time and talks, Y immediately replies, there’s a higher chance that Y remembers most of what X said and Y’s reply will be more relevant to what X talked about. Say, X said something to Y and Y didn’t reply immediately, but took some time, the longer Y waits to reply, the more context Y looses of the conversation, and more irrelevant his reply to X will be.
|CNN-4 Pre-Padding||CNN-4 Post-Padding|
Pre-padding and post padding doesn’t matter much to CNN because unlike LSTMs, CNNs don’t try to remember stuff from the previous output, but instead tries to find pattern in the given data. For Sentiment Analysis, LSTMs are more efficient than CNNs. But it is better to pre- pad sequences as they were more efficient in case of LSTM. In general, pre-padding would be better when multiple types of neural networks are combined to perform a task.
-  Sutskever, I., Vinyals, O. and Le, Q.V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112), 2014.
-  Vinyals, O., Toshev, A., Bengio, S. and Erhan, D. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on (pp. 3156-3164), 2015.
-  Graves, A., Generating sequences with recurrent neural networks arXiv preprint arXiv:1308.0850, 2013.
-  Wang, D. and Nyberg, E. A long short-term memory model for answer sentence selection in question answering In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (Vol. 2, pp. 707-712) 2015.
-  Krizhevsky, A., Sutskever, I. and Hinton, G.E., Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105) 2012.
-  Lawrence, S., Giles, C.L., Tsoi, A.C. and Back, A.D., Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks, 8(1), pp.98-113. 1997.
-  Kim, Y., Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. 2014.
-  dos Santos, C. and Gatti, M., Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 69-78). 2014.
-  Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
-  Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Comput. 9, 8 (November 1997), 1735-1780. 1997.
-  L. C. Jain and L. R. Medsker. Recurrent Neural Networks: Design and Applications (1st ed.). CRC Press, Inc., Boca Raton, FL, USA. 1999.
-  LeCun, Y. and Bengio, Y., Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995.
-  Ibrahim Naji, Naji, Twitter Sentiment Analysis Training Corpus (Dataset) Thinknook [Online]. Available:http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/. [Accessed: 26-Apr-2018].
-  Muller, Thomas; Cotterell, Ryan; Fraser, Alexander; Schutze, Hinrich. Joint Lemmatization and Morphological Tagging with LEMMING Proceedings of the 2015 Conference on Empirical Methods in Natural LanguageProcessing, pages 2268–2274 2015.
-  Saif, Hassan, Miriam Fernández, Yulan He, and Harith Alani. On stopwords, filtering and data sparsity for sentiment analysis of twitter.810-817. 2014.