Written text in form of documents, reports, emails, articles etc. forms an integral part of our everyday life. Classification, clustering, analysis and other related machine learning (ML) tasks are important components to make the huge amounts of text useful for higher level semantic tasks such a web search, sentiment analysis, open domain question answering etc. Natural language processing (NLP) and the entailing body of research also forms a complementary space which can gain by all the development in the text processing field, and vice-versa. Text and NLP based ML research was largely focused on intelligent feature design till the advent of the the deluge of deep learning. Computer vision (CV) took the pole position in this revolution, but text and NLP are very strongly catching upto these fields. Present day text understanding and classification techniques either rely on human designed featuresWang and Manning (2012)2016). Another complementary field which is developing recently is the handling of short text snippets (Severyn and Moschitti (2015)), similar to those issued as web search queries or those appearing as titles of documents and web pages. Techniques which have worked for larger text understanding seem to not work well for short text scenarios. Unlike paragraphs or documents, short texts do not always observe the syntax of natural language. Short texts usually lack any context. Short texts can also be fairly ambiguous, even to human observers, because they contain polysemes and typos.
Sentence pair processing (Yin et al. (2016)
) is the third related field which is largely based on classifying not one but multiple textual units at the same time. In this paper we study these varied research sub fields under one domain, with the assumption that a strong representation can lead to the adoption of standardized convolution neural network based models from CV in the text and NLP domain.
Representation models can further be divided into two categories: explicit representation and implicit representation Wang and Wang (2016). For explicit approaches, a given text is modeled following traditional NLP steps, including chunking, labeling, and syntactic parsing. Although explicit models are easily understandable by human beings, it is highly prone to getting entangled into ambiguities inherent to human language. Additionally, they also suffers from the data sparsity problem. For example, when an entity is missing in a knowledge base, one cannot obtain any representative feature for it and forceful smoothing techniques need to be employed.
For implicit representation, the text is represented using Neural Language Model (NLM) Bengio et al. (2003)
. An NLM maps texts to an implicit semantic space and parameterizes them as a vector. An implicit representation model can capture richer information from context and facilitate text understanding with the help of deep neural networks. However, apart from being non human interpretable, they also suffer from data distribution problems such as handling rare words and phrases. Representing the words in a text snippet by their word embeddings and then treating the system as a matrix on which to run CNNs was proposed by numerous authors (Collobert et al. (2011), Kim (2014)).
In this work we propose a novel representation for text blocks, namely sentences, paragraphs and documents which can be obtained by encoding the interactions of each word with its neighbors in the text block. More specifically,
We propose a very simple yet elegant representation for a text block and show that this representation is better suited for additive convolution kernels.
Coupled with this elegant representation, we use very simple convolution networks, lifted as is from the computer vision domain to obtain state of the art results on multiple text classification benchmark datasets.
Propose simple transfer learning experiments both from text to text and image to text domain. To the best of our knowledge these experiments have never been reported, at least in their most advanced sense.
2 Related Work
The idea that at least some aspects of word meaning can be induced from patterns of word co-occurrence is becoming increasingly popular. The success of word embeddings in numerous text and NLP tasks is largely due to this intuition itself. A systematic exploration of the principal computational possibilities for formulating and validating representations of word meanings from word co-occurrence statistics was presented by Bullinaria and Levy (2007). Pairwise word interaction as a possible representation unit was proposed by He and Lin (2016). They proposed a pairwise word interaction model and further tied it to a similarity focus mechanism to identify important correspondences for better similarity measurement. Although the starting point is very similar to our work, but the focus mechanism is largely hand crafted leaving the method extremely fine tuned to one particular task.
Another work which closely follows the ideas we discuss in this work is the neural attention based sentence summarization work by Rush et al. (2015). They propose an attention framework to identify the abstractive summary for a given text block. The cost function is an NLM model with the input pair encoding being done by a simple attention mechanism. Again the simple attention mechanism comes close to the our work, but the NLM model makes the representation highly task specific.
We propose a simple model which is both simple to generate and train. The trained model can further be used on a completely different dataset. These two facts lead to a generic concept of handling text blocks in a manner almost similar to the transfer learning scenarios which have become popular in the CV domain. One of the earliest attempts at transfer learning for text classification was proposed by Do and Ng (2005). They proposed to use hand crafted features to represent the documents and then learned a softmax classifier. Even though this satisfies the explicit requirements for transfer learning, but the features are again tuned to a specific category of problems. Specific reference to TFIDF type of features lead to failures in generalization capabilities when we switch to modern day datasets arising out of web queries and tweets. Another similar work was proposed by Raina et al. (2006), where a smoothed word correlation matrix was learned for transfer learning.
We propose a document to matrix conversion by utilizing the inherent redundancies present in text. Word embedding methods such as GloVe proposed by Pennington et al. (2014), use a small window around each word to define a context for the center word. This process is then iterated, whereby the inner product of the center word with the context words are maximized to generate stable representations for the words in vector spaces. This method of finding vector representations has been shown to have multiple desirable properties.
Looking at the trained embeddings, one might then assume that the embedding for a particular word is highly conditioned on the immediate neighborhood that it appears in. This essentially means a word and its immediate context neighbors span a low dimensional manifold. Locally linear embedding (LLE) Roweis and Saul (2000)
, starts by finding an estimate for the low dimensional manifold and then follows it up with an eigen-decomposition step which finds the lower dimensional representation for the data. Incidentally, for very high dimensional data finding the lower intrinsic dimension to project the data itself is a challenge as shown byGupta and Huang (2010). But the first step of representing the center point as a linear combination of the neighboring points does indeed create a low dimensional representation for the center point to lie in.
Another tangential way to motivate our model is self attention. Let us assume we want to find the self attention of each word in a document with its neighbors and repeat it for the entire document. This amounts to identifying the strong affinities present in the words constituting the document. Note that most question answering (QA) techniques such as BiDaf (Seo et al. (2016)) and r-net (Wang et al. (2017)), proposed to solve SQuAD dataset challenge proposed by Rajpurkar et al. (2018), use this attention matrix between the question and answer as the starting point for identifying the answer span.
3.1 Convergence Analysis
Let a piece of text be defined as an ordered collection of words . Defining the embedding for each word as and taking a dense inner product of all the words in the document with all other words in the same document we can write the dense self-attention matrix as
where and N is the total number of words in the document. The embedding for each word can be obtained from any of the state-of-the-art embedding techniques. For our comparative experiments we use the 300 dimensional GloVe embedding by Pennington et al. (2014). Setting the neighborhood of every word as the entire document is highly redundant, and hence we set the neighborhood as a small interval around each word, denoted by which is a hyper-parameter in our work. Introducing this hyper-parameter the representation for a document can now be written as
Note that for the ends of the document, where the one sided neighborhood is smaller than
we can use zero padding or we can also embed with respect to circular neighbors formed by joining the beginning to the end of the document. We call this representation as the document to image (D2I) representation. Starting from the first reconstruction equation of LLE byRoweis and Saul (2000), we can write
where denotes the cost of reconstructing each from its neighbors ’s with as the matrix of reconstruction weights. This equation is solved individually for each and hence we can look into each component of the sum separately, dropping the dependence on and for notational simplicity, as:
One of the properties of the right hand side of the reconstruction equation is the fact that it is scaling and rotation independent (Roweis and Saul (2000)). Donating as , we can scale the entire equation by and write
where we have absorbed into the new unknown weights . One interesting way to look at Eq. 6 is that and are entirely interchangeable in it.
Now let us look at the cost function used to learn the GloVe embedding vectors. GloVe optimization function is written as
where we club the two bias terms and into one joint term , similar to as defined earlier and is the count of the two words and appearing in the neighborhood of each other in the entire training corpus. Assuming the vectors have been trained, we can again separate the equation out into separate optimizations functions for each center word leading to
where denotes the optimized cost function for the center word at index . The local optima at this point corresponds to the derivative with respect to the weight vectors being close to zero. Also note that the function scales low frequency pairs, but seals the high frequency pairs to 1. Using this sealing property, at least for the high frequency pairs, we can ignore the scaling and write
Noting that at the local optima, the cost function changes much more slowly with respect to the bias terms than , we can approximate the term outside the first bracket as a constant for the center word and write the final equation as
where . This equation is a a scaled version of Eq. 6, with and the superscript denotes that it is the weight equivalent for GloVe formulation. Finally by applying the linearity of expectation111http://www.math.mcgill.ca/dstephens/556/Handouts/Math556-05-Inequalities.pdf and the fact that , combining Eq. 6 and Eq. 10, we get
and hence the GloVe cost function is a variational upper bound on the LLE reconstruction cost for high frequency neighborhoods and minimizing it leads to a neighborhood which respects the LLE constraint.
3.2 Implementation Details
Once the document has been transformed to this matrix of size this can now be reshaped, resized and put through all the transformations which are done over matrices, more specifically images. Any unknown word, which appears as UNK in many of these word embedding techniques, will appear as a row of all zeros in the document image. In our method we remove all such rows from the image. Two examples from our dataset are shown in Fig. 1 (first and third panels). The diagonal patterns of solid blues refer to the unknown words which lead to zeros in the image. These images encode both the short as well as long distance correlations amongst the words in a document. For every word we explicitly encode its interactions with its neighbors. The nearby rows encode the nearby words. If this image were to be convolved by a kernel of size , then the encoding distance increases to . This renders a simplified way to encode long distance relationships amongst the words of a document.
Looking closely at the neighbor hood of one pixel in the new representation image, as shown in Fig. 2, we can now define interesting interpretations for the edges in the representation image. For the center pixel, , the vertical edge towards right is . Similarly, the horizontal edge coming down is denoted by , which is essentially the convolution at with the edge operator multiplied by the embedding at to project the embeddings to the inner product space. Similarly, all the edges which can be obtained by subtracting the center pixel in green with its immediate neighbors in blue can all be represented by a convolution with a simple edge filter of type followed by a projection to inner product space.
This brings us to the comparison with the preferred way of using word embeddings, which is just representing each word with its embedding and stacking them next to each other, and then applying convolution filters as proposed by numerous authors (Qiu and Huang (2015), Severyn and Moschitti (2015)). The two example documents from Fig. 1 first and third panels from left, when represented as a matrix of their stacked word embeddings appear as shown in Fig. 1 (second and fourth panels). Note that these images still show a lot more cross pollination amongst dimensions, because we train smaller 50 dimensional embeddings for representational purposes. The actual 300 dimension GloVe embedding shows extremely little cross pollination along the horizontal axis.
This representation has been used by numerous authors to create intermediate representation for sentences, documents etc. or as input to more complex models such as those proposed by Das et al. (2016), Kim (2014), Collobert et al. (2011). These operations do encode correlations between the different dimensions of the representation of the same word (1D interactions) well, but do not exploit the redundancies present within the document itself. The relations between the entries of different word vectors have been shown to be fairly non-correlated222https://nlp.stanford.edu/projects/glove/, and hence the information encoded by convolution kernels is not very well understood in this case. The banded structure of the word embeddings result from the fact that the multiplicative interactions in the GloVe model occur component-wise. While there are additive interactions resulting from the dot product in the cost function, in general there is little room for the individual dimensions to cross-pollinate. Hence any additive kernel being convolved with such a representation similar to the work by Qiu and Huang (2015) does not encode the proper information content of the document and tries to force fit an additive response from non-correlated entities.
Once we move to the inner product based self-attention space, these limitations are removed. This is mostly evident from the fact that in all our experiments we use the simplest convolutional neural network for MNIST as provided in the tensorflow tutorial333https://www.tensorflow.org/tutorials/estimators/cnn. Visually it appears that the self-attention based representation encodes more information into a compact space. The self attention matrix for one of the example documents in Fig. 1 is shown in Fig. 3. The deep red color at the principal diagonal refers to the inner product of the word with itself. This value is removed from the representation. Also note that the regions near the principal diagonal have on average higher energy (darker color), with some strong low energy (lighter color) words. These regions are created by non-connected, rare words such as dossier, incalculable. This also points to a lookup based refining of the self-attention image, where a rare word can be replaced by its more frequently used synonym by using a resource like WordNet (Miller (1995)), but this has been left as a future research direction in this work.
4 Experiments and Results
For the experiments reported in this section we adopt a simple five layer CNN with an additional softmax layer as shown in Fig.4.
Note that the document to image (D2I) transformation remains same for all the datasets. We fix the hyper-parameter .
The TwitterPPDB444https://github.com/castorini/data/tree/master/twitterPPDB dataset is a new dataset, which is also the largest human-labeled paraphrase corpus to date. It consists of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification, textual similarity, and also question answering tasks. The comparative results for this dataset are shown in Table. 1.
|DeepPairwiseWord||He and Lin (2016)||0.7490|
We present comparative evaluation on WikiQA (Yang et al. (2015)), an open domain question-answer dataset. We use the subtask that assumes that there exists at least one correct answer for each question. The WikiQA dataset consists of 20,360 question candidate pairs for training, 1,130 pairs in validation set and 2,352 pairs in the test set, where we adopt the standard setup of only considering questions with correct answers in test. For binary classification mean average precision is not a good metric and mean reciprocal rank for non-multilabel problems can be obtained by using label ranking average precision555http://scikitlearn.org/stable/modules/model_evaluation.html#label-ranking-average-precision. The comparative results for the WikiQA dataset are shown in Table. 2. The ABCNN model proposed by Yin et al. (2016) proposed a dense attention matrix generation between the question and answer representation and showed impressive gains for the WikiQA dataset. The self embedding matrix shown in Fig. 3 is quite similar to the inner product based attention proposed by Yin et al. (2016). The major difference is that we utilize the word embeddings directly to arrive at the self-attention, whereas Yin et al. (2016) use convolutions over the stacked embedding representation and then generate the intermediate representations which are used to get the attention matrix.
The TrecQA dataset (Voorhees (2001)) from the Text Retrieval Conferences has been widely used for the answer selection task during the past decade. To enable direct comparison with previous work, we used the same training, development, and test sets as released by Yao et al. (2013). The TrecQA data consists of 1,229 questions with 53,417 question-answer pairs in the TRAIN-ALL training set, 82 questions with 1,148 pairs in the development set, and 100 questions with 1,517 pairs in the test set. The comparative results for the TrecQA dataset are shown in Table. 3.
Sentences Involving Compositional Knowledge (SICK) is from Task 1 of the 2014 SemEval competition (Marelli et al. (2014)) and consists of 9,927 annotated sentence pairs, with 4,500 for training, 500 as a development set, and 4,927 for testing. Each pair has a relatedness score [1, 5] which increases with similarity and an entailment label which takes three values contradict, entail and neutral. The entailment prediction task was a labeling task and we present results for that task only. The comparative results for the SICK dataset are shown in Table. 4. Note that both the methods DTRNN by Socher et al. (2011) and DeepPairwise Word by He and Lin (2016) use orders of magnitude more parameters than our proposed model. Also note that since the SICK dataset has both relatedness and entailment signals mixed into the data, He and Lin (2016) propose a KL divergence based cost function. We have maintained the softmax cost function to highlight the generic behavior of the proposed model. Note that with larger amount of data in TwitterPPDB, our model beats the DeepPairwise Word model as shown in Table. 1. This points towards over-fitting in case of smaller dataset.
5 Transfer Learning Experiments
5.1 Text to Text Transfer Learning
We believe that these set of experiments are the first of their kind in this field. We start by the simpler experiment, wherein we train a network based on the TwitterPPDB dataset. The hyper-parameter in all these experiments. We choose same network as shown in Fig. 4. This 5 layer network is trained first with TwitterPPDB data till convergence. We obtain the features from the last dense layer of the trained network, and fine tune it with the TrecQA, WikiQA and the SICK data to generate the results shown in Table. 5. Although both the results are lesser than when trained with their own data, the results show that the information transfered through the network is still significant. Also note that for both WikiQA and TrecQA, the transfer learned model still beats the best state-of-the-art models as shown in Table. 2 and Table. 3.
|Train Dataset||Fine Tune Dataset||MRR|
5.2 Image to Text Transfer Learning
We train the same network as shown in Fig. 4 with MNIST data (LeCun and Cortes (2010)) and then fine tune the model with WikiQA, TrecQA and SICK data. We fix the size parameter to match the dimension of MNIST images. This is the first instance of transfer learning between image and text domain. The MRR values for the three datasets are shown in Table. 6.
Note that the results are slightly higher than the corresponding values shown in Table. 5. The gain in performance may be due to the fact that the base system had better information to train itself by utilizing the MNIST data. The structure present in the MNIST dataset when used for pre-training, renders more generic learning of the weights than achieved by the images generated by the D2I conversion of the Twitter dataset. This leads to better performance for the downstream classification tasks. These results are extremely interesting, and we propose to continue working on more such experiments in the future.
|Train Dataset||Fine Tune Dataset||MRR|
6 Conclusion and Future Work
In this paper we have presented a novel yet simple method of incorporating neighborhood information in text and consequently convert it to an image. This representation has some unique properties such as encoding meaningful edge information. This unique representation can now be easily fed to standard networks lifted from the computer vision community. We present comparative results to multiple text classification datasets and show that we are beating most of them. We also present text to text and image to text transfer learning experiments and show that after transforming to the proposed D2I representation, text and vision can be utilized within the ambit of similar models. This is the first attempt at bridging the gap between the separate networks which have been proposed in the vision and text community, and we sincerely hope that this will generate lot of interest within the research community to take this further. The representation is still single channel and hence networks designed for single channel gray scale images can be used for now. Transforming the self-attention images to multiple channels, such that they can exploit more sophisticated models such as Inception-v3 (Szegedy et al. (2016)) still remains an active research opportunity for the future.
- Bengio et al. (2003) Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, 2003.
- Bullinaria and Levy (2007) J. A. Bullinaria and J. P. Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3):510–526, Aug 2007.
- Collobert et al. (2011) R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, Nov. 2011.
- Conneau et al. (2016) A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun. Very deep convolutional networks for natural language processing, 2016. URL CoRR,abs//1606.01781,2016.
- Das et al. (2016) A. Das, H. Yenala, M. K. Chinnakotla, and M. Shrivastava. Together we stand: Siamese networks for similar question retrieval. In ACL, 2016.
- Do and Ng (2005) C. B. Do and A. Y. Ng. Transfer learning for text classification. In Proceedings of the 18th International Conference on Neural Information Processing Systems, NIPS’05, pages 299–306, Cambridge, MA, USA, 2005. MIT Press.
- Gupta and Huang (2010) M. Gupta and T. Huang. Regularized maximum likelihood for intrinsic dimension estimation. In UAI 2010, pages 220–227, 2010.
- He and Lin (2016) H. He and J. J. Lin. Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. In HLT-NAACL, pages 937–948. The Association for Computational Linguistics, 2016.
- Kim (2014) Y. Kim. Convolutional neural networks for sentence classification. In EMNLP, pages 1746–1751, 2014.
- LeCun and Cortes (2010) Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
- Marelli et al. (2014) M. Marelli, L. Bentivogli, M. Baroni, R. Bernardi, S. Menini, and R. Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment, 08 2014.
- Miller (1995) G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, Nov. 1995.
- Pennington et al. (2014) J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014.
Qiu and Huang (2015)
X. Qiu and X. Huang.
Convolutional neural tensor network architecture for community-based question answering.In
Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pages 1305–1311. AAAI Press, 2015.
- Raina et al. (2006) R. Raina, A. Y. Ng, and D. Koller. Constructing informative priors using transfer learning. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 713–720, New York, NY, USA, 2006. ACM.
- Rajpurkar et al. (2018) P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad. In ACL 2018, Volume 2: Short Papers, pages 784–789, 2018.
- Roweis and Saul (2000) S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500):2323–2326, 2000.
Rush et al. (2015)
A. M. Rush, S. Chopra, and J. Weston.
A neural attention model for abstractive sentence summarization.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389. Association for Computational Linguistics, 2015.
- Seo et al. (2016) M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional Attention Flow for Machine Comprehension. ArXiv e-prints, Nov. 2016.
- Severyn and Moschitti (2015) A. Severyn and A. Moschitti. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, pages 373–382, New York, NY, USA, 2015. ACM.
Socher et al. (2011)
R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning.
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection.In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 801–809, 2011.
Szegedy et al. (2016)
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
- Voorhees (2001) E. M. Voorhees. The trec question answering track. Nat. Lang. Eng., 7(4):361–378, Dec. 2001.
- Wang and Manning (2012) S. Wang and C. D. Manning. Baselines and bigrams: Simple, good sentiment and topic classification. In ACL: Short Papers - Volume 2, ACL ’12, pages 90–94, 2012.
- Wang et al. (2017) W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou. Gated self-matching networks for reading comprehension and question answering. In ACL (Volume 1: Long Papers), pages 189–198, 2017.
- Wang and Wang (2016) Z. Wang and H. Wang. Understanding short texts. In the Association for Computational Linguistics (ACL) (Tutorial), August 2016.
- Yang et al. (2015) Y. Yang, S. W.-t. Yih, and C. Meek. Wikiqa: A challenge dataset for open-domain question answering. EMNLP, pages 2013–2018, September 2015.
- Yao et al. (2013) X. Yao, B. V. Durme, C. Callison-burch, and P. Clark. Answer extraction as sequence tagging with tree edit distance. In In North American Chapter of the Association for Computational Linguistics (NAACL, 2013.
- Yin et al. (2016) W. Yin, H. Schütze, B. Xiang, and B. Zhou. Abcnn: Attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics, 4:259–272, 2016. URL http://aclweb.org/anthology/Q16-1019.
- Yu et al. (2014) L. Yu, K. M. Hermann, P. Blunsom, and S. Pulman. Deep learning for answer sentence selection. Proceedings of the Deep Learning and Representation Learning Workshop: NIPS-2014, 2014.
- Zhang et al. (2017) H. Zhang, J. Rao, J. Lin, and M. D. Smucker. Automatically extracting high-quality negative examples for answer selection in question answering. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pages 797–800, New York, NY, USA, 2017. ACM.