Modern NLP systems rely on word embeddings as input units to encode the statistical semantic and syntactic properties of words, ranging from standard context-independent embeddings such as word2vec Mikolov et al. (2013) and Glove Pennington et al. (2014) to contextualized embeddings such as ELMo Peters et al. (2018) and BERT Devlin et al. (2018). However, most applications operate at the phrase or sentence level. Hence, the word embeddings are averaged to yield sentence embeddings. Averaging is an efficient compositional operation that leads to good performance. In fact, averaging is difficult to beat by more complex compositional models as illustrated across several classification tasks: topic categorization, semantic textual similarity, and sentiment classification Aldarmaki and Diab (2018). Encoding sentences into fixed-length vectors that capture various full sentence linguistic properties leading to performance gains across different classification tasks remains a challenge. Given the complexity of most models that attempt to encode sentence structure, such as convolutional, recursive, or recurrent networks, the trade-off between efficiency and performance tips the balance in favor of simpler models like vector averaging. Sequential neural sentence encoders, like Skip-thought Kiros et al. (2015) and InferSent Conneau et al. (2017), can potentially encode rich semantic and syntactic features from sentence structures. However, for practical applications, sequential models are rather cumbersome and inefficient, and the gains in performance are typically mediocre compared with vector averaging Aldarmaki and Diab (2018). In addition, the more complex models typically don’t generalize well to out-of-domain data Wieting et al. (2015). FastSent Hill et al. (2016)
is an unsupervised alternative approach of lower computational cost, but similar to vector averaging, it disregards word order. Tensor-based composition can effectively capture word order, but current approaches rely on restricted grammatical constructs, such as transitive phrases, and cannot be easily extended to variable-length sequences of arbitrary structuresMilajevs et al. (2014). Therefore, despite its obvious disregard for structural properties, the efficiency and reasonable performance of vector averaging makes them more suitable for practical text classification.
In this work, we propose to use the Discrete Cosine Transform (DCT) as a simple and efficient way to model word order and structure in sentences while maintaining practical efficiency. DCT is a widely-used technique in digital signal processing applications such as image compression Watson (1994) as well as speech recognition Huang and Zhao (2000) , but to our knowledge, this is the first successful application of DCT for NLP applications, and in particular sentence embedding. We use DCT to summarize the general feature patterns in word sequences and compress them into fixed-length vectors. Experiments in probing tasks demonstrate that our DCT embeddings preserve more syntactic and semantic features compared with vector averaging. Furthermore, the results indicate that DCT performance in downstream applications is correlated with these features.
2.1 Discrete Cosine Transform
Discrete Cosine Transform (DCT) is an invertible function that maps an input sequence of real numbers to the coefficients of orthogonal cosine basis functions. Given a vector of real numbers , we calculate a sequence of DCT coefficients as follows:111There are several variants of DCT. We use DCT type II Shao and Johnson (2008) in our implementation
for . Note that is the sum of the input sequence normalized by the square length, which is proportional to the average of the sequence. The coefficients can be used to reconstruct the original sequence exactly using the inverse transform. In practice, DCT is used for compression by preserving only the coefficients with large magnitudes. Lower-order coefficients represent lower signal frequencies which correspond to the overall patterns in the sequence Ahmed et al. (1974).
2.2 DCT Sentence Embeddings
We apply DCT on the word vectors along the length of the sentence. Given a sentence of words , we stack the sequence of -dimensional word embeddings in an matrix, then apply DCT along the rows. In other words, each feature in the vector space is compressed independently, and the resultant DCT embeddings summarize the feature patterns along the word sequence. To get a fixed-length and consistent sentence vector, we extract and concatenate the first DCT coefficients and discard higher-order coefficients, which results in sentence vectors of size . For cases where
, we pad the sentence withzero vectors.222Evaluation script is available at https://github.com/N-Almarwani/DCT_Sentence_Embedding In image compression, the magnitude of the coefficients tends to decrease with increasing , but we didn’t observe this trend in text data except that tends to have larger absolute value than the remaining coefficients. Nonetheless, by retaining lower-order coefficients we get a consistent representation of overall feature patterns in the word sequence.
Figure 1 illustrates the properties of DCT embeddings compared to vector averaging (AVG). Notice that the first DCT coefficients, , result in vectors that are independent of word order since the lowest frequency represents the average energy in the sequence. In this sense, is similar to AVG, where “dog bites man” and “man bites dog” have identical embeddings. The second-order coefficients, on the other hand, are sensitive to word order, which results in different representations for the above sentence pair. The counterexample “man bitten by dog” shows that
embeddings are most sensitive to the overall patterns—in this case: “man … dog”—which results in an embedding more similar to “man bites dog”, than the semantically similar “dog bites man”. However, there are still some variations in the final embeddings from the different word components (‘bitten’ vs. ‘bite’), which can potentially be useful in downstream tasks. Since both DCT and AVG are unparameterized, the downstream classifiers can incorporate a hidden layer to learn these subtle variations in higher-order features depending on the learning objective.
2.3 A Note on Complexity
The cosine terms in Equation 2 can be pre-calculated for efficiency. For a maximum sentence length and a given , the total number of terms is for each feature. The run-time complexity is equivalent to calculating weighted averages, which is proportional to , where should be set to a small constant relative to the expected length.333We experimented with . Note also that the number of input parameters in downstream classification models will increase linearly with . With parallel implementations however, the difference in run-time complexity between AVG and DCT is practically negligible.
3 Experiments and Results
3.1 Evaluation Framework
We use the SentEval toolkit444https://github.com/facebookresearch/SentEval Conneau and Kiela (2018) to evaluate the sentence representations on different probing as well as downstream classification tasks. The probing benchmark set was designed to analyze the quality of sentence embeddings. It contains a set of 10 classification tasks, summarized in Table 1, that address varieties of linguistic properties including surface, syntactic, and semantic information Conneau et al. (2018). The downstream set, on the other hand, includes the following standard classification tasks: binary and fine-grained sentiment classification (MR, SST2, SST5) Pang and Lee (2004); Socher et al. (2013), product reviews (CR) Hu and Liu (2004), opinion polarity (MPQA) Wiebe et al. (2005), question type classification (TREC) Voorhees and Tice (2000), natural language inference (SICK-E) Marelli et al. (2014), semantic relatedness (SICK-R, STSB) Marelli et al. (2014); Cer et al. (2017), paraphrase detection (MRPC) Dolan et al. (2004), and subjectivity/objectivity (SUBJ) Pang and Lee (2004).
|WC||Word Content analysis|
|TreeDepth||Tree depth prediction|
|TopConst||Top Constituents prediction|
|BShift||Word order analysis|
|Tense||Verb tense prediction|
|SubjNum||Subject number prediction|
|ObjNum||Object number prediction|
Semantic odd man out
Probing tasks performance of vector averaging (AVG) and max pooling (MAX) vs. DCT embeddings with various. Majority (baseline), Human (human-bound), and a linear classifier with sentence length as sole feature (Length) as reported in Conneau et al. (2018), respectively.
3.2 Experimental setup
For the word embeddings, we use pre-trained FastText embeddings of size 300 Mikolov et al. (2018) trained on Common-Crawl. We generate DCT sentence vectors by concatenating the first DCT coefficients, which we denote by . We compare the performance against: vector averaging of the same word embeddings, denoted by AVG, and vector max pooling, denoted by MAX.555To compare with other sentence embedding models, refer to the results in Conneau and Kiela (2018) and Conneau et al. (2018)
For all tasks, we trained multi-layer perceptron (MLP) classifiers following the setup in SentEval. We tuned the following hyper-parameters on the validation sets: number of hidden states (in [0, 50, 100, 200, 512]) and dropout rate (in [0, 0.1, 0.2]). Note that the case with 0 hidden states corresponds to a Logistic Regression classifier.
3.3 Result & Discussion
We report the performance in probing tasks in Table 2. In general, DCT yields better performance compared to averaging on all tasks, and larger often yields improved performance in syntactic and semantic tasks. For the surface information tasks, SentLen and Word content (WC), significantly outperforms AVG. This is attributed to the non-linear scaling factor in DCT, where longer sentences are not discounted as much as in averaging. The performance decreased with increasing in , which reflects the trade-off between deep and surface linguistic properties, as discussed in Conneau et al. (2018).
While increasing has no positive effect on surface information tasks, syntactic and semantic tasks demonstrate performance gains with larger . This trend is clearly observed in all syntactic tasks and three of the semantic tasks, where DCT performs well above AVG and the performance improves with increasing . The only exception is SOMO, where increasing actually results in lower performance, although all DCT results are still higher than AVG by about 1% to 34%.
The correlation between the performance in probing tasks and the standard text classification tasks is discussed in Conneau et al. (2018), where they show that most tasks are only positively correlated with a small subset of semantic or syntactic features, with the exception of TREC and some sentiment classification benchmarks. Furthermore, some tasks like SST and SICK-R are actually negatively correlated with the performance in some probing tasks like SubjNum, ObjNum, and BShift. This explains why simple averaging often outperforms more complex models in these tasks. Our results in Table 3 are consistent with these observations, where we see improvements in most tasks, but the difference is not as significant as the probing tasks, except in TREC question classification where increasing leads to much better performance. As discussed in Aldarmaki and Diab (2018), the ability to preserve word order leads to improved performance in TREC, which is exactly the advantage of using DCT instead of AVG. Note also that increasing , while preserves more information, leads to increasing the number of model parameters, which in turn may negatively affect the generalization of the model by overfitting. In our experiments, yielded the best trade-off.
3.4 Comparison w. Related Methods
Spectral analysis is frequently employed in signal processing to decompose a signal into separate frequency components, each revealing some information about the source signal, to enable analysis and compression. To the best of our knowledge, spectral methods have only been recently exploited to construct sentence embedding Kayal and Tsatsaronis (2019).666 independent from and in parallel with this work .
Kayal and Tsatsaronis propose EigenSent that utilized Higher-Order Dynamic Mode Decomposition (HODMD) Le Clainche and Vega (2017) to construct sentence embedding. These embeddings summarize the dynamic properties of the sentence. In their work, they compare EigenSent with various sentence embedding models, including a different implementation of the Discrete Cosine Transform (DCT*). In contrast to our implementation described in section 2.2, DCT* is applied at the word level along the word embedding dimension.
For fair comparison, we use the same sentiment and text classification datasets, the SST-5, 20 newsgroups (20-NG) and Reuters-8 (R-8), as those used in kayal2019eigensent. We also evaluate using the same pre-trained word embedding, framework and approaches as described in their work. Table 4 shows the best results for the various models as reported in kayal2019eigensent, in addition to the best performance of our model denoted as .777The best results were achieved with k=3 for SST-5 and k=2 for 20-NG and R-8.
Note that the DCT-based model, DCT*, described in kayal2019eigensent performed relatively poorly in all tasks, while our model achieved close to state-of-the-art performance in both the 20-NG and R-8 tasks. Our model outperformed EignSent on all tasks and generally performed better than or on par with p-means, ELMo, BERT, and EigenSentAvg on both the 20-NG and R-8. On the other hand, both EigenSentAvg and ELMo performed better than all other models on SST-5.
We proposed using the Discrete Cosine Transform (DCT) as a mechanism to efficiently compress variable-length sentences into fixed-length vectors in a manner that preserves some of the structural characteristics of the original sentences. By applying DCT on each feature along the word embedding sequence, we efficiently encode the overall feature patterns as reflected in the low-order DCT coefficients. We showed that these DCT embeddings reflect average semantic features, as in vector averaging but with a more suitable normalization, in addition to syntactic features like word order. Experiments using the SentEval suite showed that DCT embeddings outperform the commonly-used vector averaging on most tasks, particularly tasks that correlate with sentence structure and word order. Without compromising practical efficiency relative to averaging, DCT provides a suitable mechanism to represent both the average of the features and their overall syntactic patterns.
- Discrete cosine transform. IEEE transactions on Computers 100 (1), pp. 90–93. Cited by: §2.1.
- Evaluation of unsupervised compositional representations. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2666–2677. Cited by: §1, §3.3.
- SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14. Cited by: §3.1.
Supervised learning of universal sentence representations from natural language inference data.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 670–680. External Links: Cited by: §1.
- SentEval: an evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Cited by: §3.1, footnote 5.
- What you can cram into a single $ &!#* vector: probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2126–2136. Cited by: §3.1, §3.3, §3.3, Table 2, footnote 5.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
- Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics, pp. 350. Cited by: §3.1.
Learning distributed representations of sentences from unlabelled data. In Proceedings of NAACL-HLT, pp. 1367–1377. Cited by: §1.
- Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177. Cited by: §3.1.
- A dct-based fast signal subspace technique for robust speech recognition. IEEE Transactions on Speech and Audio Processing 8 (6), pp. 747–751. Cited by: §1.
- EigenSent: spectral sentence embeddings using higher-order dynamic mode decomposition. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 4536–4546. Cited by: §3.4, §3.4, Table 4.
- Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: §1.
- Higher order dynamic mode decomposition. SIAM Journal on Applied Dynamical Systems 16 (2), pp. 882–925. Cited by: §3.4.
- Semeval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pp. 1–8. Cited by: §3.1.
- Advances in pre-training distributed word representations. In Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki, Japan. External Links: Cited by: §3.2.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
- Evaluating neural word representations in tensor-based compositional settings. Conference on Empirical Methods in Natural Language Processing (EMNLP). Cited by: §1.
- A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, pp. 271. Cited by: §3.1.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §1.
- Type-ii/iii dct/dst algorithms with reduced number of arithmetic operations. Signal Processing 88 (6), pp. 1553–1564. Cited by: footnote 1.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §3.1.
- Building a question answering test collection. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 200–207. Cited by: §3.1.
- Image compression using the discrete cosine transform. Mathematica journal 4 (1), pp. 81. Cited by: §1.
- Annotating expressions of opinions and emotions in language. Language resources and evaluation 39 (2-3), pp. 165–210. Cited by: §3.1.
- Towards universal paraphrastic sentence embeddings. CoRR abs/1511.08198. Cited by: §1.