Compressibility of Distributed Document Representations

by   Blaž Škrlj, et al.
Jozef Stefan Institute

Contemporary natural language processing (NLP) revolves around learning from latent document representations, generated either implicitly by neural language models or explicitly by methods such as doc2vec or similar. One of the key properties of the obtained representations is their dimension. Whilst the commonly adopted dimensions of 256 and 768 offer sufficient performance on many tasks, it is many times unclear whether the default dimension is the most suitable choice for the subsequent downstream learning tasks. Furthermore, representation dimensions are seldom subject to hyperparameter tuning due to computational constraints. The purpose of this paper is to demonstrate that a surprisingly simple and efficient recursive compression procedure can be sufficient to both significantly compress the initial representation, but also potentially improve its performance when considering the task of text classification. Having smaller and less noisy representations is the desired property during deployment, as orders of magnitude smaller models can significantly reduce the computational overload and with it the deployment costs. We propose CoRe, a straightforward, representation learner-agnostic framework suitable for representation compression. The CoRe's performance is showcased and studied on a collection of 17 real-life corpora from biomedical, news, social media, and literary domains. We explored CoRe's behavior when considering contextual and non-contextual document representations, different compression levels, and 9 different compression algorithms. Current results based on more than 100,000 compression experiments indicate that recursive Singular Value Decomposition offers a very good trade-off between the compression efficiency and performance, making CoRe useful in many existing, representation-dependent NLP pipelines.



There are no comments yet.


page 1


GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain

Deep neural language models have set new breakthroughs in many tasks of ...

Improve Document Embedding for Text Categorization Through Deep Siamese Neural Network

Due to the increasing amount of data on the internet, finding a highly-i...

KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

The development of over-parameterized pre-trained language models has ma...

Contextual Text Denoising with Masked Language Models

Recently, with the help of deep learning models, significant advances ha...

A Survey on Neural Network Language Models

As the core component of Natural Language Processing (NLP) system, Langu...

A neural document language modeling framework for spoken document retrieval

Recent developments in deep learning have led to a significant innovatio...

Knowledge Graph informed Fake News Classification via Heterogeneous Representation Ensembles

Increasing amounts of freely available data both in textual and relation...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Contemporary machine learning methods increasingly rely on the quality of latent representations, produced during e.g., training of deep neural network models or

dimensionality reduction techniques. The common denominator to many models used in practice throughout science and industry is a rather arbitrary selection of the embedding dimension – it appears widely accepted that a sufficiently high embedding dimension is preferred (e.g. 256 or 768). However, in many practical scenarios such as the development of embedded systems, mobile and online learning, model compactness is the desired property [joulin2016fasttextzip, luo2017thinet]. There has been research targeted at finding e.g., sufficient neural network architectures for, e.g., mobile deployment [howard2017mobilenets], and, similarly, the distillation of existing large neural language models [sanh2019distilbert]. Albeit succeeding at their tasks, such research endeavors do not emphasize the actual properties of the obtained representations, but rather the model itself. In recent years, the learned representations have been actively studied, offering insights into how existing representations can be compressed for many practical applications, including image [damahe2019review] and online text classification [acharya2019online].

Finally, as the manual annotations are potentially expensive, the domain of self-supervised learning explores to what extent can a neural network-based system learn relevant representations without any supervision 

[mao2020survey] The purpose of this paper is to explore, in the domain of document representation learning, to what extent can an existing representation be efficiently compressed with as little performance loss as possible, in self-supervised manner – without human annotations.

The contributions of this work are: i) We propose CoRe, an embedding-agnostic methodology for automatic recursive compression of latent document representations. CoRe can construct up to two orders of magnitude () smaller representations whilst maintaining the performance within a few percentage points, offering drastic memory consumption reduction for down-stream learning. ii) The proposed methodology is evaluated on 17 real-life data sets, where we explore to what extent contextual and non-contextual document representations can be compressed with multiple linear and non-linear dimensionality reduction methods, offering novel insights into the compressibility of different types of document representations. iii) We demonstrate that a very efficient SVD-based recursive compression can offer better-than-initial performance resulting from lower-dimensional representations.

The remainder of this work is structured as follows. In Section II we discuss the related work. In Section III, the proposed CoRe methodology is presented, followed by the description of the considered compression algorithms (Section IV), empirical evaluation setting (Section V) and the experimental results (Section VI). We conclude with discussion and conclusions in Section VII. The project’s repository is accesible here.

Ii Related Work

The proposed work builds on the ideas from the fields of representation learning and model compression. The notion of representation learning has been considered throughout different sub-domains of machine learning and has received increasing interest in the last years. Neural network-based embedding learning is becoming the prevailing method for obtaining representations (embeddings) of graphs or their nodes [zhang2018network], images [zerdoumi2018image] and documents [kesiraju2020learning, reimers-2020-multilingual-sentence-bert]. In recent years, two main types of document embeddings emerged, namely the ones based on contextual neural language models [NIPS2017_3f5ee243, Peters:2018] and the non-contextual ones based on earlier word representation learning techniques [le2014distributed]. Commonly used ad hoc dimensionalities of the representations of, e.g., 768, 256, and 128 yield satisfying results, however, are commonly not rigorously inspected, possibly due to high computational costs of manual inspection of a model’s behavior concerning this hyperparameter.

Exploration of how language models can be compressed has been an active effort for more than a decade [talbot-brants-2008-randomized]. As a result, the development of methods that automatically explore a given representation’s properties and attempt to further reduce it is an ongoing research endeavor [acharya2019online, choi2020universal]. With the advent of neural language models, distillation re-emerged as a form of model compression [sun2020contrastive]. Similarly, the pruning of existing models was also shown to be a viable alternative [liu2018efficient, zhu2017prune]. By selecting the representative subword space, very memory-efficient models can be obtained [zhao2019extreme]. Similarly, high compressibility of neural language models can be obtained via low-rank approximations [acharya2019online, chen2018groupreduce]. Recently, attempts to exploit external knowledge to compress transformer-based models such as BERT [devlin-etal-2019-bert] were also investigated [sun2019patient]. Finally, the idea of self-supervised learning has recently been explored in the context of language models, albeit originating in the image domain [liu2020self]. For example, the Albert [lan2019albert] exploits the inter-sentence coherence to obtain better and more compact language models. Albeit many compression-related ideas have been proposed and already offered model improvements, to our knowledge no systematic evaluation of different representation compression algorithms has yet been conducted. Furthermore, another rationale for this paper is that it is not clear whether recursive compression is a better option if compared to a direct projection to a lower dimension, which we explored systematically in this work.

Iii Compression of Representations (CoRe)

Fig. 1:

Schematic overview of recursive autoencoding, as explored in this work. The initial representation is recursively compressed (

) to a compact, however expressive representation. This example shows two compression steps only (for readability purposes).

The proposed approach (schematically shown in Figure 1) investigates to what extent a given latent representation can be compressed to maintain similar functionality compared to the initially obtained representation. This work builds on the idea of lossy compression – the obtained, more compressed representations can be of lower quality, as long as that quality is sufficient for a given practical application. Albeit there exist many compression benchmarks, normally the reconstruction error of the origin e.g., data set is measured. However, as the purpose of learned

representations is to enable association between input data and e.g., a collection of targets (e.g., genres), the quality of the compressed representations is estimated via extrinsic evaluation with the logistic regression classifier.

Iii-a Recursive autoencoding

Next, we formally describe the incremental reduction of a given representation’s dimension – the key idea of CoRe. Let represent a -dimensional representation of a set of documents – a corpus. Let be the performance of the classifier learned from -dimensional embeddings of the documents. The purpose of this paper is to explore how changes with and to find -dimensionional representation, such that is at most worse (but can be better) than the initial performance . We further define a function . For example, projects the space from 512 to 256 dimensions. The key idea of this paper explores the following recurrence relation: . The parameter denotes the dimension reduction factor; the larger the , the more the space is reduced at each step of the optimization. After steps, CoRe next creates the compressed representation , where . Here, on each call of the EMB function, we learn how to reduce the dimension of the output of the previous call, starting from the initial data frame and proceeding towards lower dimensions.

The key idea revolves around recursive construction of incrementally smaller representations. Notice that up to this point, we did not discuss the properties of EMB – the proposed procedure is agnostic to the embedding algorithm.

Iii-B Time and Space Complexity

Let denote the initial representation’s dimension. Next, let be the number of recursive steps. Then, the computational complexity is , which can be simplified to , provided that is lower-bounded by some constant, e.g., . Here, the and correspond to the dimensions of the input and hidden layer in a single-layer architecture,

to the number of epochs and

to the number of documents. Potentially interesting is also the complexity w.r.t. , i.e. the reduction term. By assuming is an exponent of (commonly the case, e.g., ), the complexity can also be expressed as , indicating that recursive reduction is not much more expensive than its first step, which already requires steps. Indeed, the exact value of the above sum is , thus the proportion of the computations done in the first step of the iteration is approximately which is always greater than if . In terms of space complexity, the complexity is bounded as . The implementation, however, offers an option to hold all intermediary compression steps (and produce them), which results in additional space overhead which was in the considered experiments not problematic.

Iv Representation compression algorithms

In the following section we discuss the representation compression techniques we considered in this work intending to systematically explore the space of low-dimensional document representations. The considered methods are summarised in Table I.

Compression approach Description
UMAP [McInnes2018] Non-linear compression based on manifold theory
Sparse random projections [li2006very] Johnson Lindenstrauss lemma-informed projections
Truncated SVD [halko2010finding] Singular value decomposition
Cluster-aggregation (mean, median, max) Clustering into dimensions followed by aggregation
Neural autoencoder - small/large Neural autoencoders of various complexities
Random subspaces (inspired by [ho1998nearest, Breskvar2018]) Random subspace of dimension
TABLE I: Overview of the considered compression algorithms. If no reference is given, the algorithm was built for this work.

The widely adopted methods such as UMAP [McInnes2018] were shown to perform competitively to most learning-based approaches such as e.g., t-SNE [tsne], hence this and similar approaches are not included in this work. As contributions of this work, we implemented the following approaches, of which performance we believe offers additional insights into the representations’ properties. The Neural-small and Neural-large are two differently sized autoencoder architectures. A single layer example can be stated as


where is a dense representation of documents. The SoftSign activation is defined as , The BN (BatchNorm) is defined as: The is a small constant required for numeric stability. The goal of AE is to learn the association , . To obtain a low-dimensional representation, forward pass is considered only up to the first hidden layer, i.e.,

Note how no activations are employed, ensuring non-activated representations. The weight updates are considered as follows. The main hypothesis of this work explores whether incremental reconstruction of latent spaces of lower dimension indeed preserves the performance. As such, the autoencoder attempts (at each step) to overfit the representation, and is hence optimized until the loss is . Note that being able to reconstruct a given input data set with zero error can be related to lossless compression; thus, CoRe effectively explores whether incremental steps of theoretically lossless compression yield an useful, low-dimensional (lossy) representation.

Next, we implemented a variant of the random subspaces algorithm, which can be summarised in the following two simple steps. First, randomly select dimensions from the initial representation. Create a subspace-based only on these representations and perform normalization across samples. This approach serves as a very simple/naïve baseline.

The third branch of approaches we implemented was aimed at exploring whether the incremental aggregation of detected structures can serve as a simple-to-use compression technique. Here, the two-step procedure operates as follows. First, the KMeans++ [kmeans] algorithm is used on the transposed initial matrix to partition the dimensions into clusters. For each detected cluster, an aggregation function is applied. We considered max, mean, and median-based aggregations. Intuitively, this procedure should maintain the key parts of the feature space that are of relevance to maintaining the initial structure. The compression can thus be summarised as:

This approach explores whether the macro structure of the embeddings offers sufficient expressive power. Note that the recursive version of this algorithm incrementally reduces the feature space (instead of using at each step, it uses the prior representation ). Finally, we included the sparse random projections algorithm as a baseline [li2006very].

Data set topic
bbc [greene06icml] 1406 49418 343 4 news topics
subjects 1786 104972 227 4 text topics
CNN-news 2107 144077 605 7 news topics
pan-2017 [rangel2017overview] 3599 575670 1478 2 gender detection
insults 3946 29133 288 2 insult detection
questions [li2002learning] 5452 13278 37 6 question categories
SMSSpam [almeida2011contributions] 5571 17744 88 2 SMS messages spam detection
MedRelation 8361 16796 99 2 medical relations detection
AAAI2021 [akhtar2021overview] 8473 41387 149 2 fake news detection
mbti 8675 313038 419 16 personality type detection (from text)
yelp 10000 76482 592 5 review classification
hatespeech [gibert2018hate] 10557 29982 195 4 white supremacists hate speech detection
semeval-2019 [zampierietal2019] 13240 39714 155 2 offensive speech prediction
articles 19990 285144 10692 20 news classification
sarcasm [misra2019sarcasm] 28619 50241 166 2 sarcasm detection
authors 53678 29997 1008 45 Victorian era author detection
drugs-condition [grasser2018aspect] 70406 96879 435 150 drug side effects classification
TABLE II: Data sets used: their domain and the numbers of documents , tokens , maximal document length , and unique labels .

V Empirical evaluation

The experiments were aimed to explore the compressibility of DistilBERT [sanh2020distilbert] and doc2vec [le2014distributed] representations concerning both the considered compression level, as well as the compression (autoencoding) algorithm used.

For each compression algorithm (and each data set) we conducted stratified three-fold cross-validation. All experiments were repeated three times, which was possible due to the utilization of the SLING supercomputing infrastructure. Similarly, we report the performance for each compression step. Statistical significance is assessed via critical distance diagrams [demvsar2006statistical] which compare average ranks achieved with a given method. We conducted the compression experiments on 17 different real-life data sets, shown in Table II.

Fig. 2: Overall compression performance for doc2vec-based initial representations and compression (autoencoding) algorithms. The x-axis denotes the number of compression steps (dimension halvings as ), and y-axis the difference in (micro) F1 performance w.r.t. the initial representation. The method names with “-dir” correspond to direct projection into a given dimension (non-recursive compression).

V-a Evaluation of representation quality

We are interested in a given compressed representation’s relation to the performance of the initial representation. Hence, we introduce , a measure which is computed as: . If , the compressed representation is better by when compared to the original (initial) representation’s performance. If , the original representation is better. For multi-class problems, we considered the micro score. Note that the scores are reported in range and not as percentages. The classifier used in this work is Logistic Regression with regularization parameter C set to 1 [scikit-learn]. The default dimension of representations was set to 768, as commonly adopted in the literature. Each computational job had at most 2 hours to finish on a 16GB (RAM) 8 core virtual machine with no GPU.

Vi Results

Due to space limitations, we comment on all representations, but only show the graphs for non-contextual ones (doc2vec). In Figure 2, the overall impact of compression on the performance can be observed. It can be observed that both recursive, as well as non-recursive representations offer sufficient performance when considering multiple compression steps (above the red line of 5% margin), however, recursive compression offers superior, in the case of doc2vec also better-than-origin representation’s performance.

Fig. 3: doc2vec results. The lines entail statistically (p = 0.05; Friedman-Nemenyi) indistinguishable performance.

The results indicate that the first few compression steps can even be beneficial for the final representation’s quality, however, once very low-dimensional representations are considered, surprisingly, cluster-max-direct yielded the highest-positioned curve, indicating good average performance. This observation only holds for non-contextual representations. For the contextual ones, the SVD-based approaches perform consistently well.

When comparing all considered approaches (their average performance across the data sets), the first conclusion is that contextual representations (BERT) outperform the non-contextual ones (doc2vec). The other main conclusions are i) that recursive SVD’s performance is relatively consistent throughout different levels of compression, ii) that random subspaces are not sufficient for preserving the structure (consistent last places), iii) that the main competitors to SVD are neural autoencoder-based approaches, which are more computationally expensive, albeit being able to perform non-linear decomposition, and iv) that recursive linear representation performs well across different compression levels ().

We next present the critical distance diagram [demvsar2006statistical] in Figure 3. The recursive application of SVD consistently offers the best results, regardless of the representation type. Interestingly, the clustering-based reductions were more effective when considering contextual representations (cluster-mean-direct). Overall, however, autoencoder-based representations similarly performed consistently better than e.g., the UMAP-based ones. The results in tabular form (for SVD-based compression) are given in Table III.

Dataset AAAI2021 CNN-news MedRelation SMSSpam articles authors bbc drugs hatespeech insults mbti pan-2017 questions sarcasm semeval-2019 subjects yelp
Compression-representation Compr. steps
SVD (BERT) 1 -0.0 -0.004 -0.003 0.001 -0.018 -0.028 0.0 -0.024 0.013 0.002 0.003 0.004 -0.023 -0.019 0.006 0.0 0.022
2 -0.007 -0.021 -0.036 0.001 -0.03 -0.064 0.003 -0.055 0.009 0.013 0.012 -0.003 -0.051 -0.039 0.013 0.0 0.025
3 -0.017 -0.04 -0.09 -0.001 -0.049 -0.111 0.0 -0.108 0.012 0.005 0.025 0.018 -0.112 -0.073 0.004 0.002 0.02
4 -0.038 -0.08 -0.106 -0.009 -0.093 -0.166 -0.009 -0.175 0.011 -0.009 0.02 0.011 -0.16 -0.108 -0.017 -0.013 0.019
5 -0.054 -0.148 -0.134 -0.025 -0.195 -0.212 -0.02 -0.276 0.004 -0.01 0.016 -0.011 -0.21 -0.13 -0.021 -0.018 0.006
6 -0.07 -0.234 -0.141 -0.026 -0.323 -0.242 -0.057 -0.358 -0.006 -0.02 0.021 0.003 -0.256 -0.163 -0.04 -0.027 -0.002
7 -0.099 -0.324 -0.138 -0.04 -0.458 -0.281 -0.119 -0.459 -0.008 -0.02 0.009 -0.052 -0.335 -0.24 -0.057 -0.069 -0.023
8 -0.201 -0.535 -0.158 -0.058 -0.531 -0.298 -0.216 -0.513 -0.007 -0.031 0.008 -0.077 -0.425 -0.28 -0.073 -0.11 -0.052
9 -0.207 -0.575 -0.322 -0.073 -0.59 -0.298 -0.42 -0.516 -0.006 -0.043 0.006 -0.099 -0.407 -0.303 -0.084 -0.195 -0.085
SVD (doc2vec) 1 0.006 0.014 0.014 0.001 -0.012 -0.012 0.001 -0.011 0.002 0.002 0.054 0.016 -0.001 -0.001 0.002 -0.005 -0.001
2 0.006 0.015 0.014 0.001 -0.0 -0.019 0.0 -0.039 0.002 0.002 0.106 0.033 -0.001 -0.001 0.001 -0.005 -0.003
3 0.003 0.036 0.009 0.001 -0.005 -0.02 -0.006 -0.088 0.002 0.002 0.112 0.043 -0.001 -0.006 -0.007 -0.008 0.004
4 -0.008 0.062 -0.003 0.001 -0.019 -0.049 -0.007 -0.128 0.001 0.001 0.094 0.039 -0.001 -0.019 -0.028 -0.008 0.001
5 -0.014 0.058 -0.024 0.001 -0.051 -0.166 -0.003 -0.162 0.002 -0.009 0.039 0.02 -0.006 -0.033 -0.039 -0.005 -0.004
6 -0.027 0.054 -0.075 -0.006 -0.149 -0.341 -0.0 -0.198 -0.001 -0.013 -0.118 0.032 -0.022 -0.042 -0.04 -0.026 -0.024
7 -0.032 0.006 -0.147 -0.015 -0.257 -0.518 -0.005 -0.256 0.001 -0.02 -0.174 -0.004 -0.064 -0.123 -0.041 -0.036 -0.053
8 -0.034 -0.182 -0.145 -0.033 -0.428 -0.657 -0.066 -0.317 0.004 -0.018 -0.24 -0.185 -0.127 -0.164 -0.041 -0.085 -0.117
9 -0.037 -0.351 -0.149 -0.069 -0.531 -0.756 -0.284 -0.362 0.004 -0.021 -0.255 -0.18 -0.208 -0.207 -0.041 -0.09 -0.122
SVD-dir (BERT) 1 -0.0 -0.004 -0.003 0.001 -0.018 -0.028 0.0 -0.023 0.013 0.002 0.003 0.004 -0.022 -0.019 0.006 0.0 0.022
2 -0.008 -0.019 -0.049 0.001 -0.027 -0.064 0.0 -0.056 0.009 0.007 0.013 -0.002 -0.042 -0.04 0.007 0.0 0.022
3 -0.019 -0.042 -0.081 0.0 -0.048 -0.11 -0.003 -0.106 0.009 0.004 0.031 0.018 -0.108 -0.074 0.002 0.002 0.016
4 -0.039 -0.078 -0.105 -0.009 -0.095 -0.167 -0.014 -0.175 0.01 -0.007 0.019 0.011 -0.16 -0.109 -0.016 -0.013 0.018
5 -0.056 -0.146 -0.134 -0.027 -0.192 -0.212 -0.02 -0.276 0.004 -0.009 0.017 -0.011 -0.213 -0.131 -0.021 -0.018 0.003
6 -0.065 -0.233 -0.141 -0.026 -0.323 -0.242 -0.058 -0.357 -0.005 -0.02 0.02 -0.001 -0.255 -0.162 -0.039 -0.027 -0.004
7 -0.099 -0.323 -0.138 -0.04 -0.458 -0.281 -0.119 -0.459 -0.008 -0.02 0.009 -0.054 -0.333 -0.24 -0.057 -0.069 -0.024
8 -0.201 -0.535 -0.158 -0.058 -0.531 -0.299 -0.216 -0.513 -0.007 -0.031 0.008 -0.078 -0.424 -0.279 -0.073 -0.11 -0.053
9 -0.207 -0.575 -0.322 -0.073 -0.59 -0.298 -0.42 -0.516 -0.006 -0.043 0.006 -0.099 -0.407 -0.303 -0.084 -0.195 -0.086
SVD-dir (doc2vec) 1 0.007 -0.012 -0.008 0.003 -0.005 -0.016 0.004 -0.009 0.001 0.0 0.069 0.03 0.001 -0.003 -0.003 0.0 0.002
2 0.006 -0.011 -0.008 0.003 0.007 -0.023 0.003 -0.036 0.001 0.0 0.125 0.052 0.0 -0.004 -0.003 0.0 0.002
3 0.004 0.008 -0.014 0.003 0.004 -0.025 -0.004 -0.086 0.0 0.0 0.132 0.059 0.001 -0.008 -0.011 -0.002 0.007
4 -0.007 0.037 -0.022 0.003 -0.011 -0.054 -0.0 -0.126 -0.001 -0.002 0.112 0.056 0.0 -0.02 -0.032 -0.003 0.003
5 -0.016 0.033 -0.048 0.003 -0.041 -0.169 0.004 -0.158 0.001 -0.009 0.059 0.038 -0.003 -0.034 -0.043 0.004 -0.004
6 -0.028 0.03 -0.098 -0.003 -0.143 -0.345 -0.003 -0.201 -0.002 -0.015 -0.101 0.05 -0.021 -0.045 -0.045 -0.022 -0.022
7 -0.033 -0.013 -0.158 -0.012 -0.248 -0.522 -0.004 -0.256 0.001 -0.02 -0.157 0.013 -0.06 -0.12 -0.044 -0.033 -0.05
8 -0.033 -0.206 -0.162 -0.031 -0.419 -0.661 -0.062 -0.315 0.004 -0.019 -0.219 -0.166 -0.124 -0.166 -0.044 -0.076 -0.111
9 -0.038 -0.372 -0.167 -0.065 -0.524 -0.76 -0.279 -0.36 0.004 -0.022 -0.236 -0.163 -0.208 -0.211 -0.044 -0.079 -0.115
TABLE III: SVD-based compression results across the data sets and levels. The results where the relative performance w.r.t. the origin representation was are highlighted. Note how some of the data sets are easier to compress than others (e.g., articles vs. hatespeech).

Tabular results are in alignment with Figure 3. We also observed both drastic performance reductions relatively quickly (e.g., the authors data set), but also very gradual decline (hatespeech). Interestingly, the drugs-condition data set, comprised of very long documents, was very hard to compress, resulting always in .

As an ablation study, we visualized the bbc data set projected to 2D from different higher dimensions (compression levels) in Figure 4.

Fig. 4: Incremental reduction of dimension with recursive SVD and cluster visualization in 2D (bbc). The 2D visualizations were for higher dimensions obtained with default UMAP parameters.

The observation indicating that in some data sets very low dimensions preserve the initial class structure if compressed in recursive manner with CoRe indicate, that many real-life problems are potentially over-represented in terms of the dimension commonly considered, and could be analysed via much more compact (and efficient) representations.

Vii Discussion and Conclusions

In Section VI, we demonstrated that both contextual and non-contextual representations can be compressed by a factor of up to 100x (e.g., on the bbc data set), whilst maintaining the representation’s classification performance. The observed compressibility varied mostly with respect to the number of documents considered for learning. In this work, we also observed the net positive (Table III). In terms of the compression performance, we observed that linear recursive compression performed surprisingly well (SVD). Overall, the non-contextual representations (doc2vec-based) could be more easily compressed – in some cases (Figure 2), only tens of dimensions were needed to retain a net positive . On the contrary, contextual representations are according to our experiments harder to compress. Overall, however, tens of dimensions were identified by some of the best-performing methods as enough to retain margin, which could already be of practical relevance. An example where lower-dimensional representations could be practical and speed up the overall learning process are AutoML systems.

As the purpose of this work was to explore whether a technique requiring minimal or zero hyperparameter tunning can already improve the performance, we believe that according to the current results, recursive SVD-based compression is the most suitable one, even though UMAP and neural methods have the potential to perform even better (at the cost of the additional hyperparameter tuning).

Further work includes investigation of a larger collection of neural language model-based representation to confirm the current results. Finally, we identified as an open problem the margin identification, i.e., how to efficiently determine without computing all intermediary compressions. Identifying upfront could offer automation of the compression, and, when combined with recursive SVD which required zero hyperparameter tunning, provide a powerful post-hoc method for optimizing representation-based learners.


We thank Tomáš Mikolov for his invaluable comments. The work was supported by the Slovenian Research Agency (young researcher grant (BŠ), grant P2-0103), and by the European Commision (grants 825153, N2-0078, 952215, 825619).