## I Introduction

Contemporary machine learning methods increasingly rely on the quality of latent representations, produced during e.g., training of deep neural network models or

*dimensionality reduction techniques*. The common denominator to many models used in practice throughout science and industry is a rather arbitrary selection of the embedding dimension – it appears widely accepted that a sufficiently high embedding dimension is preferred (e.g. 256 or 768). However, in many practical scenarios such as the development of embedded systems, mobile and online learning, model compactness is the desired property [joulin2016fasttextzip, luo2017thinet]. There has been research targeted at finding e.g., sufficient neural network architectures for, e.g., mobile deployment [howard2017mobilenets], and, similarly, the distillation of existing large neural language models [sanh2019distilbert]. Albeit succeeding at their tasks, such research endeavors do not emphasize the actual properties of the obtained representations, but rather the model itself. In recent years, the learned representations have been actively studied, offering insights into how existing representations can be compressed for many practical applications, including image [damahe2019review] and online text classification [acharya2019online].

Finally, as the manual annotations are potentially expensive, the domain of self-supervised learning explores to what extent can a neural network-based system learn relevant representations without any supervision

[mao2020survey] The purpose of this paper is to explore, in the domain of document representation learning, to what extent can an existing representation be efficiently compressed with as little performance loss as possible, in*self-supervised*manner – without human annotations.

The contributions of this work are:
i) We propose CoRe, an embedding-agnostic methodology for automatic recursive compression of latent document representations.
CoRe can construct up to two orders of magnitude () smaller representations whilst maintaining the performance within a few percentage points, offering drastic memory consumption reduction for down-stream learning.
ii) The proposed methodology is evaluated on 17 real-life data sets, where we explore to what extent *contextual* and *non-contextual* document representations can be *compressed* with multiple linear and non-linear dimensionality reduction methods, offering novel insights into the compressibility of different types of document representations.
iii) We demonstrate that a very efficient SVD-based recursive compression can offer better-than-initial performance resulting from lower-dimensional representations.

The remainder of this work is structured as follows. In Section II we discuss the related work. In Section III, the proposed CoRe methodology is presented, followed by the description of the considered compression algorithms (Section IV), empirical evaluation setting (Section V) and the experimental results (Section VI). We conclude with discussion and conclusions in Section VII. The project’s repository is accesible here.

## Ii Related Work

The proposed work builds on the ideas from the fields of representation learning and model compression. The notion of representation learning has been considered throughout different sub-domains of machine learning and has received increasing interest in the last years. Neural network-based embedding learning is becoming the prevailing method for obtaining representations (embeddings) of graphs or their nodes [zhang2018network], images [zerdoumi2018image] and documents [kesiraju2020learning, reimers-2020-multilingual-sentence-bert].
In recent years, two main types of document embeddings emerged, namely the ones based on contextual neural language models [NIPS2017_3f5ee243, Peters:2018] and the non-contextual ones based on earlier word representation learning techniques [le2014distributed].
Commonly used *ad hoc* dimensionalities of the representations of, e.g., 768, 256, and 128 yield satisfying results, however, are commonly not rigorously inspected, possibly due to high computational costs of manual inspection of a model’s behavior concerning this hyperparameter.

Exploration of how language models can be compressed has been an active effort for more than a decade [talbot-brants-2008-randomized].
As a result, the development of methods that automatically explore a given representation’s properties and attempt to further reduce it is an ongoing research endeavor [acharya2019online, choi2020universal]. With the advent of neural language models, *distillation* re-emerged as a form of model compression [sun2020contrastive]. Similarly, the pruning of existing models was also shown to be a viable alternative [liu2018efficient, zhu2017prune]. By selecting the representative subword space, very memory-efficient models can be obtained [zhao2019extreme]. Similarly, high compressibility of neural language models can be obtained via low-rank approximations [acharya2019online, chen2018groupreduce]. Recently, attempts to exploit external knowledge to compress transformer-based models such as BERT [devlin-etal-2019-bert] were also investigated [sun2019patient].
Finally, the idea of *self-supervised* learning has recently been explored in the context of language models, albeit originating in the image domain [liu2020self]. For example, the Albert [lan2019albert] exploits the inter-sentence coherence to obtain better and more compact language models.
Albeit many compression-related ideas have been proposed and already offered model improvements, to our knowledge no systematic evaluation of different representation compression algorithms has yet been conducted. Furthermore, another rationale for this paper is that it is not clear whether recursive compression is a better option if compared to a direct projection to a lower dimension, which we explored systematically in this work.

## Iii Compression of Representations (CoRe)

The proposed approach (schematically shown in Figure 1) investigates to what extent a given latent representation can be compressed to maintain similar functionality compared to the initially obtained representation. This work builds on the idea of *lossy compression* – the obtained, more compressed representations can be of lower quality, as long as that quality is sufficient for a given *practical application*. Albeit there exist many compression benchmarks, normally the reconstruction error of the origin e.g., data set is measured. However, as the purpose of *learned*

representations is to enable association between input data and e.g., a collection of targets (e.g., genres), the quality of the compressed representations is estimated via extrinsic evaluation with the logistic regression classifier.

### Iii-a Recursive autoencoding

Next, we formally describe the incremental reduction of a given representation’s dimension – the key idea of CoRe. Let represent a -dimensional representation of a set of documents – a corpus. Let be the performance of the classifier learned from -dimensional embeddings of the documents. The purpose of this paper is to explore how changes with and to find -dimensionional representation, such that is at most worse (but can be better) than the initial performance . We further define a function . For example, projects the space from 512 to 256 dimensions. The key idea of this paper explores the following recurrence relation: . The parameter denotes the dimension reduction factor; the larger the , the more the space is reduced at each step of the optimization. After steps, CoRe next creates the compressed representation , where . Here, on each call of the EMB function, we learn how to reduce the dimension of the output of the previous call, starting from the initial data frame and proceeding towards lower dimensions.

The key idea revolves around *recursive construction* of incrementally smaller representations. Notice that up to this point, we did not discuss the properties of EMB – the proposed procedure is *agnostic* to the embedding algorithm.

### Iii-B Time and Space Complexity

Let denote the initial representation’s dimension. Next, let be the number of recursive steps. Then, the computational complexity is , which can be simplified to , provided that is lower-bounded by some constant, e.g., . Here, the and correspond to the dimensions of the input and hidden layer in a single-layer architecture,

to the number of epochs and

to the number of documents. Potentially interesting is also the complexity w.r.t. , i.e. the reduction term. By assuming is an exponent of (commonly the case, e.g., ), the complexity can also be expressed as , indicating that recursive reduction is not much more expensive than its first step, which already requires steps. Indeed, the exact value of the above sum is , thus the proportion of the computations done in the first step of the iteration is approximately which is always greater than if . In terms of space complexity, the complexity is bounded as . The implementation, however, offers an option to hold all intermediary compression steps (and produce them), which results in additional space overhead which was in the considered experiments not problematic.## Iv Representation compression algorithms

In the following section we discuss the representation compression techniques we considered in this work intending to systematically explore the space of low-dimensional document representations. The considered methods are summarised in Table I.

Compression approach | Description |
---|---|

UMAP [McInnes2018] | Non-linear compression based on manifold theory |

Sparse random projections [li2006very] | Johnson Lindenstrauss lemma-informed projections |

Truncated SVD [halko2010finding] | Singular value decomposition |

Cluster-aggregation (mean, median, max) | Clustering into dimensions followed by aggregation |

Neural autoencoder - small/large | Neural autoencoders of various complexities |

Random subspaces (inspired by [ho1998nearest, Breskvar2018]) | Random subspace of dimension |

The widely adopted methods such as UMAP [McInnes2018] were shown to perform competitively to most learning-based approaches such as e.g., t-SNE [tsne], hence this and similar approaches are not included in this work. As contributions of this work, we implemented the following approaches, of which performance we believe offers additional insights into the representations’ properties. The *Neural-small* and *Neural-large* are two differently sized autoencoder architectures. A single layer example can be stated as

(1) |

where is a dense representation of documents. The SoftSign activation is defined as , The BN (BatchNorm) is defined as: The is a small constant required for numeric stability. The goal of AE is to learn the association , . To obtain a low-dimensional representation, forward pass is considered only up to the first hidden layer, i.e.,

Note how no activations are employed, ensuring non-activated representations.
The weight updates are considered as follows. The main hypothesis of this work explores whether incremental reconstruction of latent spaces of lower dimension indeed preserves the performance. As such, the autoencoder attempts (at each step) to overfit the representation, and is hence optimized until the loss is . Note that being able to reconstruct a given input data set with zero error can be related to *lossless compression*; thus, CoRe effectively explores whether incremental steps of theoretically lossless compression yield an useful, low-dimensional (lossy) representation.

Next, we implemented a variant of the random subspaces algorithm, which can be summarised in the following two simple steps. First, randomly select dimensions from the initial representation. Create a subspace-based only on these representations and perform normalization across samples. This approach serves as a very simple/naïve baseline.

The third branch of approaches we implemented was aimed at exploring whether the incremental aggregation of detected structures can serve as a simple-to-use compression technique. Here, the two-step procedure operates as follows. First, the KMeans++ [kmeans] algorithm is used on the transposed initial matrix to partition the dimensions into clusters. For each detected cluster, an aggregation function is applied. We considered max, mean, and median-based aggregations. Intuitively, this procedure should maintain the key parts of the feature space that are of relevance to maintaining the initial structure. The compression can thus be summarised as:

This approach explores whether the macro structure of the embeddings offers sufficient expressive power. Note that the recursive version of this algorithm incrementally reduces the feature space (instead of using at each step, it uses the prior representation ). Finally, we included the sparse random projections algorithm as a baseline [li2006very].

Data set | topic | ||||
---|---|---|---|---|---|

bbc [greene06icml] | 1406 | 49418 | 343 | 4 | news topics |

subjects | 1786 | 104972 | 227 | 4 | text topics |

CNN-news | 2107 | 144077 | 605 | 7 | news topics |

pan-2017 [rangel2017overview] | 3599 | 575670 | 1478 | 2 | gender detection |

insults | 3946 | 29133 | 288 | 2 | insult detection |

questions [li2002learning] | 5452 | 13278 | 37 | 6 | question categories |

SMSSpam [almeida2011contributions] | 5571 | 17744 | 88 | 2 | SMS messages spam detection |

MedRelation | 8361 | 16796 | 99 | 2 | medical relations detection |

AAAI2021 [akhtar2021overview] | 8473 | 41387 | 149 | 2 | fake news detection |

mbti | 8675 | 313038 | 419 | 16 | personality type detection (from text) |

yelp | 10000 | 76482 | 592 | 5 | review classification |

hatespeech [gibert2018hate] | 10557 | 29982 | 195 | 4 | white supremacists hate speech detection |

semeval-2019 [zampierietal2019] | 13240 | 39714 | 155 | 2 | offensive speech prediction |

articles | 19990 | 285144 | 10692 | 20 | news classification |

sarcasm [misra2019sarcasm] | 28619 | 50241 | 166 | 2 | sarcasm detection |

authors | 53678 | 29997 | 1008 | 45 | Victorian era author detection |

drugs-condition [grasser2018aspect] | 70406 | 96879 | 435 | 150 | drug side effects classification |

## V Empirical evaluation

The experiments were aimed to explore the compressibility of DistilBERT [sanh2020distilbert] and doc2vec [le2014distributed] representations concerning both the considered compression level, as well as the compression (autoencoding) algorithm used.

For each compression algorithm (and each data set) we conducted stratified three-fold cross-validation. All experiments were repeated three times, which was possible due to the utilization of the SLING supercomputing infrastructure. Similarly, we report the performance for each compression step. Statistical significance is assessed via critical distance diagrams [demvsar2006statistical] which compare average ranks achieved with a given method. We conducted the compression experiments on 17 different real-life data sets, shown in Table II.

### V-a Evaluation of representation quality

We are interested in a given compressed representation’s relation to the performance of the initial representation. Hence, we introduce , a measure which is computed as:
.
If , the compressed representation is *better* by when compared to the original (initial) representation’s performance. If , the original representation is better. For multi-class problems, we considered the micro score. Note that the scores are reported in range and not as percentages. The classifier used in this work is Logistic Regression with regularization parameter C set to 1 [scikit-learn]. The default dimension of representations was set to 768, as commonly adopted in the literature. Each computational job had at most 2 hours to finish on a 16GB (RAM) 8 core virtual machine with no GPU.

## Vi Results

Due to space limitations, we comment on all representations, but only show the graphs for non-contextual ones (doc2vec). In Figure 2, the overall impact of compression on the performance can be observed. It can be observed that both recursive, as well as non-recursive representations offer sufficient performance when considering multiple compression steps (above the red line of 5% margin), however, recursive compression offers superior, in the case of doc2vec also better-than-origin representation’s performance.

The results indicate that the first few compression steps can even be beneficial for the final representation’s quality, however, once very low-dimensional representations are considered, surprisingly, cluster-max-direct yielded the highest-positioned curve, indicating good average performance. This observation only holds for non-contextual representations. For the contextual ones, the SVD-based approaches perform consistently well.

When comparing all considered approaches (their average performance across the data sets), the first conclusion is that contextual representations (BERT) outperform the non-contextual ones (doc2vec). The other main conclusions are i) that recursive SVD’s performance is relatively consistent throughout different levels of compression, ii) that random subspaces are not sufficient for preserving the structure (consistent last places), iii) that the main competitors to SVD are neural autoencoder-based approaches, which are more computationally expensive, albeit being able to perform non-linear decomposition, and iv) that recursive *linear* representation performs well across different compression levels ().

We next present the critical distance diagram [demvsar2006statistical] in Figure 3. The recursive application of SVD consistently offers the best results, regardless of the representation type. Interestingly, the clustering-based reductions were more effective when considering contextual representations (cluster-mean-direct). Overall, however, autoencoder-based representations similarly performed consistently better than e.g., the UMAP-based ones. The results in tabular form (for SVD-based compression) are given in Table III.

Dataset | AAAI2021 | CNN-news | MedRelation | SMSSpam | articles | authors | bbc | drugs | hatespeech | insults | mbti | pan-2017 | questions | sarcasm | semeval-2019 | subjects | yelp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Compression-representation | Compr. steps | |||||||||||||||||

SVD (BERT) | 1 | -0.0 | -0.004 | -0.003 | 0.001 | -0.018 | -0.028 | 0.0 | -0.024 | 0.013 | 0.002 | 0.003 | 0.004 | -0.023 | -0.019 | 0.006 | 0.0 | 0.022 |

2 | -0.007 | -0.021 | -0.036 | 0.001 | -0.03 | -0.064 | 0.003 | -0.055 | 0.009 | 0.013 | 0.012 | -0.003 | -0.051 | -0.039 | 0.013 | 0.0 | 0.025 | |

3 | -0.017 | -0.04 | -0.09 | -0.001 | -0.049 | -0.111 | 0.0 | -0.108 | 0.012 | 0.005 | 0.025 | 0.018 | -0.112 | -0.073 | 0.004 | 0.002 | 0.02 | |

4 | -0.038 | -0.08 | -0.106 | -0.009 | -0.093 | -0.166 | -0.009 | -0.175 | 0.011 | -0.009 | 0.02 | 0.011 | -0.16 | -0.108 | -0.017 | -0.013 | 0.019 | |

5 | -0.054 | -0.148 | -0.134 | -0.025 | -0.195 | -0.212 | -0.02 | -0.276 | 0.004 | -0.01 | 0.016 | -0.011 | -0.21 | -0.13 | -0.021 | -0.018 | 0.006 | |

6 | -0.07 | -0.234 | -0.141 | -0.026 | -0.323 | -0.242 | -0.057 | -0.358 | -0.006 | -0.02 | 0.021 | 0.003 | -0.256 | -0.163 | -0.04 | -0.027 | -0.002 | |

7 | -0.099 | -0.324 | -0.138 | -0.04 | -0.458 | -0.281 | -0.119 | -0.459 | -0.008 | -0.02 | 0.009 | -0.052 | -0.335 | -0.24 | -0.057 | -0.069 | -0.023 | |

8 | -0.201 | -0.535 | -0.158 | -0.058 | -0.531 | -0.298 | -0.216 | -0.513 | -0.007 | -0.031 | 0.008 | -0.077 | -0.425 | -0.28 | -0.073 | -0.11 | -0.052 | |

9 | -0.207 | -0.575 | -0.322 | -0.073 | -0.59 | -0.298 | -0.42 | -0.516 | -0.006 | -0.043 | 0.006 | -0.099 | -0.407 | -0.303 | -0.084 | -0.195 | -0.085 | |

SVD (doc2vec) | 1 | 0.006 | 0.014 | 0.014 | 0.001 | -0.012 | -0.012 | 0.001 | -0.011 | 0.002 | 0.002 | 0.054 | 0.016 | -0.001 | -0.001 | 0.002 | -0.005 | -0.001 |

2 | 0.006 | 0.015 | 0.014 | 0.001 | -0.0 | -0.019 | 0.0 | -0.039 | 0.002 | 0.002 | 0.106 | 0.033 | -0.001 | -0.001 | 0.001 | -0.005 | -0.003 | |

3 | 0.003 | 0.036 | 0.009 | 0.001 | -0.005 | -0.02 | -0.006 | -0.088 | 0.002 | 0.002 | 0.112 | 0.043 | -0.001 | -0.006 | -0.007 | -0.008 | 0.004 | |

4 | -0.008 | 0.062 | -0.003 | 0.001 | -0.019 | -0.049 | -0.007 | -0.128 | 0.001 | 0.001 | 0.094 | 0.039 | -0.001 | -0.019 | -0.028 | -0.008 | 0.001 | |

5 | -0.014 | 0.058 | -0.024 | 0.001 | -0.051 | -0.166 | -0.003 | -0.162 | 0.002 | -0.009 | 0.039 | 0.02 | -0.006 | -0.033 | -0.039 | -0.005 | -0.004 | |

6 | -0.027 | 0.054 | -0.075 | -0.006 | -0.149 | -0.341 | -0.0 | -0.198 | -0.001 | -0.013 | -0.118 | 0.032 | -0.022 | -0.042 | -0.04 | -0.026 | -0.024 | |

7 | -0.032 | 0.006 | -0.147 | -0.015 | -0.257 | -0.518 | -0.005 | -0.256 | 0.001 | -0.02 | -0.174 | -0.004 | -0.064 | -0.123 | -0.041 | -0.036 | -0.053 | |

8 | -0.034 | -0.182 | -0.145 | -0.033 | -0.428 | -0.657 | -0.066 | -0.317 | 0.004 | -0.018 | -0.24 | -0.185 | -0.127 | -0.164 | -0.041 | -0.085 | -0.117 | |

9 | -0.037 | -0.351 | -0.149 | -0.069 | -0.531 | -0.756 | -0.284 | -0.362 | 0.004 | -0.021 | -0.255 | -0.18 | -0.208 | -0.207 | -0.041 | -0.09 | -0.122 | |

SVD-dir (BERT) | 1 | -0.0 | -0.004 | -0.003 | 0.001 | -0.018 | -0.028 | 0.0 | -0.023 | 0.013 | 0.002 | 0.003 | 0.004 | -0.022 | -0.019 | 0.006 | 0.0 | 0.022 |

2 | -0.008 | -0.019 | -0.049 | 0.001 | -0.027 | -0.064 | 0.0 | -0.056 | 0.009 | 0.007 | 0.013 | -0.002 | -0.042 | -0.04 | 0.007 | 0.0 | 0.022 | |

3 | -0.019 | -0.042 | -0.081 | 0.0 | -0.048 | -0.11 | -0.003 | -0.106 | 0.009 | 0.004 | 0.031 | 0.018 | -0.108 | -0.074 | 0.002 | 0.002 | 0.016 | |

4 | -0.039 | -0.078 | -0.105 | -0.009 | -0.095 | -0.167 | -0.014 | -0.175 | 0.01 | -0.007 | 0.019 | 0.011 | -0.16 | -0.109 | -0.016 | -0.013 | 0.018 | |

5 | -0.056 | -0.146 | -0.134 | -0.027 | -0.192 | -0.212 | -0.02 | -0.276 | 0.004 | -0.009 | 0.017 | -0.011 | -0.213 | -0.131 | -0.021 | -0.018 | 0.003 | |

6 | -0.065 | -0.233 | -0.141 | -0.026 | -0.323 | -0.242 | -0.058 | -0.357 | -0.005 | -0.02 | 0.02 | -0.001 | -0.255 | -0.162 | -0.039 | -0.027 | -0.004 | |

7 | -0.099 | -0.323 | -0.138 | -0.04 | -0.458 | -0.281 | -0.119 | -0.459 | -0.008 | -0.02 | 0.009 | -0.054 | -0.333 | -0.24 | -0.057 | -0.069 | -0.024 | |

8 | -0.201 | -0.535 | -0.158 | -0.058 | -0.531 | -0.299 | -0.216 | -0.513 | -0.007 | -0.031 | 0.008 | -0.078 | -0.424 | -0.279 | -0.073 | -0.11 | -0.053 | |

9 | -0.207 | -0.575 | -0.322 | -0.073 | -0.59 | -0.298 | -0.42 | -0.516 | -0.006 | -0.043 | 0.006 | -0.099 | -0.407 | -0.303 | -0.084 | -0.195 | -0.086 | |

SVD-dir (doc2vec) | 1 | 0.007 | -0.012 | -0.008 | 0.003 | -0.005 | -0.016 | 0.004 | -0.009 | 0.001 | 0.0 | 0.069 | 0.03 | 0.001 | -0.003 | -0.003 | 0.0 | 0.002 |

2 | 0.006 | -0.011 | -0.008 | 0.003 | 0.007 | -0.023 | 0.003 | -0.036 | 0.001 | 0.0 | 0.125 | 0.052 | 0.0 | -0.004 | -0.003 | 0.0 | 0.002 | |

3 | 0.004 | 0.008 | -0.014 | 0.003 | 0.004 | -0.025 | -0.004 | -0.086 | 0.0 | 0.0 | 0.132 | 0.059 | 0.001 | -0.008 | -0.011 | -0.002 | 0.007 | |

4 | -0.007 | 0.037 | -0.022 | 0.003 | -0.011 | -0.054 | -0.0 | -0.126 | -0.001 | -0.002 | 0.112 | 0.056 | 0.0 | -0.02 | -0.032 | -0.003 | 0.003 | |

5 | -0.016 | 0.033 | -0.048 | 0.003 | -0.041 | -0.169 | 0.004 | -0.158 | 0.001 | -0.009 | 0.059 | 0.038 | -0.003 | -0.034 | -0.043 | 0.004 | -0.004 | |

6 | -0.028 | 0.03 | -0.098 | -0.003 | -0.143 | -0.345 | -0.003 | -0.201 | -0.002 | -0.015 | -0.101 | 0.05 | -0.021 | -0.045 | -0.045 | -0.022 | -0.022 | |

7 | -0.033 | -0.013 | -0.158 | -0.012 | -0.248 | -0.522 | -0.004 | -0.256 | 0.001 | -0.02 | -0.157 | 0.013 | -0.06 | -0.12 | -0.044 | -0.033 | -0.05 | |

8 | -0.033 | -0.206 | -0.162 | -0.031 | -0.419 | -0.661 | -0.062 | -0.315 | 0.004 | -0.019 | -0.219 | -0.166 | -0.124 | -0.166 | -0.044 | -0.076 | -0.111 | |

9 | -0.038 | -0.372 | -0.167 | -0.065 | -0.524 | -0.76 | -0.279 | -0.36 | 0.004 | -0.022 | -0.236 | -0.163 | -0.208 | -0.211 | -0.044 | -0.079 | -0.115 |

*articles*vs.

*hatespeech*).

Tabular results are in alignment with Figure 3. We also observed both drastic performance reductions relatively quickly (e.g., the *authors* data set), but also very gradual decline (*hatespeech*). Interestingly, the *drugs-condition* data set, comprised of very long documents, was very hard to compress, resulting always in .

As an ablation study, we visualized the
*bbc* data set projected to 2D from different higher dimensions (compression levels) in Figure 4.

The observation indicating that in some data sets very low dimensions preserve the initial class structure if compressed in recursive manner with CoRe indicate, that many real-life problems are potentially over-represented in terms of the dimension commonly considered, and could be analysed via much more compact (and efficient) representations.

## Vii Discussion and Conclusions

In Section VI, we demonstrated that both contextual and non-contextual representations can be compressed by a factor of up to 100x (e.g., on the *bbc* data set), whilst maintaining the representation’s classification performance. The observed compressibility varied mostly with respect to the number of documents considered for learning. In this work, we also observed the net positive (Table III).
In terms of the compression performance, we observed that linear recursive compression performed surprisingly well (SVD). Overall, the non-contextual representations (doc2vec-based) could be more easily compressed – in some cases (Figure 2), only tens of dimensions were needed to retain a net positive . On the contrary, contextual representations are according to our experiments harder to compress. Overall, however, tens of dimensions were identified by some of the best-performing methods as enough to retain margin, which could already be of practical relevance. An example where lower-dimensional representations could be practical and speed up the overall learning process are AutoML systems.

As the purpose of this work was to explore whether a technique requiring minimal or zero hyperparameter tunning can already improve the performance, we believe that according to the current results, recursive SVD-based compression is the most suitable one, even though UMAP and neural methods have the potential to perform even better (at the cost of the additional hyperparameter tuning).

Further work includes investigation of a larger collection of neural language model-based representation to confirm the current results. Finally, we identified as an open problem the *margin identification*, i.e., how to efficiently determine without computing all intermediary compressions.
Identifying upfront could offer automation of the compression, and, when combined with recursive SVD which required zero hyperparameter tunning, provide a powerful *post-hoc* method for optimizing representation-based learners.

## Acknowledgment

We thank Tomáš Mikolov for his invaluable comments. The work was supported by the Slovenian Research Agency (young researcher grant (BŠ), grant P2-0103), and by the European Commision (grants 825153, N2-0078, 952215, 825619).

Comments

There are no comments yet.