Square matrices appear in many machine learning problems and models. Optimization over a large square matrix is expensive in memory and in time. Therefore an economic approximation is needed. Conventional approximation approaches factorize the square matrix into a number matrices of much lower ranks. However, the low-rank constraint is a performance bottleneck if the approximated matrix is intrinsically high-rank or close to full rank. In this paper, we propose to approximate a large square matrix with a product of sparse full-rank matrices. In the approximation, our method needs only N(log N)^2 non-zero numbers for an N× N full matrix. We present both non-parametric and parametric ways to find the factorization. In the former, we learn the factorizing matrices directly, and in the latter, we train neural networks to map input data to the non-zero matrix entries. The sparse factorization method is tested for a variety of synthetic and real-world square matrices. The experimental results demonstrate that our method gives a better approximation when the approximated matrix is sparse and high-rank. Based on this finding, we use our parametric method as a scalable attention architecture that performs strongly in learning tasks for long sequential data and defeats Transformer and its several variants.READ FULL TEXT VIEW PDF
Many machine learning models include one or more square matrices, such as the kernel matrix in Support Vector Machines(Cortes and Vapnik, 1995)
or Gaussian Process, the affinity matrix in graph or network representation, and the attention matrix in transformers(Vaswani et al., 2017a). In many machine learning problems, the solution space is also over square matrices, for example, graph matching and network architecture optimization.
Obtaining a full matrix can cause infeasible storage and computational cost if is large. For example, a double-precision kernel matrix of the typical MNIST handwritten digit data set () requires about 36.5G memory. In another example, we need to calculate eight quintillion float numbers to fill a full attention matrix of a human DNA sequence with about 3.2 billion base pairs.
Therefore we need economical surrogates to approximate the full square matrices at a large scale. Matrix factorization is a widely used approach that approximately factorizes the square matrix to some cheaper matrices. In conventional matrix factorization, the factorizing matrices must be low-rank, for example, in Truncated Singular Value Decomposition and Nyström approximation(Williams and Seeger, 2001; Drineas and Mahoney, 2005). However, these conventional approaches do not work well if the square matrix in the approximation is not low-rank.
This paper proposes a novel approximation method called Sparse Factorization (SF), where the approximated square matrix is factorized into some full-rank sparse matrices. Using the Chord protocol (Stoica et al., 2001) to specify the non-zero entry positions222Non-zero entries are those stored, including both non-zeros and explicit zeros., we can match an full matrix with factorizing matrices, where each contains non-zero entries and thus in total non-zeros. Therefore the approximation is economical when is large.
We first study non-parametric learning of the approximation. That is, we directly minimize the approximation loss over the sparse factorizing matrices. The new method was tested on a variety of synthetic and real-world square matrices, with comparison to the most accurate low-rank approximation method, TSVD. We find that SF is especially suitable for approximating sparse square matrices with unknown non-zero positions, with a clear win over TSVD.
The Chord protocol also enables us to parameterize the mapping from input data to the non-zero entries in a matrix row with neural networks. This parametric form of SF thus provides a new attention architecture. We find that only one block of SF attention defeats Transformer and its several state-of-the-art variants on very long synthetic sequence classification and on the public Long Range Arena benchmark tasks.
The remaining of the paper is organized as follows. We first briefly review the matrix approximation based on low-rank factorization in Section 2. In Section 3, we present the machine learning formulation of sparse factorization of square matrices. The parametric formulation using neural networks is proposed in Section 4. In Section 5, we present the experimental settings and results. Conclusion and future work are given in Section 6.
The conventional matrix approximation is to factorize the large square matrix into a few low-rank matrices, for example or with , , and . Nyström decomposition (Williams and Seeger, 2001) is of the tri-factor kind, where and are calculated using normal kernel function, and is learned.
The approximation error can be measured by a certain divergence, for example the squared Frobenius norm (F-norm): . Minimization of over and
has a closed-form solution using Truncated Singular Value Decomposition (TSVD). Denotethe diagonal matrix with the largest
singular values. The corresponding left and right singular vectors as columns form matricesand , respectively. Then the minimum appears by setting and , where we can move any scaling from to or vice versa.
In general , where with . We can always reduce a multi-factor case to the two-factor form by grouping and , where and . It is required that when approximating large square matrices. Otherwise, the factorizing matrices are still too large. This two-factor form reveals a bottleneck as illustrated in Figure 1 (bottom), which shows that the approximation quality of all variants of low-rank matrix factorization cannot exceed TSVD(, ) in terms of Frobenius norm.
As we have seen, the performance of LRMF is capped due to the low-rank constraint. Its performance is poor if the approximated square matrix is far from low-rank, i.e., the sum of the few largest eigenvalues does not dominate the matrix trace.
Here we propose a new approximation method that is free of the low-rank constraint. Our method implements the approximation with a number () of square and sparse factorizing matrices. The approximation is still economical if the total number of non-zero entries is much smaller than .
There are many ways to specify the sparse structure (i.e. the non-zero positions). A good specification should guarantee that the product of the factorizing matrices, , should be a full matrix. The condition prevents that the approximating matrix contains some always-zero entries. In addition, we consider a secondary requirement that each factorizing matrix has the same sparse structure, which provides better symmetry in the approximation.
We thus adopt a modified Chord protocol (Stoica et al., 2001) which originates for peer-to-peer distributed hash tables. In the protocol, the indices from to are organized in a circular graph, where the th node connects to itself and the -th nodes with . We set in our work. The Chord protocol is illustrated in Figure 2 (top).
We use the Chord protocol to specify the non-zero positions in each factorizing matrix, where each matrix row has non-zero entries. Thus every factorizing matrix has non-zeros. Note that the product of the factorizing matrices corresponds to the connections in the circular graph after multiple hops. We can set the number of factorizing matrices to
, which corresponds to the number of hops, and the resulting matrix product becomes a full matrix with high probability(see Stoica et al., 2001, Theorem 2). In total, there are non-zero entries, still much smaller than for a large .
The (Chord) Sparse Factorization (SF) can thus be formulated as the following optimization problem:
where ’s are sparse square matrices with non-zero positions specified by the Chord protocol. The approximation scheme is illustrated in Figure 2 (bottom). We also call the approach non-parametric SF as we directly optimize over the factorizing matrices.
Besides direct optimization over the factorizing matrices, we can consider the mapping from vectorial input data to the factorizing matrix entries because each node in the Chord protocol has the same out-degree. The mapping as a component can endow the model 1) generalization to newly coming data and 2) representation learning by transforming the current embedding representation to a new embedding.
We present a transformation model as a concrete example to illustrate the idea. It is a transformer-like model (see Figure 3 left), where we replace the scaled dot-product attention with a product of sparse square matrices. The matrix product provides an approximation to a full non-normalized attention matrix.
One block of such Parametric Sparse Factorization Attention (PSF-Attn) transforms data in the current embedding to new embedding . There is a number of MLPs in the block, where the MLP with parameters takes (the th row of ) as an input and returns the non-zeros in the th row of , for and . That is, for ,
Similar to transformers, we use another MLP with parameters to convert an row to the corresponding row.
The new model can work as a building block in representation learning frameworks. For example, if we define an objective function over , the learning can be formulated as the following optimization problem:
where the optimization can be implemented with back-propagation and a gradient based algorithm such as Adam (Kingma and Ba, 2014). It is also straightforward to stack multiple PSF-Attn blocks to implement deeper learning.
The PSF-Attn method has two advantages. First, the original transformer and many of its variants (e.g. Wang et al., 2020; Choromanski et al., 2020; Tay et al., 2021; Katharopoulos et al., 2020) are built upon scaled dot-product, which essentially employ low-rank matrix factorization. They may not work well if the attention is intrinsically high rank. In contrast, all factorizing matrices in our method are full-rank, and so is their product. Therefore PSF-Attn does not suffer from the low-rank bottleneck constraint. Second, our model does not require softmax over a large square matrix, avoiding many computational difficulties. Removing the softmax also endows more freedom in mixing the rows because the mixture can go beyond the convex hull.
Our method also differs from the previous multi-layer sparse attention approaches such as (Li et al., 2019; Child et al., 2019; Correia et al., 2019) because they still use scaled dot-products. PSF-Attn can give full attention in one block. For the th element in the sequence, its attention to other elements can directly be obtained by vector-matrix product .
We have performed three groups of experiments. In the first group, we tried the non-parametric SF for different types of square matrices and compared it with TSVD. Next, we demonstrated the scalability of PSF-Attn on synthetic sequences up to tens of thousands positions. Finally, we tested PSF-Attn for the Long Range Arena benchmark data sets to see its performance in real-world practice. The first group of experiments was run on a standard PC with an Intel Core i9 CPU. The second and third groups were run on a Linux server with one NVIDIA-Tesla V100 GPU with 32GB of memory.
|TSVD 126, SF 136||TSVD 2327, SF 2649||TSVD 1870, SF 2297||TSVD 4296, SF 4283||TSVD 6158, SF 5180||TSVD 1705, SF 1342|
|TSVD 3826, SF 1569||TSVD 2525, SF 1648||TSVD 2907, SF 2032||TSVD 7182, SF 5697||TSVD 8629, SF 7058||TSVD 2047, SF 1169|
There are many types of square matrices. TSVD is the best for approximating matrices close to low rank in terms of F-norm. For non-parametric SF, we performed an empirical study 1) to show that SF can supersede TSVD in F-norm by using the same number of non-zeros and 2) to identify the types of square matrices particularly suitable for SF.
Given a square matrix, we used the Matlab fminunc optimizer to solve the problem in Eq. 1, where the non-zero entries in the factorizing matrices are initialized to random numbers between . The total number of non-zero entries for SF and TSVD are and . For a fair comparison, we set the in TSVD such that it has nearly the same number and no fewer non-zeros than SF.
We first used grayscale images as approximated matrices because we can directly see them. Figure 4 (top) shows six typical square images, where we provide the resulting TSVD and SF approximation errors in F-norm below each image (more examples in Appendix 1).
The chess image (matrix) is close to low-rank because the black-and-white chessboard can is two-rank. Therefore TSVD performs better as expected. TSVD also works well for the Lena and apple images because TSVD tends to preserve low-frequency information in images. In contrast, TSVD is not as good as SF for the other three images that contain rich high-frequency details such as lines and corners, which indicates a matrix type where SF can defeat LRMF.
To further verify this, we computed the gradient magnitudes of the images (shown in Figure 4 bottom). In this way, the intensities in constant areas become zero, and the remaining non-zeros are mainly high-frequency details. We can see that SF gives a lower approximation error than TSVD for all such matrices, which confirms that SF is more advantageous for approximating matrices with rich high-frequency details.
In summary, the results show that the TSVD performance does not cap SF by using the same number of non-zeros for approximating square matrices. The winning cases indicate that SF is often better than LRMF when the approximated matrix is 1) sparse, 2) intrinsically high-rank, or 3) containing rich high-frequency details.
|Data type||Data name||TSVD||SF|
|surface mesh||Shuttle Entry||2.73e+03||1.86e+03|
Besides square images, we have also compared TSVD and SF on several other types of square matrices. Table 1 shows the comparison results. The data types include affinity matrices of dense graphs (dense graph), affinity matrix of sparse networks (network), affinity matrix of surface mesh over 3D objects (surface mesh), and covariance matrix of vectorial data (covariance). We can see that for dense graph and covariance types, sometimes TSVD is better while sometimes SF can win. SF wins for most cases for surface mesh because the mesh networks probably do not have a low-rank structure. SF wins all data sets in the network type, which indicates that SF is more effective for approximating sparse matrices. We give the details of the data sets in Appendix 1.1.
Here we test whether PSF-Attn is scalable to approximate large attention matrices. We have used two synthetic data sets composed of long sequences for supervised learning tasks. A similar experimental setup appeared in(Hochreiter and Schmidhuber, 1997) for scalability tests. The details of the data sets and tasks are given below:
Adding Problem. This is a sequence regression task. Each element of an input sequence is a pair of numbers , where , , . We generated signals at two randomly selected positions and such that and elsewhere. The learning target is . For example, an input sequence will have the learning target . Unlike (Hochreiter and Schmidhuber, 1997), we did not restrict the and choice to make the task more challenging. That is, the relevant signals can appear either locally or at a great distance from each other. In evaluation, a network prediction is considered correct if .
Temporal Order. This is a sequence classification task. A sequence consists of randomly chosen symbols from the alphabet , where the first four are noise symbols. Each sequence has two signal symbols, either or , which appear at two arbitrary positions. The four target classes correspond to the ordered combinations of the signal symbols , , , and . For example, an input sequence
should be classified as Class 2.
We prepared data of different sequence lengths for each problem. We started with and gradually increased the length by the factor of two, up to . For every sequence length, we generated 200,000 training and 5,000 testing instances.
We compared PSF-Attn with several popular methods based on scaled dot-product attention (referred to as X-former architectures). The first two, Linformer (Wang et al., 2020) and Performer (Choromanski et al., 2020) have also claimed to be scalable attention-based architectures. For completeness, we also included the original Transformer (Vaswani et al., 2017b)333available at https://github.com/lucidrains/linformer and https://github.com/lucidrains/performer-pytorch.
We followed standard cross-validation techniques to tune the main hyperparameters, such as the number of layers and heads, dimensionality of the token embedding, and query/key/value dimensions. For the Temporal Order problem, we directly fed the data instances to the embedding layers. For the Adding problem, the input data was only two-dimensional, and one of them was real-valued. Directly using such a low-dimensional embedding space would lead to poor attention. We augmented the dimensionality with a linear layer to assure sufficient freedom for the scaled dot-products in the X-former architectures. All the models were optimized using the Adam optimizer(Kingma and Ba, 2014) with the learning rate of 0.001 using batch size of 40.
The results are shown in Figure 5. For the Adding problem, we see that all models work fine for short sequences ( for ). The X-former methods, however, turn worse or even useless when the sequences are longer. Linformer has an error rate of for and for , which is nearly as bad as random guessing (). Transformer becomes problematic ( error) when . Performer starts to get wrong when , giving only accuracy, and when , its prediction becomes almost random guessing. In contrast, PSF-Attn achieves accuracy for all the tested lengths.
A similar pattern holds for the Temporal Order problem. When , all compared models give perfect or nearly perfect predictions. With longer sequences, for example when , the error rates of Transformer, Performer, and Linformer become , , and , respectively. When , their error rates become , , and , respectively. In contrast, PSF-Attn achieves accuracy for , and accuracy for .
In summary, our method has better scalability than the three attention models based on scaled dot-products in terms of lower learning errors. PSF-Attn can still achieve nearlyprediction accuracy for an attention matrix size up to tens of thousands.
|Image, airplane||Image, ship||Pathfinder, Negative||Pathfinder, Positive|
Next we provide the experimental results on Long Range Arena (LRA), a publicly available benchmark for modeling long sequential data (Tay et al., 2020). We select four tasks from LRA that cover various data types and demand flexible reasoning abilities of tested models.
|Transformer (Tay et al., 2020)||36.37||64.27||42.44||71.40|
|Transformer (Zhu et al., 2021)||37.13||65.35||-||-|
|Transformer (Xiong et al., 2021)||37.10||65.02||38.20||74.16|
|Sparse Transformer (Tay et al., 2020)||17.07||63.58||44.24||71.71|
|Longformer (Tay et al., 2020)||35.63||62.58||42.22||69.71|
|Linformer (Tay et al., 2020)||37.70||53.94||38.56||76.34|
|Linformer (Zhu et al., 2021)||37.38||56.12||-||-|
|Linformer (Xiong et al., 2021)||37.25||55.91||37.84||67.60|
|Reformer (Tay et al., 2020)||37.27||56.10||38.07||68.50|
|Reformer (Zhu et al., 2021)||36.44||64.88||-||-|
|Reformer (Xiong et al., 2021)||19.05||64.88||43.29||69.36|
|Performer (Tay et al., 2020)||18.01||65.40||42.77||77.05|
|Performer (Zhu et al., 2021)||32.78||65.21||-||-|
|Performer (Xiong et al., 2021)||18.80||63.81||37.07||69.87|
|BigBird (Tay et al., 2020)||36.06||64.02||40.83||74.87|
|Linear Transformer (Tay et al., 2020)||16.13||65.90||42.34||75.30|
|Transformer-LS (Zhu et al., 2021)||38.36||68.40||-||-|
|RFA-Gaussian (Peng et al., 2021)||36.80||66.00||-||-|
|Nyströmformer (Zhu et al., 2021)||37.34||65.75||-||-|
|Nyströmformer (Xiong et al., 2021)||37.15||65.52||41.58||70.94|
) and standard deviation () across multiple runs in the format.
ListOps. This is a sequence classification task for measuring the ability of models to identify and parse hierarchically constructed data (Nangia and Bowman, 2018). We used an enlarged version of the original ListOps, with a max sequence length up to 2000 and tree depth up to 10 (Tay et al., 2020). Each element in a sequence can be an operator, a digit, and a left or right bracket. The brackets define lists of items. Each operator, MAX, MIN, MED, and SUM_MOD, takes the items in a list as input and returns a digit, where MED means median and SUM_MOD means summation followed by modulo 10. An example sequence [MAX 6 [MED 3 2 2] 8 5 [MIN 8 6 2]] has ground truth answer 8. A prediction is correct if an output value of a neural network matches the ground truth label. For good predictions, a model should access all sequence elements and identify the proper parsing structure.
This is a binary sentiment classification task constructed from the IMDb reviews(Maas et al., 2011)
. Given a review as a sequence of characters, the goal is to classify it as positive or negative. Due to the character-level representation, the sequences are much longer than the word-level version used in conventional language modeling. We truncated or padded every sequence to a fixed length ().
. This task is to classify images into one of ten classes. The images and class labels come from the grayscale version of CIFAR10. Each image is flattened to form a sequence of length 1024. Unlike conventional computer vision, the task requires the predictors to treat the grayscale levels (0-255) as categorical values. That is, each image becomes a sequence of symbols with an alphabet size of 256. Two example images and their class labels are shown in Figure6.
Pathfinder. This is a binary classification task on synthetic images, which is motivated by cognitive psychology (Linsley et al., 2018). Each image (size ) contains two highlighted endpoints and some path-like patterns. Similar to the Image Classification task, the predictors must treat the pixels as categorical values and flatten the image to a sequence of length 1024. The task is to classify whether there is a path consisting of dashes between two highlighted points. Two example images and their classes are shown in Figure 6.
We have tested PSF-Attn in the above LRA learning tasks, where the hyperparameters in PSF-Attn were again tuned by cross-validation. No external data was used for pre-training. We ran PSF-Attn four times with a different random seed for each task. The mean and standard deviation across the multiple runs are reported in Table 2.
For comparison, we quote the prediction accuracies reported for many X-former methods in the literature, including Transformer (Vaswani et al., 2017a), Sparse Transformer (Child et al., 2019), Longformer (Beltagy et al., 2020), Linformer (Wang et al., 2020), Reformer (Kitaev et al., 2020), Performer (Choromanski et al., 2020), BigBird (Zaheer et al., 2020), Linear transformer (Katharopoulos et al., 2020), Transformer-LS (Zhu et al., 2021), RFA-Gaussian (Peng et al., 2021), and Nyströmformer (Xiong et al., 2021). If a method has different implementations, we quote all variants and their results. We exclude results that rely on external data for pre-training.
We see that PSF-Attn wins all tasks by giving the best classification accuracy among all compared methods. Such strong cross-task wins suggest that our method usually provides better attention approximation than those based on scaled dot-products.
Remarkably, our method has substantially improved the state-of-the-art in the Text and Pathfinder tasks. For Text, PSF-Attn achieves accuracy, compared to the runner-up by Transformer-LS. Our method wins by accuracy for Pathfinder, which gains about higher than the best X-former (Linear Transformer ). The significant improvement brought by PSF-Attn is probably because the two tasks involve sparse attention matrices.
We also investigated whether the approximating attention is meaningful by visualization. Figure 7 shows an attention vector (absolute values) of token [“CLS”]. Tokens in the review having more weight are highlighted. We see that PSF-Attn has a good performance in capturing sparse attention and identify the relevant words.
Figure 8 illustrates the attention values for a positive instance in Pathfinder. We calculated the attention vector of the -th point as the -th row in . We then summed the attention vectors of the endpoints and their direct neighbors and reshaped the sum to for visualization, where we see that the positions with high absolute visualized values distribute around the connecting path between the endpoints.
We have proposed a new approximation method called Sparse Factorization for square matrices by using a product of sparse matrices. Given the same budget of non-zero numbers, our method has shown superior performance over conventional low-rank matrix factorization approaches, especially in cases where the approximated matrix is sparse, high-rank, or contains high-frequency details. We have also given parametric design and performed an empirical study on the classification of long sequential data. As the critical attention component, our method has demonstrated clear wins over the conventional scaled dot-product transformer and its several variants in terms of scalability and accuracy.
We have employed the Chord protocol to fix the non-zero positions in the sparse factorizing matrices. Later we could consider the other predefined protocols or even adaptively learned protocols for the sparse structure. In this work, we have considered approximating unconstrained square matrices. In the future, we could investigate the approximation to more specific matrices such as non-negative, stochastic, symmetric, or semi-definite matrices. As a result, we could extend the method for further applications such as large-scale graph matching, Gaussian Process, or network structure identification.
Learning long-range spatial dependencies with horizontal gated recurrent units.In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 152–164, 2018.
Learning word vectors for sentiment analysis.In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011.
In addition to six square images from the main paper, we have studied another six square images using the previously described approach. The results are shown in Figure A1. The conclusions are aligned with those in the main paper
There exist cases where SF performs better than TSVD, which means the SF approximation quality is not capped by TSVD by using the same number of non-zeros.
SF is more likely to win for images where there are more high-frequency details, especially for the gradient-magnitude images.
In the main paper Table 1, we have present comparison between TSVD and SF for approximating different types of square matrices using the same number of non-zeros. Here we provide the sources and statistics of the square matrices.
AuralSonar (). It is the Aural Sonar data set from [Chen et al., 2009]. The original research investigated the human ability to distinguish different types of sonar signals by ear (Philips et al., 2006).
Voting (). It is the Voting data set from [Chen et al., 2009], which contains dissimilarities between 435 voting records with 16 scaled voting attributes.
Sawmill (). It is the Sawmill communication network from the Pajek data sets444available at http://vlado.fmf.uni-lj.si/pub/networks/data/. This is a sparse matrix with 124 non-zero entries.
Scotland (). It is the Scotland network from the Pajek data sets, which is about Corporate interlocks in Scotland (1904-5). The matrix is sparse with 644 non-zero entries.
A99m (). It is the GD’99 - Linden strasse network from the Pajek data sets. The network is about the characters and their relations in the long-running German soap opera called ‘Lindenstrasse’. The matrix is sparse with 510 non-zero entries.
Mexican power (). It is the Mexican network from the Pajek data sets. The network contains the core of this political elite: the presidents and their closest collaborators. The matrix is sparse with 117 non-zero entries.
Strike (). It is the Strike network from the Pajek data sets. This is a social network about informal communication within a sawmill on strike. The matrix is sparse with 38 non-zero entries.
Webkb Cornell (). It is the Cornell subset in the LINQS WebKB data set555available at https://linqs.soe.ucsc.edu/data. The network is about citations among the 195 publication from Cornell. The matrix is sparse with 304 non-zero entries.
WorldTrade (). It is the World_trade network in the Pajek data sets. The network is about world trade in miscellaneous manufactures of metal, 1994. The matrix is sparse with 998 non-zero entries.
Mesh1e1 (). It is the mesh1e1 data set in SuiteSparse Matrix Collection666available at https://sparse.tamu.edu/. The matrix was originally from NASA, collected by Alex Pothen. It is a sparse matrix with 306 non-zero entries.
Mesh2e1 (). It is the mesh2e1 data set in SuiteSparse Matrix Collection. The matrix was originally from NASA, collected by Alex Pothen. It is a sparse matrix with 2018 non-zero entries.
OrbitRaising (). It is the orbitRaising_1 data set in SuiteSparse Matrix Collection. The matrix was from an optimal control problem. It is a sparse matrix with 2906 non-zero entries.
Shuttle Entry (). It is the spaceShuttleEntry_1 data set in SuiteSparse Matrix Collection. The matrix was from an optimal control problem. It is a sparse matrix with 6891 non-zero entries.
AntiAngiogenesis (). It is the tumorAntiAngiogenesi_1 data set in SuiteSparse Matrix Collection. The matrix was from an optimal control problem. It is a sparse matrix with 1783 non-zero entries.
Phoneme (). It is the covariance matrix of the Phoneme data set accompanied with the Elements of Machine Learning book [Hastie et al., 2001]. The original data777available at https://web.stanford.edu/~hastie/ElemStatLearn/data.html has 4508 instances of 256 dimensions.
MiniBooNE (). It is the covariance matrix of the MiniBooNE particle identification data set in the UCI Repository888available at https://archive.ics.uci.edu/ml/. The original data has 130064 instances of 50 dimensions.
Covertype (). It is the covariance matrix of the Covertype data set in the UCI Repository. The original data has 581012 instances of 54 dimensions.
Mfeat (). It is the covariance matrix of the Multiple Features data set in the UCI Repository. The original data has 2000 instances of 649 dimensions.
OptDigits (). It is the covariance matrix of the Optical Recognition of Handwritten Digits data set in the UCI Repository. The original data has 5620 instances of 64 dimensions.
PenDigits (). It is the covariance matrix of the Pen-Based Recognition of Handwritten Digits data set in the UCI Repository. The original data has 10992 instances of 16 dimensions.
Acoustic (). It is the covariance matrix of the SensIT Vehicle (acoustic) data set in the LIBSVM Classification (Multi-class) data collection999available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html. The original data has 98528 instances of 50 dimensions.
IJCNN (). It is the covariance matrix of the ijcnn1 data set in the LIBSVM Classification (Binary class) data collection101010available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. The original data has 126701 instances of 22 dimensions.
Spam Ham (). It is the covariance matrix of a data set for email classification practice. The original data has 10000 instances of 448 features.
TIMIT (). It is the covariance matrix of the TIMIT speech recognition data111111available at https://catalog.ldc.upenn.edu/LDC93S1. There are 151290 instances, each contains 390 features (concatenating MFCC features in 10 consecutive 30ms windows).
Votes (). It is the covariance matrix of the Congressional Voting Records data set in the UCI Repository. The original data has 435 instances of 16 dimensions.
|TSVD 820, SF 1312||TSVD 1008, SF 1308||TSVD 1671, SF 2378||TSVD 2623, SF 2879||TSVD 3121, SF 2863||TSVD 1655, SF 1462|
|TSVD 1079, SF 839||TSVD 679, SF 401||TSVD 1824, SF 865||TSVD 3612, SF 1673||TSVD 4344, SF 2656||TSVD 2743, SF 1666|
In this section, we provide a detailed explanation of the LRA experiments [Tay et al., 2020]
. The technical information, such as memory consumption and time per epoch, is provided. In addition, we visualized attention maps for several instances from the Pathfinder task and explained their extraction process in detail.
The data set for the LRA benchmark is publicly available. The information about data and the download link can be found in the official GitHub repository: https://github.com/google-research/long-range-arena.
ListOps The raw data for this problem is organized as three separate files basic_train.tsv, basic_test.tsv, basic_val.tsv for training, testing, and validation data, respectively. The split is fixed. In addition to the tokens described in the main paper, each sequence has "(" and ")" symbols, which should be removed. To equalize the lengths of the sequences, we used the built-in PyTorch padding functional. After the sequences are prepared, the embedding layer processes each unique value, thus mapping elements to the embedding space. The rest of the training process is straightforward.
Text Classification The IMDB data set is downloaded using the tensorflow-dataset package. It includes 25 000 instances for training and another 25 000 for the test set. To transform the raw reviews into sequences, we first went through the whole corpus and extracted the character vocabulary. Then we mapped each sequence to a vector of indices using this vocabulary. Finally, we truncated or padded each sequence to a fixed length of and started training.
Image Classification CIFAR10 is a well-known dataset, which can be downloaded from the torchvision package. The train/test splitting is fixed. To make images gray-scaled, we used standard transformation transforms.Grayscale from the same package. An image is flattened to a sequence of length 1024. Then each element is mapped to a dictionary of size 256 (all possible intensity values) and given to the embedding layer.
Pathfinder The problem data consists of two types of files: images and metafiles. Metafiles store information about all the images and their corresponding labels (positive or negative). There are three classes of images: curv_baseline (easy), curv_contour_length_9 (medium), curv_contour_length_14 (hard). An image class corresponds to the distance between its endpoints (curve length), thus positively correlates with the difficulty level. Three images from the different catalogs are present in Figure A2. The exact data split is not provided. To separate the data into three parts, we iterated over all metafiles from the catalogs and constructed the training/val/test (90%/5%/5%) sets such that all three types of images are present equally. The rest of the processing is similar to the Image Classification task.
In this subsection, we explain the process of attention values extraction in detail. Two different approaches are used to obtain and interpret the training results because PSF-Attn was trained using different pooling strategies: adding the ["CLS"] token to the sequences of the IMDB reviews and simple flattening for the Pathfinder sequences.
Having an attention matrix , the attention vector of the -th sequence element is the -th row of .
Text Classification The ["CLS"] token is responsible for representing the information collected from the whole input sequence. Its attention vector corresponds to the -th row in the attention matrix. After normalization of the absolute matrix values to a range of [0, 1], we extracted the row and visualized its values along with the corresponding review characters of one test instance. The example in the main paper illustrates the ability of PSF-Attn to operate on sparse attention matrices successfully and to output meaningful results reflecting high classification accuracy.
Pathfinder Flattening as a pooling strategy requires another approach to interpret the attention values. The attention matrix for every test instance is of size (1024x1024), while the input image is (32x32). This mismatch makes it problematic to map attention values to the corresponding input image directly. To achieve this, we identified the positions of the endpoints and considered their direct neighbors (18 pixels positions, including themselves) as the source for visualization. We extracted the respective rows of the attention matrix and, finally, took their sum. The resulting vector was reshaped to the original size and depicted in a heat map. Figure A3 shows the attention values obtained using the described method. We can see that the identified attention vectors of store the meaningful information that visually matches the target path.
|Problem||Train Size (k)||Time||Param. (M)||Memory|