structurednets
Structured matrices for compressing neural networks
view repo
The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a lowrank residual. Existing use of LDR matrices in deep learning has applied fixed displacement operators encoding forms of shift invariance akin to convolutions. We introduce a rich class of LDR matrices with more general displacement operators, and explicitly learn over both the operators and the lowrank component. This class generalizes several previous constructions while preserving compression and efficient computation. We prove bounds on the VC dimension of multilayer neural networks with structured weight matrices and show empirically that our compact parameterization can reduce the sample complexity of learning. When replacing weight layers in fullyconnected, convolutional, and recurrent neural networks for image classification and language modeling tasks, our new classes exceed the accuracy of existing compression approaches, and on some tasks even outperform general unstructured layers while using more than 20X fewer parameters.
READ FULL TEXT VIEW PDFStructured matrices for compressing neural networks
Recent years have seen a surge of interest in structured representations for deep learning, motivated by achieving compression and acceleration while maintaining generalization properties. A popular approach for learning compact models involves constraining the weight matrices to exhibit some form of dense but compressible structure and learning directly over the parameterization of this structure. Examples of structures explored for the weight matrices of deep learning pipelines include lowrank matrices [15, 41], lowdistortion projections [48], (block)circulant matrices [8, 17], Toeplitzlike matrices [33, 44], and constructions derived from Fourierrelated transforms [36]. Though they confer significant storage and computation benefits, these constructions tend to underperform general fullyconnected layers in deep learning. This raises the question of whether broader classes of structured matrices can achieve superior downstream performance while retaining compression guarantees.
Our approach leverages the low displacement rank (LDR) framework (Section 2), which encodes structure through two sparse displacement operators and a lowrank residual term [26]. Previous work studying neural networks with LDR weight matrices assumes fixed displacement operators and learns only over the residual [44, 49]. The only case attempted in practice that explicitly employs the LDR framework uses fixed operators encoding shift invariance, producing weight matrices which were found to achieve superior downstream quality than several other compression approaches [44]. Unlike previous work, we consider learning the displacement operators jointly
with the lowrank residual. Building upon recent progress on structured dense matrixvector multiplication
[14], we introduce a much more general class of LDR matrices and develop practical algorithms for using these matrices in deep learning architectures. We show that the resulting class of matrices subsumes many previously used structured layers, including constructions that did not explicitly use the LDR framework [36, 17]. When compressing weight matrices in fullyconnected, convolutional, and recurrent neural networks, we empirically demonstrate improved accuracy over existing approaches. Furthermore, on several tasks our constructions achieve higher accuracy than general unstructured layers while using an order of magnitude fewer parameters.To shed light on the empirical success of LDR matrices in machine learning, we draw connections to recent work on learning equivariant representations, and hope to motivate further investigations of this link. Notably, many successful previous methods for compression apply classes of structured matrices related to convolutions
[8, 17, 44]; while their explicit aim is to accelerate training and reduce memory costs, this constraint implicitly encodes a shiftinvariant structure that is wellsuited for image and audio data. We observe that the LDR construction enforces a natural notion of approximate equivariance to transformations governed by the displacement operators, suggesting that, in contrast, our approach of learning the operators allows for modeling and learning more general latent structures in data that may not be precisely known in advance.Despite their increased expressiveness, our new classes retain the storage and computational benefits of conventional structured representations. Our construction provides guaranteed compression (from quadratic to linear parameters) and matrixvector multiplication algorithms that are quasilinear in the number of parameters. We additionally provide the first analysis of the sample complexity of learning neural networks with LDR weight matrices, which extends to lowrank, Toeplitzlike and other previously explored fixed classes of LDR matrices. More generally, our analysis applies to structured matrices whose parameters can interact multiplicatively with high degree. We prove that the class of neural networks constructed from these matrices retains VC dimension almost linear in the number of parameters, which implies that LDR matrices with learned displacement operators are still efficiently recoverable from data. This is consistent with our empirical results, which suggest that constraining weight layers to our broad class of LDR matrices can reduce the sample complexity of learning compared to unstructured weights.
We provide a detailed review of previous work and connections to our approach in Appendix B.
We introduce a rich class of LDR matrices where the displacement operators are explicitly learned from data, and provide multiplication algorithms implemented in PyTorch (Section
3).^{1}^{1}1Our code is available at https://github.com/HazyResearch/structurednets.We prove that the VC dimension of multilayer neural networks with LDR weight matrices, which encompasses a broad class of previously explored approaches including the lowrank and Toeplitzlike classes, is quasilinear in the number of parameters (Section 4).
We empirically demonstrate that our construction improves downstream quality when compressing weight layers in fullyconnected, convolutional, and recurrent neural networks compared to previous compression approaches, and on some tasks can even outperform general unstructured layers (Section 5).
The generic term structured matrix refers to an matrix that can be represented in much fewer than parameters, and admits fast operations such as matrixvector multiplication. The displacement rank approach represents a structured matrix through displacement operators defining a linear map on matrices, and a residual , so that if
(1) 
then can be manipulated solely through the compressed representation . We assume that and
have disjoint eigenvalues, which guarantees that
can be recovered from (c.f. Theorem 4.3.2, Pan [39]). The rank of (also denoted ) is called the displacement rank of w.r.t. .^{2}^{2}2Throughout this paper, we use square matrices for simplicity, but LDR is welldefined for rectangular.The displacement approach was originally introduced to describe the Toeplitzlike matrices, which are not perfectly Toeplitz but still have shiftinvariant structure [26]. These matrices have LDR with respect to shift/cycle operators. A standard formulation uses , where denotes the matrix with on the subdiagonal and in the topright corner. The Toeplitzlike matrices have previously been applied in deep learning and kernel approximation, and in several cases have performed significantly better than competing compressed approaches [44, 33, 10]. Figure 1 illustrates the displacement (1) for a Toeplitz matrix, showing how the shift invariant structure of the matrix leads to a residual of rank at most 2.
A few distinct classes of useful matrices are known to satisfy a displacement property: the classic types are the Toeplitz, Hankel, Vandermonde, and Cauchylike matrices (Appendix C, Table 5), which are ubiquitous in other disciplines [39]. These classes have fixed operators consisting of diagonal or shift matrices, and LDR properties have traditionally been analyzed in detail only for these special cases. Nonetheless, a few elegant properties hold for generic operators, stating that certain combinations of (and operations on) LDR matrices preserve low displacement rank. We call these closure properties, and introduce an additional block closure property that is related to convolutional filter channels (Section 5.2).
We use the notation to refer to the matrices of displacement rank with respect to .
LDR matrices are closed under the following operations:
[label=()]
Transpose/Inverse If , then and .
Sum If and , then .
Product If and , then .
Block Let satisfy for . Then the block matrix has displacement rank .
We consider two classes of new displacement operators. These operators are fixed to be matrices with particular sparsity patterns, where the entries are treated as learnable parameters.
The first operator class consists of subdiagonal (plus corner) matrices: , along with the corner , are the only possible nonzero entries. As is a special case matching this sparsity pattern, this class is the most direct generalization of Toeplitzlike matrices with learnable operators.
The second class of operators are tridiagonal (plus corner) matrices: with the exception of the outer corners and , can only be nonzero if . Figure 2 shows the displacement operators for the Toeplitzlike class and our more general operators. We henceforth let LDRSD and LDRTD denote the classes of matrices with low displacement rank with respect to subdiagonal and tridiagonal operators, respectively. Note that LDRTD contains LDRSD.
The matrices we introduce can model rich structure and subsume many types of linear transformations used in machine learning. We list some of the structured matrices that have LDR with respect to tridiagonal displacement operators:
The LDRTD matrices contain:
[label=()]
lowrank matrices.
the other classic displacement structures: Hankellike, Vandermondelike, and Cauchylike matrices.
orthogonal polynomial transforms, including the Discrete Fourier and Cosine Transforms.
Given the parameters , the operation that must ultimately be performed is matrixvector multiplication by . Several schemes for explicitly reconstructing from its displacement parameters are known for specific cases [40, 43], but do not always apply to our general operators. Instead, we use to implicitly construct a slightly different matrix with at most double the displacement rank, which is simpler to work with.
Let denote the Krylov matrix, defined to have th column . For any vectors , then the matrix
(2) 
has displacement rank at most with respect to .
Thus our representation stores the parameters , where are either subdiagonal or tridiagonal operators (containing or parameters), and . These parameters implicitly define the matrix (2), which is the LDR weight layer we use.
Generic and nearlinear time algorithms for matrixvector multiplication by LDR matrices with even more general operators, including both the LDRTD and LDRSD classes, were recently shown to exist [14]. However, complete algorithms were not provided, as they relied on theoretical results such as the transposition principle [6] that only imply the existence of algorithms. Additionally, the recursive polynomialbased algorithms are difficult to implement efficiently. For LDRSD, we provide explicit and complete nearlinear time algorithms for multiplication by (2), as well as substantially simplify them to be useful in practical settings and implementable with standard library operations. We empirically compare the efficiency of our implementation and unstructured matrixvector multiplication in Figure 5 and Table 6 in Appendix LABEL:sec:additionalresults, showing that LDRSD accelerates inference by 3.3446.06x for . We also show results for the lowrank and Toeplitzlike classes, which have a lower computational cost. For LDRTD, we explicitly construct the and matrices for from Proposition 3 and then apply the standard matrixvector multiplication algorithm. Efficient implementations of nearlinear time algorithms for LDRTD are an interesting area of future work.
Define the simultaneous computation of
Fast Fourier Transforms (FFT), each with size
, to be a batched FFT with total size .Consider any subdiagonal matrix and vectors . Then or can be multiplied by any vector by computing batched FFTs, each of total size . The total number of computations is .
These algorithms are also automatically differentiable, which we use to compute the gradients when learning. More complete descriptions of these algorithms are presented in Appendix C.
The matrices we use (2) are unusual in that the parameters interact multiplicatively (namely in ) to implicitly define the actual layer. In contrast, fullyconnected layers are linear and other structured layers, such as Fastfood and ACDC [30, 48, 36], are constant degree in their parameters. However, we can prove that this does not significantly change the learnability of our classes:
Let denote the class of neural networks with LDR layers, total parameters, and piecewise linear activations. Let denote the corresponding classification functions, i.e. . The VC dimension of this class is
Theorem 2 matches the standard bound for unconstrained weight matrices [4, 24]. This immediately implies a standard PAClearnable guarantee [46]. Theorem 2 holds for even more general activations and matrices that for example include the broad classes of [14]. The proof is in Appendix LABEL:sec:vc_dim, and we empirically validate the generalization and sample complexity properties of our class in Section 5.3.
We observe that displacement rank is related to a line of work outside the resourceconstrained learning community, specifically on building equivariant (also called covariant in some contexts [5, 34]) feature representations that transform in predictable ways when the input is transformed. An equivariant feature map satisfies
(3) 
for transformations (invariance is the special case when is the identity) [32, 16, 42]. This means that perturbing the input by a transformation before passing through the map is equivalent to first finding the features then transforming by .
Intuitively, LDR matrices are a suitable choice for modeling approximately equivariant linear maps, since the residual of (3) has low complexity. Furthermore, approximately equivariant maps should retain the compositional properties of equivariance, which LDR satisfies via Proposition 1. For example, Proposition 13 formalizes the notion that the composition of two approximately equivariant maps is still approximately equivariant. Using this intuition, the displacement representation (1) of a matrix decomposes into two parts: the operators define transformations to which the model is approximately equivariant, and the low complexity residual controls standard model capacity.
Equivariance has been used in several ways in the context of machine learning. One formulation, used for example to model egomotions, supposes that (3) holds only approximately, and uses a fixed transformation along with data for (3) to learn an appropriate [1, 32]. Another line of work uses the representation theory formalization of equivariant maps [12, 27]. We describe this formulation in more detail and show how LDR satisfies this definition as well in Appendix LABEL:sec:group_rep, Proposition LABEL:prop:equivariance. In contrast to previous settings, which fix one or both of , our formulation stipulates that can be uniquely determined from , , and learns the latter as part of an endtoend model. In Section 5.4 we include a visual example of latent structure that our displacement operators learn, where they recover centering information about objects from a 2D image dataset.
In Section 5.1 we consider a standard setting of compressing a single hidden layer (SHL) neural network and the fullyconnected (FC) layer of a CNN for image classification tasks. Following previous work [7, 44], we test on two challenging MNIST variants [29], and include two additional datasets with more realistic objects (CIFAR10 [28] and NORB [31]). Since SHL models take a single channel as input, we converted CIFAR10 to grayscale for this task. Our classes and the structured baselines are tested across different parameter budgets in order to show tradeoffs between compression and accuracy. As shown in Table 1, in the SHL model, our methods consistently have higher test accuracy than baselines for compressed training and inference, by 3.14, 2.70, 3.55, and 3.37 accuracy points on MNISTbgrot, MNISTnoise, CIFAR10, and NORB respectively. In the CNN model, as shown in Table 1 in Appendix LABEL:sec:additionalresults, we found improvements of 5.56, 0.95, and 1.98 accuracy points over baselines on MNISTbgrot, MNISTnoise, and NORB respectively. Additionally, to explore whether learning the displacement operators can facilitate adaptation to other domains, we replace the inputhidden weights in an LSTM for a language modeling task, and show improvements of 0.8130.47 perplexity points compared to baselines at several parameter budgets.
In addition to experiments on replacing fullyconnected layers, in Section 5.2 we also replace the convolutional layer of a simple CNN while preserving performance within 1.05 accuracy points on CIFAR10. In Section 5.3, we consider the effect of a higher parameter budget. By increasing the rank to just , the LDRSD class meets or exceeds the accuracy of the unstructured FC layer in all datasets we tested on, for both SHL and CNN.^{3}^{3}3In addition to the results reported in Table 1, Figure 3 and Table LABEL:table:imagesextendedcnn in Appendix LABEL:sec:additionalresults, we also found that at rank 16 the LDRSD class on the CNN architecture achieved test accuracies of 68.48% and 75.45% on CIFAR10 and NORB respectively. Appendix D includes more experimental details and protocols. Our PyTorch code is publicly available at github.com/HazyResearch/structurednets.
Sindhwani et al. [44] showed that for a fixed parameter budget, the Toeplitzlike class significantly outperforms several other compression approaches, including Random Edge Removal [11], Low Rank Decomposition [15], Dark Knowledge [25], HashedNets [7], and HashedNets with Dark Knowledge. Following previous experimental settings [7, 44], Table 1 compares our proposed classes to several baselines using dense structured matrices to compress the hidden layer of a single hidden layer neural network. In addition to Toeplitzlike, we implement and compare to other classic LDR types, Hankellike and Vandermondelike, which were previously indicated as an unexplored possibility [44, 49]. We also show results when compressing the FC layer of a 7layer CNN based on LeNet in Appendix LABEL:sec:additionalresults, Table LABEL:table:imagesextendedcnn. In Appendix LABEL:sec:additionalresults, we show comparisons to additional baselines at multiple budgets, including network pruning [23] and a baseline used in [7], in which the number of hidden units is adjusted to meet the parameter budget.
Method  MNISTbgrot  MNISTnoise  CIFAR10  NORB 

Unstructured  44.08  65.15  46.03  59.83 
622506  622506  1058826  1054726  
LDRTD ()  45.81  78.45  45.33  62.75 
14122  14122  18442  14342  
Toeplitzlike [44] ()  42.67  75.75  41.78  59.38 
14122  14122  18442  14342  
Hankellike ()  42.23  73.65  41.40  60.09 
14122  14122  18442  14342  
Vandermondelike ()  37.14  59.80  33.93  48.98 
14122  14122  18442  14342  
Lowrank [15] ()  35.67  52.25  32.28  43.66 
14122  14122  18442  14342  
Fastfood [48]  38.13  63.55  39.64  59.02 
10202  10202  13322  9222  
Circulant [8]  34.46  65.35  34.28  46.45 
8634  8634  11274  7174 
At rank one (the most compressed setting), our classes with learned operators achieve higher accuracy than the fixed operator classes, and on the MNISTbgrot, MNISTnoise, and NORB datasets even improve on FC layers of the same dimensions, by 1.73, 13.30, and 2.92 accuracy points respectively on the SHL task, as shown in Table 1. On the CNN task, our classes improve upon unstructured fullyconnected layers by 0.85 and 2.25 accuracy points on the MNISTbgrot and MNISTnoise datasets (shown in Table LABEL:table:imagesextendedcnn in Appendix LABEL:sec:additionalresults). As noted above, at higher ranks our classes meet or improve upon the accuracy of FC layers on all datasets in both the SHL and CNN architectures.
Additionally, in Figure 3 we evaluate the performance of LDRSD at higher ranks. Note that the ratio of parameters between LDRSD and the Toeplitzlike or lowrank is , which becomes negligible at higher ranks. Figure 3 shows that at just rank , the LDRSD class meets or exceeds the performance of the FC layer on all four datasets, by 5.87, 15.05, 0.74, and 6.86 accuracy points on MNISTbgrot, MNISTnoise, CIFAR10, and NORB respectively, while still maintaining at least 20x fewer parameters.
Of particular note is the poor performance of lowrank matrices. As mentioned in Section 2, every fixedoperator class has the same parameterization (a lowrank matrix). We hypothesize that the main contribution to their marked performance difference is the effect of the learned displacement operator modeling latent invariances in the data, and that the improvement in the displacement rank classes—from lowrank to Toeplitzlike to our learned operators—comes from more accurate representations of these invariances. As shown in Figure 3, broadening the operator class (from Toeplitzlike at to LDRSD at ) is consistently a more effective use of parameters than increasing the displacement rank (from Toeplitzlike at to ). Note that LDRSD () and Toeplitzlike () have the same parameter count.
Here, we replace the inputhidden weights in a single layer long shortterm memory network (LSTM) for a language modeling task. We evaluate on the WikiText2 dataset, consisting of 2M training tokens and a vocabulary size of 33K
[35]. We compare to Toeplitzlike and lowrank baselines, both previously investigated for compressing recurrent nets [33]. As shown in Table 2, LDRSD improves upon the baselines for each budget tested. Though our class does not outperform the unstructured model, we did find that it achieves a significantly lower perplexity than the fixed Toeplitzlike class (by 19.9442.92 perplexity points), suggesting that learning the displacement operator can help adapt to different domains.Num. Parameters  LDRSD  Toeplitzlike  Lowrank 

2048  166.97  186.91  205.72 
3072  154.51  177.60  179.46 
5120  141.91  178.07  172.38 
9216  143.60  186.52  144.41 
17408  132.43  162.58  135.65 
25600  129.46  155.73  133.37 
Convolutional layers of CNNs are a prominent example of equivariant feature maps.^{4}^{4}4Convolutions are designed to be shift equivariant, i.e. shifting the input is equivalent to shifting the output. It has been noted that convolutions are a subcase of Toeplitzlike matrices with a particular sparsity pattern^{5}^{5}5E.g. a convolutional filter on an matrix has a Toeplitz weight matrix supported on diagonals . [8, 44]. As channels are simply block matrices^{6}^{6}6A layer consisting of inchannels and outchannels, each of which is connected by a weight matrix of class , is the same as a block matrix., the block closure property implies that multichannel convolutional filters are simply a Toeplitzlike matrix of higher rank (see Appendix C, Corollary 1). In light of the interpretation of LDR of an approximately equivariant linear map (as discussed in Section 4), we investigate whether replacing convolutional layers with more general representations can recover similar performance, without needing the handcrafted sparsity pattern.
Briefly, we test the simplest multichannel CNN model on the CIFAR10 dataset, consisting of one layer of convolutional channels (
in/out channels), followed by a FC layer, followed by the softmax layer. The final accuracies are listed in Table
3. The most striking result is for the simple architecture consisting of two layers of a single structured matrix. This comes within 1.05 accuracy points of the highly specialized architecture consisting of convolutional channels + pooling + FC layer, while using fewer layers, hidden units, and parameters. The full details are in Appendix D.First hidden layer(s)  Last hidden layer  Hidden units  Parameters  Test Acc. 
3 Convolutional Channels (CC)  FC  3072, 512  1573089  54.59 
3CC + Max Pool 
FC  3072, 768, 512  393441  55.14 
4CC + Max Pool  FC  4096, 1024, 512  524588  60.05 
Toeplitzlike channels  Toeplitzlike  3072, 512  393216  57.29 
LDRSD channels  LDRSD  3072, 512  417792  59.36 
Toeplitzlike matrix  Toeplitzlike  3072, 512  393216  55.29 
LDRSD matrix  LDRSD  3072, 512  405504  59.00 
Theorem 2 states that the theoretical sample complexity of neural networks with structured weight matrices scales almost linearly in the total number of parameters, matching the results for networks with fullyconnected layers [4, 24]. As LDR matrices have far fewer parameters, the VC dimension bound for LDR networks are correspondingly lower than that of general unstructured networks. Though the VC dimension bounds are sufficient but not necessary for learnability, one might still expect to be able to learn over compressed networks with fewer samples than over unstructured networks. We empirically investigate this result using the same experimental setting as Table 1 and Figure 3, and show in Table LABEL:table:generror (Appendix LABEL:sec:additionalresults) that the structured classes consistently have lower generalization error^{7}^{7}7As standardly measured by the difference between training and test error. than the unstructured baseline.
We investigate whether LDR models with learned displacement operators require fewer samples to achieve the same test error, compared to unstructured weights, in both the single hidden layer and CNN architectures. Tables LABEL:table:samplecomplexityshl and LABEL:table:samplecomplexitycnn in Appendix LABEL:sec:additionalresults show our results. In the single hidden layer architecture, when using only 25% of the training data the LDRTD class exceeds the performance of an unstructured model trained on the full MNISTnoise dataset. On the CNN model, only 50% of the training data is sufficient for the LDRTD to exceed the performance of an unstructured layer trained on the full dataset.
Finally, we examine the actual structures that our models learn. Figure 4(a,b) shows the heat map of the weight matrix for the Toeplitzlike and LDRSD classes, trained on MNISTbgrot with a single hidden layer model. As is convention, the input is flattened to a vector in . The Toeplitzlike class is unable to determine that the input is actually a image instead of a vector. In contrast, LDRSD class is able to pick up regularity in the input, as the weight matrix displays gridlike periodicity of size 28.
Figure 4(c) reveals why the weight matrix displays this pattern. The equivariance interpretation (Section 4) predicts that should encode a meaningful transformation of the inputs. The entries of the learned subdiagonal are in fact recovering a latent invariant of the 2D domain: when visualized as an image, the pixel intensities correspond to how the inputs are centered in the dataset (Figure 4(d)). Figure LABEL:fig:visualizationNORB in Appendix LABEL:sec:additionalresults shows a similar figure for the NORB dataset, which has smaller objects, and we found that the subdiagonal learns a correspondingly smaller circle.
We substantially generalize the class of low displacement rank matrices explored in machine learning by considering classes of LDR matrices with displacement operators that can be learned from data. We show these matrices can improve performance on downstream tasks compared to compression baselines and, on some tasks, even general unstructured weight layers. We hope this work inspires additional ways of using structure to achieve both more compact and higher quality representations, especially for deep learning models which are commonly acknowledged to be overparameterized.
We thank Taco Cohen, Jared Dunnmon, Braden Hancock, Tatsunori Hashimoto, Fred Sala, Virginia Smith, James Thomas, Mary Wootters, Paroma Varma, and Jian Zhang for helpful discussions and feedback.
We gratefully acknowledge the support of DARPA under Nos. FA87501720095 (D3M) and FA86501827865 (SDH), NIH under No. N000141712266 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity) and CCF1563078 (Volume to Velocity), ONR under No. N000141712266 (Unifying Weak Supervision), the Moore Foundation, NXP, Xilinx, LETICEA, Intel, Google, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, and American Family Insurance, and members of the Stanford DAWN project: Intel, Microsoft, Teradata, Facebook, Google, Ant Financial, NEC, SAP, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of DARPA, NIH, ONR, or the U.S. Government.
Proceedings of the IEEE International Conference on Computer Vision
, pages 37–45. IEEE, 2015.Flexible, high performance convolutional neural networks for image classification.
InIJCAI ProceedingsInternational Joint Conference on Artificial Intelligence
, volume 22, page 1237. Barcelona, Spain, 2011.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 991–999, 2015.Generalization error of invariant classifiers.
In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1094–1103, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR. URL http://proceedings.mlr.press/v54/sokolic17a.html.Symbol  Used For 

LDR  low displacement rank 
LDRSD  matrices with low displacement rank with respect to subdiagonal operators 
LDRTD  matrices with low displacement rank with respect to tridiagonal operators 
displacement operators  
Sylvester displacement,  
(displacement) rank  
parameters which define the rank residual matrix , where  
unitfcirculant matrix, defined as  
Krylov matrix, with column  
matrices of displacement rank with respect to  
feature map  
CC  convolutional channels 
FC  fullyconnected 
Our study of the potential for structured matrices for compressing deep learning pipelines was motivated by exciting work along these lines from Sindhwani et al. [44], the first to suggest the use of low displacement rank (LDR) matrices in deep learning. They specifically explored applications of the Toeplitzlike class, and empirically show that this class is competitive against many other baselines for compressing neural networks on image and speech domains. Toeplitzlike matrices were similarly found to be effective at compressing RNN and LSTM architectures on a voice search task [33]. Another special case of LDR matrices are the circulant (or blockcirculant) matrices, which have also been used for compressing CNNs [8]; more recently, these have also been further developed and shown to achieve stateoftheart results on FPGA and ASIC platforms [17]. Earlier works on compressing deep learning pipelines investigated the use of lowrank matrices [41, 15]—perhaps the most canonical type of dense structured matrix—which are also encompassed by our framework, as shown in Proposition 2. Outside of deep learning, Choromanski and Sindhwani [10] examined a structured matrix class that includes Toeplitzlike, circulant, and Hankel matrices (which are all LDR matrices) in the context of kernel approximation.
On the theoretical side, Zhao et al. [49] study properties of neural networks with LDR weight matrices, proving results including a universal approximation property and error bounds. However, they retain the standard paradigm of fixing the displacement operators and varying the lowrank portion. Another natural theoretical question that arises with these models is whether the resulting hypothesis class is still efficiently learnable, especially when learning the structured class (as opposed to these previous fixed classes). Recently, Oymak [37] proved a Rademacher complexity bound for one layer neural networks with lowrank weight matrices. To the best of our knowledge, Theorem 2 provides the first sample complexity bounds for neural networks with a broad class of structured weight matrices including lowrank, our LDR classes, and other general structured matrices [14].
In Section 3 we suggest that the LDR representation enforces a natural notion of approximate equivariance and satisfies closure properties that one would expect of equivariant representations. The study of equivariant feature maps is of broad interest for constructing more effective representations when known symmetries exist in underlying data. Equivariant linear maps have long been used in algebraic signal processing to derive efficient transform algorithms [18, 19]. The fact that convolutional networks induce equivariant representations, and the importance of this effect on sample complexity and generalization, has been wellanalyzed [12, 2, 21, 45]. Building upon the observation that convolutional filters are simply linear maps constructed to be translation equivariant^{8}^{8}8Shifting the input to a convolutional feature map is the same as shifting the output., exciting recent progress has been made on crafting representations invariant to more complex symmetries such as the spherical rotation group [13] and egomotions [1]. Generally, however, underlying assumptions are made about the domain and invariances present in order to construct feature maps for each application. A few works have explored the possibility of learning invariances automatically from data, and design deep architectures that are in principle capable of modeling and learning more general symmetries [20, 38].
Displacement rank has traditionally been used to describe the Toeplitzlike, Hankellike, Vandermondelike, and Cauchylike matrices, which are ubiquitous in disciplines such as engineering, coding theory, and computer algebra. Their associated displacement representations are shown in Table 5.
Structured Matrix  Displacement Rank  

Toeplitz  
Hankel  
Vandermonde  
Cauchy 
The following identities are easily verified:
The remainder
is the block matrix
This is the sum of matrices of rank and thus has rank .
∎
A block matrix , where each block is a Toeplitzlike matrix of displacement rank , is Toeplitzlike with displacement rank .
Apply Proposition 4 where each has the form . Let and . Note that and (of the same size as ) differ only in entries, and similarly and differ in entries. Since an sparse matrix also has rank at most ,
has rank at most . ∎
Expanding on the claim in Section 3, we formally show that these structured matrices are contained in the tridiagonal (plus corners) LDR class. This includes several types previously used in similar works.



In Table 7, we provide details on the datasets we use for evaluation. For all our experiments, batch sizes were chosen to be 50. NORB was downsampled to
, and the left stereo image was used. Training was performed with stochastic gradient descent with momentum, with the number of epochs set to 50 on all datasets. 15% of the training data was used for the validation set on all experiments. We fixed momentum at 0.9 for all methods for all experiments, and performed a grid search over learning rate. Unless otherwise stated, for each method, we tested the learning rates {0.0002, 0.0005, 0.001, 0.002}, with three trials (with random initializations) per learning rate. For each trial, we test on the validation set at each epoch, and report the test accuracy of the model with the highest validation accuracy, over all learning rates, trials, and epochs.
In Figure 3, for each method and each of the four learning rates, we perform five trials with random initializations and report the average and standard deviation of the test accuracy of the learning rate with the highest average validation accuracy.
Dataset  Training Examples  Test Examples  Number of Classes 

MNISTbgrot [29]  12000  50000  10 
MNISTnoise [29]  12000  2000  10 
CIFAR10 [28]  50000  10000  10 
NORB [31]  291600  58320  6 
Rectangles [29]  1200  50000  2 
In these experiments, we used an architecture consisting of a fullyconnected hidden layer, followed by a fullyconnected softmax layer. In order to be consistent with the architecture used in Sindhwani et al. [44], we do not use a bias term in the hidden layer.
In these experiments, shown in Table LABEL:table:imagesextendedcnn in Appendix LABEL:sec:additionalresults
, we tested on a LeNetbased architecture. The architecture has 2 convolution/pool layers with 6 and 16 channels respectively, followed by a fullyconnected layer, followed by fullyconnected logit/softmax layer. We replaced the second to last fullyconnected layer, which was of dimensions
for the MNISTbgrot and MNISTnoise datasets, and for the CIFAR10 and NORB experiments.This experiment corresponds to Table 3.
Here, we investigated whether the convolutional layers of CNNs can be learned automatically. For our experiments, we test on the simplest possible multichannel CNN model on the CIFAR10 dataset. The model consists of one layer of convolutional channels ( RGB in channels,
out channels, stride
), followed by a fullyconnected layer and a final FC+softmax layer (total of 4 layers). We replace the convolutions with various structured matrices of the same dimensions, keeping the same channel structure (e.g. it would consist of square structured matrices) and number of hidden units.^{9}^{9}9The convolutions are padded to ensure their input and output dimensions are equal.
The LDR classes benefit from being composed with LDR matrices of the same type (due to the composition property, Proposition 13), so we additionally replace the later FC layer with the same structured matrix type.
By Proposition 14, channels of Toeplitzlike matrices form a larger Toeplitzlike matrix of the same size. Using this insight, we consider replacing the channel structure of the convolutional layer with either channels of structured matrices or a single wide structured matrix. (Also, note that this is able to leverage the asymptotic fast nature of our structured classes.)
Because it seems that convolutional layers are strongly dependent on pooling – our structured matrices outperform them in isolation – we compare against a version of the CNN with an additional pooling layer after the convolutional channels. Note that this comparison is the same basic four layer model with a structured matrix vs. a five layer convolutional model with pooling. Since the architectures are quite different and difficult to directly compare, we also experimented with adding more hidden units to the pooling model.
For a language modeling application^{10}^{10}10Code available at https://github.com/pytorch/examples/tree/master/word_language_model., we explored replacing weight matrices in a recurrent neural network with structured matrices. We evaluate on a single layer LSTM architecture, defined by the update equations:
In our experiments we replace the matrices
with structured matrices. We use a hidden layer of size 128, and word embedding size of 128. We evaluate on the Wikitext2 dataset, which consists of Wikipedia articles (2,088,628 training, 217,646 validation, and 245,569 test tokens). The total vocabulary is of size 33,278. We use the default hyperparameters and train using stochastic gradient descent with an initial learning rate of 20. The learning rate is annealed 4x after each epoch if performance does not improve on the validation set. Results are shown in Table
2.