1 Introduction
1.1 Machine Learning on Constrained Devices
While the leitmotif “local sensing and remote inference”^{1}^{1}1We borrowed this slogan from [DP17] has been at the foundation of a range of successful AI applications, there exist many scenarios in which it is highly preferable to run machine learning algorithms either directly on the device or at least locally on the edge instead of remotely. As tracking and selling our digital data has become a booming business model [ZUB19], there is an increasing and urgent need to preserve privacy in digital devices. Privacy and security are much harder to compromise if data is processed locally instead of being sent to the cloud. Also, some AI applications may be deployed “off the grid”, in regions far away from any mobile or internet coverage. In addition, the cost of communication may become prohibitive and scaling to millions of devices may be impractical [DP17]. Moreover, some applications cannot afford latency issues that might result from remote processing.
Hence, there is a strong incentive for local inference “at the edge”, if it can be accomplished with little sacrifice in accuracy and within the resource constraints of the edge device. Yet, the difficulty in “pushing AI to the edge” lies exactly in these resource constraints of many edge devices regarding computing power, energy supply, and memory.
In this paper we focus on such constrained devices and propose a hybrid hardwaresoftware framework that has the potential to significantly reduce the computational complexity and memory requirements of ondevice machine learning.
1.2 Outline of our approach
When we deploy an AIequipped device in practice, we know a priori what kind of task this device is supposed to carry out. The key idea of our approach can thus be summarized as follows: We can take this knowledge into account already in the data acquisition step itself and try to measure only the taskrelevant information, thereby we significantly lower the size of the data that enter the device and thus reduce the computational burden and memory requirements of that device.
To that end we propose compressive deep learning, a hybrid hardwaresoftware approach that can be summarized by two steps, see also Figure 1. First, we construct a projection operator that is specifically tailored to the desired machine learning task and which is determined by the entire training set (or a subset of the training set). This projection operator compresses the data simultaneously with the sensing process, like in standard compressive sensing [FR13]. But unlike compressive sensing, our projection operator is tailored specifically to the intended machine learning task, which therefore allows for a much more “aggressive” compression. The construction of the projection operator is of course critical and various techniques are described in Section 2. This projection will be implemented in hardware, thus the data enter the software layer of the device already in compressed form. The data acquisition/compression step is followed by a deep network that processes the compressed data and carries out the intended task.
While our approach is applicable to a wide range of machine learning tasks, in this paper we will mainly focus on image classification. We emphasize that the projection/compression we carry out at the data acquisiton step is completely different from standard (jpegtype) image compression. Firstly, standard impage compression happens at the software layer and is completely independend from the data acquisition step, while our proposed compressionwhilesensing scheme is an intrinsic part of the data acquisition step. Secondly, standard image compression is designed to work for a vast range of images and independent of the task we later perform, while our compression scheme is inherently tied to the image classification task we intend to carry out.
1.3 Compressive sensing and beyond
As mentioned above, our approach is in part inspired by ideas from compressive sensing. The compressive sensing paradigm uses simultaneous sensing and compression. At the core of compressed sensing lies the discovery that it is possible to reconstruct a sparse signal exactly from an underdetermined linear system of equations and that this can be done in a computationally efficient manner via minimization, cf. [FR13]. Compressive sensing consists of two parts: (i) The sensing step, which simultaneously compresses the signal. This step is usually implemented (mostly) in hardware. (ii) The signal reconstruction step via carefully designed algorithms (thus this is done by software).
Compressed sensing aligns with a few of our objectives, however, they also differ in the following crucial ways:

Adaptivity: Standard compressive sensing is not adaptive. It considers all possible sparse signals under certain representations. For different data sets, the fundamental assumptions are the same. This assumption is likely too weak for a specific image data set, where only images with certain characteristics are included. For image classification, this is usually the case. Statistical information in the data set may be exploited to achieve better results.

Exact reconstruction: Compressive sensing aims for exact signal reconstruction. That means enough measurements must be taken to ensure all information needed for exact recovery. Clearly, this is too stringent a constraint for image classification where only label recovery is required instead of full signal reconstruction.

Storage and processing cost: It is cumbersome to implement random matrices often proposed in compressive sensing in hardware; only certain structured random matrices can be implemented efficiently.
1.4 Prior work
Various approaches have been proposed for improving the computational cost of neural networks and for ondevice deep learning. We point out that in the approaches described below, the assumptions made about the properties of these devices and their capabilities may differ for different approaches.

Quantization: It is observed that representing weights in neural networks with lower precision does not lead to serious performance drop in testing accuracy. In this way, the computational cost of basic arithmetic operations is reduced while the model’s capability of predicting correct labels is largely preserved.

Matrix decomposition: In terms of runtime, convolutional layers are usually the most costly component of a convolutional neural network. However, many of the weights in these layers are redundant and a large amount of insignificant calculations can be avoided. Instead of quantizing the weights, methods from this category exploit the sparsity of the trained convolutional layers and seek to find some low rank representations of these layers.

Pruning: To exploit the sparsity in neural networks, besides treating each layer as a whole and find a global replacement, it is also possible to deal with each node or connection locally. For a pretrained neural network, connections with low importance can be removed, reducing the total number of weights. More advanced pruning techniques have been developed, see e.g. [HMD15, LUW17, AAN+17].

Hardware: Several special purpose processors have been designed for ondevice machine learning, see e.g. [QUA17, INT18, GOO18]. Recently, an alloptical machine learning device has been built, see [LRY+18]. The device can perform image classification at the speed of light without needing external power, and it achieves up to 93.39% classification accuracy for MNIST. Another work describes a hardware implementation of the kernel method, which is potentially a universal preprocessor for any data ([SCC+16]
. An image is first encoded into a beam of monochromatic laser, and then projected on a random medium. Because of frequent scattering with the nanoparticles in the medium, the transmitted light can be seen as a signal obtained by applying a kernel operation with a random matrix involved. The resulting signal is then collected by a camera and can be fed into a machine learning model. No energy is consumed except for the incoming laser signal during the process.
Our proposed approach is different from all the methods listed above, but can be combined with any one of them.
Other approaches towards compressing the input in a specific manner before clustering or classificiation can be found e.g. in [DDW+07, HS10, RP12, RRC+13, TPG+16, MMV18]. However, except for [MMV18], these papers do not taylor the compression operator to the classification task, but are rather using compression operators that are in line with classical compressive sensing.
2 Compressive Deep Learning
In a nutshell our framework can be summarized as follows. First, we construct a compression matrix in form of a projection operator, determined by the entire training set (or a subset of the training set). This compression matrix will be implemented in hardware, that automatically feeds the compressed data into the next layer of the system. Following the compression matrix there is a neural network that processes the compressed data and recovers the labels.
There are many possibilities to construct the projection operator. It is important to keep in mind that ultimately the projection/compression step is supposed to be realized in hardware. Therefore it makes sense to impose some structure on the projection operator to make it more amendable to an efficient hardware implementation. Unlike compressive sensing, we will not use a random matrix that samples the image space essentially uniformly at random, but instead we construct a projector that focuses on the regions of interest, i.e., we concentrate our measurement on those regions of the ambient space, in which the images we aim to classify are located, thereby preserving most information with a small number of measurements. To that end, principal component decomposition would suffice to construct the projection matrix. However, a typical PCA matrix is unstructured and is thus hard to implement efficiently, easily and at low cost in hardware.
Hence, we need to impose additional condition on the projection operator to be constructed. There is a range of options, but the most convenient one is arguably to consider projections with convolution structure. Convolutions are ubiquitous in signal and image processing, they are a main ingredient of many machine learning techniques (deep learning being one of them), and they can be implemented efficiently in hardware [RG75].
We will consider two approaches to construct such a convolutionstructured projection:

We try to find among all convolution matrices with orthogonal rows the one that is “most similar” to the PCA matrix. While this matrix nearness problem is nonconvex, we will prove that there is a convex problem “nearby” and that this convex problem has a convenient explicit solution. This construction of the projection matrix is independent of the CNN we use for image classification.

We construct the convolution projection by jointly optimizing the projection matrix and the CNN used for image classification. We do this by adding a “zeroth” convolution layer to our image classification CNN. The weights of this zeroth layer will give us the coefficients of the (nonunitary) convolution projection matrix. Of course, for the actual image classification we later remove this zeroth layer, since the whole point is to implement this layer in hardware. In theory this should yield a projection matrix with superior performance, because this approach jointly optimizes the projection and the classification. But due to the nonconvex nature of this optimization problem, there is no guarantee that we can actually find the optimal solution.
2.1 Construction of projection with convolution structure via constrained matrix nearness
Given a data set , one can find all principal components of this finite point set, which form an orthonormal basis of the space . These vectors can be grouped into a matrix, with each row being one of the components,
The matrix contains full information about the best approximating affine subspaces of different dimensions to the data set . Dimension reduction can be achieved by using the first few rows of . For any , we can construct
which can serve as a compression matrix that reduces the dimension of the input data from to . Here, the ratio represents the compression rate of the input.
Yet, in general is not a structured matrix which is required for efficient hardware implementation. As we mentioned before, we try to find a convolutiontype matrix that is in some sense as close as possible to . Since we focus on image classification, our input signals are images. Therefore, we consider twodimensional convolutions. In terms of matrices that means we either deal with blockcirculant matrices with circulant blocks (BCCB) or blockToeplitz matrices with Toeplitz blocks [DAV79]. Here, for convenience we focus on BCCB matrices. Note however that as projection, is naturally a fat matrix (there are more columns than rows), while BCCB are square matrices by definition. Thus, we are concerned with finding the best approximation of by subsampled versions of BCCB matrices.
Assume samples in the data set are all twodimensional signals of size ; vectorizing these signals yields vectors of length . The space of subsampled matrices with structure is the space of subblockcirculant matrices, denoted by , whose elements are matrices with the same dimensions as . See the appendix for an exact definition of subblockcirculant matrices and . We formulate the problem of constructing a convolutionstructured projection matrix close to the PCA matrix as follows: find the subblockcirculant matrix , an element of , that is most similar to .
How do we measure similarity between matrices? One way to think about this is to identify each matrix with its row space. In our example, the row space of any is an dimensional subspace of , the dimensional Euclidean space. In other words, we identify each row space with a point in the Grassmannian manifold and measure the angle between the row spaces of and . There are various ways to do so. We follow [CHS96] and use the chordal distance for this purpose. Besides its geometric definition, the chordal distance between the row space of and that of can be computed via
Hence our goal is to find the subblockcirculant matrix that minimizes its chordal distance to .
(1) 
This structured matrix nearness problem does not have an easy solution. Therefore we consider a “nearby” problem that does have a nice explicit solution. We compute
(2) 
We used the fact that both and elements of
have unit vector as rows. The above estimation shows that even though
does not determine the optimal chordal distance between two row spaces, it is always an upper bound of the chordal distance. Inspired by this estimation, instead of the original optimization problem (1), we consider the following problem:(3) 
Note that here is a constant and is determined by the compression rate. This optimization problem can be solved within a more general setting, namely in the context of complete commuting family of unitary matrices.
To solve (3), we consider the more general problem
(4) 
where is a given matrix of dimension and is a commuting family of unitary matrices. The exact definition of is given in the appendix. More precisely, we assume is a complete commuting family of unitary matrices and
is a unitary matrix that diagonalizes elements of
. Note that unitary transformation preserves the Frobenius norm since for any ,As a result, the objective function is then
which can be decomposed into its diagonal part and its offdiagonal part. Since the offdiagonal part is independent of , in order to minimize , we only need to minimize the diagonal part. Let . By definition, . Hence, minimization of the diagonal part becomes
where the norm is simply the Euclidean norm in . The minimizer may not be unique, but one minimizer is given by
(5) 
where . Applying the parametrization map, we get
(6) 
due to the completeness of , and is a minimizer of (4). Thus, we have proved the following theorem.
Theorem 1.
We can readily show that the family of BCCB matrices, denoted by , is a special case.
Corollary 1.
See the appendix for the proof.
Definition 1 (Downsampling Operator).
For any , a downsampling operator is 11 map from to , where .
The downsampling operator chooses elements from in a certain order without replacement. We will use to sample the rows of a matrix from .
Definition 2 (Subsampled Unitary BCCB).
The Subsampled Unitary BCCB formed by downsampling via the downsampling operator is an matrix given by
We denote . The collection of all such matrices is denoted , which we will simply refer to as .
A subsampled unitary BCCB is formed by certain rows of some unitary block circulant matrix with circulant blocks whose indices are determined by a fixed downsampling operator.
Consider a downsampling operator . Let be a complete commuting family of unitary matrices. If we subsample every element of using , we end up getting a set of subsampled unitary matrices,
Now, consider the following optimization problem,
(8) 
where is a given matrix with dimension of . Define the
zeropadding map
as follows:for any . Observe that any solution to
gives a solution to (8). This is because for any ,
where is the th row of and is a unit vector.
Theorem 2.
Downsampled from by some downsampling operator , the collection of subsampled unitary BCCB is obviously a special case since . Therefore, the optimization problem
can be solved by following Theorem 2 and Theorem 1. The procedure of solving this problem is summarized in Algorithm 1.
2.2 Construction of convolution projection via joint nonconvex optimization
In this section we construct a projection matrix by setting up a joint optimization problem that involves both solving the image classification and the finding the optimal projection matrix. To solve this nonconvex optimization problem we employ deep learning. More precisely, we construct a CNN to which we add an additional layer as zeroth layer. This zeroth layer will consist of one 2D convolution followed by downsampling. The overall architecture is illustrated in Figure 2.
3 Numerical experiments
To demonstrate the capabilities of our methods, we test both the PNN (name may be changed) and the circulant approximation against a few other baseline methods. We test these methods on two standard test sets, the MNIST dataset consisting of images depicting handwritten digits, and the FashionMNIST dataset, consisting of images of fashion products. The general workflow consists of two steps. In the preprocssing step we subject the images to the projection operator to simulate the compressive image acquistion via hardware. In the second step we feed these images into a convolutional neural network for classification. We will describe the preprocessing methods, the architecture of the network and training details in the rest of the section.
3.1 Preprocessing step: Projection operator
In order to put the PNN and the circulant approximation method in context, we conducted a few other methods in the experiments and compare their results against each other. The list below is a brief summary of all these results, and Table 1 gives the relation between the stride and the compression rate.

Downsampling: Downsample the images using a certain stride. When the stride is equal to , the original dataset is used. Downsampling, i.e. just taking lowresolution images, is the easiest (and least sophisticated) way to reduce dimensionality of the images.

Random Convolution: A random convolutional filter of size is generated and then applied to a few locations in the image, determined by the stride. The size of the resulted image is determined by the stride of the convolution applied to the image. The image size is unchanged when the stride is equal to . There is no constraints on the filter size, but we fixed the filter size in our experiments for the sake of simplicity. Projection matrices of this type have been proposed in the context of compressive sensing (e.g. see [ROM09]).

PCA: Compute the PCA components for the entire training set. Given the compression rate and a raw image, the compressed data is the coefficients of a few leading PCA component for the image. When uncompressed (the compression rate is equal to 1), all coefficients are used.

Circulant Approximation: This is the construction outlined in Subsection 2.1. First compute a matrix with structure that is most similar to the PCA matrix of the training set and then compress images using this matrix. The dimension of the matrix with structure is determined by the compression rate. It is a square matrix if the data is uncompressed.

PNN: This is the construction presented in Subsection 2.2. Add a convolutional layer with a certain stride and one single feature map on the top of the architecture. Train the resulting network on the original dataset. The first convolutional layer serves as a compressor, and it is optimized along with the rest of the network.
Stride  Dimension  Compression 

1  1.00  
2  4.00  
3  7.84  
4  16.00  
5  21.78  
6  31.36 
3.2 Architecture and training
The neural network for the raw data has the following architecture. The first weighted layer is a convolution layer with filters of the size
and followed by ReLU nonlinearity and a maxpooling layer with stride
. The second weighted layer is the same as the first one except that it has filters, also followed by ReLU and a maxpooling layer with stride . The third weighted layer is a fully connected layer withunits and followed by a dropout layer. The last layer is a softmax layer with
channels, corresponding to the classes in the dataset. This network has 857,738 weights in total. Architectures of neural networks dealing with compressed data are listed in Table 2.The table also lists the total number of weights and the number of floating point operations for each forward pass. In general, both numbers are decreased when we use smaller inputs. Since we are using very small images in the first place, some networks have two pooling layers and some do not have any. This explains why some networks with smaller inputs have more weights and flops than networks with larger inputs. For larger images, we will be able to apply same number of pooling operations both for the raw data and the compressed data. In this way, the decrease in the number of weights and flops will be even more significant.
Stride  1  2  3  4  5  6 

Input  
First convolution  conv32  
First pooling  maxpool2    
Second convolution  conv32  
Second pooling  maxpool2    
Fully connected  FC256  
Output  softmax10  
Weights  857,738  857,738  464,522  857,738  644,746  464,522 
MegaFlopssec  22.93  6.94  3.54  6.70  4.92  3.42 
For PNN, it starts with a convolutional layer with a single filter of the size and a certain stride. The output is then fed into the network described above.
We trained the networks using ADAM. We also did a grid search for hyperparameters. In general, we found that with learning rate , dropout rate generated the best results. Each network was trained for epochs with minibatches of the size .
3.3 Results
As shown in Figure 3, for both datasets, it is evident that both PNN and the circulant approximation achieve higher accuracy rate than downsampling and random convolution, especially when the data are heavily compressed. In most cases, the PNN method works slightly better than the circulant approximation. However, the circulant approximation method exhibits its ability to retain high accuracy when pushing to more extreme compression rate on MNIST. Since PNN is a relaxation of the circulant approximation method, the global optimum of the former is always no worse than the latter. What we observe in MNIST is a result of the training process, which has no guarantee for global optimum of the neural network.
Another interesting result is that the PCA method seems not to be affected by extreme compression rates but rather benefits from them. This is probably because only the coefficients of the leading PCA components have high signaltonoise ratio, and the rest are mostly noise. Therefore, the PCA method performs better simply by discarding the noisy coefficients. For the purpose of our work, the PCA method cannot be compared with the other methods directly since
the PCA matrix is not a convolution and cannot be implemented by hardware.3.4 Reducing the number of filters
As we have seen in Table 2, because of the small image size, some networks with smaller inputs have more weights and flops than networks with larger inputs. To further reduce the number of weights and flops for networks with small inputs, we explore the compressibility of these networks by reducing the number of filters. For the network for the case of stride , the network given in Table 2 has 32 filters in the first convolutional layer and 64 filters in the second one. Here, we consider using networks with much less filters. The architectures of these networks are listed below. The network with index is the same as the network with stride in Table 2.
Index  1  2  3  4  5 

Input  
First convolution  conv32  conv16  conv8  conv4  conv2 
Second convolution  conv64  conv32  conv16  conv8  conv4 
Fully connected  FC256  
Output  softmax10  
Weights  464,522  220,874  108,650  54,938  28,682 
MegaFlopssec  3.42  1.07  0.37  0.15  0.06 
The results of these neural networks for MNIST are shown in Figure 4. Both the PNN and the circulant approximation have relatively high accuracy rates even with very small number of filters.
4 Appendix
4.1 Matrices with circulant strcuture
Circulant matrices can be thought as discrete convolutions applicable to one dimensional signals. They have the nice property that circulant matrices can be diagonalized by the Discrete Fourier Matrix. For twodimensional signals, the corresponding matrix is a block circulant matrix with circulant blocks (BCCB).
Definition 3 (Block Circulant Matrix with Circulant Blocks).
An block circulant matrix with circulant blocks has an circulant block structure and each of its block is an circulant matrix itself. The matrix can be written as
The lemma below from [DAV79] (Theorem 5.8.1) shows that BCCBs can be diagonalized by the 2D DFT.
Lemma 1 (Diagonalization by 2D DFT).
Let
be the twodimensional unitary Discrete Fourier Transform matrix. Then, a matrix
is a block circulant matrix with circulant blocks if and only if is a diagonal matrix. In particular, if is a block circulant matrix with circulant blocks and is its first column, then can be diagonalized by as4.2 Complete commuting family of unitary matrices
Definition 4 (Commuting Family of Unitary Matrices).
is called a commuting family of unitary matrices if is consisted of unitary matrices of the same dimension that commute with each other.
Properties of commuting families of unitary matrices are presented below.
Definition 5 (Parametrization Map).
Let be a unitary matrix, then the parametrization map induced by is defined by
Lemma 2.
Let be a commuting family of unitary matrices. There exists a unitary matrix that diagonalizes any . Furthermore, the preimage of under is a subset of the dimensional torus , where .
Proof.
A commuting family of matrices can be simultaneously diagonalized, and unitary matrices are normal matrices and hence can be diagonalized by some unitary matrix. Therefore, there exists a unitary matrix that diagonalizes any . For any , for some . Note that entries of
are eigenvalues of
since is unitary. Hence, since is a unitary matrix itself and eigenvalues of a unitary matrix are complex numbers with magnitude . Finally, is a onetoone map and since . ∎To ensure solvability of (4), we will need completeness for .
Definition 6 (Complete Commuting Family of Unitary Matrices).
A commuting family of unitary matrices is called complete if its preimage under the parametrization map is the entire torus. In other words, .
Proof of Corollary 1.
For any , is a diagonal matrix. This shows is a commuting family, and the unitary matrix that diagonalizes all elements of the family is . By its definition, consists of unitary matrices. To prove it is complete, we need to show , for any . Denote . On the one hand, thanks to Lemma 1, since is diagonal. On the other hand, is unitary since
Here, is the entrywise product and we used the fact that . This shows .
To show the last statement of the corollary, we will use the fact that the eigenvalues of a real circulant matrix are conjugate symmetric, and vice versa (cf. [DAV79], p. 7276.). For any , can be written as , where are circulant blocks of and . Hence, . Since and are real matrices, both and are diagonal matrices with conjugate symmetric diagonals. Then, and are normalized as in (5). The resulting vectors are still conjugate symmetric. Denote these two vectors and , both column vectors. Thus, , and where
Both and are real, and thus is also real. ∎
5 Discussion and Conclusion
We have introduced the first steps towards developing a principled hybrid hardwaresoftware framework that has the potential to significantly reduce the computational complexity and memory requirements of ondevice machine learning. At the same time, the proposed framework raises several challenging questions. How much can we compress data so that we can still with high accuracy conduct the desired machine learning task? In certain cases literature offers some answers (see e.g. [RRC+13, TPG+16, MMV18]
), but for more realistic scenarios research is still in its infancy. Moreover, ideas from transfer learning should help in designing efficient deep networks for compressive input. We hope to address some of these questions in our future work.
Acknowledgments
The authors want to thank Donald Pinckney for discussions and initial simulations related to the topic of this paper. The authors. acknowledge partial support from NSF via grant DMS 1620455 and from NGA and NSF via grant DMS 1737943.
References
 [AAN+17] (2017) Nettrim: convex pruning of deep neural networks with performance guarantee. In Advances in Neural Information Processing Systems, pp. 3177–3186. Cited by: item 4.
 [CHS96] (1996) Packing lines, planes, etc.: packings in grassmannian spaces. Experimental mathematics 5 (2), pp. 139–159. Cited by: §2.1.
 [DDW+07] (2007) The smashed filter for compressive classification and target recognition. In Computational Imaging V, Vol. 6498, pp. 64980H. Cited by: §1.4.
 [DAV79] (1979) Circulant matrices. John Wiley & Sons. Cited by: §2.1, §4.1, §4.2.
 [DP17] (201703) Implementation of efficient, low power deep neural networks on nextgeneration intel client platforms. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6590–6591. Cited by: §1.1, footnote 1.

[DHY+17]
(2017)
More is less: a more complicated network with less inference complexity.
In
The IEEE International Conference on Computer Vision (ICCV)
, Cited by: item 1.  [FR13] (2013) A mathematical introduction to compressive sensing. Applied and Numerical Harmonic Analysis, Springer, Basel. Cited by: §1.2, §1.3.
 [GOO18] (2018) Edge TPU. Note: https://cloud.google.com/edgetpu Cited by: item 5.
 [HMD15] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: item 4.
 [HZC+17] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: item 1.

[HS10]
(2010)
Compressive spectral clustering
. In AIP Conference Proceedings, Vol. 1281(1), pp. 1720–1722. Cited by: §1.4.  [IHM+16] (2016) Squeezenet: alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: item 1.
 [INT18] (2018) Hardware. Note: https://ai.intel.com/hardware Cited by: item 5.
 [LRY+18] (2018) Alloptical machine learning using diffractive deep neural networks. Science 361 (6406), pp. 1004–1008. Cited by: item 5.
 [LUW17] (2017) Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pp. 3288–3298. Cited by: item 4.
 [MMV18] (2018) SqueezeFit: labelaware dimensionality reduction by semidefinite programming. arXiv preprint arXiv:1812.02768. Cited by: §1.4, §5.
 [QUA17] (2017) We are making ondevice AI ubiquitous. Note: https://www.qualcomm.com/news/onq/2017/08/16/wearemakingdeviceaiubiquitous Cited by: item 5.
 [RG75] (1975) Theory and application of digital signal processing. Englewood Cliffs, NJ, PrenticeHall, Inc.. Cited by: §2.
 [RRC+13] (2013) Compressive classification. In 2013 IEEE International Symposium on Information Theory, pp. 674–678. Cited by: §1.4, §5.
 [ROM09] (2009) Compressive sensing by random convolution. SIAM Journal on Imaging Sciences 2 (4), pp. 1098–1128. Cited by: item 2.

[RP12]
(2012)
Compressive clustering of highdimensional data
. In 2012 11th International Conference on Machine Learning and Applications, Vol. 1, pp. 380–385. Cited by: §1.4.  [SCC+16] (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 6215–6219. Cited by: item 5.
 [TPG+16] (2016) Compressive spectral clustering. In International Conference on Machine Learning, pp. 1002–1011. Cited by: §1.4, §5.
 [ZUB19] (2019) The Age of Surveillance Capitalism: The Fight for the Future at the New Frontier of Power. PublicAffairs. Cited by: §1.1.
Comments
There are no comments yet.