Convolutional Neural Networks (CNNs)  are the state-of-the-art technique for image classification, routinely achieving better-than-human performance. New CNN architectures and applications continue to emerge at a prodigious rate. More recently, substantial interest has arisen in compressing neural networks, including CNNs, to use fewer parameters and to require less memory so as to enable running on devices with limited size, weight, and power (SWaP). Note, “compression” in this context refers to reducing these computational and memory requirements while minimizing the effect on classification accuracy; this does not necessarily require that the compression can be reversed.
Compressing the network, however, addresses only one side of the coin: what about compressing the images
to which the CNN is applied? Though images are often stored in compressed form, CNN architectures currently uncompress all images prior to classifying them. Being able to compress the images also presents an additional advantage: given a dataset of large images and a network that expects small images, such a compression algorithm may preserve more information than extant techniques such as downgrading (DG) or cropping. Thus, this brief presents a compression algorithm that reduces the images’ size on disk and does not require (or even allow) the images to be uncompressed prior to being classified by the CNN.
To understand why extant compression algorithms are inadequate, we we must consider how the CNN ingests the original image. The first layer of a CNN begins by ingesting a small “convolutional region” from the top-left of the image (the value of
is set by the CNN architecture). After processing this area, the convolutional region “strides” (is translated)pixels to the right and the process repeats; in this way, the convolutional region “convolves” left-to-right, top-to-bottom across the matrix of pixel intensities (see Figure 1a). Thus, any effective compression scheme must preserve the localization, such that nearby pixels generally correspond to semantically coherent information. It is this requirement that existing techniques, such as JPEG compression, fail to meet.
In response, we propose “Localized Compression” (LC). Rather than compressing the image as a whole, we divide the original image into blocks and compress each block to (with ). This reduces the number of pixels in the compressed image by a factor of . While there are no restrictions on , we require to be chosen such that is divisible by ; this ensures that each convolutional region receives the compresed pixels in the same relative order. This is illustrated in Figure 1b.
In principal, we could use standard compression techniques like JPEG compression or Principal Component Analysis (PCA) to compress theseblocks to blocks. In practice, however, most modern CNN architectures have very small values of , such as 4 (AlexNet), or even 2 (DenseNet). We must therefore compress already small blocks (e.g., ) into much smaller blocks (e.g., ). This rules out many well-established (reversible) compression techniques, including PCA and JPEG compression.
Instead, we consider two solutions that are compatible with such small sizes: random matrix multiplication (RMM) and percentile-based sampling. RMM entails multiplying the original matrix by random matrices of the appropriate dimensions; this has proven effective on related problems , , . Percentile-based sampling techniques entail retaining the maximum, minimum, and other values from the original matrix.
We evaluate LC and compare these different options using to two standard datasets, ImageNet and the German Traffic Sign Recognition Benchmark , and two different CNN architectures, AlexNet  and DenseNet . Both LC and DG produce images of the same size and therefore offer the same reduction in storage and processing requirements; therefore, we will compare them in terms of classification accuracy. Our results show that LC with percentile-based sampling is approximately 2% more accurate than DG when .
The remainder of this brief is organized as follows. Section II describes related work. Section III provides more detail on random-matrix-based, sampling-based, and other techniques to represent an matrix with an matrix, while Section IV formally defines using these techniques for LC. Section V shows numerical results from applying LC to standard datasets with existing network architectures. Section VI contains a brief digression in which we apply random matrices to the related problem of compressing fully-connected layers. Finally, we draw conclusions in Section VII.
Ii Related Work
To our knowledge, there is no past work that addresses applying CNNs to compressed images (i.e., without immediately uncompressing each image). There is, however, much related work about compressing the CNN itself, and about compressing the inputs to other types of classifiers.
With respect to compressing the CNN itself, research has focused on four key areas: (1) reducing the number of network parameters by pruning and sharing, (2) using low-rank factorization to compress the network weights, (3) representing convolutional filters as transformations of a small number of base filters, and (4) transferring the essential knowledge from a deep network to a shallower network (knowledge distillation) , . Of these, using low-rank factorization to compress the network weights is most similar to our paradigm. In particular, Denton et al. 
showed that performing tensor decompositions (based on the singular value decomposition) on trained convolutional layers can significantly accelerate CNNs with minimal loss in classification accuracy. While we also consider extensions of the singular value decomposition (SectionIII), our work is different in that we compress the inputs to the CNN (images) rather than the CNN itself.
Compressing inputs to simple classifiers (single-layer perceptrons) has also been thoroughly studied. In particular, Wimalajeewa and Varshneyet al.  studied compressing high-dimensional vector input (e.g., telemetry data) to deep, fully-connected neural networks. While we also consider random matrices, we apply them to CNNs rather than simple classifiers or fully-connected neural networks.
Section VI discusses using random matrices to compress a CNN’s fully-connected layers. Here, there is substantial related work: in particular, Cheng et al.  has shown that circulant projection matrices (a subset of random projection matrices) efficiently compress fully-connected layers in CNNs, and Wójcik et al.  considers random projection matrices for the same purpose in fully-connected (non-convolutional) neural networks. Our work on the fully-connected layers fills in the gap, using random, non-circulant matrices to compress fully-connected layers in CNNs.
We begin by considering how to compress two-dimensional blocks to with (in this work, we treat each channel separately). We refer to the block as . Here, we consider five options.
entails taking a weighted average or interpolation of neighboring pixels. This technique is already widely used: raw images are typically down- or up-sampled (as well as reshaped or cropped) to a standard size prior to applying the CNN. We do not consider DG as a form of LC since downgraded images arematrices of uncompressed pixels just like the uncompressed images. Rather, in this work, we use DG (as implemented in OpenCV’s INTER_AREA algorithm ) as the baseline against which LC is compared.
Principal Component Analysis  is a widely-used compression procedure in which an image is approximated as a linear combination of its
principal eigenvectors. We apply PCA toand store the resulting parameters in an
matrix, padding with zeros as needed. We must therefore choosesuch that the number of PCA parameters does not exceed . Concretely, we must require that:
Though PCA has been widely studied, it offers two disadvantages. First, its computational complexity is very high, as eigenvectors must be calculated for each block. Second, Equation 1 implies that PCA is simply incompatible with some dimensionalities. For example, it is impossible to represent even a single principal component of any block in an block if and . For these reasons, we do not consider PCA further in this work.
Percentiles. With percentile-based sampling, we sort the uncompressed points by their intensity and then sample from this distribution at pre-determined percentile values (e.g., the minimum, 33rd percentile, 67th percentile, and maximum). The computational complexity of this compression technique is somewhat high, as each block’s values must be sorted.
Random Matrix Multiplication (RMM). Some recent work in compressive sensing  has looked at performing dimensionality reduction as a precursor to classification by multiplying the original features on the left by a (sparse) random matrix. In our case, we define as the vectorized form of . We then fill an matrix, , with values randomly drawn according to:
where (lower values of are more efficient; in this work, we set ). We then perform dimensionality reduction according to . We then reshape the resulting vector to .
Random Matrix Sketching (MS). Similar to RMM, MS fills an matrix according to equation (2) (we again set ) and then compresses according to . This smaller dimensionality further reduces the compression technique’s computational complexity.
Localized Compression entails using the techniques in Section III to compress entire images. We refer to the entire uncompressed image as . We begin by defining blocks over (with ). We compress each channel separately and so consider only two dimensions here. In principal, these blocks can be offset from one another by any number of pixels (i.e., an uncompressed pixel can be in zero, one, or multiple blocks), but for simplicity, we consider only blocks that completely tile the image with no overlap (i.e., each uncompressed pixel is in exactly one block).
Algorithms 1 and 2 formally define LC for single-channel images (multi-channel images simply apply the compression operation to each channel separately). Algorithm 1 (“inline mode”) is a proof-of-concept in which the compression is performed at runtime: that is, we simply run the CNN as normal, inserting a step wherein we apply the compression operation to each block and then apply the normal convolutional operation to the resulting block. This is conceptually straightforward, but offers little or no savings in terms of storage efficiency (the uncompressed images must be stored) or computational efficiency (the reduction in learned convolutional parameters is roughly offset by the addition of compression operations). Algorithm 2 (“default mode”) makes some adjustments such that the compression is performed prior to runtime. Default mode achieves the same storage and computational efficiency as DG; however, this introduces some complications with respect to data augmentation (described below).
Algorithm 1 (“inline mode”) begins by resizing () each image to and writing these images to disk. We then cycle through the images as normal. For each image, we use the data augmentation operations () with randomly-drawn parameters to modify each image: these operations may include cropping to , taking a left-right flip, or any other data augmentation strategy. We then locally-compress each image () and classify it using the CNN (). Note, when compressing with random matrices, we use the same random matrix for each image.
Algorithm 2 (“default mode”) differs in that it writes to disk after performing the data augmentation and localized compression. It also requires that be divisible by so that the convolutional region will always stride over an integer number of compressed blocks. In this way, only the compressed images are written to disk (reducing the storage requirement), and the compression operations must only be performed once (reducing the computational requirement). The challenge with this ordering is that after compression, the full suite of data augmentation techniques can no longer be used; instead, only some limited set of data augmentation techniques () can be applied. In particular:
Crops. It is customary to take the final crop during data augmentation (i.e., after resizing the image to ). In default mode, this is still possible, however, the crops must not be allowed to sub-divide the blocks.
Flips. It is customary to take left-right flips of the image during data augmentation. In default mode, it is still possible to reverse the ordering of the blocks; however, the internal structure of each block must not be changed.
Other data augmentation schemes may or may not be applicable post-compression.
We therefore expect that networks trained in default mode will be somewhat less accurate than networks trained in inline mode. To bridge this gap, we allow default mode to produce copies of each image. These copies are produced using the full suite of data augmentation techniques (); at runtime, we randomly select one of these images and then apply the limited set of data augmentation techniques () to achieve further augmentation. We therefore expect that increasing will increase our classification accuracy, but will also increase our storage requirements.
V Numerical Experiments
We test our procedure on two network architectures and two datasets. Our architectures are AlexNet  and DenseNet : AlexNet is a dated architecture that has been widely used to evaluate compression algorithms, while DenseNet is a more modern architecture that achieves considerably higher accuracy. Both architectures require input images of a uniform size; we take as the reference size for all images. Our datasets are the German Traffic Signs Recognition Benchmark (GTSRB)  (39K training images over 37 classes) and ImageNet  2012 (1.3M training images over 1000 classes). While these are both standard datasets for classification challenges, a key difference is that most ImageNet images are larger than the reference size, whereas most GTSRB images are smaller than the reference size. We expect that LC will be more effective on large images (as there is more information to exploit).
We base our implementation of the networks, including parameters such as weight decay, on those from TensorFlow Slim, . In all tests (except where indicated), we begin by resizing and reshaping the images to (e.g., ), cropping a random patch from this (i.e.,
), and then performing a left-right flip at random. All evaluation is performed with a single center crop and no left-right flip. We train all networks with the Momentum Optimizer with momentum 0.9 and a learning rate that begins at 0.01 and is reduced by an order of magnitude every 20 epochs, for a total of 65 epochs. This simple scheme is fully network-agnostic and offers relatively fast training times while giving top-1 accuracies only slightly lower than those reported by the network authors.
Our first test compare the percentile, RMM, and MS compression algorithms (as described in Section III) against the baseline of simply downgrading the images to the equivalent size. We use inline mode to allow all algorithms to use identical, off-the-shelf dataset augmentation techniques (as described in Section IV). We set and ; thus, the final images are . Note, these small images sizes require removing the last pooling layer from AlexNet. Our results, shown in Table I, illustrate that LC is viable: all methods give results within a few percent of the baseline (DG), and the percentile method gives an accuracy 1-2% higher than DG. We therefore perform LC with the percentile-based compression technique in the remainder of this work.
Our second test validates default mode. In particular, we compare default mode’s accuracy with various of against the accuracy achieved by inline mode. In addition to classification accuracy, Table II also shows the uncompressed-to-compressed storage ratio (SR) and the uncompressed-to-compressed computational ratio (CR). The results show that LC is still viable in default mode: even with , LC remains more accurate than DG. Setting further increases the classification accuracy; however, continuing to increase shows only a modest improvement in classification accuracy. In the remainder of this work, we use default mode with .
Third, we test LC for different compression ratios. In particular, we keep and vary the size of ; larger values of therefore result in smaller compressed image sizes. Our results for LC and DG are given in Table III, and show that LC consistently outperforms DG for significant compression ratios (i.e., reducing the number of pixels by more than a factor of 4), but advantage is less clear for smaller compression ratios.
Finally, we consider applying LC to larger images: rather than beginning with and compressing, we begin with large images and compress to . To evaluate this, we consider only the ImageNet dataset, and select only those images with more than pixels (of which there are 30,192 for training and 1,110 for testing). Our results are given in Table IV. Though the CNNs are clearly data starved, our results suggest that LC gives higher accuracy than DG for larger images just as it did for smaller images.
Vi Applying random matrices to fully-connected layers
We now take a brief digression to consider a related problem: using the techniques of Section III to compress the fully-connected layers inside the CNN itself. As discussed in Section II, substantial work has been put into compressing these fully-connected layers in deep neural networks; our contribution is to extend this work by applying static, non-circulant random matrices to CNNs. Our strategy is to introduce a new deterministic layer immediately prior to each fully-connected layer that compresses the inputs to the following layer. The percentile-based sampling method, though effective for LC, is not an appropriate choice for this layer, as it would continually reorder the features. We therefore select the MS-based technique for this layer; this efficiently and dramatically reduces the number of weights in the hidden layer.
As an numerical example, we again consider the AlexNet architecture, which contains three fully-connected layers, the first of which contains 6400 weights. We then reshape these weights to and multiply on the left by a random matrix. This produces 3328 weights, which we feed into the next fully-connected layer. We repeat this for each fully-connected layers. We can also reduce the number of nodes in each FC layer to compensate for the reduced number of inputs.
|CR by Layer|
Table V shows our results applying this to the GTSRB and ImageNet datasets with the AlexNet network architecture (we did not consider DenseNet, as it contains only a single fully-connected layer). Table V reports the compression ratio for each of the three fully-connected layers (FC1, FC2, and FC3), and the number of nodes contained in each FC layer (nNodes). Compared to the uncompressed AlexNet, we observe a 1% increase in accuracy on GSTRB when compressing the network up to a 72% and less than 1% decrease in accuracy when tested using ImageNet compressing the network by 45%. The counter-intuitive increase in accuracy for GTSRB may be because the original images are so small that the large network has too many parameters relative to the images’ information content.
The primary contribution of this work is to introduce Localized Compression (LC), an alternative to downgrading when CNNs require an image size much smaller than the original image’s resolution. In some sense, LC is a generalization of downgrading: downgrading always performs some sort of pixel averaging and requires , whereas LC supports different compression techniques and different values of . The most successful compression technique, percentile-based sampling, could be viewed as applying a generalization of a pooling layer to the original image. By choosing such that divides , LC supports any network architecture.
We also extended previous work ,  on applying sparse random matrices to deep neural networks. Though percentile-based sampling outperformed random-matrix-based techniques on LC, we showed that sparse random matrices are an effective way to compress both convolutional and fully-connected layers in CNNs.
With respect to LC, our results show that when it is used with percentile-based sampling and relatively high compression ratios, LC gives a 1-2% accuracy improvement over downgrading. LC can therefore be useful in applications where the average image size is much larger than a CNN’s reference size. This is potentially a useful capability: many modern cameras can produce high-resolution images; LC provides a way to exploit the extra information that such devices provide without increasing the computational or SWaP requirements.
The authors would like to acknowledge Profs. Pramod Varshney and Thakshila Wimalajeewa at Syracuse University for sharing their expertise with using random matrices for classification.
This work was supported by the Air Force Research Laboratory under contract FA8650-17-C-1154.
-  (2017) A survey of model compression and acceleration for deep neural networks. CoRR abs/1710.09282. External Links: Cited by: §II.
-  (2015) An exploration of parameter redundancy in deep networks with circulant projections. Proceedings of the IEEE International Conference on Computer Vision 2015 Inter (1), pp. 2857–2865. External Links: Cited by: §II.
ImageNet: a large-scale hierarchical image database.
2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. External Links: Cited by: §I, §V.
-  (2014-04) Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. ArXiv e-prints. External Links: Cited by: §II.
-  (2018)(Website) External Links: Cited by: §III.
-  (2016) Deep learning. MIT Press. Cited by: §I.
-  (2015) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149. External Links: Cited by: §II.
-  (2016) Densely connected convolutional networks. CoRR abs/1608.06993. External Links: Cited by: §I, §V.
-  () Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 2012. Cited by: §I, §V.
-  (2018)(Website) External Links: Cited by: §V.
-  (2016)(Website) External Links: Cited by: §V.
-  (2014) A tutorial on principal component analysis. CoRR abs/1404.1100. External Links: Cited by: §III.
Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Networks (0), pp. –. Note: External Links: Cited by: §I, §V.
-  (2013-11) Recovery of Sparse Matrices via Matrix Sketching. ArXiv e-prints. External Links: Cited by: §I.
-  (2018) Impact of very sparse random projections on compressed classification. Publication forthcoming.. Cited by: §I, §II, §III, §VII.
Training neural networks on high-dimensional data using random projection. Pattern Analysis and Applications. External Links: Cited by: §I, §II, §II, §VII.