I Introduction
Convolutional Neural Networks (CNNs) [6] are the stateoftheart technique for image classification, routinely achieving betterthanhuman performance. New CNN architectures and applications continue to emerge at a prodigious rate. More recently, substantial interest has arisen in compressing neural networks, including CNNs, to use fewer parameters and to require less memory so as to enable running on devices with limited size, weight, and power (SWaP). Note, “compression” in this context refers to reducing these computational and memory requirements while minimizing the effect on classification accuracy; this does not necessarily require that the compression can be reversed.
Compressing the network, however, addresses only one side of the coin: what about compressing the images
to which the CNN is applied? Though images are often stored in compressed form, CNN architectures currently uncompress all images prior to classifying them. Being able to compress the images also presents an additional advantage: given a dataset of large images and a network that expects small images, such a compression algorithm may preserve more information than extant techniques such as downgrading (DG) or cropping. Thus, this brief presents a compression algorithm that reduces the images’ size on disk and does not require (or even allow) the images to be uncompressed prior to being classified by the CNN.
To understand why extant compression algorithms are inadequate, we we must consider how the CNN ingests the original image. The first layer of a CNN begins by ingesting a small “convolutional region” from the topleft of the image (the value of
is set by the CNN architecture). After processing this area, the convolutional region “strides” (is translated)
pixels to the right and the process repeats; in this way, the convolutional region “convolves” lefttoright, toptobottom across the matrix of pixel intensities (see Figure 1a). Thus, any effective compression scheme must preserve the localization, such that nearby pixels generally correspond to semantically coherent information. It is this requirement that existing techniques, such as JPEG compression, fail to meet.In response, we propose “Localized Compression” (LC). Rather than compressing the image as a whole, we divide the original image into blocks and compress each block to (with ). This reduces the number of pixels in the compressed image by a factor of . While there are no restrictions on , we require to be chosen such that is divisible by ; this ensures that each convolutional region receives the compresed pixels in the same relative order. This is illustrated in Figure 1b.
In principal, we could use standard compression techniques like JPEG compression or Principal Component Analysis (PCA) to compress these
blocks to blocks. In practice, however, most modern CNN architectures have very small values of , such as 4 (AlexNet), or even 2 (DenseNet). We must therefore compress already small blocks (e.g., ) into much smaller blocks (e.g., ). This rules out many wellestablished (reversible) compression techniques, including PCA and JPEG compression.Instead, we consider two solutions that are compatible with such small sizes: random matrix multiplication (RMM) and percentilebased sampling. RMM entails multiplying the original matrix by random matrices of the appropriate dimensions; this has proven effective on related problems [15], [14], [16]. Percentilebased sampling techniques entail retaining the maximum, minimum, and other values from the original matrix.
We evaluate LC and compare these different options using to two standard datasets, ImageNet
[3] and the German Traffic Sign Recognition Benchmark [13], and two different CNN architectures, AlexNet [9] and DenseNet [8]. Both LC and DG produce images of the same size and therefore offer the same reduction in storage and processing requirements; therefore, we will compare them in terms of classification accuracy. Our results show that LC with percentilebased sampling is approximately 2% more accurate than DG when .The remainder of this brief is organized as follows. Section II describes related work. Section III provides more detail on randommatrixbased, samplingbased, and other techniques to represent an matrix with an matrix, while Section IV formally defines using these techniques for LC. Section V shows numerical results from applying LC to standard datasets with existing network architectures. Section VI contains a brief digression in which we apply random matrices to the related problem of compressing fullyconnected layers. Finally, we draw conclusions in Section VII.
Ii Related Work
To our knowledge, there is no past work that addresses applying CNNs to compressed images (i.e., without immediately uncompressing each image). There is, however, much related work about compressing the CNN itself, and about compressing the inputs to other types of classifiers.
With respect to compressing the CNN itself, research has focused on four key areas: (1) reducing the number of network parameters by pruning and sharing, (2) using lowrank factorization to compress the network weights, (3) representing convolutional filters as transformations of a small number of base filters, and (4) transferring the essential knowledge from a deep network to a shallower network (knowledge distillation) [7], [1]. Of these, using lowrank factorization to compress the network weights is most similar to our paradigm. In particular, Denton et al. [4]
showed that performing tensor decompositions (based on the singular value decomposition) on trained convolutional layers can significantly accelerate CNNs with minimal loss in classification accuracy. While we also consider extensions of the singular value decomposition (Section
III), our work is different in that we compress the inputs to the CNN (images) rather than the CNN itself.Compressing inputs to simple classifiers (singlelayer perceptrons) has also been thoroughly studied. In particular, Wimalajeewa and Varshney
[15]recently considered sparse random matrices to compress the input to simple classifiers (nearest neighbor classifiers, random forests, and support vector machines), and Wójcik
et al. [16] studied compressing highdimensional vector input (e.g., telemetry data) to deep, fullyconnected neural networks. While we also consider random matrices, we apply them to CNNs rather than simple classifiers or fullyconnected neural networks.Section VI discusses using random matrices to compress a CNN’s fullyconnected layers. Here, there is substantial related work: in particular, Cheng et al. [2] has shown that circulant projection matrices (a subset of random projection matrices) efficiently compress fullyconnected layers in CNNs, and Wójcik et al. [16] considers random projection matrices for the same purpose in fullyconnected (nonconvolutional) neural networks. Our work on the fullyconnected layers fills in the gap, using random, noncirculant matrices to compress fullyconnected layers in CNNs.
Iii Theory
We begin by considering how to compress twodimensional blocks to with (in this work, we treat each channel separately). We refer to the block as . Here, we consider five options.
Downgrading
entails taking a weighted average or interpolation of neighboring pixels. This technique is already widely used: raw images are typically down or upsampled (as well as reshaped or cropped) to a standard size prior to applying the CNN. We do not consider DG as a form of LC since downgraded images are
matrices of uncompressed pixels just like the uncompressed images. Rather, in this work, we use DG (as implemented in OpenCV’s INTER_AREA algorithm [5]) as the baseline against which LC is compared.Principal Component Analysis [12] is a widelyused compression procedure in which an image is approximated as a linear combination of its
principal eigenvectors. We apply PCA to
and store the resulting parameters in anmatrix, padding with zeros as needed. We must therefore choose
such that the number of PCA parameters does not exceed . Concretely, we must require that:(1) 
Though PCA has been widely studied, it offers two disadvantages. First, its computational complexity is very high, as eigenvectors must be calculated for each block. Second, Equation 1 implies that PCA is simply incompatible with some dimensionalities. For example, it is impossible to represent even a single principal component of any block in an block if and . For these reasons, we do not consider PCA further in this work.
Percentiles. With percentilebased sampling, we sort the uncompressed points by their intensity and then sample from this distribution at predetermined percentile values (e.g., the minimum, 33rd percentile, 67th percentile, and maximum). The computational complexity of this compression technique is somewhat high, as each block’s values must be sorted.
Random Matrix Multiplication (RMM). Some recent work in compressive sensing [15] has looked at performing dimensionality reduction as a precursor to classification by multiplying the original features on the left by a (sparse) random matrix. In our case, we define as the vectorized form of . We then fill an matrix, , with values randomly drawn according to:
(2) 
where (lower values of are more efficient; in this work, we set ). We then perform dimensionality reduction according to . We then reshape the resulting vector to .
Random Matrix Sketching (MS). Similar to RMM, MS fills an matrix according to equation (2) (we again set ) and then compresses according to . This smaller dimensionality further reduces the compression technique’s computational complexity.
Iv Method
Localized Compression entails using the techniques in Section III to compress entire images. We refer to the entire uncompressed image as . We begin by defining blocks over (with ). We compress each channel separately and so consider only two dimensions here. In principal, these blocks can be offset from one another by any number of pixels (i.e., an uncompressed pixel can be in zero, one, or multiple blocks), but for simplicity, we consider only blocks that completely tile the image with no overlap (i.e., each uncompressed pixel is in exactly one block).
Algorithms 1 and 2 formally define LC for singlechannel images (multichannel images simply apply the compression operation to each channel separately). Algorithm 1 (“inline mode”) is a proofofconcept in which the compression is performed at runtime: that is, we simply run the CNN as normal, inserting a step wherein we apply the compression operation to each block and then apply the normal convolutional operation to the resulting block. This is conceptually straightforward, but offers little or no savings in terms of storage efficiency (the uncompressed images must be stored) or computational efficiency (the reduction in learned convolutional parameters is roughly offset by the addition of compression operations). Algorithm 2 (“default mode”) makes some adjustments such that the compression is performed prior to runtime. Default mode achieves the same storage and computational efficiency as DG; however, this introduces some complications with respect to data augmentation (described below).
Algorithm 1 (“inline mode”) begins by resizing () each image to and writing these images to disk. We then cycle through the images as normal. For each image, we use the data augmentation operations () with randomlydrawn parameters to modify each image: these operations may include cropping to , taking a leftright flip, or any other data augmentation strategy. We then locallycompress each image () and classify it using the CNN (). Note, when compressing with random matrices, we use the same random matrix for each image.
Algorithm 2 (“default mode”) differs in that it writes to disk after performing the data augmentation and localized compression. It also requires that be divisible by so that the convolutional region will always stride over an integer number of compressed blocks. In this way, only the compressed images are written to disk (reducing the storage requirement), and the compression operations must only be performed once (reducing the computational requirement). The challenge with this ordering is that after compression, the full suite of data augmentation techniques can no longer be used; instead, only some limited set of data augmentation techniques () can be applied. In particular:

Crops. It is customary to take the final crop during data augmentation (i.e., after resizing the image to ). In default mode, this is still possible, however, the crops must not be allowed to subdivide the blocks.

Flips. It is customary to take leftright flips of the image during data augmentation. In default mode, it is still possible to reverse the ordering of the blocks; however, the internal structure of each block must not be changed.
Other data augmentation schemes may or may not be applicable postcompression.
We therefore expect that networks trained in default mode will be somewhat less accurate than networks trained in inline mode. To bridge this gap, we allow default mode to produce copies of each image. These copies are produced using the full suite of data augmentation techniques (); at runtime, we randomly select one of these images and then apply the limited set of data augmentation techniques () to achieve further augmentation. We therefore expect that increasing will increase our classification accuracy, but will also increase our storage requirements.
V Numerical Experiments
We test our procedure on two network architectures and two datasets. Our architectures are AlexNet [9] and DenseNet [8]: AlexNet is a dated architecture that has been widely used to evaluate compression algorithms, while DenseNet is a more modern architecture that achieves considerably higher accuracy. Both architectures require input images of a uniform size; we take as the reference size for all images. Our datasets are the German Traffic Signs Recognition Benchmark (GTSRB) [13] (39K training images over 37 classes) and ImageNet [3] 2012 (1.3M training images over 1000 classes). While these are both standard datasets for classification challenges, a key difference is that most ImageNet images are larger than the reference size, whereas most GTSRB images are smaller than the reference size. We expect that LC will be more effective on large images (as there is more information to exploit).
We base our implementation of the networks, including parameters such as weight decay, on those from TensorFlow Slim
[11], [10]. In all tests (except where indicated), we begin by resizing and reshaping the images to (e.g., ), cropping a random patch from this (i.e.,), and then performing a leftright flip at random. All evaluation is performed with a single center crop and no leftright flip. We train all networks with the Momentum Optimizer with momentum 0.9 and a learning rate that begins at 0.01 and is reduced by an order of magnitude every 20 epochs, for a total of 65 epochs. This simple scheme is fully networkagnostic and offers relatively fast training times while giving top1 accuracies only slightly lower than those reported by the network authors.
Our first test compare the percentile, RMM, and MS compression algorithms (as described in Section III) against the baseline of simply downgrading the images to the equivalent size. We use inline mode to allow all algorithms to use identical, offtheshelf dataset augmentation techniques (as described in Section IV). We set and ; thus, the final images are . Note, these small images sizes require removing the last pooling layer from AlexNet. Our results, shown in Table I, illustrate that LC is viable: all methods give results within a few percent of the baseline (DG), and the percentile method gives an accuracy 12% higher than DG. We therefore perform LC with the percentilebased compression technique in the remainder of this work.
ImageNet  GTSRB  

Method  Size  AlexNet  DenseNet  AlexNet  DenseNet 
Uncompressed  56.8%  68.7%  96.8%  94.9%  
Downgraded  27.0%  47.3%  92.8%  92.9%  
Percentiles  29.4%  47.9%  94.3%  93.8%  
RMM  26.4%  45.7%  93.1%  94.0%  
MS  26.4%  42.0%  93.4%  93.4% 
Our second test validates default mode. In particular, we compare default mode’s accuracy with various of against the accuracy achieved by inline mode. In addition to classification accuracy, Table II also shows the uncompressedtocompressed storage ratio (SR) and the uncompressedtocompressed computational ratio (CR). The results show that LC is still viable in default mode: even with , LC remains more accurate than DG. Setting further increases the classification accuracy; however, continuing to increase shows only a modest improvement in classification accuracy. In the remainder of this work, we use default mode with .
Third, we test LC for different compression ratios. In particular, we keep and vary the size of ; larger values of therefore result in smaller compressed image sizes. Our results for LC and DG are given in Table III, and show that LC consistently outperforms DG for significant compression ratios (i.e., reducing the number of pixels by more than a factor of 4), but advantage is less clear for smaller compression ratios.
ImageNet  GTSRB  
CR  SR  AlexNet  DenseNet  AlexNet  DenseNet  
1  12.25x  12.25x  28.4%  47.4%  94.5%  94.4% 
2  12.25x  6.125x  28.9%  47.9%  94.1%  93.5% 
4  12.25x  3.06x  29.0%  48.0%  94.4%  93.7% 
inline  1x  1x  29.4%  47.9%  94.3%  93.8% 
ImageNet  GTSRB  
AlexNet  DenseNet  AlexNet  DenseNet  
Compression Ratio  LC  DG  LC  DG  LC  DG  LC  DG 
27.6%  25.6%  42.7%  42.3%  94.0%  93.2%  93.7%  93.5%  
28.9%  27.0%  47.9%  47.3%  94.1%  92.8%  93.5%  92.9%  
31.8%  30.7%  50.2%  50.2%  94.0%  94.1%  94.0%  93.3%  
37.0%  36.1%  53.0%  53.2%  94.9%  94.4%  94.7%  93.7%  
44.2%  45.4%  58.1%  61.2%  95.5%  95.0%  94.3%  94.2%  
52.5%  52.2%  62.6%  63.4%  96.0%  97.0%  94.3%  94.0% 
Finally, we consider applying LC to larger images: rather than beginning with and compressing, we begin with large images and compress to . To evaluate this, we consider only the ImageNet dataset, and select only those images with more than pixels (of which there are 30,192 for training and 1,110 for testing). Our results are given in Table IV. Though the CNNs are clearly data starved, our results suggest that LC gives higher accuracy than DG for larger images just as it did for smaller images.
AlexNet  DenseNet  

DG  16.7%  15.5% 
LC  17.2%  17.8% 
Vi Applying random matrices to fullyconnected layers
We now take a brief digression to consider a related problem: using the techniques of Section III to compress the fullyconnected layers inside the CNN itself. As discussed in Section II, substantial work has been put into compressing these fullyconnected layers in deep neural networks; our contribution is to extend this work by applying static, noncirculant random matrices to CNNs. Our strategy is to introduce a new deterministic layer immediately prior to each fullyconnected layer that compresses the inputs to the following layer. The percentilebased sampling method, though effective for LC, is not an appropriate choice for this layer, as it would continually reorder the features. We therefore select the MSbased technique for this layer; this efficiently and dramatically reduces the number of weights in the hidden layer.
As an numerical example, we again consider the AlexNet architecture, which contains three fullyconnected layers, the first of which contains 6400 weights. We then reshape these weights to and multiply on the left by a random matrix. This produces 3328 weights, which we feed into the next fullyconnected layer. We repeat this for each fullyconnected layers. We can also reduce the number of nodes in each FC layer to compensate for the reduced number of inputs.
CR by Layer  
FC1  FC2  FC3  nNodes  GTSRB  ImageNet  CR 
1.00  1.00  1.00  4096  96.8%  58.4%  1.00 
0.50  1.00  1.00  4096  96.5%  57.5%  0.59 
0.50  0.50  0.50  4096  96.0%  57.8%  0.55 
0.50  0.50  0.50  2048  98.2%  54.3%  0.28 
Table V shows our results applying this to the GTSRB and ImageNet datasets with the AlexNet network architecture (we did not consider DenseNet, as it contains only a single fullyconnected layer). Table V reports the compression ratio for each of the three fullyconnected layers (FC1, FC2, and FC3), and the number of nodes contained in each FC layer (nNodes). Compared to the uncompressed AlexNet, we observe a 1% increase in accuracy on GSTRB when compressing the network up to a 72% and less than 1% decrease in accuracy when tested using ImageNet compressing the network by 45%. The counterintuitive increase in accuracy for GTSRB may be because the original images are so small that the large network has too many parameters relative to the images’ information content.
Vii Conclusions
The primary contribution of this work is to introduce Localized Compression (LC), an alternative to downgrading when CNNs require an image size much smaller than the original image’s resolution. In some sense, LC is a generalization of downgrading: downgrading always performs some sort of pixel averaging and requires , whereas LC supports different compression techniques and different values of . The most successful compression technique, percentilebased sampling, could be viewed as applying a generalization of a pooling layer to the original image. By choosing such that divides , LC supports any network architecture.
We also extended previous work [15], [16] on applying sparse random matrices to deep neural networks. Though percentilebased sampling outperformed randommatrixbased techniques on LC, we showed that sparse random matrices are an effective way to compress both convolutional and fullyconnected layers in CNNs.
With respect to LC, our results show that when it is used with percentilebased sampling and relatively high compression ratios, LC gives a 12% accuracy improvement over downgrading. LC can therefore be useful in applications where the average image size is much larger than a CNN’s reference size. This is potentially a useful capability: many modern cameras can produce highresolution images; LC provides a way to exploit the extra information that such devices provide without increasing the computational or SWaP requirements.
Acknowledgment
The authors would like to acknowledge Profs. Pramod Varshney and Thakshila Wimalajeewa at Syracuse University for sharing their expertise with using random matrices for classification.
This work was supported by the Air Force Research Laboratory under contract FA865017C1154.
References
 [1] (2017) A survey of model compression and acceleration for deep neural networks. CoRR abs/1710.09282. External Links: Link, 1710.09282 Cited by: §II.
 [2] (2015) An exploration of parameter redundancy in deep networks with circulant projections. Proceedings of the IEEE International Conference on Computer Vision 2015 Inter (1), pp. 2857–2865. External Links: Document, 1502.03436, ISBN 9781467383912, ISSN 15505499 Cited by: §II.

[3]
(200906)
ImageNet: a largescale hierarchical image database.
In
2009 IEEE Conference on Computer Vision and Pattern Recognition
, Vol. , pp. 248–255. External Links: Document, ISSN 10636919 Cited by: §I, §V.  [4] (201404) Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. ArXiv eprints. External Links: 1404.0736 Cited by: §II.
 [5] (2018)(Website) External Links: Link Cited by: §III.
 [6] (2016) Deep learning. MIT Press. Cited by: §I.
 [7] (2015) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149. External Links: Link, 1510.00149 Cited by: §II.
 [8] (2016) Densely connected convolutional networks. CoRR abs/1608.06993. External Links: Link, 1608.06993 Cited by: §I, §V.
 [9] () Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 2012. Cited by: §I, §V.
 [10] (2018)(Website) External Links: Link Cited by: §V.
 [11] (2016)(Website) External Links: Link Cited by: §V.
 [12] (2014) A tutorial on principal component analysis. CoRR abs/1404.1100. External Links: Link, 1404.1100 Cited by: §III.

[13]
(2012)
Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition
. Neural Networks (0), pp. –. Note: External Links: ISSN 08936080, Document, Link Cited by: §I, §V.  [14] (201311) Recovery of Sparse Matrices via Matrix Sketching. ArXiv eprints. External Links: 1311.2448 Cited by: §I.
 [15] (2018) Impact of very sparse random projections on compressed classification. Publication forthcoming.. Cited by: §I, §II, §III, §VII.

[16]
(20180319)
Training neural networks on highdimensional data using random projection
. Pattern Analysis and Applications. External Links: ISSN 1433755X, Document, Link Cited by: §I, §II, §II, §VII.
Comments
There are no comments yet.