Neural networks have successfully been applied to many domains Bengio_2013 ; LeCunBH2015 . Two trends have sparked the use of neural networks in recent years. Firstly, the data volumes have increased dramatically in many domains yielding large amounts of training data. Secondly, the compute power of today’s systems has significantly increased as well, particularly those of massively-parallel architectures based on graphics processing units. Those specialized architectures can be used to reduce the practical runtime needed for training and applying neural networks, which has led to the development of more and more complex neural network architectures AlexNet ; he2016deep ; densenet .
Many machine learning applications require data to be exchanged between servers and clients during the inference phase. For instance, the data might be stored on a server and users have to download the data in order to process them on a local machine. This is the case, for example, in remote sensing, where current projects produce petabytes of satellite data every year Landsat ; Sentinel . The application of a machine learning model in this field to, e. g., monitor changes on a global scale, often requires the transfer of large amounts of image data between the server and the client that executes the model, see Figure 1. Similarly, data have often to be transferred from clients to servers for further processing. For instance, data collected by mobile devices are transferred to remote servers for the analysis conducted by virtual assistants such as Amazon’s Alexa, Apple’s Siri, or the Google Assistant.
While the reduction of the training and inference runtimes have received considerable attention CoatesHWWCN13 ; Han2015 ; GordonENCWYC18 ; NIPS2016_6250 ; pmlr-v70-kumar17a ; XuKWC2013 , relatively little work has been done regarding the transfer of data induced by such server/client based scenarios. However, this data transfer between clients and servers can become a severe bottleneck, which can significantly affect the way users can make use of the available data. In some cases the necessary data transfer can be reduced based on prior knowledge (e. g., in case one knows that only certain input channels are relevant for the task to be conducted). Also, for some learning tasks, the data transfer can be reduced by extracting a small amount of expressive features from the raw data. In general, such feature based reductions have to be adapted to the specific tasks and might also lead to a worse performance compared to purely data-driven approaches.
Contribution: We propose a framework that allows to automatically select those parts of the input data that are relevant for a particular network. In particular, our approach aims to select the minimal amount of data needed to achieve a model performance that is comparable with one that can be obtained on all the input data. The individual selection criteria can be adapted to the specific needs of the task at hand as well as to the transfer capabilities between the server and the clients. As shown in our experiments, our framework can be used to sometimes significantly reduce the amount of data needed to be transferred during the inference phase without affecting the model performance much.
2 Related Work
Reducing the practical training time has gained significant attention in recent years. This includes the use of specialized techniques such as massively-parallel or distributed implementations CoatesHWWCN13 ; NIPS2012_4687 ; NIPS2018_8028 . Approaches aiming at an efficient inference phase have been proposed, including schemes that aim at reducing the weights of networks or at reducing the amount of floating point operations needed during inference Han2015 ; GordonENCWYC18 . Similarly, methods that aim at small tree-based models have been suggested pmlr-v70-kumar17a ; XuKWC2013 . The transfer of data during the inference phase has been addressed as well such as constructing machine learning models under the assumption of a limited prediction-time budget. For instance, Nan et al. NIPS2016_6250
propose a method that prunes features during the construction of random forests such that only few are needed during the inference phase (thus, avoiding costs for the computation and the transfer of the features). In some cases, data compression can be used to reduce the amount of bytes needed to be transferred (e. g., images compressed via JPEG). However, this usually requires to retrain a network to find a suitable compression level, which is not known beforehand. Further, such compressed versions might not be available on the server/client side.111Our framework can handle these compression scenarios as a special case, where the optimal compression level is automatically selected during the training phase. Deep neural networks have also been used to compress image data 7999241 , but the resulting compressed versions are independent of the learning task.
We conduct a gradient-driven search to find suitable weight assignments for the selection masks. An alternative to our approach are greedy schemes that, e. g., incrementally select input channels or pixels. However, these schemes might yield suboptimal results since only one channel/pixel is selected in each step. Further, these approaches quickly become computationally infeasible in case many channels or input pixels are given. Naturally, an exhaustive search for finding optimal mask assignments is computationally intractable. Our approach can be seen as a trade-off between these two variants. Finally, peripheral vision and deep saliency maps have been proposed to visualize neural networks strasburger2011peripheral ; deepSaliency ; selvaraju2017grad ; these techniques are also somewhat related to our work.
3 Learning Selection Masks
We resort to masks that can be used to select certain parts of the input data. These masks are adapted during the training process such that (a) the predictive power of the network is satisfying and (b) only a minimal amount of the input data is selected. We will focus on image data in this work for the sake of exposition, but our approach can also be applied to other types of data.
3.1 Selection Masks
The selection masks allow to select parts of the data such as certain input channels or individual pixels of the different channels, see Figure 2. For each such mask, an associated cost can be defined, which can be used to adapt the masks to the specific requirements of the task at hand (e. g., in case selecting pixels from one channel will cause less data transfer in the inference phase than from another channel). Our optimization approach resorts to the following mask realizations, see Figure 3:
channel(any): To select an arbitrary number of input channels, a joint mask is used, which contains, for each of the channels, two weights. For instance, a mask with and corresponds to selecting the first but not the second channel. Before applying the mask to an image , the first two axes are broadcasted, which yields a mask .
channel(xor): In a similar fashion, one can select exactly one of the input channels by resorting to a joint mask of the form . Here, exactly one of the weights equals one. For instance, a mask with corresponds to only the last channel being selected. As before, the first two axes are broadcasted prior to the application of the mask, yielding a mask of the form .
pixel(any): To conduct pixel-wise selections, one can directly consider joint masks , which permit to select individual pixels per channel. For instance, a mask with and for corresponds to selecting all pixels on the diagonal for the first two channels.
pixel(xor): Similarly, one can only allow one channel to be selected per pixel by considering a joint mask of the form , which contains, for each pixel, exactly one non-zero element corresponding to the selected channel for that pixel.
Note that variants of these four selection schemes can easily be obtained. For instance, shapes can be defined that partition the input data into, say, nine rectangular cells by considering masks of the form , where the first two axes are broadcasted to the corresponding cells. Such variants would allow to select certain cutouts, see Figure 4. The particular masks can be chosen according to the specific transfer capabilities between server and client. Finally, the different selection masks can also be applied sequentially with individual costs being assigned to them, see Section 4.
3.2 Algorithmic Framework
Let be a training set consisting of images with associated labels . The goal of the training process is to find suitable weight assignments both for the selection masks as well as for the neural network that is applied to the data.
3.2.1 Optimization Approach
Our procedure for learning suitable mask and network weights is given by LearnSelectionMasks, see Algorithm 1: Both the joint selection mask as well as the parameters and are initialized in Line 1 and Line 1, respectively. The parameter determines the trade-off between the task loss and the mask loss . Typically, is initialized with a small positive value (e. g., ) and is gradually increased during training. Both the selection mask and the network are trained simultaneously by iterating over a pre-defined number
of epochs, each being split intobatches (for the sake of exposition, we assume a batch size of 1).222Instead of resorting to a fixed number of epochs, other stopping criteria can be used such as stopping as soon as the loss for the involved masks is below a certain user-defined threshold. For each batch, a discrete mask is computed via the procedure DiscretizeMasks, which is used to obtain the masked image . The induced prediction is then used to compute the task loss . In addition, the overall mask loss is computed. Note that the discretized weights are used in the forward pass, whereas a mask with real-valued weights is used in the backward pass in Line 1. After each epoch, both and are adapted. As detailed below, the procedure DiscretizeMasks alternates between an “exploration” and a “fixation” phase, specified by the parameter . The final discrete weights for the joint mask are computed in Line 1 and, together with the updated model , returned in Line 1.
Learning Discrete Masks:
Naturally, exhaustive search schemes that find the optimal discrete weights by testing out all possible assignments are computationally infeasible. Also, simple greedy approaches such as forward/backward selection of channels become computationally very challenging and are clearly ill-suited for pixel-wise selections. Learning such discrete masks is difficult since the induced objective is not differentiable, which rules out the use of gradient-based optimizers commonly applied for training neural networks. One way to circumvent this problem is the so-called Gumbel-Max trick, which has been recently proposed in the context of variational auto-encoders gumbelTrick2014 ; gumbel1954statistical ; GumbelVAEJang2016 . The procedure DiscretizeMasks resorts to this trick to discretize the real-valued masks in the forward pass of Algorithm 1. For instance, given a mask corresponding to channel(any), the procedure yields a discrete mask via
where corresponds to the -th channel and where each is either zero or some small random noise, depending on which phase is executed (see below).
Thus, the softmax function is used as a surrogate for the discrete operation. The parameter is called temperature. A large
leads to the resulting weights being close to uniformly distributed, whereas a small value forrenders the values outputted by the softmaxDiscretizeMasks alternates between “explore” and “fixate”, specified by the parameter . If is true, then is some Gumbel noise with uniform distribution . If is false, no noise is added (i. e., ). In the exploration phase, the optimizer can try out new possible mask assignments, whereas the network weights are adapted to the new data input in the fixation phase. The amount of changes made during the exploration phase is also influenced by the temperature parameter .
Initialization and Adaptation:
The selection goal influences the initialization of the real-valued mask . In case all input channels for the channel(any) scheme are equally important, the individual masks are set to for all to initially “select” all of the channels, where for some small . In case the channels should be treated differently, the initialization can be adapted accordingly. For instance, only the first channel can be selected initially by setting for and for , see Section 4.
The procedure InitLambdaTau initializes both and . The parameter , which determines the trade-off between the task loss and the loss associated with all masks, is initialized to a small value (e. g., ). The temperature parameter is initialized to a positive constant (e. g., ). The adaptation of both and after each epoch are handled by the procedure AdaptLambdaTau: In the course of the training process, the influence of is gradually increased until epochs have been processed or some other stopping criterion is met (e. g., as soon as the desired reduction w.r.t. is achieved). Since the range of values for the model loss is generally not known beforehand, we resort to a scheduler that increases in Line 1 of Algorithm 1 in case the overall error has not decreased for a certain amount of epochs. The scheduler behaves similarly to standard learning schedulers, but instead of decreasing the learning rate, the value for is increased by a certain factor (e. g., ). The temperature influences the outcome of the softmax operation in Equation (2): A large value leads to similar weights being mapped to similar ones via the operation, whereas a small value for amplifies small differences such that the outputted weights are close to zero/one. For each new assignment of , we resort to some cool-down sequence, where is reset to and gradually decreased by a factor after each epoch (e. g., ). This cool-down sequence let the process explore different weight assignments at the beginning, whereas binary decisions are fostered towards the end.
3.3 Extension and Reduction
Different costs can be assigned to the individual masks, which are jointly taken into account by the overall mask loss . For instance, given input channels, one can resort to different losses to favor the selection of certain channels. This turns out to be useful in case different “versions” for the input channels are available, whose transfer costs vary (e. g., compressed images or thumbnails of different sizes).
Often, pre-trained networks with a fixed input structure are given. The selection of different versions for such networks can be handled via simple operators, see Figure 5: The extend operator can be used to extend a given input feature map (e. g., by generating ten compressed versions of different quality), whereas the merge operator can combine feature maps in a user-defined way (e. g., by summing up the input channels). For instance, an extend operation followed by a channel(xor) selection and a merge operation can be used to gradually select a certain version of each input channel without significantly changing the input for a given network in each step, thus allowing to learn masks for pre-trained networks without having to retrain the network weights from scratch, see Section 4.
We implemented our approach in Python 3.6 using PyTorch (version 1.1). Except for the trade-off parameter, default parameters were used for all experiments (, , , and ). The learning rates for all selection masks were set to . For the networks, the Adam adam optimizer with AMSGrad amsgrad and learning rate was used. The initial assignment as well as the factor for can have a significant impact. For this reason, we considered a small grid of possible assignments. The influence of this parameter is shown in Figure 14; for all other figures, one of the four configurations is presented.
We considered several classification datasets and network architectures, see Table 1. In addition to the well-known cifar10, mnist, and svhn datasets cifar10 ; lecun2010mnist ; 37648 , we considered two datasets from remote sensing and astronomy, respectively. For each instance of remote, one is given an image with 36 channels originating from six multi-spectral bands available for six different dates PRISHCHEPOV2012195 . The learning goal is to predict the type of change occurring in the central pixel of each image. The astronomical dataset is related to detecting supernovæ ScalzoETAL2017 . Each instance is represented by an image with three channels and the goal is to predict the type of object in the center of the image (a balanced version of the dataset was used). Both remote and supernovæ depict typical datasets in remote sensing and astronomy, respectively, with the target objects being located in the centers of the images. For all experiments, we considered a fixed amount of epochs and monitored the classification accuracy on the hold-out set. Each experiment was conducted times and the lines of the figures represent individual runs (the thicker black line is the aggregated mean over all runs). If not stated otherwise, we considered pre-trained networks before applying our selection approach.
4.1 Channel Selection
The first experiment addressed the task of selecting a subset of the input channels. We used remote, supernovæ, and cifar10 as datasets, for which different outcomes were expected. For each of the channels, we assigned the same mask loss . The overall mask loss was the sum over all selected channels.
The outcome is shown in Figure 6. As expected, channel-wise selection worked best on remote due to many channels carrying similar information. Only if less than % of the channels were selected, the accuracy started to drop. In Figure 7, the selection process is sketched, where each row represents a different epoch (from top to bottom: , , , , ) and where each columns corresponds to one of the channels. For supernovæ, the removal of a single channel did not significantly affect the classification accuracy. For some runs, all channels were removed at once, which indicates that the steps made for were too large (thus, a smaller should be considered). On cifar10, only one of the three channels could be dropped with a minimal degradation of accuracy. Thus, as expected, less channels could be removed for both supernovæ and cifar10 due to the channels being less redundant compared to remote.
4.2 Pixel-wise Selection
Next, we addressed pixel-wise selections (pixel(any)) and conducted a similar experiment with the three previous datasets. The mask loss was obtained by summing over the selected pixels, where a weight of was assigned to each individual pixel. The results are given in Figure 9. It can be seen that all plots for the mask loss are smoother than for the channel-wise selections, which is due to the fact that the selection decisions to be made at each step were much more fine-grained (for cifar10 and supernovæ only three channels but thousands of subpixels are given). It can be seen that the accuracy drops slightly at the beginning of the training process. This is due to the fact that the networks were not trained with missing inputs before and, hence, had to learn to compensate the missing input at the beginning. This effect could be lessened by (a) adding dropout layers to the networks or by (b) decreasing both and to let the approach do less exploration at the beginning. Overall, the achieved reduction w.r.t. the remained accuracy is higher than for the channel-wise selection, although there are notable spikes in supernovæ that most likely stem from the removal of subpixels being crucial for the classification task (the removal of some central pixels seem to have had a significant impact). The development of the masks w.r.t. is shown in Figure 8.
4.2.1 Feature Map Selection
In many cases, preprocessed data are available on the server/client side. The next experiment was dedicated to such scenarios. In particular, we considered ten compressed versions for the cifar10 images of different JPEG qualities . The goal was to select one of these versions via channel(xor).
To capture the varying costs for the transfer of the different versions, we assigned to each version with quality level . Also, only the masks were initialized such that only the version with the highest quality was initially selected. Figure 11 shows the results. It can be seen that the lowest possible value () was obtained for , for which an accuracy of about % remained. Also, an accuracy of about % could be maintained while reaching a loss of about . An illustration of the reduced input over the epochs is given in Figure 10.
4.2.2 Combination of Selection Masks
This experiment demonstrates the use of multiple selection masks and mask losses. The following operations were applied, see Figure 12: First an extend operation was used to generate different JPEG qualities for each channel. Afterwards, a channel(xor) selection operation followed by a merge operation (sum) were applied. Finally, a pixel(any) selection was conducted to select certain subpixels of the merged channels. For this experiment, we used cifar10, mnist, and svhn. The joint mask loss was set to the product of the two previously defined losses.
The results are shown in Figure 13. Note that the models for svhn and mnist were not pre-trained in this case, which is why the accuracies start with a lower value. Since mnist is a dataset with many empty border pixels, our approach was able to remove % of the pixels in the first few epochs. Also, the lowest possible JPEG quality was used. Similar effects can be observed on svhn although it seems that is was harder to remove pixels due to more background pixels compared to mnist. For cifar10, the results show that the combined masks yielded similar outcomes as for the individual masks, see again Figure 9 and 11.
4.2.3 Influence of
The parameter usually has a great impact on the mask selection process. Figure 14 shows the influence of the four different configurations considered for our experiments given the remote dataset. It can be seen that a large (blue and red line) leads to the mask loss quickly decreasing. For such settings, it seems that the network was not able to compensate the loss in information, which is why the accuracy was lower until the network was able to adapt to the new input. A smaller initial value for leads to the selection process taking less input data away at the beginning, which avoids an initial drop of accuracy. Similarly, a large leads to a faster decrease w.r.t. , which can be suboptimal in certain cases.
The transfer of data between servers and clients can become a major bottleneck during the inference phase of a neural network. We propose a framework that allows to automatically select those parts of the data needed by the network to perform well, while, at the same time, to select only a minimal amount of data. Our approach resorts to various types of selection masks that are jointly optimized together with the corresponding network during the training phase. Our experiments show that it is often possible to achieve a good accuracy with significantly less input data needed to be transferred. We expect that such selection masks will play an important role in the future for data-intensive domains such as remote sensing or for scenarios where the data transfer bandwidth is very limited.
-  Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436–444, 2015.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems (NeurIPS), pages 1106–1114. Curran Associates, 2012.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778. IEEE, 2016.
-  Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269. IEEE, 2017.
-  Michael A. Wulder, Jeffrey G. Masek, Warren B. Cohen, Thomas R. Loveland, and Curtis E. Woodcock. Opening the archive: How free data has enabled the science and monitoring promise of landsat. Remote Sensing of Environment, 122(Supplement C):2 – 10, 2012. Landsat Legacy Special Issue.
-  Jian Li and David P. Roy. A global analysis of sentinel-2a, sentinel-2b and landsat-8 data revisit intervals and implications for terrestrial monitoring. Remote Sensing, 9(902), 2017.
-  A. Coates, B. Huval, T. Wang, D. J. Wu, B. C. Catanzaro, and A. Y. Ng. Deep learning with COTS HPC systems. In International Conference on Machine Learning (ICML), volume 28 of JMLR Proceedings, pages 1337–1345. JMLR.org, 2013.
-  Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. In International Conference on Neural Information Processing Systems (NeurIPS), pages 1135–1143, Cambridge, MA, USA, 2015. MIT Press.
-  Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. Morphnet: Fast & simple resource-constrained structure learning of deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1586–1595. IEEE Computer Society, 2018.
-  Feng Nan, Joseph Wang, and Venkatesh Saligrama. Pruning random forests for prediction on a budget. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2334–2342. Curran Associates, Inc., 2016.
-  Ashish Kumar, Saurabh Goyal, and Manik Varma. Resource-efficient machine learning in 2 KB RAM for the internet of things. In Doina Precup and Yee Whye Teh, editors, International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 1935–1944. PMLR, 06–11 Aug 2017.
Zhixiang Eddie Xu, Matt J. Kusner, Kilian Q. Weinberger, and Minmin Chen.
Cost-sensitive tree of classifiers.In International Conference on Machine Learning (ICML), volume 28 of JMLR Proceedings, pages 133–141. JMLR.org, 2013.
-  Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25, pages 1223–1231. Curran Associates, Inc., 2012.
-  Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam Sung Kim, and Alexander Schwing. Pipe-sgd: A decentralized pipelined sgd framework for distributed deep net training. In Advances in Neural Information Processing Systems 31, pages 8045–8056. Curran Associates, Inc., 2018.
-  F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, and D. Zhao. An end-to-end compression framework based on convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):3007–3018, 2018.
-  Hans Strasburger, Ingo Rentschler, and Martin Jüttner. Peripheral vision and pattern recognition: A review. Journal of vision, 11(5):13–13, 2011.
-  Sen He and Nicolas Pugeault. Deep saliency: What is learnt by a deep network about saliency? Computing Research Repository (CoRR), abs/1801.04261, 2018.
-  Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017.
-  Chris J. Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In International Conference on Neural Information Processing Systems (NeurIPS), pages 3086–3094, 2014.
-  Emil Julius Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office, 1954.
-  Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations (ICLR), 2017.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computing Research Repository (CoRR), abs/1412.6980, 2014.
-  Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations (ICLR), 2018.
-  Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10. canadian institute for advanced research, 2009.
-  Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2:18, 2010.
-  Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
-  Alexander V. Prishchepov, Volker C. Radeloff, Maxim Dubinin, and Camilo Alcantara. The effect of landsat ETM/ETM+ image acquisition dates on the detection of agricultural land abandonment in eastern europe. Remote Sensing of Environment, 126:195 – 209, 2012.
-  R. A. Scalzo, F. Yuan, M. J. Childress, A. Möller, B. P. Schmidt, B. E. Tucker, B. R. Zhang, P. Astier, M. Betoule, and N. Regnault. The skymapper supernova and transient search. Computing Research Repository (CoRR), abs/1702.05585, 2017.