1 Introduction
In the past decade deep neural networks have set new performance standards in many highimpact applications. These include object classification Krizhevsky et al. (2012); Sermanet et al. (2013), speech recognition Hinton et al. (2012), image caption generation Vinyals et al. (2014); Karpathy & FeiFei (2014) and domain adaptation Glorot et al. (2011b). As data sets increase in size, so do the number of parameters in these neural networks in order to absorb the enormous amount of supervision Coates et al. (2013). Increasingly, these networks are trained on industrialsized clusters Le (2013) or highperformance graphics processing units (GPUs) Coates et al. (2013).
Simultaneously, there has been a second trend as applications of machine learning have shifted toward mobile and embedded devices. As examples, modern smart phones are increasingly operated through speech recognition
Schuster (2010), robots and selfdriving cars perform object recognition in real time Montemerlo et al. (2008), and medical devices collect and analyze patient data Lee & Verma (2013). In contrast to GPUs or computing clusters, these devices are designed for low power consumption and long battery life. Most importantly, they typically have small working memory. For example, even the topoftheline iPhone 6 only features a mere 1GB of RAM.^{1}^{1}1http://en.wikipedia.org/wiki/IPhone_6The disjunction between these two trends creates a dilemma when stateoftheart deep learning algorithms are designed for deployment on mobile devices. While it is possible to train deep nets offline on industrialsized clusters (serverside), the sheer size of the most effective models would exceed the available memory, making it prohibitive to perform testing ondevice. In speech recognition, one common cure is to transmit processed voice recordings to a computation center, where the voice recognition is performed serverside Chun & Maniatis (2009). This approach is problematic, as it only works when sufficient bandwidth is available and incurs artificial delays through network traffic Kosner (2012). One solution is to train small models for the ondevice classification; however, these tend to significantly impact accuracy Chun & Maniatis (2009), leading to customer frustration.
This dilemma motivates neural network compression. Recent work by Denil et al. (2013) demonstrates that there is a surprisingly large amount of redundancy among the weights of neural networks. The authors show that a small subset of the weights are sufficient to reconstruct the entire network. They exploit this by training lowrank decompositions of the weight matrices. Ba & Caruana (2014) show that deep neural networks can be successfully compressed into “shallow” singlelayer neural networks by training the small network on the (log) outputs of the fully trained deep network Buciluǎ et al. (2006). Courbariaux et al. (2014) train neural networks with reduced bit precision, and, long predating this work, LeCun et al. (1989) investigated dropping unimportant weights in neural networks. In summary, the accumulated evidence suggests that much of the information stored within network weights may be redundant.
In this paper we propose HashedNets, a novel network architecture to reduce and limit the memory overhead of neural networks. Our approach is compellingly simple: we use a hash function to group network connections into hash buckets uniformly at random such that all connections grouped to the hash bucket share the same weight value . Our parameter hashing is akin to prior work in feature hashing Weinberger et al. (2009); Shi et al. (2009); Ganchev & Dredze (2008)
and is similarly fast and requires no additional memory overhead. The backpropagation algorithm
LeCun et al. (2012) can naturally tune the hash bucket parameters and take into account the random weight sharing within the neural network architecture.We demonstrate on several real world deep learning benchmark data sets that HashedNets can drastically reduce the model size of neural networks with little impact in prediction accuracy. Under the same memory constraint, HashedNets have more adjustable free parameters than the lowrank decomposition methods suggested by Denil et al. (2013), leading to smaller drops in descriptive power.
Similarly, we also show that for a finite set of parameters it is beneficial to “inflate” the network architecture by reusing each parameter value multiple times. Best results are achieved when networks are inflated by a factor –. The “inflation” of neural networks with HashedNets imposes no restrictions on other network architecture design choices, such as dropout regularization Srivastava et al. (2014)
Glorot et al. (2011a); LeCun et al. (2012), or weight sparsity Coates et al. (2011).2 Feature Hashing
Learning under memory constraints has previously been explored in the context of largescale learning for sparse data sets. Feature hashing (or the hashing trick) Weinberger et al. (2009); Shi et al. (2009) is a technique to map highdimensional text documents directly into bagofword Salton & Buckley (1988)vectors, which would otherwise require use of memory consuming dictionaries for storage of indices corresponding with specific input terms.
Formally, an input vector is mapped into a feature space with a mapping function where . The mapping is based on two (approximately uniform) hash functions and and the dimension of the hashed input x is defined as .
The hashing trick leads to large memory savings for two reasons: it can operate directly on the input term strings and avoids the use of a dictionary to translate words into vectors; and the parameter vector of a learning model lives within the much smaller dimensional instead of . The dimensionality reduction comes at the cost of collisions, where multiple words are mapped into the same dimension. This problem is less severe for sparse data sets and can be counteracted through multiple hashing Shi et al. (2009) or larger hash tables Weinberger et al. (2009).
In addition to memory savings, the hashing trick has the appealing property of being sparsity preserving, fast to compute and storagefree. The most important property of the hashing trick is, arguably, its (approximate) preservation of inner product operations. The second hash function, , guarantees that inner products are unbiased in expectation Weinberger et al. (2009); that is,
(1) 
Finally, Weinberger et al. (2009)
also show that the hashing trick can be used to learn multiple classifiers within the same hashed space. In particular, the authors use it for multitask learning and define multiple hash functions
, one for each task, that map inputs for their respective tasks into one joint space. Let denote the weight vectors of the respective learning tasks, then if a classifier for task does not interfere with a hashed input for task ; i.e. .3 Notation
Throughout this paper we type vectors in bold (x), scalars in regular ( or ) and matrices in capital bold (). Specific entries in vectors or matrices are scalars and follow the corresponding convention, the dimension of vector is and the entry of matrix is .
Feed Forward Neural Networks.
We define the forward propagation of the layer in a neural networks as,
(2) 
where is the (virtual) weight matrix in the layer. The vectors denote the activation units before and after transformation through the transition function
. Typical activation functions are rectifier linear unit (ReLU)
Nair & Hinton (2010), sigmoid or tanh LeCun et al. (2012).4 HashedNets
In this section we present HashedNets, a novel variation of neural networks with drastically reduced model sizes (and memory demands). We first introduce our approach as a method of random weight sharing across the network connections and then describe how to facilitate it with the hashing trick to avoid any additional memory overhead.
4.1 Random weight sharing
In a standard fullyconnected neural network, there are weighted connections between a pair of layers, each with a corresponding free parameter in the weight matrix . We assume a finite memory budget per layer, , that cannot be exceeded. The obvious solution is to fit the neural network within budget by reducing the number of nodes in layers or by reducing the bit precision of the weight matrices Courbariaux et al. (2014). However if is sufficiently small, both approaches significantly reduce the ability of the neural network to generalize (see Section 6). Instead, we propose an alternative: we keep the size of untouched but reduce its effective memory footprint through weight sharing. We only allow exactly different weights to occur within , which we store in a weight vector . The weights within are shared across multiple randomly chosen connections within . We refer to the resulting matrix as virtual, as its size could be increased (i.e. nodes are added to hidden layer) without increasing the actual number of parameters of the neural network.
Figure 1 shows a neural network with one hidden layer, four input units and two output units. Connections are randomly grouped into three categories per layer and their weights are shown in the virtual weight matrices and . Connections belonging to the same color share the same weight value, which are stored in and , respectively. Overall, the entire network is compressed by a factor , i.e. the weights stored in the virtual matrices and are reduced to only six real values in and . On data with four input dimensions and two output dimensions, a conventional neural network with six weights would be restricted to a single (trivial) hidden unit.
4.2 Hashed Neural Nets (HashedNets)
A naïve implementation of random weight sharing can be trivially achieved by maintaining a secondary matrix consisting of each connection’s group assignment. Unfortunately, this explicit representation places an undesirable limit on potential memory savings.
We propose to implement the random weight sharing assignments using the hashing trick. In this way, the shared weight of each connection is determined by a hash function that requires no storage cost with the model. Specifically, we assign to an element of indexed by a hash function , as follows:
(3) 
where the (approximately uniform) hash function maps a key to a natural number within . In the example of Figure 1, and therefore . For our experiments we use the opensource implementation xxHash.^{2}^{2}2https://code.google.com/p/xxhash/
4.3 Feature hashing versus weight sharing
This section focuses on a single layer throughout and to simplify notation we will drop the superscripts . We will denote the input activation as of dimensionality . We denote the output as with dimensionality .
To facilitate weight sharing within a feed forward neural network, we can simply substitute Eq. (
3) into Eq. (2):(4) 
Alternatively and more in line with previous work Weinberger et al. (2009), we may interpret HashedNets in terms of feature hashing. To compute , we first hash the activations from the previous layer, a, with the hash mapping function . We then compute the inner product between the hashed representation and the parameter vector w,
(5) 
Both w and are dimensional, where is the number of hash buckets in this layer. The hash mapping function is defined as follows. The element of , i.e. , is the sum of variables hashed into bucket :
(6) 
Starting from Eq. (5), we show that the two interpretations (Eq. (4) and (5)) are equivalent:
The final term is equivalent to Eq. (4).
Sign factor.
With this equivalence between random weight sharing and feature hashing on input activations, HashedNets inherit several beneficial properties of the feature hashing. Weinberger et al. (2009) introduce an additional sign factor to remove the bias of hashed innerproducts due to collisions. For the same reasons we multiply (3) by the sign factor for parameterizing V Weinberger et al. (2009):
(7) 
where is a second hash function independent of . Incorporating to feature hashing and weight sharing does not change the equivalence between them as the proof in the previous section still holds with the sign term (details omitted for improved readability).
Sparsity.
As pointed out in Shi et al. (2009) and Weinberger et al. (2009), feature hashing is most effective on sparse feature vectors since the number of hash collisions is minimized. We can encourage this effect in the hidden layers with sparsity inducing transition functions, e.g. rectified linear units (ReLU) Glorot et al. (2011a) or through specialized regularization Chen et al. (2014); Boureau et al. (2008). In our implementation, we use ReLU transition functions throughout, as they have also been shown to often result in superior generalization performance in addition to their sparsity inducing properties Glorot et al. (2011a).
Alternative neural network architectures.
While this work focuses on general, fully connected feed forward neural networks, the technique of HashedNets could naturally be extended to other kinds of neural networks, such as recurrent neural networks
Pineda (1987) or others Bishop (1995). It can also be used in conjunction with other approaches for neural network compression. All weights can be stored with low bit precision Courbariaux et al. (2014); Gupta et al. (2015), edges could be removed Cireşan et al. (2011) and HashedNets can be trained on the outputs of larger networks Ba & Caruana (2014) — yielding further reductions in memory requirements.4.4 Training HashedNets
Training HashedNets is equivalent to training a standard neural network with equality constraints for weight sharing. Here, we show how to (a) compute the output of a hash layer during the feedforward phase, (b) propagate gradients from the output layer back to input layer, and (c) compute the gradient over the shared weights during the back propagation phase. We use dedicated hash functions between layers and , and denote them as and .
Output.
Adding the hash functions and and the weight vectors into the feed forward update (2) results in the following forward propagation rule:
(8) 
Error term.
Let
denote the loss function for training the neural network,
e.g. cross entropy or the quadratic loss Bishop (1995). Further, let denote the gradient of over activation in layer , also known as the error term. Without shared weights, the error term can be expressed as , where represents the first derivative of the transition function . If we substitute Eq. (7) into the error term we obtain:(9) 
Gradient over parameters.
To compute the gradient of with respect to a weight we need the two gradients,
(10) 
Here, the first gradient is the standard gradient of a (virtual) weight with respect to an activation unit and the second gradient ties the virtual weight matrix to the actual weights through the hashed map. Combining these two, we obtain
(11)  
(12) 
5 Related Work
Deep neural networks have achieved great progress on a wide variety of realworld applications, including image classification Krizhevsky et al. (2012); Donahue et al. (2013); Sermanet et al. (2013); Zeiler & Fergus (2014), object detection Girshick et al. (2014); Vinyals et al. (2014)
Razavian et al. (2014), speech recognition Hinton et al. (2012); Graves et al. (2013); Mohamed et al. (2011), and text representation Mikolov et al. (2013).There have been several previous attempts to reduce the complexity of neural networks under a variety of contexts. Arguably the most popular method is the widely used convolutional neural network
Simard et al. (2003). In the convolutional layers, the same filter is applied to every receptive field, both reducing model size and improving generalization performance. The incorporation of pooling layers Zeiler & Fergus (2013)can reduce the number of connections between layers in domains exhibiting locality among input features, such as images. Autoencoders
Glorot et al. (2011b) share the notion of tied weights by using the same weights for the encoder and decoder (up to transpose).Other methods have been proposed explicitly to reduce the number of free parameters in neural networks, but not necessarily for reducing memory overhead. Nowlan & Hinton (1992) introduce soft weight sharing for regularization in which the distribution of weight values is modeled as a Gaussian mixture. The weights are clustered such that weights in the same group have similar values. Since weight values are unknown before training, weights are clustered during training. This approach is fundamentally different from HashedNets since it requires auxiliary parameters to record the group membership for every weight.
Instead of sharing weights, LeCun et al. (1989) introduce “optimal brain damage” to directly drop unimportant weights. This approach requires auxiliary parameters for storing the sparse weights and needs retraining time to finetune the resulting architecture. Cireşan et al. (2011) demonstrate in their experiments that randomly removing connections leads to superior empirical performance, which shares the same spirit of HashedNets.
Courbariaux et al. (2014) and Gupta et al. (2015) learn networks with reduced numerical precision for storing model parameters (e.g. bit fixedpoint representation Gupta et al. (2015) for a compression factor of over doubleprecision floating point). Experiments indicate little reduction in accuracy compared with models trained with doubleprecision floating point representation. These methods can be readily incorporated with HashedNets, potentially yielding further reduction in model storage size.
A recent study by Denil et al. (2013) demonstrates significant redundancy in neural network parameters by directly learning a lowrank decomposition of the weight matrix within each layer. They demonstrate that networks composed of weights recovered from the learned decompositions are only slightly less accurate than networks with all weights as free parameters, indicating heavy overparametrization in full weight matrices. A followup work by Denton et al. (2014) uses a similar technique to speed up testtime evaluation of convolutional neural networks. The focus of this line of work is not on reducing storage and memory overhead, but evaluation speed during test time. HashedNets is complementary to this research, and the two approaches could be used in combination.
Following the line of model compression, Buciluǎ et al. (2006), Hinton et al. (2014) and Ba & Caruana (2014) recently introduce approaches to learn a “distilled” model, training a more compact neural network to reproduce the output of a larger network. Specifically, Hinton et al. (2014) and Ba & Caruana (2014) train a large network on the original training labels, then learn a much smaller “distilled” model on a weighted combination of the original labels and the (softened) softmax output of the larger model. The authors show that the distilled model has better generalization ability than a model trained on just the labels. In our experimental results, we show that our approach is complementary by learning HashedNets with soft targets. Rippel et al. (2014)
propose a novel dropout method, nested dropout, to give an order of importance for hidden neurons. Hypothetically, less important hidden neurons could be removed after training, a method orthogonal to HashedNets.
Ganchev & Dredze (2008)
are among the first to recognize the need to reduce the size of natural language processing models to accommodate mobile platform with limited memory and computing power. They propose
random feature mixing to group features at random based on a hash function, which dramatically reduces both the number of features and the number of parameters. With the help of feature hashing Weinberger et al. (2009), Vowpal Wabbit, a largescale learning system, is able to scale to terafeature datasets Agarwal et al. (2014).6 Experimental Results
3 Layers  5 Layers  

RER  LRD  NN  DK  HashNet  HashNet  RER  LRD  NN  DK  HashNet  HashNet  
mnist  1.43  1.22  
basic  2.89  2.62  
rot  10.34  8.61  
bgrand  12.27  10.76  
bgimg  18.92  18.49  
bgimgrot  50.05  45.67  
rect  22.93  23.86  
convex  2.96  2.36 
3 Layers  5 Layers  

RER  LRD  NN  DK  HashNet  HashNet  RER  LRD  NN  DK  HashNet  HashNet  
mnist  2.65  1.92  
basic  3.79  3.19  
rot  17.62  11.67  
bgrand  20.32  13.76  
bgimg  26.17  20.01  
bgimgrot  58.25  51.93  
rect  30.43  26.95  
convex  3.37  2.64 
We conduct extensive experiments to evaluate HashedNets on eight benchmark datasets. For full reproducibility, our code is available at http://www.weinbergerweb.com.
Datasets.
Datasets consist of the original mnist handwritten digit dataset, along with four challenging variants Larochelle et al. (2007). Each variation amends the original through digit rotation (rot), background superimposition (bgrand and bgimg), or a combination thereof (bgimgrot). In addition, we include two binary image classification datasets: convex and rect Larochelle et al. (2007). All data sets have prespecified training and testing splits. Original mnist has splits of sizes (training) and (testing). Both convex and rect and as well as each mnist variation set has (training) and (testing).
Baselines and method.
We compare HashedNets with several existing techniques for sizeconstrained, feedforward neural networks. Random Edge Removal (RER) Cireşan et al. (2011) reduces the total number of model parameters by randomly removing weights prior to training. LowRank Decomposition (LRD) Denil et al. (2013)
decomposes the weight matrix into two lowrank matrices. One of these component matrices is fixed while the other is learned. Elements of the fixed matrix are generated according to a zeromean Gaussian distribution with standard deviation
with inputs to the layer.Each model is compared against a standard neural network with an equivalent number of stored parameters, Neural Network (EquivalentSize) (NN). For example, for a network with a single hidden layer of units and a storage compression factor of , we adopt a sizeequivalent baseline with a single hidden layer of units. For deeper networks, all hidden layers are shrunk at the same rate until the number of stored parameters equals the target size. In a similar manner, we examine Dark Knowledge (DK) Hinton et al. (2014); Ba & Caruana (2014) by training a distilled model to optimize the cross entropy with both the original labels and soft targets generated by the corresponding full neural network (compression factor ). The distilled model structure is chosen to be same as the “equivalentsized” network (NN) at the corresponding compression rate.
Finally, we examine our method under two settings: learning hashed weights with the original training labels (HashNet) and with combined labels and DK soft targets (HashNet). In all cases, memory and storage consumption is defined strictly in terms of free parameters. As such, we count the fixed low rank matrix in the LowRank Decomposition method as taking no memory or storage (providing this baseline a slight advantage).
Experimental setting.
HashedNets and all accompanying baselines were implemented using Torch7 Collobert et al. (2011) and run on NVIDIA GTX TITAN graphics cards with cores and GB of global memory. We use bit precision throughout but note that the compression rates of all methods may be improved with lower precision Courbariaux et al. (2014); Gupta et al. (2015)
. We verify all implementations by numerical gradient checking. Models are trained via stochastic gradient descent (minibatch size of
) with dropout and momentum. ReLU is adopted as the activation function for all models. Hyperparameters are selected for all algorithms with Bayesian optimization
(Snoek et al., 2012) and hand tuning on validation splits of the training sets. We use the open source Bayesian Optimization MATLAB implementation “bayesopt.m” from Gardner et al. (2014).^{3}^{3}3http://tinyurl.com/bayesoptResults with varying compression.
Figures 2 and 3 show the performance of all methods on mnist and the rot variant with different compression factors on layer ( hidden layer) and layer ( hidden layers) neural networks, respectively. Each hidden layer contains hidden units. The axis in each figure denotes the fractional compression factor. For HashedNets and the low rank decomposition and random edge removal compression baselines, this means we fix the number of hidden units () and vary the storage budget () for the weights ().
We make several observations: The accuracy of HashNet and HashNet outperforms all other baseline methods, especially in the most interesting case when the compression factor is small (i.e. very small models). Both compression baseline algorithms, low rank decomposition and random edge removal, tend to not outperform a standard neural network with fewer hidden nodes (black line), trained with dropout. For smaller compression factors, random edge removal likely suffers due to a significant number of nodes being entirely disconnected from neighboring layers. The sizematched NN is consistently the best performing baseline, however its test error is significantly higher than that of HashNet especially at small compression rates. The use of Dark Knowledge training improves the performance of HashedNets and the standard neural network. Of all methods, only HashNet and HashNet maintain performance for small compression factors.
For completeness, we show the performance of all methods on all eight datasets in Table 1 for compression factor and Table 2 for compression factor . HashNet and HashNet outperform other baselines in most cases, especially when the compression factor is very small (Table 2). With a compression factor of on average only bits of information are stored per (virtual) parameter.
Results with fixed storage.
We also experiment with the setting where the model size is fixed and the virtual network architecture is “inflated”. Essentially we are fixing (the number of “real” weights in ), and vary the number of hidden nodes (). An expansion factor of denotes the case where every virtual weight has a corresponding “real” weight, . Figure 4 shows the test error rate under various expansion rates of a network with one hidden layer (left) and three hidden layers (right). In both scenarios we fix the number of real weights to the size of a standard fullyconnected neural network with hidden units in each hidden layer whose test error is shown by the black dashed line.
With no expansion (at expansion rate ), different compression methods perform differently. At this point edge removal is identical to a standard neural network and matches its results. If no expansion is performed, the HashNet performance suffers from collisions at no benefit. Similarly the lowrank method still randomly projects each layer to a random feature space with same dimensionality.
For expansion rates greater , all methods improve over the fixedsized neural network. There is a general trend that more expansion decreases the test error until a “sweetspot” after which additional expansion tends to hurt. The test error of the HashNet neural network decreases substantially through the introduction of more “virtual” hidden nodes, despite that no additional parameters are added. In the case of the 5layer neural network (right) this trend is maintained to an expansion factor of , resulting in “virtual” nodes. One could hypothetically increase arbitrarily for HashNet, however, in the limit, too many hash collisions would result in increasingly similar gradient updates for all weights in w.
The benefit from expanding a network cannot continue forever. In the random edge removal the network will become very sparsely connected; the lowrank decomposition approach will eventually lead to a decomposition into rank matrices. HashNet also respects this trend, but is much less sensitive when the expansion goes up. Best results are achieved when networks are inflated by a factor .
7 Conclusion
Prior work shows that weights learned in neural networks can be highly redundant Denil et al. (2013). HashedNets exploit this property to create neural networks with “virtual” connections that seemingly exceed the storage limits of the trained model. This can have surprising effects. Figure 4 in Section 6 shows the test error of neural networks can drop nearly , from to , through expanding the number of weights “virtually” by a factor
. Although the collisions (or weightsharing) might serve as a form of regularization, we can probably safely ignore this effect as both networks (with and without expansion) were also regularized with dropout
Srivastava et al. (2014) and the hyperparameters were carefully finetuned through Bayesian optimization.So why should additional virtual layers help? One answer is that they probably truly increase the expressiveness of the neural network. As an example, imagine we are provided with a neural network with hidden nodes. The internal weight matrix has weights. If we add another set of hidden nodes, this increases the expressiveness of the network. If we require all weights of connections to these additional nodes to be “reused” from the set of existing weights, it is not a strong restriction given the large number of weights in existence. In addition, the backprop algorithm can adjust the shared weights carefully to have useful values for all their occurrences.
As future work we plan to further investigate model compression for neural networks. One particular direction of interest is to optimize HashedNets for GPUs. GPUs are very fast (through parallel processing) but usually feature small onboard memory. We plan to investigate how to use HashedNets to fit larger networks onto the finite memory of GPUs. A specific challenge in this scenario is to avoid noncoalesced memory accesses due to the pseudorandom hash functions—a sensitive issue for GPU architectures.
References
 Agarwal et al. (2014) Agarwal, Alekh, Chapelle, Olivier, Dudík, Miroslav, and Langford, John. A reliable effective terascale linear learning system. The Journal of Machine Learning Research, 15(1):1111–1133, 2014.
 Ba & Caruana (2014) Ba, Jimmy and Caruana, Rich. Do deep nets really need to be deep? In NIPS, pp. 2654–2662, 2014.

Bishop (1995)
Bishop, Christopher M.
Neural Networks for Pattern Recognition
. Oxford University Press, Inc., 1995. 
Boureau et al. (2008)
Boureau, Ylan, Cun, Yann L, et al.
Sparse feature learning for deep belief networks.
In NIPS, pp. 1185–1192, 2008.  Buciluǎ et al. (2006) Buciluǎ, Cristian, Caruana, Rich, and NiculescuMizil, Alexandru. Model compression. In KDD, 2006.
 Chen et al. (2014) Chen, Minmin, Weinberger, Kilian Q., Sha, Fei, and Bengio, Yoshua. Marginalized denoising autoencoders for nonlinear representations. In ICML, pp. 1476–1484, 2014.
 Chun & Maniatis (2009) Chun, ByungGon and Maniatis, Petros. Augmented smartphone applications through clone cloud execution. In HotOS, 2009.
 Cireşan et al. (2011) Cireşan, Dan C, Meier, Ueli, Masci, Jonathan, Gambardella, Luca M, and Schmidhuber, Jürgen. Highperformance neural networks for visual object classification. arXiv preprint arXiv:1102.0183, 2011.
 Coates et al. (2011) Coates, Adam, Ng, Andrew Y, and Lee, Honglak. An analysis of singlelayer networks in unsupervised feature learning. In AISTATS, 2011.
 Coates et al. (2013) Coates, Adam, Huval, Brody, Wang, Tao, Wu, David, Catanzaro, Bryan, and Andrew, Ng. Deep learning with cots hpc systems. In Proceedings of The 30th International Conference on Machine Learning, pp. 1337–1345, 2013.
 Collobert et al. (2011) Collobert, Ronan, Kavukcuoglu, Koray, and Farabet, Clément. Torch7: A matlablike environment for machine learning. In BigLearn, NIPS Workshop, 2011.
 Courbariaux et al. (2014) Courbariaux, M., Bengio, Y., and David, J.P. Low precision storage for deep learning. arXiv preprint arXiv:1412.7024, 2014.
 Denil et al. (2013) Denil, Misha, Shakibi, Babak, Dinh, Laurent, de Freitas, Nando, et al. Predicting parameters in deep learning. In NIPS, 2013.
 Denton et al. (2014) Denton, Emily, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736, 2014.
 Donahue et al. (2013) Donahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman, Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013.
 Ganchev & Dredze (2008) Ganchev, Kuzman and Dredze, Mark. Small statistical models by random feature mixing. In Workshop on Mobile NLP at ACL, 2008.
 Gardner et al. (2014) Gardner, Jacob, Kusner, Matt, Weinberger, Kilian, Cunningham, John, et al. Bayesian optimization with inequality constraints. In ICML, 2014.
 Girshick et al. (2014) Girshick, Ross, Donahue, Jeff, Darrell, Trevor, and Malik, Jitendra. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
 Glorot et al. (2011a) Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Deep sparse rectifier networks. In AISTATS, 2011a.
 Glorot et al. (2011b) Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Domain adaptation for largescale sentiment classification: A deep learning approach. In ICML, pp. 513–520, 2011b.
 Graves et al. (2013) Graves, Alex, Mohamed, AR, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
 Gupta et al. (2015) Gupta, Suyog, Agrawal, Ankur, Gopalakrishnan, Kailash, and Narayanan, Pritish. Deep learning with limited numerical precision. arXiv preprint arXiv:1502.02551, 2015.
 Hinton et al. (2012) Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdelrahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012.
 Hinton et al. (2014) Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the knowledge in a neural network. NIPS workshop, 2014.
 Karpathy & FeiFei (2014) Karpathy, Andrej and FeiFei, Li. Deep visualsemantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306, 2014.
 Kosner (2012) Kosner, A.W. Client vs. server architecture: Why google voice search is also much faster than siri @ONLINE, October 2012. URL http://tinyurl.com/c2d2otr.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 Larochelle et al. (2007) Larochelle, Hugo, Erhan, Dumitru, Courville, Aaron C, Bergstra, James, and Bengio, Yoshua. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, pp. 473–480, 2007.

Le (2013)
Le, Quoc V.
Building highlevel features using large scale unsupervised learning.
In ICASSP, pp. 8595–8598. IEEE, 2013.  LeCun et al. (1989) LeCun, Yann, Denker, John S, Solla, Sara A, Howard, Richard E, and Jackel, Lawrence D. Optimal brain damage. In NIPS, 1989.
 LeCun et al. (2012) LeCun, Yann A, Bottou, Léon, Orr, Genevieve B, and Müller, KlausRobert. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Springer, 2012.
 Lee & Verma (2013) Lee, Kyong Ho and Verma, Naveen. A lowpower processor with configurable embedded machinelearning accelerators for highorder and adaptive analysis of medicalsensor signals. SolidState Circuits, IEEE Journal of, 48(7):1625–1637, 2013.
 Mikolov et al. (2013) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
 Mohamed et al. (2011) Mohamed, Abdelrahman, Sainath, Tara N, Dahl, George, Ramabhadran, Bhuvana, Hinton, Geoffrey E, and Picheny, Michael A. Deep belief networks using discriminative features for phone recognition. In ICASSP, 2011.
 Montemerlo et al. (2008) Montemerlo, Michael, Becker, Jan, Bhat, Suhrid, Dahlkamp, Hendrik, Dolgov, Dmitri, Ettinger, Scott, Haehnel, Dirk, Hilden, Tim, Hoffmann, Gabe, Huhnke, Burkhard, et al. Junior: The stanford entry in the urban challenge. Journal of field Robotics, 25(9):569–597, 2008.

Nair & Hinton (2010)
Nair, Vinod and Hinton, Geoffrey E.
Rectified linear units improve restricted boltzmann machines.
In ICML, pp. 807–814, 2010.  Nowlan & Hinton (1992) Nowlan, Steven J and Hinton, Geoffrey E. Simplifying neural networks by soft weightsharing. Neural computation, 4(4):473–493, 1992.
 Pineda (1987) Pineda, Fernando J. Generalization of backpropagation to recurrent neural networks. Physical review letters, 59(19):2229, 1987.
 Razavian et al. (2014) Razavian, Ali Sharif, Azizpour, Hossein, Sullivan, Josephine, and Carlsson, Stefan. Cnn features offtheshelf: an astounding baseline for recognition. In CVPR Workshop, 2014.
 Rippel et al. (2014) Rippel, Oren, Gelbart, Michael A, and Adams, Ryan P. Learning ordered representations with nested dropout. arXiv preprint arXiv:1402.0915, 2014.
 Salton & Buckley (1988) Salton, Gerard and Buckley, Christopher. Termweighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.

Schuster (2010)
Schuster, Mike.
Speech recognition for mobile devices at google.
In
PRICAI 2010: Trends in Artificial Intelligence
, pp. 8–10. Springer, 2010.  Sermanet et al. (2013) Sermanet, Pierre, Eigen, David, Zhang, Xiang, Mathieu, Michaël, Fergus, Rob, and LeCun, Yann. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
 Shi et al. (2009) Shi, Qinfeng, Petterson, James, Dror, Gideon, Langford, John, Smola, Alex, and Vishwanathan, S.V.N. Hash kernels for structured data. Journal of Machine Learning Research, 10:2615–2637, December 2009.
 Simard et al. (2003) Simard, Patrice Y, Steinkraus, Dave, and Platt, John C. Best practices for convolutional neural networks applied to visual document analysis. In ICDAR, volume 2, pp. 958–958. IEEE Computer Society, 2003.
 Snoek et al. (2012) Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning algorithms. In NIPS, 2012.
 Srivastava et al. (2014) Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Vinyals et al. (2014) Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555, 2014.
 Weinberger et al. (2009) Weinberger, Kilian, Dasgupta, Anirban, Langford, John, Smola, Alex, and Attenberg, Josh. Feature hashing for large scale multitask learning. In ICML, 2009.
 Zeiler & Fergus (2013) Zeiler, Matthew D and Fergus, Rob. Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557, 2013.
 Zeiler & Fergus (2014) Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional networks. In ECCV, 2014.
Comments
There are no comments yet.