Despite the increased computational power resulted from the advances in high-performance computing, e.g. GPU and large-scale clusters, deployment of deep neural networks, especially under stringent resource constraints, remain a serious challenge because of their extraordinary demand on storage and computations. Binarizing the data representation along each layer would be a natural solution. In practice, most deep neural networks use single-precision (32-bit) floating point numbers for data representation. Hence, turning these values into binary ones would reduce the required storage by 32 times. Also, with real values replaced by binary values, the multiplication operations, which dominate the run-time in most cases, would reduce to bit-operations that are much more efficient. Overall, a binary representation would allow us to design much bigger and powerful networks using limited computation and storage resources. Moreover, as a way to understand the deep neural network, binary activations give a clear definition of neuron firing, which enables us to study the compositional architectures, the firing patterns, and various other properties in a principled way.
While we desire a complete binary representation, the optimization of an integer programming problem over millions of variables is intimidating. It is also somewhat unreasonable to constrain all the activations to be binary in the very beginning especially on lower layers. We instead take an alternative approach: the activations are initialized with real values, but are encouraged to diverge gradually during the training process. Hopefully, the network could reach to a solution where a binary representation is favored without the loss of performance. Concretely, our activation function is bounded between 0 and 1, and linear in between. The slope of the linear part is parameterized and can be learned from data. We encourage binary values by a weight growing constraint over these slope parameters (opposite to weight decay). Ideally, if the final activations are binary across all the data, the activation function works just like step functions. The limitation of the current approach is that the model consumes slightly more memory and computations during the training time.
We replace the ReLU nonlinearity with our bounded rectifiers in the current deep model architectures, and test the performance on MNIST, CIFAR10, and ImageNet standard benchmarks. We have the following findings. First, our proposed activation function is more expressive than ReLU by observing no loss of performance if we don’t force a complete binary representation. Since binary values are intrinsically more robust, we even find bounded activations as well as weight growing constraint could be considered as a new way to regularize the model. Second, our approach can binarize the last layer representation without loss of performance on all the tasks, and suffer modestly when binarizing all layers in the network. We continue to study how each of the layers affects the performance when they are binarized. It turns out that binarization of some particular layers comprises most of the losses. Third, when we enlarge the layers with more channels, we gain some obvious performance improvement. This indicates the configurations of some layers are too small and limit the performance for binary representations. Last, since the binary represenation makes the output less sensitive to the model parameters, we even manage to binarize the model parameters altogether, which leads to a complete bit-wise deep model.
Binary neurons give a clear definition of fire and inhibition, which are a lot easier to be understood and interpreted than real values. To understand these representations, we find that there exists some “positive classes” and “negative classes” where all the instances in the class consistently cause a binary neuron to turn on or off. In this sense, the function of this neuron is simply to separate “positive classes” and “negative classes”. Some of the neurons are capable of representing high level concept like animals and articifials, while some other neurons are more elusive based only on class labels. To discover their functionality visually, we show the shared image pattern across these classes and assign a semantic description for the neuron. It turns out that the notion that these neurons have captured is still quite low level, such as repetitives, square shapes and rugged textures. We believe these findings are truly intriguing. We hope that our binary model can help to study and understand the properties of deep neural networks, and potentially accelerate the deployment of DNN in computational limited environments.
2 Related Works
Learning binary codes is an active research topic for machine learning in general. The code should be reasonably short, but contain as much information as possible, and allow fast computation during inference time. When handling tremendously large data, binary codes are essential for developing efficient search and matching algorithms.Torralba et al. (2008) first introduced the problem to the vision community. Techniques based on hashing such as Andoni & Indyk (2006); Weiss et al. (2009) and quantization Gong & Lazebnik (2011) have been proposed. Encoding high-dimensional image data into short binary codes could be particularly useful for large-scale similarity search and indexing on the web.
The binary representation for deep models is not a new topic. At the very beginning of neural network, inspired biologically, the heaviside step function has been used as the activation function. Training algorithms such as Toms (1990); Barlett & Downs (1992) have been proposed. However, they never take full advantage of back-propagation algorithm, and can only work on toy examples with one hidden layer. For deep probabilistic models like RBM, DBN, DBM(Hinton, 2002; Hinton et al., 2006; Salakhutdinov & Hinton, 2009)
, all of the internal representations are modeled as binary bernoulli variables. Despite a rather simple representation, these neural networks have shown that binary representations could still be very powerful for modeling complex data. However, for discriminative models especially convolutional neural networks that dramatically advance state-of-the-art in many areas, research that focuses on optimizing the binary codes is limited.
Previous work does make some positive observations by directly binarizing the features after the regular training is done. As the ReLU activation function has been widely used, the feature patterns can be understood as either activated or inhibited. In the paper Agrawal et al. (2014), they binarize the features as positives and zeros, and find that the performance drop is negligible. The experiment reveals that the local minimum of a binary representation for an individual layer exists. But it never discusses about how the binary representation could be further optimized and how binary representations for multiple layers coexist. Later on, Lin et al. (2015); Zhong et al. (2015)
merely optimize the representation of the last layer using a static sigmoid function, which is never truly binary. Our work enforces the representations to be truly binary and it can be applied throughout the whole network.
Our work is closely related to model compressions for deep neural networks as we are optimizing networks under constrained settings. Han et al. (2015) reduces network parameters more than 30 times without performance loss by removing connection weights that below a certain threshold. As another way to compress networks, Soudry et al. (2014); Courbariaux et al. (2015) propose deep networks with complete binary parameters. Unlike these work that focuses on compressing model parameters, we work on compressing the representations. This could be extremely useful for image applications where feature representations consume most of the storage. We believe both lines of research hold the same promise to make the computation fast and storage consumption efficient.
Our goal is to learn a binary representation for deep neural networks at some or all of the layers. Large-scale integer programming is extremely hard, especially for deep neural networks with millions of units. We avoid this approach, but choose to softly encourge the units from real values to binary values. We achieve this by designing an activation function that constrains the value between 0 and 1, and favors 0 and 1 as local minimums during training. The approach borrows the nice gradient property of ReLU and takes the full power of back-propagation for optimization. We now describe the activation function and show how we encourage the units to diverge.
3.1 Adjustable bounded rectifiers
We design the activation function to be a simple linear function but clip the value less than 0 or bigger than 1. Formally, it can be defined as:
Where is the input of the activation function on the th channel, is the coefficient controlling the slope of the linear interval. The subscript indicates that units in the same channel share the same slope parameters. When the slope is small, it allows the unit to perceive large input ranges; when the slope is large, it is only sensitive to small input ranges. When the activation is saturated, it will have less chance of jumping back to the linear interval because the gradient in the saturated zone is 0. This is the intuition that we design an activation function which favors binary values.
Our activation function introduces a number of extra parameters, each slope coefficient for each feature channel. In fact, these parameters are redundant because they can be equivalently absorbed into the previous convolution or fully connected layers. Each set of parameters can be replaced by , and the activation function works with slope constant one. When the slope is large enough, our function degenerate close to the standard step function. The slope parameter can even be discarded and the activation function can be replaced by a step function during inference time.
One may worry the bounded rectifiers hurts the expressiveness of the neural network since it clips the values bigger than 1 compared with ReLU. However, the “redundant” slope parameters actually make it even more powerful than ReLU. To see this, we can theoretically cast any function expressed by a ReLU network to a network with bounded activations. The idea is to set the slope so that the activations always lie in the linear interval. For a network with a convolution layer , ReLU activation function and another convolution layer . Suppose the maximum value of the first convolution layer is . We can just set the slope in the new network to be . In this way, the bounded rectifiers will never saturate with value bigger than 1. Since we shrink the magnitude of the network by , we can compensate it by making the second convolution bigger and do appropriate translations . We can continue to do this for the rest of the network. Then our network with bounded rectifiers is equivalent to the ReLU network.
The adjustable bounded rectifiers are differentiable, and the slope parameters can be learned end-to-end together with other model parameters. The gradients of slope parameter is,
Where is the objective function of the neural network, is the identity function. The summation is applied on all the units that share the same slope parameter. With this derivation, we can train the whole neural network under the standard framework.
3.2 Regularization As Weight Growth
Deep neural networks usually have many local minimums. Although the units tend to trap in the saturated regions where no gradients is available, using the above activation function alone is not sufficient to have the desired binary representations. We have to aggressively encourage more units to diverge during the training process. One possible way is to interpret each activation as a probability, and add entropy loss for each individual unit. We find it generally works, but may be tricky to choose the loss weights across different layers. Another way is to let the bounded rectifiers get close to step functions. We can achieve this by encouraging the slope parameters to grow. This may sound strange at first, because parameters are prefered to be small to avoid overfitting. However, a bigger slope may lead to more binary values, which is robust and actually less prone to overfit. We even find that adding the slope growing constraint can lessen the necessity to use dropout. Concretely, we use the negative log as the loss function. It has two appealing properties: First, the loss is unbounded towards the negative infinity. This would drive the slope to grow aggressively. Second, the derivative of the loss goes to zeros when the slope goes to infinity. This would make the optimization process stable. We have also experimented with other functions such as and , but none of them works better than the log function.
3.3 Implementation Issues
During the starting time of the training, the bounded rectifiers almost have the same behavior as the ReLU function. Therefore, any initialization methods that work for ReLU could be directly applied on the bounded rectifiers. In our experiments, we use the xavier method (Glorot & Bengio, 2010) for all the layers (He et al. (2015) also works), and initialize the slope with value 1 (a gaussian with std value 1 also works). In the paper Ioffe & Szegedy (2015)
, they manually normalize each layer to have zero mean and unit variance. This could be particular useful for networks that have dramatic training dynamics and have extremely different feature magnitudes. For example, the features of the VGG networkSimonyan & Zisserman (2014) are in the order of at the front while at the end. However, our activation magnitude and training dynamics are much more stable because of the activation function. We just scale the input by to make the first layer consistent with the rest.
We divide the optimization protocol into two phases. In the first phase, we set the slope growing constraint to be relatively small. The constraint helps regularize the model, and the training procedure focuses on optimizing the performance, not on producing a binary representation. The first phase can reach the exact original performance with about 70 - 90% binary representations (this may vary for different layers). In the second phase, we set the slope growing constraint to be a lot larger. This pushes the rest continuous values to binary while trying to maintain the objective performance. During the gradient update cycle, the growing constraint is added outside of the learning rate to aggressively binarize the representations. At the inference stage, we replace the bounded rectifiers with step functions.
. The baseline methods are solvers and architectures from the Caffe software package(Jia et al., 2014). We conduct the experiments on two primary settings: binarizing the features for the last layer, and binarizing the features for all layers. Both settings have great practical values, and our approach gives very promising results. We discuss binarizing model parameters altogether to derive a complete bit-wise deep model in the last section.
4.1 Binarizing the last layer representation
|Regular||Binary||Finetuned||1st Phase||1st Binary||2nd Phase||2nd Binary|
Binarizing the features for the last layer can be very useful for large scale image search and retrieval algorithms. Our experiment shows that features from the last layer of current CNNs can be safely binarized. Take MNIST for an example, the baseline method obtains 99.11% with standard ReLU activations. Directly binarizing the features without retraining the softmax suffers a lot. Retraining the classifier while fixing the binary features boosts the performance to 98.37%. For our method, we monitor the level of binarization as well as the classification performance along the training time in Figure2. Training the first phase using a small weight growth steadily binarizes more units. Direct binarizing the features without retraining the softmax gives a performance 99.12%. Training the second phase using a larger weight growth maintains the accuracy 99.22%, but substantially increases the level of binarization. Directly binarizing the features gives the same performance 99.22%. Note there is no need to retrain the classifier for our method because the level of binarization is already very high. The corresponding experimental results on CIFAR10 and ImageNet are summarized in Table 1. From results on three datasets, we can always reach the performance using features as real values. It is important to note that our method does not simply threshold ReLU activations, it learns a total different representation. For example, the Alexnet features on ImageNet are quite sparse, while ours is not sparse at all. Our feature has 58% zero and 42% one.
Feature robustness. Binary features are inherently robust and less prone to overfit. To prove this, we train the deep models for longer iterations using a small weight growth constraint. In Figure 3, the baseline model with ReLU activations is severely affected by overfitting, and the training loss drops a lot faster than the testing loss. For our bounded rectifiers, on CIFAR10, longer training continues to give us better performance. We reach a performance of 80.48%, which is significantly better than 75.40%, which is the best performance that current method can achieve with the same architecture. On ImageNet, our method only uses dropout on fc6 layer but not fc7 layer. One baseline setting is consistent with ours, and the other uses full dropout training. This experiment shows that even when a binary representation is not desired, bounded rectifiers as well as weight growth constraint should be considered as an effective way for regularization.
4.2 Binarizing the whole network
Binarizing the representations across all the layers could potentially save memory usage to a great extent and speed up computations. It also allows us to explore even deeper and wider network architectures. Apparently, forcing all the layers to be binary is an extremely difficult task. If the layer at the front is binarized while losing too much information, the following layers will have no chance to recover. Also, if the layer at the end is binarized a lot faster than previous layers, the gradient signal will be too weak to be propagated. To our surprise, in Figure 4, our network can automatically maintain a steady level of binarization across all the layers while conservatively encouraging binarization.
We conduct experiments on CIFAR10 and ImageNet. For the baseline method111The baseline here is a little bit lower than original Alexnet because this baseline is trained with xavier initializations. We want to cancel out the effects of initializations since the original Alexnet is sophisticatedly tuned., we take the well-trained ReLU model and it suffers a lot if we binarize all the layers simultaneously. Optimally, we can add a binary representation layer each time from the front to the back, and finetune all the layers behind as real values. Then we repeat this process until all the layers are binarized. The performance increases to 45.48% and 27.73%, respectively. For our approach, we replace ReLU activations with our bounded activations for all the layers and train the deep model from scratch. After two phases of training, we obtain an accuracy of 72.14% and 68.39% by directly replacing each bounded rectifiers with a step function. We can also follow the layer-wise finetuning as for the baseline method. In our experiment, to make things simple, we just finetune the final softmax classifier. Our final result is 73.08%, 68.85%. Details are included in Table 2.
|Regular||Binary||Finetuned||1st Phase||1st Binary||2nd Phase||2nd Binary||Finetuned|
Larger network architectures. We lose about 2.4%, 9.5% of accuracy by using the same architecture as networks with real values. To find out which layers our approach suffers most, we train a set of models that gradually have more binay layers, e.g. binarizing only the last layer, and add the second last layer, etc. In figure 4, on AlexNet, we can see that conv2, conv3, and fc7 layers do not hurt the performance while other layers may have some noticeable losses. Maybe it’s the capacity of the network architecture that limits the performance for binary representations. Can we reach to the same performance or even better performance by using bigger networks? To test the possibility, we double the number of feature channels for all the layers, and redo the experiment under the same settings. Finally, we reach a performance of 76.22% and 74.23% for the twice larger networks. This obviously confirms the architecture should be made larger for binary values. Also, it would be very interesting to investigate the minimum binary architecture in order to match the original performance. For example, maybe only conv1 and conv5 layer needs to be made larger, and all the other layers could be remained the same. Since it needs an exhaustive search over a large amount of models, we leave it for the future work.
4.3 Activation patterns of binary feature units
Binary representations give a clear definition of fire and non-fire. Can we develop an interpretation of the binary value zero and one? To start with, we work on ImageNet dataset and calculate the average responses of the fc7 units for all 1000 classes. This gives us a matrix (see Figure 5) where each element indicates how often each class fires for a specific neuron. We find that for some neurons, there exists some “positive classes” where all the instances in the class consistently fire for this neuron, and some “negative classes” where all the instances in the class consistently inhibit for this neuron. Such neurons carry strong discriminative information for multi-class classification. In fact, given only one such particular neuron, we can easily build a classifier to separate “positive classes” and “negative classes” by first adding a bias 0.5 and choosing a weight of 1 for positive classes, -1 for negative classes, and 0 for the remaining ambiguous classes.
In figure 5, for neuron 1 and 2, the positive and negative classes are quite obviously corresponding to semantics, since ImageNet labels are based on WordNet hierarchy. But for other neurons, like neuron 3 and 4, this does not always seem to be the case. What visual properties do these classes share in order to fire/inhibit for the same neuron? For each class in the “positive classes” and “negative classes”, we randomly select one image instance in the class, and find the minimum image representation to highlight the object part that contribute most to this neuron. We use the visualization technique from Zhou et al. (2014). Concretely, we first segment the image by edges, regions and corners, and then remove each segment greedily until the binary response flips. In figure 6, we show examples of 7 neurons. Please note that each image does not simply represent for an instance, but for the whole class. Unlike previous visualizations that attempt to find the consistency within a class, we show the consistent image pattern across classes. Our results show that some fc7 neurons do represent high level concept, like dogs, animals, and car wheels. While some others may still capture low level information like texture and shapes. One interesting result of pattern “Wings” reveals the visual consistency of butterfly wings, goose wings, mushroom wings, sailboat wings, and even mountain wings. Moreover, Parizi et al. (2014) puts forward the notion of negative parts for classification. Our binary representation experimentally proves that negative parts automatically emerge in deep neural networks, and the same part can be shared as positive responses and negative responses such as dog faces in figure 6.
4.4 Towards binarizing model parameters
We always lose some performance using binary representations under the same architecture. However, the number of model parameters remain the same. This means that there should be a lot more redundancy in the model with binary representations. Intuitively, changing a particular weight of the model does not necessarily change the output because the value is thresholded to binary. Also, the scale of the weights does not influence the output. Overall, parameters using a binary representation are less sensitive to the output responses. Can we compress the model parameters to binary altogether? Specifically, after we obtain the model with binary representations, we simply threshold the model weights to , and keep it fixed. We leave the conv1 parameters to be real values since the input image is not binary and has rich sensory information. Then we finetune the bias and slope parameters using back-propagation. In this way, all the middle representations as well as most of the model parameters are binarized.
We conduct the experiments on CIFAR10 dataset. For the baseline architecture, we achieve 63.55%, which is 11.85% worse than the regular model 75.40%, and 9.53% worse than our binary representation model with real-valued parameters 73.08%. For the twice larger network architecture, the performance increases to 69.68%. To achieve the baseline accuracy, we make the network 4 times larger, and it finally reaches to 75.07%. The model contains about 16 times numbers of model parameters and 4 times numbers of feature units compared with the baseline model. However, all the computations could be stored and calculated in a bit-wise way.
In this paper, we propose adjustable bounded rectifiers along with its optimization techniques to learn binary representations for deep neural networks. Our results confirm the redundancy in current deep models. We can safely binarize the last layer and even all the layers with reasonable good performance. We also show our activation function can be used as a new way to regularize the model even when a binary representation is not desired. Using the learned representations, we can interpret the function of each binary neuron using semantic phrases by examining the minimal activation pattern of input images. We believe our binary model has great practical value for deployment, especially in targeted hardwares. In the future, we plan to optimize the smallest binary architecture that matches the state-of-the-art real value networks and incorporate binarizing model parameters into optimization.
- Agrawal et al. (2014) Agrawal, Pulkit, Girshick, Ross, and Malik, Jitendra. Analyzing the performance of multilayer neural networks for object recognition. In Computer Vision–ECCV 2014, pp. 329–344. Springer, 2014.
- Andoni & Indyk (2006) Andoni, Alexandr and Indyk, Piotr. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pp. 459–468. IEEE, 2006.
- Barlett & Downs (1992) Barlett, Peter L and Downs, Tom. Using random weights to train multilayer networks of hard-limiting units. Neural Networks, IEEE Transactions on, 3(2):202–210, 1992.
- Courbariaux et al. (2015) Courbariaux, Matthieu, Bengio, Yoshua, and David, Jean-Pierre. Binaryconnect: Training deep neural networks with binary weights during propagations. arXiv preprint arXiv:1511.00363, 2015.
Deng et al. (2009)
Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li.
Imagenet: A large-scale hierarchical image database.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.
Glorot & Bengio (2010)
Glorot, Xavier and Bengio, Yoshua.
Understanding the difficulty of training deep feedforward neural
International conference on artificial intelligence and statistics, pp. 249–256, 2010.
- Gong & Lazebnik (2011) Gong, Yunchao and Lazebnik, Svetlana. Iterative quantization: A procrustean approach to learning binary codes. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 817–824. IEEE, 2011.
- Han et al. (2015) Han, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. 2015.
- He et al. (2015) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015.
Hinton, Geoffrey E.
Training products of experts by minimizing contrastive divergence.Neural computation, 14(8):1771–1800, 2002.
- Hinton et al. (2006) Hinton, Geoffrey E, Osindero, Simon, and Teh, Yee-Whye. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
- Huang et al. (2006) Huang, Guang-Bin, Zhu, Qin-Yu, Mao, KZ, Siew, Chee-Kheong, Saratchandran, P, and Sundararajan, N. Can threshold networks be trained directly? Circuits and Systems II: Express Briefs, IEEE Transactions on, 53(3):187–191, 2006.
- Ioffe & Szegedy (2015) Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Jia et al. (2014) Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM, 2014.
- Krizhevsky & Hinton (2009) Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images, 2009.
- Lin et al. (2015) Lin, Kevin, Yang, Huei-Fang, Hsiao, Jen-Hao, and Chen, Chu-Song. Deep learning of binary hash codes for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 27–35, 2015.
- Parizi et al. (2014) Parizi, Sobhan Naderi, Vedaldi, Andrea, Zisserman, Andrew, and Felzenszwalb, Pedro. Automatic discovery and optimization of parts for image classification. arXiv preprint arXiv:1412.6598, 2014.
Salakhutdinov & Hinton (2009)
Salakhutdinov, Ruslan and Hinton, Geoffrey E.
Deep boltzmann machines.In International Conference on Artificial Intelligence and Statistics, pp. 448–455, 2009.
- Simonyan & Zisserman (2014) Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Soudry et al. (2014)
Soudry, Daniel, Hubara, Itay, and Meir, Ron.
Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights.In Advances in Neural Information Processing Systems, pp. 963–971, 2014.
- Toms (1990) Toms, DJ. Training binary node feedforward neural networks by back propagation of error. Electronics letters, 26(21):1745–1746, 1990.
- Torralba et al. (2008) Torralba, Antonio, Fergus, Rob, and Weiss, Yair. Small codes and large image databases for recognition. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. IEEE, 2008.
- Weiss et al. (2009) Weiss, Yair, Torralba, Antonio, and Fergus, Rob. Spectral hashing. In Advances in neural information processing systems, pp. 1753–1760, 2009.
- Zhong et al. (2015) Zhong, Guoqiang, Yang, Pan, Wang, Sijiang, and Dong, Junyu. A deep hashing learning network. arXiv preprint arXiv:1507.04437, 2015.
- Zhou et al. (2014) Zhou, Bolei, Khosla, Aditya, Lapedriza, Agata, Oliva, Aude, and Torralba, Antonio. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856, 2014.