Adjustable Bounded Rectifiers: Towards Deep Binary Representations

11/19/2015 ∙ by Zhirong Wu, et al. ∙ The Chinese University of Hong Kong 0

Binary representation is desirable for its memory efficiency, computation speed and robustness. In this paper, we propose adjustable bounded rectifiers to learn binary representations for deep neural networks. While hard constraining representations across layers to be binary makes training unreasonably difficult, we softly encourage activations to diverge from real values to binary by approximating step functions. Our final representation is completely binary. We test our approach on MNIST, CIFAR10, and ILSVRC2012 dataset, and systematically study the training dynamics of the binarization process. Our approach can binarize the last layer representation without loss of performance and binarize all the layers with reasonably small degradations. The memory space that it saves may allow more sophisticated models to be deployed, thus compensating the loss. To the best of our knowledge, this is the first work to report results on current deep network architectures using complete binary middle representations. Given the learned representations, we find that the firing or inhibition of a binary neuron is usually associated with a meaningful interpretation across different classes. This suggests that the semantic structure of a neural network may be manifested through a guided binarization process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the increased computational power resulted from the advances in high-performance computing, e.g. GPU and large-scale clusters, deployment of deep neural networks, especially under stringent resource constraints, remain a serious challenge because of their extraordinary demand on storage and computations. Binarizing the data representation along each layer would be a natural solution. In practice, most deep neural networks use single-precision (32-bit) floating point numbers for data representation. Hence, turning these values into binary ones would reduce the required storage by 32 times. Also, with real values replaced by binary values, the multiplication operations, which dominate the run-time in most cases, would reduce to bit-operations that are much more efficient. Overall, a binary representation would allow us to design much bigger and powerful networks using limited computation and storage resources. Moreover, as a way to understand the deep neural network, binary activations give a clear definition of neuron firing, which enables us to study the compositional architectures, the firing patterns, and various other properties in a principled way.

While we desire a complete binary representation, the optimization of an integer programming problem over millions of variables is intimidating. It is also somewhat unreasonable to constrain all the activations to be binary in the very beginning especially on lower layers. We instead take an alternative approach: the activations are initialized with real values, but are encouraged to diverge gradually during the training process. Hopefully, the network could reach to a solution where a binary representation is favored without the loss of performance. Concretely, our activation function is bounded between 0 and 1, and linear in between. The slope of the linear part is parameterized and can be learned from data. We encourage binary values by a weight growing constraint over these slope parameters (opposite to weight decay). Ideally, if the final activations are binary across all the data, the activation function works just like step functions. The limitation of the current approach is that the model consumes slightly more memory and computations during the training time.

We replace the ReLU nonlinearity with our bounded rectifiers in the current deep model architectures, and test the performance on MNIST, CIFAR10, and ImageNet standard benchmarks. We have the following findings. First, our proposed activation function is more expressive than ReLU by observing no loss of performance if we don’t force a complete binary representation. Since binary values are intrinsically more robust, we even find bounded activations as well as weight growing constraint could be considered as a new way to regularize the model. Second, our approach can binarize the last layer representation without loss of performance on all the tasks, and suffer modestly when binarizing all layers in the network. We continue to study how each of the layers affects the performance when they are binarized. It turns out that binarization of some particular layers comprises most of the losses. Third, when we enlarge the layers with more channels, we gain some obvious performance improvement. This indicates the configurations of some layers are too small and limit the performance for binary representations. Last, since the binary represenation makes the output less sensitive to the model parameters, we even manage to binarize the model parameters altogether, which leads to a complete bit-wise deep model.

Binary neurons give a clear definition of fire and inhibition, which are a lot easier to be understood and interpreted than real values. To understand these representations, we find that there exists some “positive classes” and “negative classes” where all the instances in the class consistently cause a binary neuron to turn on or off. In this sense, the function of this neuron is simply to separate “positive classes” and “negative classes”. Some of the neurons are capable of representing high level concept like animals and articifials, while some other neurons are more elusive based only on class labels. To discover their functionality visually, we show the shared image pattern across these classes and assign a semantic description for the neuron. It turns out that the notion that these neurons have captured is still quite low level, such as repetitives, square shapes and rugged textures. We believe these findings are truly intriguing. We hope that our binary model can help to study and understand the properties of deep neural networks, and potentially accelerate the deployment of DNN in computational limited environments.

2 Related Works

Learning binary codes is an active research topic for machine learning in general. The code should be reasonably short, but contain as much information as possible, and allow fast computation during inference time. When handling tremendously large data, binary codes are essential for developing efficient search and matching algorithms.

Torralba et al. (2008) first introduced the problem to the vision community. Techniques based on hashing such as Andoni & Indyk (2006); Weiss et al. (2009) and quantization Gong & Lazebnik (2011) have been proposed. Encoding high-dimensional image data into short binary codes could be particularly useful for large-scale similarity search and indexing on the web.

The binary representation for deep models is not a new topic. At the very beginning of neural network, inspired biologically, the heaviside step function has been used as the activation function. Training algorithms such as Toms (1990); Barlett & Downs (1992) have been proposed. However, they never take full advantage of back-propagation algorithm, and can only work on toy examples with one hidden layer. For deep probabilistic models like RBM, DBN, DBM(Hinton, 2002; Hinton et al., 2006; Salakhutdinov & Hinton, 2009)

, all of the internal representations are modeled as binary bernoulli variables. Despite a rather simple representation, these neural networks have shown that binary representations could still be very powerful for modeling complex data. However, for discriminative models especially convolutional neural networks that dramatically advance state-of-the-art in many areas, research that focuses on optimizing the binary codes is limited.

Previous work does make some positive observations by directly binarizing the features after the regular training is done. As the ReLU activation function has been widely used, the feature patterns can be understood as either activated or inhibited. In the paper Agrawal et al. (2014), they binarize the features as positives and zeros, and find that the performance drop is negligible. The experiment reveals that the local minimum of a binary representation for an individual layer exists. But it never discusses about how the binary representation could be further optimized and how binary representations for multiple layers coexist. Later on, Lin et al. (2015); Zhong et al. (2015)

merely optimize the representation of the last layer using a static sigmoid function, which is never truly binary. Our work enforces the representations to be truly binary and it can be applied throughout the whole network.

Our work is closely related to model compressions for deep neural networks as we are optimizing networks under constrained settings. Han et al. (2015) reduces network parameters more than 30 times without performance loss by removing connection weights that below a certain threshold. As another way to compress networks, Soudry et al. (2014); Courbariaux et al. (2015) propose deep networks with complete binary parameters. Unlike these work that focuses on compressing model parameters, we work on compressing the representations. This could be extremely useful for image applications where feature representations consume most of the storage. We believe both lines of research hold the same promise to make the computation fast and storage consumption efficient.

3 Approach

Our goal is to learn a binary representation for deep neural networks at some or all of the layers. Large-scale integer programming is extremely hard, especially for deep neural networks with millions of units. We avoid this approach, but choose to softly encourge the units from real values to binary values. We achieve this by designing an activation function that constrains the value between 0 and 1, and favors 0 and 1 as local minimums during training. The approach borrows the nice gradient property of ReLU and takes the full power of back-propagation for optimization. We now describe the activation function and show how we encourage the units to diverge.

3.1 Adjustable bounded rectifiers


(a) ReLU

(b) Bounded rectifiers
(c) Functions expressed by a ReLU can always be cast into a module with bounded rectifiers.
Figure 1: Adjustable bounded rectifiers and its relationship with ReLU.

We design the activation function to be a simple linear function but clip the value less than 0 or bigger than 1. Formally, it can be defined as:

(1)

Where is the input of the activation function on the th channel, is the coefficient controlling the slope of the linear interval. The subscript indicates that units in the same channel share the same slope parameters. When the slope is small, it allows the unit to perceive large input ranges; when the slope is large, it is only sensitive to small input ranges. When the activation is saturated, it will have less chance of jumping back to the linear interval because the gradient in the saturated zone is 0. This is the intuition that we design an activation function which favors binary values.

Our activation function introduces a number of extra parameters, each slope coefficient for each feature channel. In fact, these parameters are redundant because they can be equivalently absorbed into the previous convolution or fully connected layers. Each set of parameters can be replaced by , and the activation function works with slope constant one. When the slope is large enough, our function degenerate close to the standard step function. The slope parameter can even be discarded and the activation function can be replaced by a step function during inference time.

One may worry the bounded rectifiers hurts the expressiveness of the neural network since it clips the values bigger than 1 compared with ReLU. However, the “redundant” slope parameters actually make it even more powerful than ReLU. To see this, we can theoretically cast any function expressed by a ReLU network to a network with bounded activations. The idea is to set the slope so that the activations always lie in the linear interval. For a network with a convolution layer , ReLU activation function and another convolution layer . Suppose the maximum value of the first convolution layer is . We can just set the slope in the new network to be . In this way, the bounded rectifiers will never saturate with value bigger than 1. Since we shrink the magnitude of the network by , we can compensate it by making the second convolution bigger and do appropriate translations . We can continue to do this for the rest of the network. Then our network with bounded rectifiers is equivalent to the ReLU network.

The adjustable bounded rectifiers are differentiable, and the slope parameters can be learned end-to-end together with other model parameters. The gradients of slope parameter is,

(2)

Where is the objective function of the neural network, is the identity function. The summation is applied on all the units that share the same slope parameter. With this derivation, we can train the whole neural network under the standard framework.

3.2 Regularization As Weight Growth

Deep neural networks usually have many local minimums. Although the units tend to trap in the saturated regions where no gradients is available, using the above activation function alone is not sufficient to have the desired binary representations. We have to aggressively encourage more units to diverge during the training process. One possible way is to interpret each activation as a probability, and add entropy loss for each individual unit. We find it generally works, but may be tricky to choose the loss weights across different layers. Another way is to let the bounded rectifiers get close to step functions. We can achieve this by encouraging the slope parameters to grow. This may sound strange at first, because parameters are prefered to be small to avoid overfitting. However, a bigger slope may lead to more binary values, which is robust and actually less prone to overfit. We even find that adding the slope growing constraint can lessen the necessity to use dropout. Concretely, we use the negative log as the loss function

. It has two appealing properties: First, the loss is unbounded towards the negative infinity. This would drive the slope to grow aggressively. Second, the derivative of the loss goes to zeros when the slope goes to infinity. This would make the optimization process stable. We have also experimented with other functions such as and , but none of them works better than the log function.

3.3 Implementation Issues

During the starting time of the training, the bounded rectifiers almost have the same behavior as the ReLU function. Therefore, any initialization methods that work for ReLU could be directly applied on the bounded rectifiers. In our experiments, we use the xavier method (Glorot & Bengio, 2010) for all the layers (He et al. (2015) also works), and initialize the slope with value 1 (a gaussian with std value 1 also works). In the paper Ioffe & Szegedy (2015)

, they manually normalize each layer to have zero mean and unit variance. This could be particular useful for networks that have dramatic training dynamics and have extremely different feature magnitudes. For example, the features of the VGG network

Simonyan & Zisserman (2014) are in the order of at the front while at the end. However, our activation magnitude and training dynamics are much more stable because of the activation function. We just scale the input by to make the first layer consistent with the rest.

We divide the optimization protocol into two phases. In the first phase, we set the slope growing constraint to be relatively small. The constraint helps regularize the model, and the training procedure focuses on optimizing the performance, not on producing a binary representation. The first phase can reach the exact original performance with about 70 - 90% binary representations (this may vary for different layers). In the second phase, we set the slope growing constraint to be a lot larger. This pushes the rest continuous values to binary while trying to maintain the objective performance. During the gradient update cycle, the growing constraint is added outside of the learning rate to aggressively binarize the representations. At the inference stage, we replace the bounded rectifiers with step functions.

4 Experiments

In the following experiments, we work on MNIST, CIFAR10 (Krizhevsky & Hinton, 2009), and ImageNet ILSVRC2012 dataset (Deng et al., 2009)

. The baseline methods are solvers and architectures from the Caffe software package

(Jia et al., 2014). We conduct the experiments on two primary settings: binarizing the features for the last layer, and binarizing the features for all layers. Both settings have great practical values, and our approach gives very promising results. We discuss binarizing model parameters altogether to derive a complete bit-wise deep model in the last section.

4.1 Binarizing the last layer representation

Figure 2: Binarization process for the last layer along training time. Left: MNIST; Right: CIFAR10. For each curve in the figure, it is plotted as the percentage of total units which are at least (e.x. 90%) of the time represented as binary values given all the testing data. We plot 3 curves under 3 different thresholds to better show the training dynamics. We find we are able to make 99.9% units 99.9% of the time binary if we train the network for unreasonably long time. However, it makes no difference for the final performance.

 

Baselines Ours

 

Regular Binary Finetuned 1st Phase 1st Binary 2nd Phase 2nd Binary

 

MNIST 99.11 95.38 98.37 99.21 99.12 99.22 99.22
CIFAR10 75.40 38.04 62.39 75.60 74.10 75.96 75.76
ImageNet 56.66/79.92 53.67/77.86 54.06/78.15 56.22/79.54 54.69/78.13 56.23/79.53 56.13/79.45

 

Table 1: Classification performance for binarizing the last layer representations. “Regular” means continuous feature values. “Binary” means directly binarizing the feature without retraining the softmax. “Finetuned” means retraining the softmax.

Binarizing the features for the last layer can be very useful for large scale image search and retrieval algorithms. Our experiment shows that features from the last layer of current CNNs can be safely binarized. Take MNIST for an example, the baseline method obtains 99.11% with standard ReLU activations. Directly binarizing the features without retraining the softmax suffers a lot. Retraining the classifier while fixing the binary features boosts the performance to 98.37%. For our method, we monitor the level of binarization as well as the classification performance along the training time in Figure 

2. Training the first phase using a small weight growth steadily binarizes more units. Direct binarizing the features without retraining the softmax gives a performance 99.12%. Training the second phase using a larger weight growth maintains the accuracy 99.22%, but substantially increases the level of binarization. Directly binarizing the features gives the same performance 99.22%. Note there is no need to retrain the classifier for our method because the level of binarization is already very high. The corresponding experimental results on CIFAR10 and ImageNet are summarized in Table 1. From results on three datasets, we can always reach the performance using features as real values. It is important to note that our method does not simply threshold ReLU activations, it learns a total different representation. For example, the Alexnet features on ImageNet are quite sparse, while ours is not sparse at all. Our feature has 58% zero and 42% one.

Figure 3: Binary features are robust and less prone to overfit. Left: CIFAR10; Right: ImageNet. We monitor the training loss, the testing loss, the testing accuracy for longer training iterations. The gap of our training loss and testing loss is a lot smaller than ReLUs . Please note that our representation in this experiment is not completely binary.

Feature robustness. Binary features are inherently robust and less prone to overfit. To prove this, we train the deep models for longer iterations using a small weight growth constraint. In Figure 3, the baseline model with ReLU activations is severely affected by overfitting, and the training loss drops a lot faster than the testing loss. For our bounded rectifiers, on CIFAR10, longer training continues to give us better performance. We reach a performance of 80.48%, which is significantly better than 75.40%, which is the best performance that current method can achieve with the same architecture. On ImageNet, our method only uses dropout on fc6 layer but not fc7 layer. One baseline setting is consistent with ours, and the other uses full dropout training. This experiment shows that even when a binary representation is not desired, bounded rectifiers as well as weight growth constraint should be considered as an effective way for regularization.

4.2 Binarizing the whole network

Binarizing the representations across all the layers could potentially save memory usage to a great extent and speed up computations. It also allows us to explore even deeper and wider network architectures. Apparently, forcing all the layers to be binary is an extremely difficult task. If the layer at the front is binarized while losing too much information, the following layers will have no chance to recover. Also, if the layer at the end is binarized a lot faster than previous layers, the gradient signal will be too weak to be propagated. To our surprise, in Figure 4, our network can automatically maintain a steady level of binarization across all the layers while conservatively encouraging binarization.

We conduct experiments on CIFAR10 and ImageNet. For the baseline method111The baseline here is a little bit lower than original Alexnet because this baseline is trained with xavier initializations. We want to cancel out the effects of initializations since the original Alexnet is sophisticatedly tuned., we take the well-trained ReLU model and it suffers a lot if we binarize all the layers simultaneously. Optimally, we can add a binary representation layer each time from the front to the back, and finetune all the layers behind as real values. Then we repeat this process until all the layers are binarized. The performance increases to 45.48% and 27.73%, respectively. For our approach, we replace ReLU activations with our bounded activations for all the layers and train the deep model from scratch. After two phases of training, we obtain an accuracy of 72.14% and 68.39% by directly replacing each bounded rectifiers with a step function. We can also follow the layer-wise finetuning as for the baseline method. In our experiment, to make things simple, we just finetune the final softmax classifier. Our final result is 73.08%, 68.85%. Details are included in Table 2.

Figure 4: Left: Binarization process for all layers vs. training iterations on ImageNet. The curves are plotted in the same way as Figure 2. We only plot one curve for each individual layer for simplicity. Naturally, the last layer is binarized slower than previous layers to allow gradient propagations. Right: Classification performance as the network has more binary layers using AlexNet on ImageNet. Conv2, conv3, fc7 layers exhibit no loss of performance when they are binarized.

 

Baselines Ours

 

Regular Binary Finetuned 1st Phase 1st Binary 2nd Phase 2nd Binary Finetuned

 

CIFAR10 75.40 18.15 45.48 75.35 61.87 73.77 72.14 73.08
2x model 78.05 69.05 77.25 75.71 76.22

 

ImageNet 55.10/78.35 0.1/0.5 12.44/27.73 53.81/77.74 30.24/54.13 46.64/71.63 43.33/68.39 43.70/68.85
2x model 57.87/80.93 39.81/63.67 54.07/77.73 49.51/73.61 49.85/74.23

 

Table 2: Classification performance for binarizing the whole network representations. Key words are consistent with Table 1, except the baseline is finetuned layer-wise. Performance of the binary representation increases a lot when the architecture is made twice larger.

Larger network architectures. We lose about 2.4%, 9.5% of accuracy by using the same architecture as networks with real values. To find out which layers our approach suffers most, we train a set of models that gradually have more binay layers, e.g. binarizing only the last layer, and add the second last layer, etc. In figure 4, on AlexNet, we can see that conv2, conv3, and fc7 layers do not hurt the performance while other layers may have some noticeable losses. Maybe it’s the capacity of the network architecture that limits the performance for binary representations. Can we reach to the same performance or even better performance by using bigger networks? To test the possibility, we double the number of feature channels for all the layers, and redo the experiment under the same settings. Finally, we reach a performance of 76.22% and 74.23% for the twice larger networks. This obviously confirms the architecture should be made larger for binary values. Also, it would be very interesting to investigate the minimum binary architecture in order to match the original performance. For example, maybe only conv1 and conv5 layer needs to be made larger, and all the other layers could be remained the same. Since it needs an exhaustive search over a large amount of models, we leave it for the future work.

4.3 Activation patterns of binary feature units

Binary representations give a clear definition of fire and non-fire. Can we develop an interpretation of the binary value zero and one? To start with, we work on ImageNet dataset and calculate the average responses of the fc7 units for all 1000 classes. This gives us a matrix (see Figure 5) where each element indicates how often each class fires for a specific neuron. We find that for some neurons, there exists some “positive classes” where all the instances in the class consistently fire for this neuron, and some “negative classes” where all the instances in the class consistently inhibit for this neuron. Such neurons carry strong discriminative information for multi-class classification. In fact, given only one such particular neuron, we can easily build a classifier to separate “positive classes” and “negative classes” by first adding a bias 0.5 and choosing a weight of 1 for positive classes, -1 for negative classes, and 0 for the remaining ambiguous classes.

In figure 5, for neuron 1 and 2, the positive and negative classes are quite obviously corresponding to semantics, since ImageNet labels are based on WordNet hierarchy. But for other neurons, like neuron 3 and 4, this does not always seem to be the case. What visual properties do these classes share in order to fire/inhibit for the same neuron? For each class in the “positive classes” and “negative classes”, we randomly select one image instance in the class, and find the minimum image representation to highlight the object part that contribute most to this neuron. We use the visualization technique from Zhou et al. (2014). Concretely, we first segment the image by edges, regions and corners, and then remove each segment greedily until the binary response flips. In figure 6, we show examples of 7 neurons. Please note that each image does not simply represent for an instance, but for the whole class. Unlike previous visualizations that attempt to find the consistency within a class, we show the consistent image pattern across classes. Our results show that some fc7 neurons do represent high level concept, like dogs, animals, and car wheels. While some others may still capture low level information like texture and shapes. One interesting result of pattern “Wings” reveals the visual consistency of butterfly wings, goose wings, mushroom wings, sailboat wings, and even mountain wings. Moreover, Parizi et al. (2014) puts forward the notion of negative parts for classification. Our binary representation experimentally proves that negative parts automatically emerge in deep neural networks, and the same part can be shared as positive responses and negative responses such as dog faces in figure 6.

Figure 5: Top: Firing matrix for each class on each feature unit. Each element indicates how often a class fires for a particular binary unit. Bottom: A column of the firing matrix reveals an activation distribution for one unit over all categories. Some units are clearly related to semantics. e.g. Unit 1 is used to distinguish animals and artificials, and unit 2 is used to distinguish dogs and birds/fishes/reptiles. Some units (3 and 4) are not obviously semantically related.

4.4 Towards binarizing model parameters

We always lose some performance using binary representations under the same architecture. However, the number of model parameters remain the same. This means that there should be a lot more redundancy in the model with binary representations. Intuitively, changing a particular weight of the model does not necessarily change the output because the value is thresholded to binary. Also, the scale of the weights does not influence the output. Overall, parameters using a binary representation are less sensitive to the output responses. Can we compress the model parameters to binary altogether? Specifically, after we obtain the model with binary representations, we simply threshold the model weights to , and keep it fixed. We leave the conv1 parameters to be real values since the input image is not binary and has rich sensory information. Then we finetune the bias and slope parameters using back-propagation. In this way, all the middle representations as well as most of the model parameters are binarized.

We conduct the experiments on CIFAR10 dataset. For the baseline architecture, we achieve 63.55%, which is 11.85% worse than the regular model 75.40%, and 9.53% worse than our binary representation model with real-valued parameters 73.08%. For the twice larger network architecture, the performance increases to 69.68%. To achieve the baseline accuracy, we make the network 4 times larger, and it finally reaches to 75.07%. The model contains about 16 times numbers of model parameters and 4 times numbers of feature units compared with the baseline model. However, all the computations could be stored and calculated in a bit-wise way.

5 Conclusions

In this paper, we propose adjustable bounded rectifiers along with its optimization techniques to learn binary representations for deep neural networks. Our results confirm the redundancy in current deep models. We can safely binarize the last layer and even all the layers with reasonable good performance. We also show our activation function can be used as a new way to regularize the model even when a binary representation is not desired. Using the learned representations, we can interpret the function of each binary neuron using semantic phrases by examining the minimal activation pattern of input images. We believe our binary model has great practical value for deployment, especially in targeted hardwares. In the future, we plan to optimize the smallest binary architecture that matches the state-of-the-art real value networks and incorporate binarizing model parameters into optimization.

Figure 6:

Activation pattern for individual neurons. Each column represents a single neuron in the fc7 binary feature vectors. We show 6 example classes for positive and negative classes each. Each visualization is represented as the minimal image that causes the neuron to fire or inhibit. The original image is thumbnailed in the corner. We summarize the visual consistency across classes by our descriptions.

References