Median Binary-Connect Method and a Binary Convolutional Neural Nework for Word Recognition

by   Spencer Sheen, et al.
University of California, Irvine

We propose and study a new projection formula for training binary weight convolutional neural networks. The projection formula measures the error in approximating a full precision (32 bit) vector by a 1-bit vector in the l_1 norm instead of the standard l_2 norm. The l_1 projector is in closed analytical form and involves a median computation instead of an arithmatic average in the l_2 projector. Experiments on 10 keywords classification show that the l_1 (median) BinaryConnect (BC) method outperforms the regular BC, regardless of cold or warm start. The binary network trained by median BC and a recent blending technique reaches test accuracy 92.4 the full-precision network accuracy 93.5 binary network doubles the speed of full-precision network in spoken keywords recognition.



There are no comments yet.


page 1

page 2

page 3

page 4


XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

We propose two efficient approximations to standard convolutional neural...

RBCN: Rectified Binary Convolutional Networks for Enhancing the Performance of 1-bit DCNNs

Binarized convolutional neural networks (BCNNs) are widely used to impro...

Pruning Filter via Geometric Median for Deep Convolutional Neural Networks Acceleration

Previous works utilized "smaller-norm-less-important" criterion to prune...

Binary Input Layer: Training of CNN models with binary input data

For the efficient execution of deep convolutional neural networks (CNN) ...

Rotated Binary Neural Network

Binary Neural Network (BNN) shows its predominance in reducing the compl...

A Channel-Pruned and Weight-Binarized Convolutional Neural Network for Keyword Spotting

We study channel number reduction in combination with weight binarizatio...

Binarized Weight Error Networks With a Transition Regularization Term

This paper proposes a novel binarized weight network (BT) for a resource...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speeding up convolutional neural newtorks (CNN) on mobile platform such as cellular phones is an important problem in real world application of deep learning. The computational task is to reduce network complexity while maintaining its accuracy. In this paper, we study weight binarization of the simple audio CNN


on TensorFlow

[8] for keyword classification, propose a new class of training algorithms, and evaluate the trained binarized CNN on an Android phone app.

The audio CNN [8] has two convolutional layers and one fully connected layer [5]. Weight binarization refers to restricting the weight values in each layer to the form where is a common scalar factor of full precision (32-bit). The network complexity is thus reduced with no change of architecture. The network training however requires a projection step and a modification of the classical projected gradient descent (PGD) called BinaryConnect [1]. If the projection is in the sense of Euclidean distance or norm, closed form solution is available [4]. With this projector, BinaryConnect (BC) updates both binary weight and an auxiliary float precision weight to avoid weight stagnation that occurs in direct PGD: , small, the objective function. In other words, the small variation of may vanish under the projection to discrete targets , rendering . The BC update is:


where the float weight continues to evolve and eventually moves the binary weight . Two improvements of BC stand out in recent works. One called BinaryRelax (BR) is to replace the “hard projection” to discrete targets by a relaxed (approximate) projection still in analytical closed form [9] so to give and

more freedom to explore the high dimensional non-convex landscape of the training loss function. The relaxation is tightening up as training goes forward and becomes a hard projection as the training reaches the end to produce a binarized output. BR outperforms BC on image datasets (CIFAR-10, Imagenet) and various CNNs

[9]. The other technique is “blending” [10], which replaces in BC’s gradient update by the convex linear combination , for a small . With blending, the sufficient descent inequality holds at small enough : , for some positive constant , if has Lipschitz continuous gradient [10].

In this paper, we replace the projector in BC [1] by a new projector in the sense and compare their performance on audio keywords data as well as CIFAR-10 image dataset. The projector is found in closed analytical form, and it differs from the projector in that the median operation replaces an arithmatic average operation. It is known that

norm as a measure of error is more robust to outliers than

, see [6] for a case study on matrix completion. Our experiments on TensoFlow show that BC with projector (called median BC) always outperforms BC under either cold (random) or warm start. Similar findings are observed on VGG network [7] and CIFAR-10.

The rest of the paper is organized as follows. In section 2, we first present the projector, then summarize BC, median BC and BR algorithms as well as the audio CNN architecture. In section 3, we compare test accuracies for 10 keywords classification and CIFAR-10 on binary weight CNNs trained by the three algorithms. The best trained binary CNN and full precision audio CNN are implemented on an Android phone app. The binary CNN doubles the processing speed of full-precision CNN on the app when words on the list are spoken and correctly recognized. The concluding remarks are in section 4.

2 Binary Quantization and Training Algorithms

2.1 Binary Weight Projections

We consider the problem of finding the closest binary vector to a given real vector , or the projection , for , . When the distance is Euclidean (in the sense of norm )), the problem:


has exact solution [4]:


where the norm and ,

The solution is simply the sign of times the average of the absolute values of the components of .

Now let us consider the distance in the sense, or solve:


Write , where , , . Then:


clearly , and so:


For a derivation of the median solution in (2.6), see section 2 of [6] where a robust low rank matrix factorization in the sense is studied. We shall call (2.5)-(2.6) the median based binary projection.

2.2 Binary Weight Network Training Algorithms

Let the training loss function be , and learning rate be . We compare 3 algorithms below.

Binary-Connect (BC [1]):


where denotes the sequence of desired quantized weights, and is an auxiliary sequence of floating weights (32 bit).

Binary-Connect with median based quantization (Median BC):


Binary-Relax (BR [9]):


where for some .

In addition, we shall compare the blended version [10] of the above three algorithms obtained from replacing in the gradient update by , for .

2.3 Network Architecture

Let us briefly describe the architecture of keyword CNN [5, 8]

to classify a one second audio clip as either silence, an unknown word, ‘yes’, ‘no’, ‘up’, ‘down’, ‘left’, ‘right’, ‘on’, ‘off’, ‘stop’, or ‘go’. After pre-processing by windowed Fourier transform, the input becomes a single-channel image (a spectrogram) of size

, same as a vector , where and are the input feature dimension in time and frequency respectively. Next is a convolution layer that operates as follows. A weight matrix is convolved with the input . The weight matrix is a local time-frequency patch of size , where and . The weight matrix has

hidden units (feature maps), and may down-sample (stride) by a factor

in time and in frequency. The output of the convolution layer is feature maps of size

. Afterward, a max-pooling operation replaces each

feature patch in time-frequency domain by the maximum value, which helps remove feature variability due to speaking styles, distortions etc. After pooling, we have feature maps of size .

The overall architecture [8]

consists of two convolutional layers, one fully-connected layer followed by a softmax function to output class probabilities at the end. The training loss is the standard cross entropy function. The learning rate begins at

, and is reduced by a factor of 10 in the late training phase.

3 Experimental Results

In this section, we show TensorFlow training results of binary weight audio CNN based on median BC, BC and BR. With random (cold) start and 16000 iterations (default, about 70 epochs), BR is the best at 3.4 % loss from the full precision accuracy. Increasing the number of iterations to 21000 from 16000 (warm start), we found that the full precision accuracy goes up to 93.5 %, the highest level observed in our experiments. The binary weight training algorithms also improve, with BR and median BC in a near tie, at 1.4 % (1.5%) loss from the full precision accuracy respectively. For the same warm start, if blending is turned on (

), median BC takes the lead at 92.4 %, which narrows down the loss to 1.1 % from the full precision accuracy, the best binary weight model that has been trained. Table 1 shows all experimental results conducted in TensorFlow on a single GPU machine with NVIDIA GeForce GTX 745.

We see from Table 1 that BC benefits the most from blending while all methods improve with warm start. Median BC is better than BC in all cases. Since BR involves two stages of training and is more complicated to implement than BC’s single stage training, median BC strikes a good balance of accuracy and simplicity for binary weight network training.

The most accurate full precision (32-bit) and binary (1-bit) weight CNNs are imported to an app on Android cellular phone (Samsung Galaxy J7) for runtime comparison. A user speaks one of the words on the list (‘Yes’, ‘No’, ‘Up’, ‘Down’, ‘Left’, ‘Right’, ‘On’, ‘Off’, ‘Stop’, ‘Go’). In Fig. 1, a user speaks the work ‘No’, the app recognizes it correctly and reports a runtime in millesecond (ms). The full-precision model runs 120 ms while the binary weight model takes 59 ms, a 2x speed up on this particular hardware. A similar speed-up is observed on other words, for example the word ‘Off’ shown in Fig. 2. While importing the binary weight CNN to the app, we keep the standard ‘conv2d’ and ‘matmul’ TensorFlow functions in the CNN. Additional speed-up is possible with a fast multiplication algorithm utilizing the binary weight structure and/or hardware consideration.

start 32 bit Median BC BC BR
cold 87.9 % 83.3 % 83.0 % 84.5 %
warm 93.5 % 92 % 91.7 % 92.1 %
warm, blend 92.4 % 92.2 % 91.6 %
Table 1: Test accuracies of the full precision (32 bit) and the binary weight keyword CNN by different training algorithms and initializations (cold/random start, warm start), without () and with blending ().

We also tested median BC, BC and BR on CIFAR-10 color image dataset. In the experiments, we used the testing images for validation and warm start from pre-trained full precision models. We coded up the algorithms in PyTorch

[3] platform and carried out the experiments on a desktop with NVIDIA GeForce GTX 1080 Ti. We ran 200 epochs for each test. The initial learning rate is 0.1 and decays by a factor of 0.1 at epochs 80 and 140. Phase II of BR starts at epoch 150 and the parameter in BR increases by a factor of after every half epoch. In addition, we used batch size , weight decay and momentum . We tested the algorithms on VGG architectures [7], and the validation accuracies for CIFAR-10 are summarized in Table 2. We see that median BC outperforms BC, however blending appears to help BC and BR more than median BC in general.

Network 32 bit Median BC BC BR
VGG-11 92.01 % 89.49 % 88.03 % 88.90 %
VGG-11, blend 90.02 % 89.35 % 89.51 %
VGG-13 93.56 % 92.16 % 91.85 % 92.05 %
VGG-13, blend 92.25 % 91.85 % 92.48 %
VGG-16 93.45 % 91.83 % 91.68 % 92.08 %
VGG-16, blend 92.01 % 92.21 % 92.74 %
Table 2: Test accuracies of VGG networks on CIFAR-10 image dataset.
Figure 1: Recognition of a spoken word ‘No’ and runtimes (ms) on Android App. Left: full precision CNN model. Right: binary weight CNN model with 2x speedup.
Figure 2: Recognition of a spoken word ‘Off’ and runtimes (ms) on Android App. Left: full precision CNN model. Right: binary weight CNN model with 2x speedup.

4 Conclusion

We studied training algorithms of binary convolutional neural networks via the closed form projector on TensorFlow and PyTorch. The median operation takes place of arithmatic average in the standard projector. Under warm start, the median BinaryConnect improves the regular BinaryConnect on both audio and image CNNs. The trained binary CNN doubles the speed of full precision CNN when implemented on an Android phone app and tested on spoken keywords.

We observed that the blending technique in the projector context [10] tends to benefit BC more than median BC in experiments on both audio and image data. It remains to develop an alternative blending method for the median BC.

In future work, we plan to study projection in binary and higher-bit CNN training on larger datasets, and further understand its strengths. The higher-bit exact and approximate projection formulas in the sense are derived in the appendix.


We would like to thank Dr. Meng Yu for suggesting the TensorFlow keyword CNN [8] in this study, and helpful conversations while our work was in progress.

Appendix: Median -bit Projection Formulas ()

Let us consider and the problem of finding the closest ternary (2-bit) vector in the sense to a given real vector , or the projection , for , .

Write , where , , . Let the index set of nonzero components of be with , the cardinality of . Clearly, the () approximates the smallest entries of in absolute value. So is the index set of largest components of in absolute value. Let extract the largest components of in absolute value, and zero out the rest. Then


To minimize the expression in (4.1), we must have , , sgn(0):=0, and:


The optimal value of is . Let:


and be the index set of largest components of in absolute value, then the optimal solution to (4.1) is with:


It is interesting to compare with the ternary projection in the sense [11]:

For , exact solutions are combinatorial in nature and too costly computationally [11]. A few iterations of Lloyd’s algorithm [2] is more effective [9, 10], which iterates between the assignment step (-update) and centroid step (-update). In the -update of the -th iteration, with known from previous step, each component of is selected from the quantized state so that is nearest to component by component. In the -update, the problem:


has closed form solution (eq. (2.3), [6]). Let us rename the sequence as , where equals the cardinality of . Then:


where corresponds to the renamed index