Speeding up convolutional neural newtorks (CNN) on mobile platform such as cellular phones is an important problem in real world application of deep learning. The computational task is to reduce network complexity while maintaining its accuracy. In this paper, we study weight binarization of the simple audio CNN
on TensorFlow for keyword classification, propose a new class of training algorithms, and evaluate the trained binarized CNN on an Android phone app.
The audio CNN  has two convolutional layers and one fully connected layer . Weight binarization refers to restricting the weight values in each layer to the form where is a common scalar factor of full precision (32-bit). The network complexity is thus reduced with no change of architecture. The network training however requires a projection step and a modification of the classical projected gradient descent (PGD) called BinaryConnect . If the projection is in the sense of Euclidean distance or norm, closed form solution is available . With this projector, BinaryConnect (BC) updates both binary weight and an auxiliary float precision weight to avoid weight stagnation that occurs in direct PGD: , small, the objective function. In other words, the small variation of may vanish under the projection to discrete targets , rendering . The BC update is:
where the float weight continues to evolve and eventually moves the binary weight . Two improvements of BC stand out in recent works. One called BinaryRelax (BR) is to replace the “hard projection” to discrete targets by a relaxed (approximate) projection still in analytical closed form  so to give and
more freedom to explore the high dimensional non-convex landscape of the training loss function. The relaxation is tightening up as training goes forward and becomes a hard projection as the training reaches the end to produce a binarized output. BR outperforms BC on image datasets (CIFAR-10, Imagenet) and various CNNs. The other technique is “blending” , which replaces in BC’s gradient update by the convex linear combination , for a small . With blending, the sufficient descent inequality holds at small enough : , for some positive constant , if has Lipschitz continuous gradient .
In this paper, we replace the projector in BC  by a new projector in the sense and compare their performance on audio keywords data as well as CIFAR-10 image dataset. The projector is found in closed analytical form, and it differs from the projector in that the median operation replaces an arithmatic average operation. It is known that
norm as a measure of error is more robust to outliers than, see  for a case study on matrix completion. Our experiments on TensoFlow show that BC with projector (called median BC) always outperforms BC under either cold (random) or warm start. Similar findings are observed on VGG network  and CIFAR-10.
The rest of the paper is organized as follows. In section 2, we first present the projector, then summarize BC, median BC and BR algorithms as well as the audio CNN architecture. In section 3, we compare test accuracies for 10 keywords classification and CIFAR-10 on binary weight CNNs trained by the three algorithms. The best trained binary CNN and full precision audio CNN are implemented on an Android phone app. The binary CNN doubles the processing speed of full-precision CNN on the app when words on the list are spoken and correctly recognized. The concluding remarks are in section 4.
2 Binary Quantization and Training Algorithms
2.1 Binary Weight Projections
We consider the problem of finding the closest binary vector to a given real vector , or the projection , for , . When the distance is Euclidean (in the sense of norm )), the problem:
has exact solution :
where the norm and ,
The solution is simply the sign of times the average of the absolute values of the components of .
Now let us consider the distance in the sense, or solve:
Write , where , , . Then:
clearly , and so:
For a derivation of the median solution in (2.6), see section 2 of  where a robust low rank matrix factorization in the sense is studied. We shall call (2.5)-(2.6) the median based binary projection.
2.2 Binary Weight Network Training Algorithms
Let the training loss function be , and learning rate be . We compare 3 algorithms below.
Binary-Connect (BC ):
where denotes the sequence of desired quantized weights, and is an auxiliary sequence of floating weights (32 bit).
Binary-Connect with median based quantization (Median BC):
Binary-Relax (BR ):
where for some .
In addition, we shall compare the blended version  of the above three algorithms obtained from replacing in the gradient update by , for .
2.3 Network Architecture
to classify a one second audio clip as either silence, an unknown word, ‘yes’, ‘no’, ‘up’, ‘down’, ‘left’, ‘right’, ‘on’, ‘off’, ‘stop’, or ‘go’. After pre-processing by windowed Fourier transform, the input becomes a single-channel image (a spectrogram) of size, same as a vector , where and are the input feature dimension in time and frequency respectively. Next is a convolution layer that operates as follows. A weight matrix is convolved with the input . The weight matrix is a local time-frequency patch of size , where and . The weight matrix has
hidden units (feature maps), and may down-sample (stride) by a factorin time and in frequency. The output of the convolution layer is feature maps of size
. Afterward, a max-pooling operation replaces eachfeature patch in time-frequency domain by the maximum value, which helps remove feature variability due to speaking styles, distortions etc. After pooling, we have feature maps of size .
The overall architecture 
consists of two convolutional layers, one fully-connected layer followed by a softmax function to output class probabilities at the end. The training loss is the standard cross entropy function. The learning rate begins at, and is reduced by a factor of 10 in the late training phase.
3 Experimental Results
In this section, we show TensorFlow training results of binary weight audio CNN based on median BC, BC and BR. With random (cold) start and 16000 iterations (default, about 70 epochs), BR is the best at 3.4 % loss from the full precision accuracy. Increasing the number of iterations to 21000 from 16000 (warm start), we found that the full precision accuracy goes up to 93.5 %, the highest level observed in our experiments. The binary weight training algorithms also improve, with BR and median BC in a near tie, at 1.4 % (1.5%) loss from the full precision accuracy respectively. For the same warm start, if blending is turned on (), median BC takes the lead at 92.4 %, which narrows down the loss to 1.1 % from the full precision accuracy, the best binary weight model that has been trained. Table 1 shows all experimental results conducted in TensorFlow on a single GPU machine with NVIDIA GeForce GTX 745.
We see from Table 1 that BC benefits the most from blending while all methods improve with warm start. Median BC is better than BC in all cases. Since BR involves two stages of training and is more complicated to implement than BC’s single stage training, median BC strikes a good balance of accuracy and simplicity for binary weight network training.
The most accurate full precision (32-bit) and binary (1-bit) weight CNNs are imported to an app on Android cellular phone (Samsung Galaxy J7) for runtime comparison. A user speaks one of the words on the list (‘Yes’, ‘No’, ‘Up’, ‘Down’, ‘Left’, ‘Right’, ‘On’, ‘Off’, ‘Stop’, ‘Go’). In Fig. 1, a user speaks the work ‘No’, the app recognizes it correctly and reports a runtime in millesecond (ms). The full-precision model runs 120 ms while the binary weight model takes 59 ms, a 2x speed up on this particular hardware. A similar speed-up is observed on other words, for example the word ‘Off’ shown in Fig. 2. While importing the binary weight CNN to the app, we keep the standard ‘conv2d’ and ‘matmul’ TensorFlow functions in the CNN. Additional speed-up is possible with a fast multiplication algorithm utilizing the binary weight structure and/or hardware consideration.
|start||32 bit||Median BC||BC||BR|
|cold||87.9 %||83.3 %||83.0 %||84.5 %|
|warm||93.5 %||92 %||91.7 %||92.1 %|
|warm, blend||92.4 %||92.2 %||91.6 %|
We also tested median BC, BC and BR on CIFAR-10 color image dataset. In the experiments, we used the testing images for validation and warm start from pre-trained full precision models. We coded up the algorithms in PyTorch platform and carried out the experiments on a desktop with NVIDIA GeForce GTX 1080 Ti. We ran 200 epochs for each test. The initial learning rate is 0.1 and decays by a factor of 0.1 at epochs 80 and 140. Phase II of BR starts at epoch 150 and the parameter in BR increases by a factor of after every half epoch. In addition, we used batch size , weight decay and momentum . We tested the algorithms on VGG architectures , and the validation accuracies for CIFAR-10 are summarized in Table 2. We see that median BC outperforms BC, however blending appears to help BC and BR more than median BC in general.
|Network||32 bit||Median BC||BC||BR|
|VGG-11||92.01 %||89.49 %||88.03 %||88.90 %|
|VGG-11, blend||90.02 %||89.35 %||89.51 %|
|VGG-13||93.56 %||92.16 %||91.85 %||92.05 %|
|VGG-13, blend||92.25 %||91.85 %||92.48 %|
|VGG-16||93.45 %||91.83 %||91.68 %||92.08 %|
|VGG-16, blend||92.01 %||92.21 %||92.74 %|
We studied training algorithms of binary convolutional neural networks via the closed form projector on TensorFlow and PyTorch. The median operation takes place of arithmatic average in the standard projector. Under warm start, the median BinaryConnect improves the regular BinaryConnect on both audio and image CNNs. The trained binary CNN doubles the speed of full precision CNN when implemented on an Android phone app and tested on spoken keywords.
We observed that the blending technique in the projector context  tends to benefit BC more than median BC in experiments on both audio and image data. It remains to develop an alternative blending method for the median BC.
In future work, we plan to study projection in binary and higher-bit CNN training on larger datasets, and further understand its strengths. The higher-bit exact and approximate projection formulas in the sense are derived in the appendix.
We would like to thank Dr. Meng Yu for suggesting the TensorFlow keyword CNN  in this study, and helpful conversations while our work was in progress.
Appendix: Median -bit Projection Formulas ()
Let us consider and the problem of finding the closest ternary (2-bit) vector in the sense to a given real vector , or the projection , for , .
Write , where , , . Let the index set of nonzero components of be with , the cardinality of . Clearly, the () approximates the smallest entries of in absolute value. So is the index set of largest components of in absolute value. Let extract the largest components of in absolute value, and zero out the rest. Then
To minimize the expression in (4.1), we must have , , sgn(0):=0, and:
The optimal value of is . Let:
and be the index set of largest components of in absolute value, then the optimal solution to (4.1) is with:
It is interesting to compare with the ternary projection in the sense :
For , exact solutions are combinatorial in nature and too costly computationally . A few iterations of Lloyd’s algorithm  is more effective [9, 10], which iterates between the assignment step (-update) and centroid step (-update). In the -update of the -th iteration, with known from previous step, each component of is selected from the quantized state so that is nearest to component by component. In the -update, the problem:
has closed form solution (eq. (2.3), ). Let us rename the sequence as , where equals the cardinality of . Then:
where corresponds to the renamed index
-  M. Courbariaux, Y. Bengio and J. David, BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations, Conference on Neural Information Processing Systems (NIPS), pp. 3123-3131, 2015.
-  S. Lloyd, Least squares quantization in pcm, IEEE Trans. Info. Theory 28, pp. 12–137, 1982.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer, Automatic differentiation in pytorch, Tech Report, 2017.
M. Rastegari, V. Ordonez, J. Redmon and A. Farhadi,
XNOR-Net: ImageNet Classification Using Binary
Convolutional Neural Networks
, European Conference on Computer Vision (ECCV), 2016.
-  T. Sainath and C. Parada, Convolutional Neural Networks for Small-footprint Keyword Spotting, Interspeech 2015, pp. 1478-1482, Dresden, Germany, Sept. 6-10.
-  S. Sheen, A Coordinate Descent Method for Robust Matrix Factorization and Applications, SIAM Undergraduate Research Online, Volume 9, January 15, 2016. DOI: 10.1137/15S014472.
-  K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.
-  Simple audio recognition tutorial, tensorflow.org, last access Oct 16, 2018.
-  P. Yin, S. Zhang, J. Lyu, S. Osher, Y. Qi and J. Xin, BinaryRelax: A Relaxation Approach for Training Deep Neural Networks with Quantized Weights, arXiv preprint arXiv:1801.06313; SIAM Journal on Imaging Sciences, to appear, 2018.
-  P. Yin, S. Zhang, J. Lyu, S. Osher, Y. Qi and J. Xin, Blended Coarse Gradient Descent for Full Quantization of Deep Neural Networks, arXiv preprint arXiv:1808.05240, 2018.
-  P. Yin, S. Zhang, Y-Y. Qi and J. Xin, Quantization and Training of Low Bit-Width Convolutional Neural Networks for Object Detection, arXiv:1612.06052v2; J. Computational Math, 37(3), 2019, pp. 1-12. Online Aug. 2018: doi:10.4208/jcm.1803-m2017-0301.