1 Introduction
Deploying deeper neural networks having large number of parameters have been commonplace in the recent years. While this led to stateoftheart performance, it also came with high computational cost and memory requirements, which has limited their deployment on lower capacity devices. For this reason, there has been an increased interest in development of efficient deep neural network models that can work effectively on devices with limited capabilities. Two fundamental approaches aiming to solve this problem are; designing of smaller Convolutional Neural Network (CNN) models, and pruning existing networks to obtain smaller networks with comparable performance such as MobileNetV2 [1], EfficientNet [2] and ShuffleNet [3]. On the other hand, BinaryWeightNetworks provides a distinct alternative approach to this problem where fullprecision operations are replaced with binaryprecision operations. The main benefit of these models is that both memory and computation requirements are significantly reduced without changing the parameter size. Even though this comes with a performance penalty, it allows a tradeoff between network performance and computational complexity to run such networks on limited capability devices.
In the recent years, an increasing number of binary network models and implementations have been proposed [4]. BitFlow [5] is reported to have speedup against standard binary network implementations, while it has a speedup against fullprecision networks. XNORSRAM [6] is a hardware solution for ternaryXNORandaccumulate (XAC) operations, exhibiting energy saving. XNORNet [7], a prominent type of binary network, has been reported to have memory saving and theoretical speedup on CPU. XNORNet++ [8] proposed an improved training algorithm for binary networks, achieving
higher accuracy on ImageNet compared to XNORNet
[7].In this paper, we propose an implementation of binary convolutional network on GPU and optimization of binary XNOR convolution. While training of deep networks have high computational cost, training is generally done once before the deployment of the network. Hence training can be done on systems with higher computational and memory capacity. On the other hand, the inference path of the network is run continuously once it is deployed and the network is generally required to be deployed on costeffective devices for reallife applications. Hence, inference is desired to have low computational complexity for costeffective and widespread deployment. In this work, XNORNet binary network [7] is taken as the reference method and the forward path of this algorithm, used for the inference, is optimized on GPU.
2 Background
The main bottleneck of CNN models is the highmemory requirement, which hinders their deployment on limited capacity devices. BinaryWeightNetworks, [7]binarizes the weight values as opposed to using fullprecision and can achieve memory saving and speedup. By approximating both weights and input as binary values, XNOR Net can achieve speedup in implementation on CPUs. In this section we first describe the binary networks in general and then describe the specifics of XNORNet.
2.1 Binary Weight Networks
First, the weight values need to be approximated as binary values so convolution can be implemented with the help of efficient subtraction and addition operations. The binary weights, , are represented by the triplet , , , where indicates the row, indicates the column, and indicates the channel. The weights, , are represented as binary by the help of a scaling factor . Then the convolution can be approximated as in Eq. 1 where indicates a convolution without any multiplication.
(1) 
is a binary filter and is a scaling factor and To find optimal solution, the optimization in Eq. 2 is solved.
(2) 
, = n and is constant. The parameter that is to be minimized is which requires maximization of . Since is {+1,1}, the maximization can be done by taking the sign of and multiplying with By taking the derivative of with respect to , Eq. 3 and 4 are obtained.
(3) 
(4) 
By replacing with , this can be written as in Eq. 5
, which implies that optimal estimation of binary weight can be computed by taking the sign of weight and scale factor is the average of absolute weight values.
(5) 
2.2 XNOR Networks
In addition to binarization of weights, XNORNetwork also binarizes the inputs. This can be considered as binarizing the inputs of the convolutions by the help of a binary activation function. Since both the weight and input have binary values, convolution operation can then be implemented using XNOR operation. Since both are binary vectors, convolution operation is comprised of shift and dot product operations. In the Binary Weight Network,
is approximated as and the input as . So, it can be written that , This time the optimization process involves two parameters and as in Eq. 6:(6) 
where indicates elementwise product. To put the equation into a simpler form, we can define as , as , and as This can be written using the same approach in Binary Weight Networks as in Eq. 7 and 8.
(7) 
(8) 
Since are independent this leads to Eq. 9.
(9)  
For calculating scale factors, the average of each channel is taken and convolved with 2D filter . Expression and final approximation can be defined as in Eq. 10 and 11.
(10) 
(11) 
3 Algorithm Implementation
3.1 Binary Convolution
In this section, we first describe the generic implementation of the XNOR convolution. Then, the CPU and GPU implementations and their differences over the same pipeline are described. Binary convolution has the following steps:

[wide, labelwidth=!, labelindent=0pt]

XNOR Convolution Bit operations

[wide, labelwidth=!, labelindent=10pt]

Conversion of input data type to binary type

XNOR bitwise logical operation on binary data with binary weights

Summation of output binary bits where 0 values are considered as 1.

Converting Binary to float data type.


XNOR Convolution Scaling Factor Computation

[wide, labelwidth=!, labelindent=10pt]

Channelwise summation of input data.

Multiplication of matrix with the scalar value.


Multiplication of float output of XNOR convolution with and values.
3.1.1 Converting Integer to Binary
3.1.2 Binary Convolution
After converting the input to binary image, XNOR convolution is applied on the binary image. A simple iterative XNOR operation is enough to obtain XNOR convolution outputs. The pseudocode is given in Algorithm 2.
The theoretical speedup that can be achieved for this part is 58x for kernel size [7]. However, networks used in computer vision use larger kernel sizes to have a receptive field, hence convolutions with larger kernels are needed in practice. In this work, we use a kernel size of which results in a more modest speedup as it necessitates an iterative approach. In [7], convolution kernel weights fill every bit inside a register. For kernel size, this involves copying the same sign value for each bit in register. When
convolution kernels are used, XNOR convolution can not be applied to each of the bitpixel value since bits at the edge of the registers will require padding. Hence in our implementation, the weight register only contains one meaningful weight value and the other
bits are masked by bitwise AND operation.3.1.3 Binary Image to Integer Image
Output of the XNOR convolution is still in binary image format and each convolution result is stored inside a single register. To convert the convolution result into integer, the total number of 1bits in the register needs to be counted.
In our implementation, in order to count the 1bits inside the registers, we use the relevant x86_64 instruction and the special function provided by CUDA (__popc
) in CPU and GPU implementations respectively.
3.1.4 Multiplication by Scaling Factor
By averaging across channels, is obtained. is convolved with a matrix to get scaling factor matrix , which is multiplied with output.
We implemented multithreaded versions of both vanilla convolution and XNOR convolution on CPU as baseline methods to compare against the parallelized versions on GPU. In this section, we first describe the CPU implementation. This is followed by the description of GPU implementation.
3.2 Binary Convolution on CPU
The CPU implementation has the following steps.

[wide, labelwidth=!, labelindent=0pt]

Apply zero padding to the Tensor (3D).

Convert the tensor and weights to binary type.

Apply bitwise XNOR operation on binary Tensor.

Convert binary Tensor to integer Tensor.

Repeat steps 2, 3, 4 for all input channels and sum the results across input channels.

Repeat steps 2, 3, 4, 5 for output channels (filters).

Summation on input Tensor across channels to find scaling factor.

Scalar Multiplication of output result from (5) with and scaling factor.
3.2.1 Converting Integer to Binary
The input and image tensors are stored inside registers with unsigned long
data type to fully utilize 64bit CPU registers and benefit from 64bit operations. Each 64bit register can hold 64 data elements of a matrix. Hence, the input image is divided into tiles, each of which is then stored in a single register in binary form. The pseudocode is provided in Algorithm 1.
3.2.2 Binary Convolution
In this part, CPU registers are used since there is up to iterative access to the same register. The pseudocode is given in Algorithm 2.
3.2.3 Binary Image to Integer Image
Each convolution result is stored inside a single 64bit register. To convert the convolution result into integer value, the total number of 1s in the register need to be counted. This can be done by using the special builtin function __builtin_popcount
of the GCC compiler, which performs this operation more efficiently than hash mapping.
3.2.4 Multiplication by Scaling Factor
This part is done as described in 3.1.4.
3.3 Binary Convolution on GPU
In convolution, XNOR operation and scaling factor calculation are independent, hence they can be run asynchronously in two different CUDA streams. XNOR convolution result is then obtained by multiplication of the outputs of these two streams.
The GPU implementation has the following steps running in two different streams:
Stream 1:

[wide, labelwidth=!, labelindent=10pt]

Apply zero padding to the Tensor (3D).

Convert the tensor and weights to binary type.

Apply bitwise XNOR operation on binary Tensor.

Convert binary Tensor to integer Tensor.
Stream 2:

[wide, labelwidth=!, labelindent=10pt]

Summation on input Tensor across channels to find scaling factor.

Scalar Multiplication of and scaling factor.
3.3.1 Input Scaling Factor
For calculating the scaling factor of input, the average of channels is taken and convolved with 2D filter as in Eq. (10). It includes 2 steps: averaging across channels and convolution. For summation, each thread calculates the sum of a pixel across channels.
Computing input scaling factor matrix includes following steps after copying from host to device:

Set grid and block sizes.

Average pixels across input channels.

Execute memory specified CUDA function to compute kernel convolution.

Deallocate GPU memories.
3.3.2 Binary Convolution Operation
Main idea of algorithm is similar to the CPU version. However, while CPU registers are 64bit, GPU registers are 32bit and each register now holds a image tile rather than tiles. Each CUDA thread converts a image tile (stored in a single register) to binary in parallel. The total number of threads to launch can be calculated as in Eq. 12, where , , , , , , represents the input image width, image height, register width, register height, kernel width and kernel height respectively.
(12) 
After converting the image batches to binary, XNOR convolution is applied on all input channels. Denoting the number of input channels and number of output channels by and respectively, a total of different convolutions (having different weights) are calculated for each input channel. For this purpose, two options were explored: (i) using a single kernel to calculate all binary convolutions on all input, (ii) using number of kernels calculating binary convolution for each output channel separately. The first approach results in better utilization as it uses the register space for the whole process without any need for copying the result to global memory. However, this approach prohibits parallelization on output channels. For each input channel, convolution operations needs to be calculated and these operations are executed by the same kernel thread iterating a loop for the times. Therefore, the second approach preferred. In that case, since conversion of integer image to binary image can be stored in global memory, multiple streams can access these data. As a result streams can be run asynchronously, resulting in better paralellization.
3.3.3 Multiplication by Scaling Factor
A straight forward multiplication of for each CUDA thread.
4 Experimental Evaluation
Input Size  Vanilla Conv.  XNOR Conv.  Speedup 

0.062  0.024  
0.186  0.069  
0.671  0.252  
2.641  0.986 
Input Size  CPU  GPU  Speedup 

3.437  0.061  
10.623  0.186  
35.811  0.671  
132.714  2.641 
Input Size  CPU  GPU  Speedup 

0.743  0.0237  
2.531  0.0692  
10.088  0.2519  
42.011  0.9859 
We have run the experiments on a system having Intel i7700 CPU with 4cores and Nvidia GTX1080TI GPU. The time measurements take only the computation into account so memory operations like allocation, copying memory and deallocation are excluded. For multicore CPU implementation, OpenMP has been used. The sub parts that are explained above are made for single channel input matrices. For the GPU implementation, CUDA has been used. We observed that the performance was insensitive to block size, so the experiments have been conducted with a constant block size of 256. We have used a constant kernel size throughout the experiments for both CPU and GPU versions. All the experiments have been repeated 100 times and average runtimes have been calculated.
As shown in Table I, GPU XNOR convolution provides a speedup of to against GPU vanilla convolution. the speedup remains fairly constant with different image sizes.
5 Discussion
While the GPU XNOR convolution implementation has better performance than the CPU XNOR and GPU vanilla counterparts, the speedups we observed were lower than those reported in [7]. It has to be noted that our design uses a single kernel for each logical operation and as such, lacks the ability to achieve (assuming 32bit registers) binary logical operation speed. So, further optimizations could leverage bitwise parallelism. On the other hand, use of separate registers allows easier conversion from binary outputs to integers. XNOR convolution needs binarization of input and multiplication with scaling factor at the end. Converting integer input image values to binary values and restoring integer values from output of the XNOR convolution are costly operations as they require sequential write operation to modify each bit inside a register and read them after convolution. For a deeper network, this process may optimized by passing the binary outputs to the next kernel without integer conversion.
XNOR convolution involve two processes that can run concurrently, which are computing scaling matrix and binary convolution operation. As a future work, multiple streams can be used to overlap these operations.
6 Conclusions
We have implemented and optimized the XNOR convolution operation [7] used in binary convolutional networks on CPU and GPU and comparatively evaluated their performance. The experimental results show that up to speedup can be achieved on GPU compared to the multithreaded CPU implementation.
We implemented the operations required for the whole inference path of the binary network (i.e. scaling factor calculation and multiplication, binary to integer and integer to binary conversion, XNOR convolution) and made the code publicly available at https://github.com/metcan/BinaryConvolutionalNeuralNetworkInferenceonGPU. However it has to be noted that the operations other than XNOR convolution part are not optimized and developed for testing only. Hence, for a reallife deployment requiring high levels of performance, these parts also need to be optimized.
References

[1]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen, “Mobilenetv2:
Inverted residuals and linear bottlenecks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018, pp. 4510–4520.  [2] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.
 [3] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6848–6856.
 [4] H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe, “Binary neural networks: A survey,” Pattern Recognition, vol. 105, p. 107281, 2020.
 [5] Y. Hu, J. Zhai, D. Li, Y. Gong, Y. Zhu, W. Liu, L. Su, and J. Jin, “Bitflow: Exploiting vector parallelism for binary neural networks on cpu,” in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018, pp. 244–253.
 [6] S. Yin, Z. Jiang, J. Seo, and M. Seok, “Xnorsram: Inmemory computing sram macro for binary/ternary deep neural networks,” IEEE Journal of SolidState Circuits, vol. 55, no. 6, pp. 1733–1743, 2020.
 [7] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks,” CoRR, vol. abs/1603.05279, 2016. [Online]. Available: http://arxiv.org/abs/1603.05279
 [8] A. Bulat and G. Tzimiropoulos, “Xnornet++: Improved binary neural networks,” arXiv preprint arXiv:1909.13863, 2019.
Comments
There are no comments yet.