Optimization of XNOR Convolution for Binary Convolutional Neural Networks on GPU

by   Mete Can Kaya, et al.
Middle East Technical University

Binary convolutional networks have lower computational load and lower memory foot-print compared to their full-precision counterparts. So, they are a feasible alternative for the deployment of computer vision applications on limited capacity embedded devices. Once trained on less resource-constrained computational environments, they can be deployed for real-time inference on such devices. In this study, we propose an implementation of binary convolutional network inference on GPU by focusing on optimization of XNOR convolution. Experimental results show that using GPU can provide a speed-up of up to 42.61× with a kernel size of 3×3. The implementation is publicly available at https://github.com/metcan/Binary-Convolutional-Neural-Network-Inference-on-GPU



There are no comments yet.


page 1

page 2

page 3

page 4


Binarized Convolutional Neural Networks for Efficient Inference on GPUs

Convolutional neural networks have recently achieved significant breakth...

ATCN: Agile Temporal Convolutional Networks for Processing of Time Series on Edge

This paper presents a scalable deep learning model called Agile Temporal...

daBNN: A Super Fast Inference Framework for Binary Neural Networks on ARM devices

It is always well believed that Binary Neural Networks (BNNs) could dras...

A C Code Generator for Fast Inference and Simple Deployment of Convolutional Neural Networks on Resource Constrained Systems

Inference of Convolutional Neural Networks in time critical applications...

A Binary Convolutional Encoder-decoder Network for Real-time Natural Scene Text Processing

In this paper, we develop a binary convolutional encoder-decoder network...

Radius Adaptive Convolutional Neural Network

Convolutional neural network (CNN) is widely used in computer vision app...

Dynamic Pooling Improves Nanopore Base Calling Accuracy

In nanopore sequencing, electrical signal is measured as DNA molecules p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deploying deeper neural networks having large number of parameters have been commonplace in the recent years. While this led to state-of-the-art performance, it also came with high computational cost and memory requirements, which has limited their deployment on lower capacity devices. For this reason, there has been an increased interest in development of efficient deep neural network models that can work effectively on devices with limited capabilities. Two fundamental approaches aiming to solve this problem are; designing of smaller Convolutional Neural Network (CNN) models, and pruning existing networks to obtain smaller networks with comparable performance such as MobileNetV2 [1], EfficientNet [2] and ShuffleNet [3]. On the other hand, Binary-Weight-Networks provides a distinct alternative approach to this problem where full-precision operations are replaced with binary-precision operations. The main benefit of these models is that both memory and computation requirements are significantly reduced without changing the parameter size. Even though this comes with a performance penalty, it allows a trade-off between network performance and computational complexity to run such networks on limited capability devices.
In the recent years, an increasing number of binary network models and implementations have been proposed [4]. BitFlow [5] is reported to have speedup against standard binary network implementations, while it has a speedup against full-precision networks. XNOR-SRAM [6] is a hardware solution for ternary-XNOR-and-accumulate (XAC) operations, exhibiting energy saving. XNOR-Net [7], a prominent type of binary network, has been reported to have memory saving and theoretical speed-up on CPU. XNOR-Net++ [8] proposed an improved training algorithm for binary networks, achieving

higher accuracy on ImageNet compared to XNOR-Net


In this paper, we propose an implementation of binary convolutional network on GPU and optimization of binary XNOR convolution. While training of deep networks have high computational cost, training is generally done once before the deployment of the network. Hence training can be done on systems with higher computational and memory capacity. On the other hand, the inference path of the network is run continuously once it is deployed and the network is generally required to be deployed on cost-effective devices for real-life applications. Hence, inference is desired to have low computational complexity for cost-effective and widespread deployment. In this work, XNOR-Net binary network [7] is taken as the reference method and the forward path of this algorithm, used for the inference, is optimized on GPU.

2 Background

The main bottleneck of CNN models is the high-memory requirement, which hinders their deployment on limited capacity devices. Binary-Weight-Networks, [7]binarizes the weight values as opposed to using full-precision and can achieve memory saving and speed-up. By approximating both weights and input as binary values, X-NOR Net can achieve speed-up in implementation on CPUs. In this section we first describe the binary networks in general and then describe the specifics of XNOR-Net.

2.1 Binary Weight Networks

First, the weight values need to be approximated as binary values so convolution can be implemented with the help of efficient subtraction and addition operations. The binary weights, , are represented by the triplet , , , where indicates the row, indicates the column, and indicates the channel. The weights, , are represented as binary by the help of a scaling factor . Then the convolution can be approximated as in Eq. 1 where indicates a convolution without any multiplication.


is a binary filter and is a scaling factor and To find optimal solution, the optimization in Eq. 2 is solved.


, = n and is constant. The parameter that is to be minimized is which requires maximization of . Since is {+1,-1}, the maximization can be done by taking the sign of and multiplying with By taking the derivative of with respect to , Eq. 3 and 4 are obtained.


By replacing with , this can be written as in Eq. 5

, which implies that optimal estimation of binary weight can be computed by taking the sign of weight and scale factor is the average of absolute weight values.


2.2 XNOR Networks

In addition to binarization of weights, XNOR-Network also binarizes the inputs. This can be considered as binarizing the inputs of the convolutions by the help of a binary activation function. Since both the weight and input have binary values, convolution operation can then be implemented using XNOR operation. Since both are binary vectors, convolution operation is comprised of shift and dot product operations. In the Binary Weight Network,

is approximated as and the input as . So, it can be written that , This time the optimization process involves two parameters and as in Eq. 6:


where indicates element-wise product. To put the equation into a simpler form, we can define as , as , and as This can be written using the same approach in Binary Weight Networks as in Eq. 7 and 8.


Since are independent this leads to Eq. 9.


For calculating scale factors, the average of each channel is taken and convolved with 2D filter . Expression and final approximation can be defined as in Eq. 10 and 11.


3 Algorithm Implementation

3.1 Binary Convolution

In this section, we first describe the generic implementation of the XNOR convolution. Then, the CPU and GPU implementations and their differences over the same pipeline are described. Binary convolution has the following steps:

  1. [wide, labelwidth=!, labelindent=0pt]

  2. XNOR Convolution Bit operations

    1. [wide, labelwidth=!, labelindent=10pt]

    2. Conversion of input data type to binary type

    3. XNOR bitwise logical operation on binary data with binary weights

    4. Summation of output binary bits where 0 values are considered as -1.

    5. Converting Binary to float data type.

  3. XNOR Convolution Scaling Factor Computation

    1. [wide, labelwidth=!, labelindent=10pt]

    2. Channel-wise summation of input data.

    3. Multiplication of matrix with the scalar value.

  4. Multiplication of float output of XNOR convolution with and values.

3.1.1 Converting Integer to Binary

The XNOR convolution operation is a bit-wise logical operation. The input and image tensors are stored inside registers to fully utilize the given processor. The pseudo-code is provided in Algorithm


1:  for  do
2:     for  do
5:        if   then
7:        end if
8:     end for
9:     if   then
11:     end if
12:  end for
Algorithm 1 Input Image to Binary Image

3.1.2 Binary Convolution

After converting the input to binary image, XNOR convolution is applied on the binary image. A simple iterative XNOR operation is enough to obtain XNOR convolution outputs. The pseudo-code is given in Algorithm 2.

4:  for  do
5:     for  do
9:     end for
10:  end for
Algorithm 2 XNOR Convolution

The theoretical speed-up that can be achieved for this part is 58x for kernel size [7]. However, networks used in computer vision use larger kernel sizes to have a receptive field, hence convolutions with larger kernels are needed in practice. In this work, we use a kernel size of which results in a more modest speed-up as it necessitates an iterative approach. In [7], convolution kernel weights fill every bit inside a register. For kernel size, this involves copying the same sign value for each bit in register. When

convolution kernels are used, XNOR convolution can not be applied to each of the bit-pixel value since bits at the edge of the registers will require padding. Hence in our implementation, the weight register only contains one meaningful weight value and the other

bits are masked by bitwise AND operation.

3.1.3 Binary Image to Integer Image

Output of the XNOR convolution is still in binary image format and each convolution result is stored inside a single register. To convert the convolution result into integer, the total number of 1-bits in the register needs to be counted.

In our implementation, in order to count the 1-bits inside the registers, we use the relevant x86_64 instruction and the special function provided by CUDA (__popc) in CPU and GPU implementations respectively.

3.1.4 Multiplication by Scaling Factor

By averaging across channels, is obtained. is convolved with a matrix to get scaling factor matrix , which is multiplied with output.

We implemented multi-threaded versions of both vanilla convolution and XNOR convolution on CPU as baseline methods to compare against the parallelized versions on GPU. In this section, we first describe the CPU implementation. This is followed by the description of GPU implementation.

3.2 Binary Convolution on CPU

The CPU implementation has the following steps.

  1. [wide, labelwidth=!, labelindent=0pt]

  2. Apply zero padding to the Tensor (3D).

  3. Convert the tensor and weights to binary type.

  4. Apply bit-wise XNOR operation on binary Tensor.

  5. Convert binary Tensor to integer Tensor.

  6. Repeat steps 2, 3, 4 for all input channels and sum the results across input channels.

  7. Repeat steps 2, 3, 4, 5 for output channels (filters).

  8. Summation on input Tensor across channels to find scaling factor.

  9. Scalar Multiplication of output result from (5) with and scaling factor.

3.2.1 Converting Integer to Binary

The input and image tensors are stored inside registers with unsigned long data type to fully utilize 64-bit CPU registers and benefit from 64-bit operations. Each 64-bit register can hold 64 data elements of a matrix. Hence, the input image is divided into tiles, each of which is then stored in a single register in binary form. The pseudo-code is provided in Algorithm 1.

3.2.2 Binary Convolution

In this part, CPU registers are used since there is up to iterative access to the same register. The pseudo-code is given in Algorithm 2.

3.2.3 Binary Image to Integer Image

Each convolution result is stored inside a single 64-bit register. To convert the convolution result into integer value, the total number of 1s in the register need to be counted. This can be done by using the special built-in function __builtin_popcount of the GCC compiler, which performs this operation more efficiently than hash mapping.

3.2.4 Multiplication by Scaling Factor

This part is done as described in 3.1.4.

3.3 Binary Convolution on GPU

In convolution, XNOR operation and scaling factor calculation are independent, hence they can be run asynchronously in two different CUDA streams. XNOR convolution result is then obtained by multiplication of the outputs of these two streams. The GPU implementation has the following steps running in two different streams:
Stream 1:

  1. [wide, labelwidth=!, labelindent=10pt]

  2. Apply zero padding to the Tensor (3D).

  3. Convert the tensor and weights to binary type.

  4. Apply bit-wise XNOR operation on binary Tensor.

  5. Convert binary Tensor to integer Tensor.

Stream 2:

  1. [wide, labelwidth=!, labelindent=10pt]

  2. Summation on input Tensor across channels to find scaling factor.

  3. Scalar Multiplication of and scaling factor.

3.3.1 Input Scaling Factor

For calculating the scaling factor of input, the average of channels is taken and convolved with 2D filter as in Eq. (10). It includes 2 steps: averaging across channels and convolution. For summation, each thread calculates the sum of a pixel across channels.

Computing input scaling factor matrix includes following steps after copying from host to device:

  1. Set grid and block sizes.

  2. Average pixels across input channels.

  3. Execute memory specified CUDA function to compute kernel convolution.

  4. Deallocate GPU memories.

3.3.2 Binary Convolution Operation

Main idea of algorithm is similar to the CPU version. However, while CPU registers are 64-bit, GPU registers are 32-bit and each register now holds a image tile rather than tiles. Each CUDA thread converts a image tile (stored in a single register) to binary in parallel. The total number of threads to launch can be calculated as in Eq. 12, where , , , , , , represents the input image width, image height, register width, register height, kernel width and kernel height respectively.


After converting the image batches to binary, XNOR convolution is applied on all input channels. Denoting the number of input channels and number of output channels by and respectively, a total of different convolutions (having different weights) are calculated for each input channel. For this purpose, two options were explored: (i) using a single kernel to calculate all binary convolutions on all input, (ii) using number of kernels calculating binary convolution for each output channel separately. The first approach results in better utilization as it uses the register space for the whole process without any need for copying the result to global memory. However, this approach prohibits parallelization on output channels. For each input channel, convolution operations needs to be calculated and these operations are executed by the same kernel thread iterating a loop for the times. Therefore, the second approach preferred. In that case, since conversion of integer image to binary image can be stored in global memory, multiple streams can access these data. As a result streams can be run asynchronously, resulting in better paralellization.

3.3.3 Multiplication by Scaling Factor

A straight forward multiplication of for each CUDA thread.

4 Experimental Evaluation

Input Size Vanilla Conv. XNOR Conv. Speed-up
0.062 0.024
0.186 0.069
0.671 0.252
2.641 0.986
TABLE I: Comparison of vanilla convolution with XNOR convolution on GPU (ms).
Input Size CPU GPU Speed-up
3.437 0.061
10.623 0.186
35.811 0.671
132.714 2.641
TABLE II: Comparison of CPU and GPU performance for vanilla convolution (ms).
Input Size CPU GPU Speed-up
0.743 0.0237
2.531 0.0692
10.088 0.2519
42.011 0.9859
TABLE III: Comparison of CPU and GPU performance for XNOR convolution (ms).

We have run the experiments on a system having Intel i7700 CPU with 4-cores and Nvidia GTX1080TI GPU. The time measurements take only the computation into account so memory operations like allocation, copying memory and deallocation are excluded. For multi-core CPU implementation, OpenMP has been used. The sub parts that are explained above are made for single channel input matrices. For the GPU implementation, CUDA has been used. We observed that the performance was insensitive to block size, so the experiments have been conducted with a constant block size of 256. We have used a constant kernel size throughout the experiments for both CPU and GPU versions. All the experiments have been repeated 100 times and average run-times have been calculated.

As shown in Table I, GPU XNOR convolution provides a speed-up of to against GPU vanilla convolution. the speed-up remains fairly constant with different image sizes.

When CPU and GPU implementations are compared, it is observed that vanilla convolution has a speed-up of to on GPU (Table II). XNOR convolution has a speed-up of to (Table III) and speed-up increases with increasing input size due to better utilization of the GPU.

5 Discussion

While the GPU XNOR convolution implementation has better performance than the CPU XNOR and GPU vanilla counterparts, the speed-ups we observed were lower than those reported in [7]. It has to be noted that our design uses a single kernel for each logical operation and as such, lacks the ability to achieve (assuming 32-bit registers) binary logical operation speed. So, further optimizations could leverage bit-wise parallelism. On the other hand, use of separate registers allows easier conversion from binary outputs to integers. XNOR convolution needs binarization of input and multiplication with scaling factor at the end. Converting integer input image values to binary values and restoring integer values from output of the XNOR convolution are costly operations as they require sequential write operation to modify each bit inside a register and read them after convolution. For a deeper network, this process may optimized by passing the binary outputs to the next kernel without integer conversion.

XNOR convolution involve two processes that can run concurrently, which are computing scaling matrix and binary convolution operation. As a future work, multiple streams can be used to overlap these operations.

6 Conclusions

We have implemented and optimized the XNOR convolution operation [7] used in binary convolutional networks on CPU and GPU and comparatively evaluated their performance. The experimental results show that up to speed-up can be achieved on GPU compared to the multi-threaded CPU implementation.

We implemented the operations required for the whole inference path of the binary network (i.e. scaling factor calculation and multiplication, binary to integer and integer to binary conversion, XNOR convolution) and made the code publicly available at https://github.com/metcan/Binary-Convolutional-Neural-Network-Inference-on-GPU. However it has to be noted that the operations other than XNOR convolution part are not optimized and developed for testing only. Hence, for a real-life deployment requiring high levels of performance, these parts also need to be optimized.