Gated XNOR Networks: Deep Neural Networks with Ternary Weights and Activations under a Unified Discretization Framework

05/25/2017 ∙ by Lei Deng, et al. ∙ Tsinghua University The Regents of the University of California 0

There is a pressing need to build an architecture that could subsume these networks undera unified framework that achieves both higher performance and less overhead. To this end, two fundamental issues are yet to be addressed. The first one is how to implement the back propagation when neuronal activations are discrete. The second one is how to remove the full-precision hidden weights in the training phase to break the bottlenecks of memory/computation consumption. To address the first issue, we present a multistep neuronal activation discretization method and a derivative approximation technique that enable the implementing the back propagation algorithm on discrete DNNs. While for the second issue, we propose a discrete state transition (DST) methodology to constrain the weights in a discrete space without saving the hidden weights. In this way, we build a unified framework that subsumes the binary or ternary networks as its special cases.More particularly, we find that when both the weights and activations become ternary values, the DNNs can be reduced to gated XNOR networks (or sparse binary networks) since only the event of non-zero weight and non-zero activation enables the control gate to start the XNOR logic operations in the original binary networks. This promises the event-driven hardware design for efficient mobile intelligence. We achieve advanced performance compared with state-of-the-art algorithms. Furthermore,the computational sparsity and the number of states in the discrete space can be flexibly modified to make it suitable for various hardware platforms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

Code Repositories

Gated-XNOR

Ternary Weights and Activations


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) are rapidly developing with the use of big data sets, powerful models/tricks and GPUs, and have been widely applied in various fields [1]-[8], such as vision, speech, natural language, Go game, multimodel tasks, etc. However, the huge hardware overhead is also notorious, such as enormous memory/computation resources and high power consumption, which has greatly challenged their applications. As we know, most of the DNNs computing overheads result from the costly multiplication of real-valued synaptic weight and real-valued neuronal activation, as well as the accumulation operations. Therefore, a few compression methods and binary/ternary networks emerge in recent years, which aim to put DNNs on efficient devices. The former ones [9]-[14] reduce the network parameters and connections, but most of them do not change the full-precision multiplications and accumulations. The latter ones [15]-[20] replace the original computations by only accumulations or even binary logic operations.

In particular, the binary weight networks (BWNs) [15]-[17] and ternary weight networks (TWNs) [17] [18] constrain the synaptic weights to the binary space or the ternary space , respectively. In this way, the multiplication operations can be removed. The binary neural networks (BNNs) [19] [20] constrain both the synaptic weights and the neuronal activations to the binary space , which can directly replace the multiply-accumulate operations by binary logic operations, i.e. XNOR. So this kind of networks is also called the XNOR networks. Even with these most advanced models, there are issues that remain unsolved. Firstly, the reported networks are based on specially designed discretization and training methods, and there is a pressing need to build an architecture that could subsume these networks under a unified framework that achieves both higher performance and less overhead. To this end, how to implement the back propagation for online training algorithms when the activations are constrained in a discrete space is yet to be addressed. On the other side, in all these networks we have to save the full-precision hidden weights in the training phase, which causes frequent data exchange between the external memory for parameter storage and internal buffer for forward and backward computation.

In this paper, we propose a discretization framework: (1) A multi-step discretization function that constrains the neuronal activations in a discrete space, and a method to implement the back propagation by introducing an approximated derivative for the non-differentiable activation function; (2) A discrete state transition (DST) methodology with a probabilistic projection operator which constrains the synaptic weights in a discrete space without the storage of full-precision hidden weights in the whole training phase. Under such a discretization framework, a heuristic algorithm is provided at the website

https://github.com/AcrossV/Gated-XNOR, where the state number of weights and activations are reconfigurable to make it suitable for various hardware platforms. In the extreme case, both the weights and activations can be constrained in the ternary space to form ternary neural networks (TNNs). For a multiplication operation, when one of the weight and activation is zero or both of them are zeros, the corresponding computation unit is resting, until the non-zero weight and non-zero activation enable and wake up the required computation unit. In other words, the computation trigger determined by the weight and activation acts as a control signal/gate or an event to start the computation. Therefore, in contrast to the existing XNOR networks, the TNNs proposed in this paper can be treated as gated XNOR networks (GXNOR-Nets). We test this network model over MNIST, CIFAR10 and SVHN datasets, and achieve comparable performance with state-of-the-art algorithms. The efficient hardware architecture is designed and compared with conventional ones. Furthermore, the sparsity of the neuronal activations can be flexibly modified to improve the recognition performance and hardware efficiency. In short, the GXNOR-Net promises the ultra efficient hardware for future mobile intelligence based on the reduced memory and computation, especially for the event-driven running paradigm.

We define several abbreviated terms that will be used in the following sections: (1)CWS: continuous weight space; (2)DWS: discrete weight space; (3)TWS: ternary weight space; (4)BWS: binary weight space; (5)CAS: continuous activation space; (6)DAS: discrete activation space; (7)TAS: ternary weight space; (8)BAS: binary activation space; (9)DST: discrete state transition.

2 Unified discretization framework with multi-level states of synaptic weights and neuronal activations in DNNs

Suppose that there are training samples given by where is the label of the th sample . In this work, we are going to propose a general deep architecture to efficiently train DNNs in which both the synaptic weights and neuronal activations are restricted in a discrete space defined as

(1)

where is a given non-negative integer, i.e., and is the distance between adjacent states.

Remark 1. Note that different values of in denote different discrete spaces. Specifically, when , belongs to the binary space and . When , belongs to the ternary space and . Also as seen in (1), the states in are constrained in the interval , and without loss of generality, the range can be easily extended to by multiplying a scaling factor .

In the following subsections, we first investigate the problem formulation for GXNOR-Nets, i.e, is constrained in the ternary space . Later we will investigate how to implement back propagation in DNNs with ternary synaptic weights and neuronal activations. Finally a unified discretization framework by extending the weights and activations to multi-level states will be presented.

2-a Problem formulation for GXNOR-Net

By constraining both the synaptic weights and neuronal activations to binary states for the computation in both forward and backward passes, the complicated float multiplications and accumulations change to be very simple logic operations such as XNOR. However, different from XNOR networks, GXNOR-Net can be regarded as a sparse binary network due to the existence of the zero state, in which the number of zero state reflects the networks’ sparsity. Only when both the pre-neuronal activation and synaptic weight are non-zero, the forward computation is required, marked as red as seen in Fig. 1. This indicates that most of the computation resources can be switched off to reduce power consumption. The enable signal determined by the corresponding weight and activation acts as a control gate for the computation. Therefore, such a network is called the gated XNOR network (GXNOR-Net). Actually, the sparsity is also leveraged by other neural networks, such as in [21] [22].

Suppose that there are layers in a GXNOR-Net where both the synaptic weights and neuronal activations are restricted in a discrete space except the zeroth input layer and the activations of the th layer. As shown in Fig. 1, the last layer, i.e., the th layer is followed by a -SVM output layer with the standard hinge loss, which has been shown to perform better than softmax on several benchmarks [23][24].

Fig. 1: GXNOR-Net. In a GXNOR-Net, when both the pre-neuronal activation and synaptic weight are non-zero, the forward computation is required, marked as red. This indicates the GXNOR-Net is a sparse binary network, and most of the computation units can be switched off which reduces power consumption. The enable signal determined by the corresponding weight and activation acts as a control gate for the computation.

Denote as the activation of neuron in layer given by

(2)

for , where denotes an activation function and represents the synaptic weight between neuron in layer and neuron in layer . For the th training sample, represents the

th element of the input vector of

, i.e., . For the th layer of GXNOR-Net connected with the L2-SVM output layer, the neuronal activation .

The optimization model of GXNOR-Net is formulated as follows

(3)

Here represents the cost function depending on all synaptic weights (denoted as ) and neuronal activations (denoted as ) in all layers of the GXNOR-Net.

For the convenience of presentation, we denote the discrete space when describing the synaptic weight and the neuronal activation as the DWS and DAS, respectively. Then, the special ternary space for synaptic weight and neuronal activation become the respective TWS and TAS. Both TWS and TAS are the ternary space defined in (1).

The objective is to minimize the cost function in GXNOR-Nets by constraining all the synaptic weights and neuronal activations in TWS and TAS for both forward and backward passes. In the forward pass, we will first investigate how to discretize the neuronal activations by introducing a quantized activation function. In the backward pass, we will discuss how to implement the back propagation with ternary neuronal activations through approximating the derivative of the non-differentiable activation function. After that, the DST methodology for weight update aiming to solve (3) will be presented.

2-B Ternary neuronal activation discretization in the forward pass

We introduce a quantization function to discretize the neuronal activations () by setting

(4)

where

(5)
Fig. 2: Ternary discretization of neuronal activations and derivative approximation methods. The quantization function (a) together with its ideal derivative in (b) can be approximated by (c) or (d).

In Fig. 2, it is seen that quantizes the neuronal activation to the TAS and is a window parameter which controls the excitability of the neuron and the sparsity of the computation.

2-C Back propagation with ternary neuronal activations through approximating the derivative of the quantized activation function

After the ternary neuronal activation discretization in the forward pass, model (3) has now been simplified to the following optimization model

(6)

As mentioned in the Introduction section, in order to implement the back propagation in the backward pass where the neuronal activations are discrete, we need to obtain the derivative of the quantization function in (5). However, it is well known that is not continuous and non-differentiable, as shown in Fig. 2(a) and (b). This makes it difficult to implement the back propagation in GXNOR-Net in this case. To address this issue, we approximate the derivative of with respect to as follows

(7)

where is a small positive parameter representing the steep degree of the derivative in the neighbourhood of . In real applications, there are many other ways to approximate the derivative. For example, can also be approximated as

(8)

for a small given parameter . The above two approximated methods are shown in Fig. 2(c) and (d), respectively. It is seen that when , approaches the impulse function in Fig. 2(b).

Note that the real-valued increment of the synaptic weight at the th iteration at layer , denoted as , can be obtained based on the gradient information

(9)

where represents the learning rate parameter, and denote the respective synaptic weights and neuronal activations of all layers at the current iteration, and

(10)

where is a weighted sum of the neuron ’s inputs from layer :

(11)

and is the error signal of neuron propagated from layer :

(12)

and both and are approximated through (8) or (7). As mentioned, the th layer is followed by the L2-SVM output layer, and the hinge foss function [23][24] is applied for the training. Then, the error back propagates from the output layer to anterior layers and the gradient information for each layer can be obtained accordingly.


2-D Weight update by discrete state transition in the ternary weight space

Now we investigate how to solve (6) by constraining in the TWS through an iterative training process. Let be the weight state at the -th iteration step, and be the weight increment on that can be derived on the gradient information (9). To guarantee the next weight will not jump out of , define to establish a boundary restriction on :

(13)

and decompose the above as:

(14)

such that

(15)

and

(16)

where is a round operation towards zero, and generates the remainder of the division between two numbers and keeps the same sign with .

Then, we obtain a projected weight increment and update the weight by

(17)

Now we discuss how to project in CWS to make the next state in TWS, i.e. . We denote as a probabilistic projection function given by

(18)

where the sign function is given by

(19)

and (

) is a state transition probability function defined by

(20)

where is a nonlinear factor of positive constant to adjust the transition probability in probabilistic projection.

The above formula (18) implies that is among , and . For example, when , then happens with probability and happens with probability . Basically the describes the transition operation among discrete states in defined in (1), i.e., where , and .

Fig. 3: Illustration of DST in TWS. In DST, the weight can directly transit from current discrete state (marked as red circle) to the next discrete state when updating the weight, without the storage of the full-precision hidden weight. With different current weight states, as well as the direction and magnitude of weight increment , there are totally six transition cases when the discrete space is the TWS.

Fig. 3 illustrates the transition process in TWS. For example, at the current weight state , if , then has the probability of to transfer to and has the probability of to stay at ; while if , then has the probability of to transfer to and has the probability of to stay at . At the boundary state , if , then and , which means that has the probability of to stay at ; if and , , then has the probability of to transfer to , and has the probability of to stay at ; if and , , then has the probability of to transfer to , and has the probability of to transfer to . Similar analysis holds for another boundary state .

Based on the above results, now we can solve the optimization model (3) based on the DST methodology. The main idea is to update the synaptic weight based on (17) in the ternary space by exploiting the projected gradient information. The main difference between DST and the ideas in recent works such as BWNs [15]-[17], TWNs [17] [18], BNNs or XNOR networks [19] [20] is illustrated in Fig. 4

. In those works, frequent switch and data exchange between the CWS and the BWS or TWS are required during the training phase. The full-precision weights have to be saved at each iteration, and the gradient computation is based on the binary/ternary version of the stored full-precision weights, termed as “binarization” or “ternary discretization” step. In stark contrast, the weights in DST are always constrained in a DWS. A probabilistic gradient projection operator is introduced in (

18) to directly transform a continuous weight increment to a discrete state transition.

Fig. 4: Illustration of the discretization of synaptic weights. (a) shows that existing schemes frequently switch between two spaces, i.e., CWS and BSW/TWS at each iteration. (b) shows that the weights by using our DST are always constrained in DWS during the whole training phase.

Remark 2. In the inference phase, since both the synaptic weights and neuronal activations are in the ternary space, only logic operations are required. In the training phase, the remove of full-precision hidden weights drastically reduces the memory cost. The logic forward pass and additive backward pass (just a bit of multiplications at each neuron node) will also simplify the training computation to some extent. In addition, the number of zero state, i.e. sparsity, can be controlled by adjusting in , which further makes our framework efficient in real applications through the event-driven paradigm.

2-E Unified discretization framework: multi-level states of the synaptic weights and neuronal activations

Actually, the binary and ternary networks are not the whole story since is not limited to be or in defined in (1) and it can be any non-negative integer. There are many hardware platforms that support multi-level discrete space for more powerful processing ability [25]-[30].

The neuronal activations can be extended to multi-level cases. To this end, we introduce the following multi-step neuronal activation discretization function

(21)

where

(22)

for . The interval is similarly defined with in (1). To implement the back propagation algorithm, the derivative of can be approximated at each discontinuous point as illustrated in Fig. 5. Thus, both the forward pass and backward pass of DNNs can be implemented.

Fig. 5: Discretization of neuronal activations with multi-level values and derivative approximation methods. The multiple level of the quantization function (a) together with its ideal derivative in (b) can be approximated by (c) or (d).

At the same time,the proposed DST for weight update can also be implemented in a discrete space with multi-level states. In this case, the decomposition of is revisited as

(23)

such that

(24)

and

(25)

and the probabilistic projection function in (18) can also be revisited as follows

(26)

Fig. 6 illustrates the state transition of synaptic weights in DWS. In contrast to the transition example of TWS in Fig. 3, the can be larger than so that further transition is allowable.

Fig. 6: Discretization of synaptic weights in DWS with multi-level states.

3 Results

We test the proposed GXNOR-Nets over the MNIST, CIFAR10 and SVHN datasets111The codes are available at https://github.com/AcrossV/Gated-XNOR. The results are shown in Table LABEL:performance. The network structure for MNIST is “32C5-MP2-64C5-MP2-512FC-SVM”, and that for CIFAR10 and SVHN is “2(128C3)-MP2-2(256C3)-MP2-2(512C3)-MP2-1024FC-SVM”. Here , and

stand for max pooling, convolution and full connection, respectively. Specifically,

denotes 2 convolution layers with kernel and 128 feature maps, MP2 means max pooling with window size

and stride 2, and 1024FC represents a full-connected layer with 1024 neurons. Here SVM is a classifier with squared hinge loss (L2-Support Vector Machine) right after the output layer. All the inputs are normalized into the range of [-1,+1]. As for CIFAR10 and SVHN, we adopt the similar augmentation in

[24]

, i.e. 4 pixels are padded on each side of training images, and a

crop is further randomly sampled from the padded image and its horizontal flip version. In the inference phase, we only test using the single view of the original images. The batch size over MNIST, CIFAR10, SVHN are , and , respectively. Inspired by [19]

, the learning rate decays at each training epoch by

, where is the decay factor determined by . Here and are the initial and final learning rate, respectively, and is the number of total training epochs. The transition probability factor in equation (20) satisfies , the derivative approximation uses rectangular window in Fig. 2(c) where . The base algorithm for gradient descent is Adam, and the presented performance is the accuracy on testing set.

3-a Performance comparison

Methods Datasets
MNIST CIFAR10 SVHN
BNNs [19] 98.60% 89.85% 97.20%
TWNs [17] 99.35% 92.56% N.A
BWNs [16] 98.82% 91.73% 97.70%
BWNs [17] 99.05% 90.18% N.A
Full-precision NNs [17] 99.41% 92.88% N.A
GXNOR-Nets 99.32% 92.50% 97.37%
TABLE 1: Comparisons with state-of-the-art algorithms and networks.

The networks for comparison in Table LABEL:performance are listed as follows: GXNOR-Nets in this paper (ternary synaptic weights and ternary neuronal activations), BNNs or XNOR networks (binary synaptic weights and binary neuronal activations), TWNs (ternary synaptic weights and full-precision neuronal activations), BWNs (binary synaptic weights and full-precision neuronal activations), full-precision NNs (full-precision synaptic weights and full-precision neuronal activations). Over MNIST, BWNs [16] use full-connected networks with 3 hidden layers of 1024 neurons and a L2-SVM output layer, BNNs [19] use full-connected networks with 3 hidden layers of 4096 neurons and a L2-SVM output layer, while our paper adopts the same structure as BWNs [17]. Over CIFAR10 and SVHN, we remove the last full-connected layer in BWNs [16] and BNNs [19]. Compared with BWNs [17], we just replace the softmax output layer by a L2-SVM layer. It is seen that the proposed GXNOR-Nets achieve comparable performance with the state-of-the-art algorithms and networks. In fact, the accuracy of 99.32% (MNIST), 92.50% (CIFAR10) and 97.37% (SVHN) has outperformed most of the existing binary or ternary methods. In GXNOR-Nets, the weights are always constrained in the TWS without saving the full-precision hidden weights like the reported networks in Table LABEL:performance, and the neuronal activations are further constrained in the TAS . The results indicate that it is really possible to perform well even if we just use this kind of extremely hardware-friendly network architecture. Furthermore, Fig. 7 presents the graph where the error curve evolves as a function of the training epoch. We can see that the GXNOR-Net can achieve comparable final accuracy, but converges slower than full-precision continuous NN.

Fig. 7: Training curve. GXNOR-Net can achieve comparable final accuracy, but converges slower than full-precision continuous NN.

3-B Influence of , and

Fig. 8: Influence of the nonlinear factor for probabilistic projection. A properly larger value obviously improves performance, while too large value further helps little.

We analyze the influence of several parameters in this section. Firstly, we study the nonlinear factor in equation (20) for probabilistic projection. The results are shown in Fig. 8, in which larger indicates stronger nonlinearity. It is seen that properly increasing would obviously improve the network performance, while too large further helps little. obtains the best accuracy, that is the reason why we use this value for other experiments.

Fig. 9: Influence of the pulse width for derivative approximation. The pulse width for derivative approximation of non-differentiable discretized activation function affects the network performance. ‘Not too wide & not too narrow pulse’ achieves the best accuracy.

Secondly, we use the rectangular approximation in Fig. 2(c) as an example to explore the impact of pulse width on the recognition performance, as shown in Fig. 9. Both too large and too small value would cause worse performance and in our simulation, achieves the highest testing accuracy. In other words, there exists a best configuration for approximating the derivative of non-linear discretized activation function.

Fig. 10: Influence of the sparsity of neuronal activations. Here the sparsity represents the fraction of zero activations. By properly increasing the zero neuronal activations, i.e. computation sparsity, the recognition performance can be improved. There exists a best sparse space of neuronal activations for a specific network and dataset.

Finally, we investigate the influence of this sparsity on the network performance, and the results are presented in Fig. 10. Here the sparsity represents the fraction of zero activations. By controlling the width of sparse window (determined by ) in Fig. 2(a), the sparsity of neuronal activations can be flexibly modified. It is observed that the network usually performs better when the state sparsity properly increases. Actually, the performance significantly degrades when the sparsity further increases, and it approaches when the sparsity approaches . This indicates that there exists a best sparse space for a specified network and data set, which is probably due to the fact that the proper increase of zero neuronal activations reduces the network complexity, and the overfitting can be avoided to a great extent, like the dropout technology [31]. But the valid neuronal information will reduce significantly if the network becomes too sparse, which causes the performance degradation. Based on this analysis, it is easily to understand the reason that why the GXNOR-Nets in this paper usually perform better than the BWNs, BNNs and TWNs. On the other side, a sparser network can be more hardware friendly which means that it is possible to achieve higher accuracy and less hardware overhead in the meantime by configuring the computational sparsity.

3-C Event-driven hardware computing architecture

Fig. 11: Comparisons of hardware computing architectures. (a) A neural network example of one neuron Y with three inputs , , and the corresponding synaptic weights , , . (b) Full-precision Neural Network (NN) with multipliers and an accumulator. (c) Binary Weight Network (BWN) with multiplexers and an accumulator. (d) Ternary Weight Network (TWN) with multiplexers and an accumulator, under event-driven control. (e) Binary Neural Network (BNN) with XNOR and bitcount operations. (f) GXNOR-Net with XNOR and bit count operations, under event-driven control.
Networks Operations Resting Probability
Multiplication Accumulation XNOR BitCount
Full-precision NNs M M 0 0 0.0%
BWNs 0 M 0 0 0.0%
TWNs 0 0M 0 0 33.3%
BNNs or XNOR Networks 0 0 M 1 0.0%
GXNOR-Nets 0 0 0M 0/1 55.6%
TABLE 2: Operation overhead comparisons with different computing architectures.
Fig. 12: Implementation of the GXNOR-Net example. By introducing the event-driven paradigm, most of the operations are efficiently kept in the resting state until the valid gate control signals wake them up. The signal is determined by whether both the weight and activation are non-zero.

For the different networks in Table LABEL:performance, the hardware computing architectures can be quite different. As illustrated in Fig. 11, we present typical hardware implementation examples for a triple-input-single-output neural network, and the corresponding original network is shown in Fig. 11(a). The conventional hardware implementation for full-precision NN is based on multipliers for the multiplications of activations and weights, and accumulator for the dendritic integration, as shown in Fig. 11(b). Although a unit for nonlinear activation function is required, we ignore this in all cases of Fig. 11, so that we can focus on the influence on the implementation architecture with different discrete spaces. The recent BWN in Fig. 11(c) replaces the multiply-accumulate operations by a simple accumulation operation, with the help of multiplexers. When , the neuron accumulates ; otherwise, the neuron accumulates . In contrast, the TWN in Fig. 11(d) implements the accumulation under an event-driven paradigm by adding a zero state into the binary weight space. When , the neuron is regarded as resting; only when the weight is non-zero, also termed as an event, the neuron accumulation will be activated. In this sense, acts as a control gate. By constraining both the synaptic weights and neuronal activations in the binary space, the BNN in Fig. 11(e) further simplifies the accumulation operations in the BWN to efficient binary logic XNOR and bitcount operations. Similar to the event control of BNN, the TNN proposed in this paper further introduces the event-driven paradigm based on the binary XNOR network. As shown in Fig. 11(f), only when both the weight and input are non-zero, the XNOR and bit count operations are enabled and started. In other words, whether or equals to zero or not plays the role of closing or opening of the control gate, hence the name of gated XNOR network (GXNOR-Net) is granted.

Table LABEL:computing_architecture_table shows the required operations of the typical networks in Fig. 11. Here we assume that the input number of the neuron is , i.e. inputs and one neuron output. We can see that the BWN removes the multiplications in the original full-precision NN, and the BNN replaces the arithmetical operations to efficient XNOR logic operations. While, in full-precision NNs, BWNs (binary weight networks), BNNs/XNOR networks (binary neural networks), most states of the activations and weights are non-zero. So their resting probability is . Furthermore, the TWN and GXNOR-Net introduce the event-driven paradigm. If the states in the ternary space

follow uniform distribution, the resting probability of accumulation operations in the TWN reaches 33.3%, and the resting probability of XNOR and bitcount operations in GXNOR-Net further reaches 55.6%. Specifically, in TWNs (ternary weight networks), the synaptic weight has three states

while the neuronal activation is fully precise. So the resting computation only occurs when the synaptic weight is , with average probability of . As for the GXNOR-Nets, both the neuronal activation and synaptic weight have three states . So the resting computation could occur when either the neuronal activation or the synaptic weight is . The average probability is . Note that Table 2 is based on an assumption that the states of all the synaptic weights and neuronal activations subject to a uniform distribution. Therefore the resting probability varies from different networks and data sets and the reported values can only be used as rough guidelines.

Fig. 12 demonstrates an example of hardware implementation of the GXNOR-Net from Fig. 1. The original XNOR operations can be reduced to only XNOR operations, and the required bit width for the bitcount operations can also be reduced. In other words, in a GXNOR-Net, most operations keep in the resting state until the valid gate control signals wake them up, determined by whether both the weight and activation are non-zero. This sparse property promises the design of ultra efficient intelligent devices with the help of event-driven paradigm, like the famous event-driven TrueNorth neuromorphic chip from IBM [25, 26].

3-D Multiple states in the discrete space

Fig. 13: Influence of the state number in the discrete space. The state number in discrete spaces of weights and activations can be multi-level values, i.e. DWS and DAS. The state parameters of weight space and activation space are denoted as and , respectively, which is similar to the definition of in (1). There exists a best discrete space with respect to either the weight direction or the activation direction, which locates at and .

According to Fig. 5 and Fig. 6, we know that the discrete spaces of synaptic weights and neuronal activations can have multi-level states. Similar to the definition of in (1), we denote the state parameters of DWS and DAS as and , respectively. Then, the available state number of weights and activations are and , respectively. or corresponds to binary or ternary weights, and or corresponds to binary or ternary activations. We test the influence of and over MNIST dataset, and Fig. 13 presents the results where the larger circle denotes higher test accuracy. In the weight direction, it is observed that when , the network performs best; while in the activation direction, the best performance occurs when . This indicates there exists a best discrete space in either the weight direction or the activation direction, which is similar to the conclusion from the influence analysis of in Fig. 8, in Fig. 9, and sparsity in Fig. 10. In this sense, the discretization is also an efficient way to avoid network overfitting that improves the algorithm performance. The investigation in this section can be used as a guidance theory to help us choose a best discretization implementation for a particular hardware platform after considering its computation and memory resources.

4 Conclusion and Discussion

This work provides a unified discretization framework for both synaptic weights and neuronal activations in DNNs, where the derivative of multi-step activation function is approximated and the storage of full-precision hidden weights is avoided by using a probabilistic projection operator to directly realize DST. Based on this, the complete back propagation learning process can be conveniently implemented when both the weights and activations are discrete. In contrast to the existing binary or ternary methods, our model can flexibly modify the state number of weights and activations to make it suitable for various hardware platforms, not limited to the special cases of binary or ternary values. We test our model in the case of ternary weights and activations (GXNOR-Nets) over MNIST, CIFAR10 and SVHN datesets, and achieve comparable performance with state-of-the-art algorithms. Actually, the non-zero state of the weight and activation acts as a control signal to enable the computation unit, or keep it resting. Therefore GXNOR-Nets can be regarded as one kind of “sparse binary networks” where the networks’ sparsity can be controlled through adjusting a pre-given parameter. What’s more, this “gated control” behaviour promises the design of efficient hardware implementation by using event-driven paradigm, and this has been compared with several typical neural networks and their hardware computing architectures. The computation sparsity and the number of states in the discrete space can be properly increased to further improve the recognition performance of the GXNOR-Nets.

We have also tested the performance of the two curves in Fig. 2 for derivative approximation. It is found that the pulse shape (rectangle or triangle) affect less on the accuracy compared to the pulse width (or steepness) as shown in Fig. 9. Therefore, we recommend to use the rectangular one in Fig. 2(c) because it is simpler than the triangular curve in Fig. 2(d), which makes the approximation more hardware-friendly.

Through above analysis, we know that GXNOR-Net can dramatically simplify the computation in the inference phase and reduce the memory cost in the training/inference phase. However, regarding the training computation, although it can remove the multiplications and additions in the forward pass and remove most multiplications in the backward pass, it causes slower convergence and probabilistic sampling overhead. On powerful GPU platform with huge computation resources, it may be able to cover the overhead from these two issues by leveraging the reduced multiplications. However, on other embedded platforms (e.g. FPGA/ASIC), they require elaborate architecture design.

Although the GXNOR-Nets promise the event-driven and efficient hardware implementation, the quantitative advantages are not so huge if only based on current digital technology. This is because the generation of the control gate signals also requires extra overhead. But the power consumption can be reduced to a certain extent because of the less state flips in digital circuits, which can be further optimized by increasing the computation sparsity. Even more promising, some emerging nanodevices have the similar event-driven behaviour, such as gated-control memristive devices [32, 33]. By using these devices, the multi-level multiply-accumulate operations can be directly implemented, and the computation is controlled by the event signal injected into the third terminal of a control gate. These characteristics naturally match well with our model with multi-level weights and activations by modifying the number of states in the discrete space as well as the event-driven paradigm with flexible computation sparsity.

Acknowledgment. The work was partially supported by National Natural Science Foundation of China (Grant No. 61475080, 61603209), Beijing Natural Science Foundation (4164086), and Independent Research Plan of Tsinghua University (20151080467).

References

  • [1]

    C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9.

  • [2]

    S. Yu, S. Jia, C. Xu, Convolutional neural networks for hyperspectral image classification, Neurocomputing 219 (2017) 88-98.

  • [3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition, IEEE Signal Proc. Mag. 29 (2012) 82-97.
  • [4]

    Z. Huang, S. M. Siniscalchi, C. H. Lee, A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition, Neurocomputing 218 (2016) 448-459.

  • [5] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, J. Makhoul, Fast and robust neural network joint models for statistical machine translation, Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2014, pp. 1370-1380.
  • [6] F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition, IEEE Signal Proc. Let. 22 (2015) 1671-1675.
  • [7] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel D. Hassabis, Mastering the game of go with deep neural networks and tree search, Nature 529 (2016) 484-489.
  • [8] A. Karpathy, A. Joulin, F. F. F. Li, Deep fragment embeddings for bidirectional image sentence mapping, Advances in Neural Information Processing Systems (NIPS), 2014, pp. 1889-1897.
  • [9] S. Han, H. Mao, W. J. Dally, Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding, arXiv preprint arXiv:1510.00149 (2015).
  • [10] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, K. Keutzer, Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡1MB model size, arXiv preprint arXiv:1602.07360 (2016).
  • [11] S. Han, J. Pool, J. Tran, W. J. Dally, Learning both weights and connections for efficient neural network, Advances in Neural Information Processing Systems (NIPS), 2015, pp. 1135-1143.
  • [12] S. Venkataramani, A. Ranjan, K. Roy, A. Raghunathan, AxNN: energy-efficient neuromorphic systems using approximate computing, Proc. International Symposium on Low Power Electronics and Design, ACM, 2014, pp. 27-32.
  • [13] J. Zhu, Z. Qian, C. Y. Tsui, LRADNN: High-throughput and energy-efficient Deep Neural Network accelerator using Low Rank Approximation, IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), 2016, pp. 581-586.
  • [14] X. Pan, L. Li, H. Yang, Z. Liu, J. Yang, L. Zhao, Y. Fan, Accurate segmentation of nuclei in pathological images via sparse reconstruction and deep convolutional networks, Neurocomputing 229 (2017) 88-99.
  • [15] Z. Lin, M. Courbariaux, R. Memisevic, Y. Bengio, Neural networks with few multiplications, arXiv preprint arXiv:1510.03009 (2015).
  • [16] M. Courbariaux, Y. Bengio, J. P. David, Binaryconnect: Training deep neural networks with binary weights during propagations, Advances in Neural Information Processing Systems (NIPS), 2015, pp. 3105-3113.
  • [17] F. Li, B. Zhang, B. Liu, Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
  • [18] C. Zhu, S. Han, H. Mao, W. J. Dally, Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
  • [19] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1, arXiv preprint arXiv:1602.02830 (2016).
  • [20]

    M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: ImageNet classification using binary convolutional neural networks, European Conference on Computer Vision (ECCV), 2016, pp. 525-542.

  • [21] A. Knoblauch, G. Palm, F. T. Sommer, Memory capacities for synaptic and structural plasticity, Neural Computation 22 (2010) 289-341.
  • [22]

    A. Knoblauch, Efficient associative computation with discrete synapses, Neural Computation 28 (2016) 118-186.

  • [23] Y. Tang. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013).
  • [24]

    C. Y. Lee, S. Xie, P. W. Gallagher, Z. Zhang, Z. Tu, Deeply-supervised nets, International Conference on Artificial Intelligence and Statistics (AISTATS), 2015, 562-570.

  • [25] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, D. S. Modha, A million spiking-neuron integrated circuit with a scalable communication network and interface, Science 345 (2014) 668-673.
  • [26] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, Ti. Melano, D. R. Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner, D. S. Modha, Convolutional networks for fast, energy-efficient neuromorphic computing, Proceedings of the National Academy of Science of the United States of America (PNAS) 113 (2016) 11441-11446.
  • [27] B. V. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A. R. Chandrasekaran, J. M. Bussat, R. Alvarez-Icaza, J. V. Arthur, P. A. Merolla, K. Boahen, Neurogrid: a mixed-analog-digital multichip system for large-scale neural simulations. Proceedings of the IEEE 102 (2014) 699-716.
  • [28] S. B. Furber, F. Galluppi, S. Temple, L. A. Plana, The SpiNNaker project. Proceedings of the IEEE 102 (2014) 652-665.
  • [29]

    T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning, International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014, pp. 269-284.

  • [30] M. Prezioso, F. Merrikh-Bayat, B. D. Hoskins, G. C. Adam, K. K. Likharev, D. B. Strukov, Training and operation of an integrated neuromorphic network based on metal-oxide memristors, Nature 521 (2015) 61-64.
  • [31] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (2014) 1929-1958.
  • [32] Y. van de Burgt, E. Lubberman, E. J. Fuller, S. T. Keene, G. C. Faria, S. Agarwal, M. J. Marinella, A. Alec Talin, A. Salleo, A non-volatile organic electrochemical device as a low-voltage artificial synapse for neuromorphic computing, Nature Materials 16 (2017) 414-419.
  • [33] V. K. Sangwan, D. Jariwala, I. S. Kim, K. S. Chen, T. J. Marks, L. J. Lauhon, M. C. Hersam, Gate-tunable memristive phenomena mediated by grain boundaries in single-layer . Nature Nanotechnology 10 (2015) 403-406.