Learning Digital Circuits: A Journey Through Weight Invariant Self-Pruning Neural Networks

08/30/2019 ∙ by Amey Agrawal, et al. ∙ Qubole, Inc. 0

Recently, in the paper "Weight Agnostic Neural Networks" Gaier & Ha utilized architecture search to find networks where the topology completely encodes the knowledge. However, architecture search in topology space is expensive. We use the existing framework of binarized networks to find performant topologies by constraining the weights to be either, zero or one. We show that such topologies achieve performance similar to standard networks while pruning more than 99% weights. We further demonstrate that these topologies can perform tasks using constant weights without any explicit tuning. Finally, we discover that in our setup each neuron acts like a NOR gate, virtually learning a digital circuit. We demonstrate the efficacy of our approach on computer vision datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

Code Repositories

learning-digital-net

Learn weight agnostic networks which mimic digital circuits


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Frankle & Carbin [2] recently introduced the hypothesis that dense feed-forward networks contain a smaller sub-network, “The Winning Ticket” which when trained in isolation reaches test accuracy comparable to the original network. They attribute the efficacy of this sub-network to an “Initialization Lottery”, the specific set of initial weights that make the training effective. They show that re-initializing the weights of the winning ticket depletes the performance.

Based on further analysis of the Lottery Ticket algorithm, Zhou et al. [8] hypothesize that the lottery ticket algorithm works well only when the pruned weights were already headed to zero by gradient descent. They also demonstrate that only the signs initial weights are important and not their magnitude. They formulate an algorithm to identify “Supermask” which represents a certain topology within the network that can produce reasonable results without any training just using random constant weights. However, the constant weights must have the same signs as the original initialization.

In another recent work, Gaier & Ha [3] discover that sparse network topologies can be learned via architecture search. Unlike Zhou et al. [8]

, they use a constant weight with same signs for all connections within the network. They demonstrate that the performance of these network topologies remains largely invariant to the choice of the weight. This naturally raises the question of whether such topologies can be learned with backpropagation.

In order to obtain fast inference on low-powered devices, Courbariaux et al. [1] proposed a framework, “BinaryConnect”, which allows learning networks with weights are constrained to two possible values -1 and 1. In this paper, we modify this framework to learn network topology by constraining the weights to 0 and 1. We propose an explanation for why these topologies work by drawing a parallel to logic gates and successively introduce certain constructs that allow us to learn networks where each neuron acts like an OR gate and normalization layers mimic NOT gate. We further demonstrate that the topologies learned through this framework are weight agnostic and can solve complicated vision tasks.

2 Self-Pruning Networks

In BinaryConnect [1], binary weights are used during forward and backward propagation, however, the updates are performed on real-valued weights (). The real-valued weights are binarized at the beginning of forward pass. Having high precision weights allows gradient descent to make many small updates which subsequently change the the binarized values. Constraining the weights to values zero and one allows us to learn sparse network topologies. In this section, we describe the changes which allow BinaryConnect framework to work with our modified constraints.

2.1 Binarization

Courbariaux et al. [1] obatin binarized weights () using sign function. As a natural extension, we use a simple step function for binarization:

(1)

In all our experiments, we also binarize our inputs as described in section 4. For the sake of simplicity we do not use bias in our neurons.

2.2 Weight Clipping

Weight clipping a widly used regularization technique, in BinaryConnect model, the weights () are clipped within range [-1, 1]. Since in our setting we binarize weights to {0, 1}, we change the clipping range to [0, 1].

2.3 Activation Function and Normalization

Similar to the original BinaryConnect network we employ Batch Normalization (BN)

[4]. However, we use

instead of Rectifier Linear Units (ReLU)

[6]

as activation function. We further discuss these choices in section

3.

2.4 Weight Initialization

Most standard weight initialization techniques use a normal distribution with zero mean. However, due to our binarization scheme shown in Eq.

1

, most weights drawn from such a distribution would be binarized to zero. We would want to initialize our weights from a bimodal distribution such that post binarization, we have both weights set to zero and one. Hence, we use a Bernoulli distribution for weight initialization. In our experiments we find that initialization with success probability (

) anywhere in the range [0.0001, 0.04] performs fairly well.

0:  a minibatch of (inputs, targets), previous parameters (weights).
0:  updated parameters .
  1. Forward propagation:
  
  For to , compute activactions knowing and
  2. Backward propagation:
  Initialize output layer’s activations gradient
  For to , compute knowing and
  3. Parameter update:
  Compute knowing and
  
Algorithm 1 SGD training with BinaryConnect. is the cost function for minibatch and the functions binarize() and clip() specify how to binarize and clip weights. is the number of layers.

3 Learning Digital Circuits

We obtain near state-of-the-art results on MNIST using the framework described in section 2. However, we make a peculiar observation that if we remove the Batch Normalization layers from our architecture, the network performs only marginally better than random baseline. In rest of this section, we deconstruct the network to understand why Batch Normalization is critical to the the performance of this network and propose alternative components which allow us to learn networks which virtually act like digital circuits.

3.1 Network without Batch Normalization

As a direct consequence of constraining all the weights to zero and one, each neuron acts as a simple summation of all the inputs from previous layer. Since we have also binarized inputs, output of each neuron in the input layer would be a non-negative integer. When we apply activation on these values, the output would be semi-binary, that is either 0 or close to 1 (), acting like a logical OR gate. Because of the semi-binary nature of the output of this layer, we can extend the same reasoning to all the subsequent layers in the network. Hence, performing gradient descent to learn the topology in this network is similar to finding a circuit built only with OR gates which solves the task. OR gate is not an universal gate and thus is limited in its expressibility, which could be why our network does not work without Batch Normalization.

3.2 Leveraging Batch Level Statistics

Let us now consider a batch of inputs received by the Batch Normalization layer in our network. If the mean of inputs is close to 1, the high inputs (1s) would be normalized to low signals and vice versa. In this scenario, the Batch Normalization layer would act as a NOT gate. On the other hand, if the mean of the input batch is close to 0, the layer would preserve the polarity of the signals. Based on the batch statistics the output thus, either represents a logical NOR or OR gate. Being an universal gate NOR gate should provide us the expressibility required to solve complicated problems.

3.3 Substituting Batch Normalization

In order to evaluate the impact of the logical NOT like behaviour of Batch Normalization layer in isolation with its other properties, we replace the normalization layer with simple negation operation:

(2)

As reported in subsection 4 the network performs reasonably well when trained with HardNegation layer. We also define a soft negation function as described below,

(3)

Here, invert gate () is a scalar which we learn with back propagation. The values of are clipped within the range [0, 1]. In our experiments SoftNegation works significantly better than HardNegation. We also make an interesting observation that the learned values of are either 0 or 1 and no values in between.

4 Experiments

Figure 1: Variation in test accuracy with weight on MNIST dataset.

.

In this section, we show the performance of our framework on MNIST and FashionMNIST datasets. We also demonstrate that the performance of the learned network topologies is largely invariant to the weight.


Method
MNIST Fashion-MNIST
Real-Valued Network 98.1% 89.5%
Self-Pruning Network (BN) 96.7% 83.2%
Self-Pruning Network (HardNegation) 81.5% 52.9%
Self-Pruning Network (SoftNegation) 86.0% 53.3%
WANN (Tuned Weight) [3] 91.9% -
Supermask (Signed Constant) [8] 86.3% -
Table 1: Test accuracy of networks trained on the MNIST and FashionMNIST.

4.1 Mnist

MNIST [5] is standard image classification benchmark dataset with 28 x 28 gray-scale images of handwritten digits. We binarize the images using Eq. 1

and reshape them to a 748 dimensional vector. Our model consists of three dense hidden layers with 2048 units each and

activation. We use negative log likelihood loss. As shown in Table 1, we obtain near state-of-the-art performance with Batch Normalization. Experiments show that the SoftNegation operation performs significantly better than HardNegation by allowing combinations of OR and NOR layers. We can see that greater than 99% of the connections are pruned (set to 0) in Table 2.

4.2 Fashion-MNIST

Fashion-MNIST was proposed by Xiao et a. [7] as a drop in replacement for MNSIT dataset with more complexity. It contains 28 x 28 grey-scale images of apparel with labels from 10 classes. We find that the binarized network with Batch Normalization achieve accuracy very close to an identical network with real-valued weights. However, we see a significant drop in the performance with the negation layer.

4.3 Weight Invariance

Intriguingly, we observe that in our learned networks a large number of neurons output values greater than one. These outputs saturate to values close to one due to the activation. Hence, the performance of the network should not be greatly affected if we re-scale the output of neurons by changing the weights. To verify this hypothesis, we test the performance of previously learned topologies on weights in the range [0, 4]. Additionally, we do not freeze Batch Normalization parameters to facilitate the adoption of new weights. Figures 1, 2 show that indeed the performance of these topologies is largely invariant to the weights.

Figure 2: Variation in test accuracy with weight on Fashion-MNIST dataset.

.

Figure 3: Histograms show output of HardNegation layer throughout training on MNIST dataset. Number of training batches is on z-axis.

Figure 4: Histograms depicting output of SoftNegation layer throughout training on MNIST dataset. z-axis represents the number of training batches. We can observe that the values of activations saturate towards zero and one as the training progresses and () is learned.
Method MNIST Fashion-MNIST
Batch Normalization 99.36%, 99.49%, 99.33%, 95.56% 99.08%, 99.49%, 99.86%, 97.16%
HardNegation 99.18%, 99.59%, 99.87%, 99.30% 99.24%, 99.63%, 99.74%, 93.36%
SoftNegation 98.97%, 99.73%, 99.89%, 99.83% 98.81%, 99.69%, 99.86%, 99.78%
Table 2: Percentage of pruned weights for each layer.

5 Conclusion

In this paper, we propose a method to augment the BinaryConnect [1] framework to learn networks with weights zero and one, thus enabling the network to prune weights directly with gradient descent. We show that the topologies learned with our framework achieve comparable performance to their real-valued counterparts. We also demonstrate that these networks are weight agnostic in nature. We observe that Batch Normalization is critical to the functioning of our framework. In order to understand this phenomenon we deconstruct our architecture to discover the role batch-level statistics play in-order in the functioning of the network. We then propose negation layers that allow us to learn networks in which each neuron virtually act as a NOR gate.

References

  • [1] M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1, §2.1, §2, §5.
  • [2] J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1.
  • [3] A. Gaier and D. Ha (2019) Weight agnostic neural networks. arXiv preprint arXiv:1906.04358. Cited by: §1, Table 1.
  • [4] S. Ioffe and C. Szegedy (2015) Batch normalization: acceerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.3.
  • [5] Y. LeCun (1998)

    The mnist database of handwritten digits

    .
    http://yann. lecun. com/exdb/mnist/. Cited by: §4.1.
  • [6] V. Nair and G. E. Hinton (2010)

    Rectified linear units improve restricted boltzmann machines

    .
    In

    Proceedings of the 27th international conference on machine learning (ICML-10)

    ,
    pp. 807–814. Cited by: §2.3.
  • [7] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.2.
  • [8] H. Zhou, J. Lan, R. Liu, and J. Yosinski (2019) Deconstructing lottery tickets: zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067. Cited by: §1, §1, Table 1.