AXNet: ApproXimate computing using an end-to-end trainable neural network

07/27/2018 ∙ by Zhenghao Peng, et al. ∙ 0

Neural network based approximate computing is a universal architecture promising to gain tremendous energy-efficiency for many error resilient applications. To guarantee the approximation quality, existing works deploy two neural networks (NNs), e.g., an approximator and a predictor. The approximator provides the approximate results, while the predictor predicts whether the input data is safe to approximate with the given quality requirement. However, it is non-trivial and time-consuming to make these two neural network coordinate---they have different optimization objectives---by training them separately. This paper proposes a novel neural network structure---AXNet---to fuse two NNs to a holistic end-to-end trainable NN. Leveraging the philosophy of multi-task learning, AXNet can tremendously improve the invocation (proportion of safe-to-approximate samples) and reduce the approximation error. The training effort also decrease significantly. Experiment results show 50.7 more invocation and substantial cuts of training time when compared to existing neural network based approximate computing framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The conflict between increasing demand for computing and sluggish grow of hardware capability triggers the heated development of approximate computing, which has achieved massive success in both industry and research community. Many applications that do not require utterly accurate computation can achieve tremendous acceleration and drastic reduction of the energy consumption by leveraging approximate computing, especially in domains that call for real-time calculation, fast response and low power consumption such as learning [24], image processing [17] and scientific computation [21]. Approximation computing can be conduct in different hierarchies, such as hardware [5], system and software levels. Various approximate computing architectures [24, 17, 16] are advocated.

Neural network (NN) based approximate computing focus on the acceleration in software-level and has many advantages when compared to previous methods. First, neural networks are proved to be able to fit any continuous function [11], and thus this method can universally be adopted by different tasks. Second, enormous parallelism in the neural networks is exploited by the rapid advancement of various neural network accelerators. An appropriate NN can be easily deserialized and deployed in the cloud [4] and on the edge [9] and therefore achieve high speedup.

However, a single neural network is not safe to serve as an accelerator due to the lack of approximation quality control. Various metrics can represent the approximation quality, e.g., the mean-square error and absolute error between the approximated value and true value, etc. Constraining those metrics can algorithm control the approximation quality. Dictinctive quality control mechanisms, such as statistical and linear models [16], Bayes-network [18], neural network [15], are proactively used to predict whether the approximator can safely approximate the output given the input data. Those unsafe input data are sent to CPU for exact computation. On the contrary, predictors can also posteriorly monitor the output and determine the quality of approximation at the run-time [12]. Predicted errors exceeding the error-bound incurs a rollback of execution [19]. This architecture can dynamically adjust the approximator at run-time but takes more computation effort. Previous work reports that the neural network based predictor outperforms others regarding the prediction accuracy [16].

New challenges emerge if both the approximate accelerator and predictor employ a neural network [15], denoted as the approximator and predictor for simplicity, respectively. Mahajan et al. [16] first train the best approximator and consequently the best predictor separately. The ignorance of the interaction between those two NNs plunges the approximate computing to a local optimum. To cope with this issue, Xu et al. [20] propose to iteratively and alternately train the approximator and predictor, by judicious selection of the training data in each iteration. This method reduces the approximation error. However, it inevitably causes exceptionally long training time. All these methods fail to find efficient cooperation of both NNs that produces the best speedup and approximation accuracy.

The obstacle to making the two NNs cooperation is that two NNs in the approximate computing framework—although share the same training data—have different tasks: prediction and regression. Inspired by multi-task learning [3], this paper presents a novel neural network structure, namely AXNet. Instead of the weight sharing—a conventional method—AXNet fuses the approximator and predictor together, so that of AXNet a simple modification of the conventional back-propagation algorithm can train AXNet efficiently and effectively. We further propose a cost-effective deployment in a typical NPU design. To our best knowledge, AXNet is the first neural approximator that can adopt the end-to-end learning;

the proposed network fusion method has not seen in any previous work in machine learning domain.

The rest of the paper is organized as follows. Section II introduces the related works and motivation. Section III describes the proposed AXNet structure, the fusion methodology and the training algorithm. Section IV shows a case study on the deployment of AXNet in a typical NPU. Experimental results are visualized and analyzed in section V. Finally, section VI concludes this paper.

Ii Related Works and Motivation

This section first introduces the related works on neural approximate computing frameworks containing the predictor and approximator and then motivates this paper.

Mahajan et al. [15] propose an approximate computing architecture consisting of a neural approximator and a neural predictor (Figure 1). First, the approximator is trained to minimize its approximation error. In the training process, the input data of the target function entrances the approximator; the output of the approximator compares with the exact output value of the target function. The square-error between the approximate and exact output values is defined as the lost function. Then, they validate the approximator using the same set of the input data and derive a series of approximation results and consequently the approximation errors. The input data is labeled as safe-to-approximate if the resulting approximation error is within the user-defined error-bound. Then, the predictor is trained using pairs of the input data and the derived label. In this method, the approximator and predictor are trained once, denoted as “onepass” training. In this neural approximate computing framework, however, only the "safe-to-approximate" data identified by the predictor can invoke the approximator. The effective approximation error only accounts those “safe-to-approximate” data (Input data leading to significant approximation error will never enter the approximator). Thus, solely optimizing the approximator cannot efficiently minimize the approximation error of the whole framework. In onepass training method, the training process of the approximator and that of the predictor are isolated. There is no feedback from the predictor to the approximator.

Fig. 1:

(a) Existing structure with standalone neuron networks. (b) Weight-sharing structure that we try.

To cope with this issue, Xu et al. [20] propose to train the approximator and predictor in multiple iterations. The training process in the first iteration is the same as the onepass training. In the next iteration, they train the approximator using a subset of the input data; the chosen input data was safe-to-approximate in the last iteration. Consequently, the retrained approximator is validated again and generates updated labels for the whole training set, which are used to train the predictor again. Above process repeats iteratively. This training method, when compared to the onepass training, causes more precise approximated results. In fact, the predictor guides the training process of the approximator by selecting the training data. Nevertheless, the interference of the predictor also narrows down the generalization capability of the approximator, who is thereby impotent to the diversified dataset in the field. Their experimental results show data discrimination: two clusters appear in the input data space. In one cluster, the input data leads to much lower approximation error; while the one in the other cluster causes much higher approximation error. As a result, the iterative training method is not designated to improve the invocation—the speedup as well—of the approximate accelerator.

In previous works, the approximator and predictor are trained separately to minimize their loss functions. The difficulty of finding a joint loss function impedes us to make a good trade-off between the quality and the energy-efficiency. Besides, we have to pay a significant effort and spend much time to search numerous combinations of two sets of hyper-parameters, such as batch size, training rate, and epoch numbers. It is well known in the Machine Learning field that end-to-end training can decrease the supervision needed and balance the training of both NNs. A predictor and an approximator, associated with different tasks, form a composite structure—this is a typical multitask learning scenario. It has been proved that improved generalization error bounds can be achieved because of the shared parameters 

[3, 1]. All above motivates us to design a holistic end-to-end trainable neural network for approximate computing with quality guarantee.

Iii Proposed AXNet structure and its training

Multitask learning can improve generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better [2]. Inspired by this, we train the approximator and predictor in parallel, rather than successively and separately, using a shared representation. To find a shared representation, we first try weight sharing mechanism—a common approach—but fail. Then, we success by fusing the neurons between the approximator and predictor.

Iii-a Weight sharing mechanism: A false start

We first try a commonly used format of shared representation. We use Multi-layer Perceptron (MLP) as the neural networks in this paper for clarity. We thereby merge the first hidden layers of the approximator and predictor. The rest of the two NNs remain separately, as shown in Figure

1. The resulting neural network contains a prediction subnet and an approximation subnet, inherit the predictor and approximator, respectively.

The training procedure is composed of forward-propagation (FP) and backward-propagation (BP). In the FP stage, we apply the input data to the approximation subnet and derive the approximated value . The approximation error depends on the difference between and the exact output , as well as the error metric function , e.g., square-error, etc. We derive the label by comparing with the error bound:

(1)

tells whether the input data is safe-to-approximate, and is further used in the cost function (2) to train the prediction subnet.

(2)

wherein is the output of prediction subnet indicating the classification result, denotes the loss function of the approximation subnet and refers to the loss function of the prediction subnet, i.e., the cross entropy. In the BP stage–using Stochastic Gradient Descendent algorithm– (two sets of) gradients originated from two different loss functions, and , pass through all hidden layers of two subnets separately until reaching the layer with shared neurons. The sums of the two gradients are used to update the shared neurons.

Such neural network is end-to-end trainable but has a highly unstable training process, which always converges to a low invocation. Figure 3(a) provides a preliminary experiment by training such a weight-sharing neural network. We find that the gradient of the prediction subnet (Cross-Entropy) is, in most cases, an order of magnitude higher than that of the approximation subnet (MSE), but their difference varies with time. Consequently, the gradients of the prediction subnet dominate the update of all the shared weights. At the beginning of the training procedure, the invocation of the approximation subnet significantly causes the turbulence of prediction results , resulting in a drastic change of . We then observe a significant fluctuation of the shared weights that aggravate swing of the invocation of the approximation subnet. Such interference between two subnets always leads to two controversial gradients before updating the shared weights, which in turn incurs the oscillation in the training procuedure. We cannot diminish this phenomena by scale the gradients due to the ignorance of the exact order of magnitude of these two gradients.

Fig. 2: Change of invocation in training weight-sharing network (Up) and AXNet (Down).
Fig. 3: Structure of AXNet. Note that

are split from the output layer of prediction subnet. Do not consider these vectors as different layers.

Iii-B Structure and training of AXNet

To avoid above coupling effect between the two subnets, in this section, we describe our proposal AXNet. The structure of AXNet is shown in Figure 3.

Consider an approximation subnet which has an input vector with size , and hidden layers. Hidden layer has neurons and outputs a vector of activation values, . denotes the approximated values. The prediction subnet has an output layer . We split into vectors:

First vectors, called control vector, “control” the approximation subnet. Last vector is the prediction result and has one value if we apply simoid function in the preceding neurons, or two values when applying softmax activation. and are defined identically as in equation (1) and (2). Note that the output layer of the prediction subnet requires

neurons, depending on the choice of activation function for

, i.e., softmax or sigmoid.

(3)

The essence of AXNet is carrying out the Hadamard product (denote as "") between the activation vector and the corresponding control vector . The resulting vector is passed to the successive layers acting as input vector in the approximation subnet. Namely:

(4)

wherein denotes the activation function of hidden layer . Consequently, all hidden layers of approximation subnet interlink with the output layer of prediction subnet.

Input : : The input features of training data
: The fitting target of training data
: Num. of hidden layers in approximation subnet
: maximum iterations
1 while  do
2       ;
3       # Forward propagation.
4       Get by feeding prediction subnet .
5       ;
6       for i=1,…,l do
7             # Inference of approximation subnet.
8             ;
9             ;
10            
11       end for
12      Calculate .
13       # Backward propagation.
14       ;
15       for k=l,l-1,…,1 do
16            # Update approximation subnet. ;
17             ;
18             ;
19             Update and ;
20             ;
21            
22       end for
23      Calculate all and update all for parameters in prediction subnet.
24      
25 end while
Algorithm 1 Training Procedure of AXNet.

The entire network can still be trained in an end-to-end manner in back-propagation. The algorithm is shown in Algorithm 1 (refer to algorithms 6.3, 6.4 in [8]). A batch of training samples pass through prediction subnets to collect control vector (line 4). Then FP of approximation subnet (line 5-11) derives for training prediction subnet. In line 9, we apply Hadamard product to the activation value. In the BP stage (line 12-21), the gradients of pass through approximation subnet and the gradients of both and are used for updating the prediction subnet.

AXNet shows excellent training stability and convergence rate, as shown in Figure 3

(b). Interestingly, the convergence of the prediction subnet falls behind that of the approximation subnet. We denote this phenomenon as “saturation effect”, and attribute the successful training of AXNet to this saturation effect. The cause of saturation effect is the skewness of

provided to train prediction subnet when the true invocation of approximation subnet is near or . According to previous study [13], if the number of training examples that correspond to each class—safe-to-approximate or not—varies significantly between the classes, then it may be harder for the network to learn the rarer classes in some cases. Thus in the beginning, the prediction subnet fails to catch up with the immature approximation subnet. Different from the weight-sharing method, the failed training of prediction subnet does not affect the training of the approximation subnet because the approximation subnet is relatively independent of the prediction subnet. Unfortunately, this property also damages the performance of AXNet when the approximation subnet is invoked almost . Under this circumstance, common techniques to tackle imbalance data can be used [14].

Note that, previous works [20, 16] train the predictor sufficiently after the approximator. All these works, including this work, provide the evidence to advocate the delay (less effort) of training the predictor in the beginning of the training process, when the approximator is too weak to provide a high-quality approximate output. Otherwise, the skewed samples (most of them are unsafe-to-approximate) will destroy the training of the predictor. The resulting predictor makes inaccurate, if not absurd, predictions on data, which in turn misleads the training of the approximator. The same phenomena can be observed in training a Generative Adversible neural network (GAN) [7]. A common trick is to train the generator less frequently than train the discriminator.

Iii-C Analysis and Interpretation

Besides the training stability, we mathematically prove other superior properties of AXNet:

First, AXNet improves the capacity of fitting the target function by introducing extra non-linearity using the Hadamard product operations. Without loss of generality, suppose both the prediction subnet and the approximation subnet are MLPs with linear activation function. By rewriting the input vector that passes to a hidden layer of approximation subnet in a concrete mathematical form, we derive the Hadmard product (which is sent to next layer in approximation subnet) in dimension :

(5)

wherein is -dimensional input vector, refers to the weight matrix,

is the bias vector,

refers to the value of the vector , respectively. denotes the element of the weight matrix in . This equation tells that the combination of all input features, namely quadratic terms, are passed to the rest of the approximation subnet. The successive hidden layers have even higher order terms. The Hadamard product thereby introduces the higher order terms, extra non-linearity and more complex representation of input features. High order terms of input features have been widely used in previous machine learning practices as feature engineering technique [23], but by hand-craft selecting instead of automatically generating like this work.

(a) Output surface of bessel function.
(b) Output surface of AXNet.
(c) Output surface of single approximator.
(d) Activation value before and after Hadamard product.
Fig. 4: A case study to illustrate the adaptive adjustment imposed to the approximation subnet by control vector. When , we can see the control vector suppresses the activation value.

Second, thanks to the control vectors, AXNet can adjust the activation values of the hidden layers of the approximation subnet. Control vectors filter the activation value of the hidden layers in the approximation subnet through Hadamard product. Fig. 4 demonstrates a case study of this effect. Bessel function is suitable for visualization as it has two-dimensional inputs, drew in axis, and one-dimensional output, drew in vertical axis. These figures show the existence of the prediction subnet improves the fitting capacity of the approximation subnet. When the input () approaches to the corner (), the control value in vector suppresses the activation value (See Fig 4(d)). Same effect happens at other neurons in this layer. This is the reason AXNet (Figure 4(b)) produces better result than single approximator (Figure 4(c)) near the corner .

Third, this end-to-end network structure inherently balances the two learning tasks and seeks the global optimism due to the joint loss function in equation 2. When we train two isolated NNs, each of them inevitably seeks for their respective optimal parameters. However, AXNet enforces one subnet considers the loss of the other during the training procedure. Two subnets thus coordinate to achieve the minimal loss, resulting in the maximal invocation and the minimal approximation error. This coordination is more effective than the one in [20] by selecting the training samples.

Iii-D Subnet fusion with single control vector

A drawback of the current AXNet design is the increased neurons and synaptic weights in the output layer of prediction subnet as well as the extra Hardmard production. We need extra neurons for control vectors. If the approximation subnet become larger, the cost of AXNet is larger.

To resolve the above issue, we further orchestrate a simpler AXNet by interlinking prediction subnet with a single hidden layer, instead of all the hidden layers, of approximation subnet. In that hence, the prediction subnet only need a single control vector, which dramatically reduce the storage and computation overhead. The experimental result confirms that the simplified AXNet maintains its performance  (Figure 10).

Iv Architecture of AXNet Accelerator

Fig. 5: Proposed NPU architecture. (a)The structure of NPU and the data flow of AXNet. (b)The Structure of PE.

Due to the space limit, this section describes an simple NPU architecture, imitating the NPU architecture in [6], which fits AXNet. As shown in Figure 5, the NPU contains many processing engines (PEs), grouped as Tiles, a controller, an on-chip memory, and a bus scheduler. A tile (encircled by the rounded rectangle) is composed of a set of identical PEs, an input buffer and output buffer, all of which are connected by an internal bus (we omit the internal bus arbiter for clarity). We adopt neuron-level parallelism. Thus, each PE computes the output of a single neuron (as equation 4) in the prediction subnet or approximation subnet. The input /output buffer temporally stores the input/the output vector, and interfaces with the on-chip memory. The on-chip memory can interface with the DRAM, input/output buffers in the tile and CPU through the bus. It can store the weight matrix, the input samples and output results of AXNet, and the intermediate results transferring between two adjacent layers in the neural network. The controller is responsible for sending invoke signal through control bus (dotted lines) to PEs or CPU according to the prediction result (i.e., ). As data transfers concurrently between tiles, CPU and on-chip memory, a bus scheduler is necessary to avoid bus conflict.

Figure 5 shows the data flow of executing AXNet. When the input sample comes into the on-chip memory, the NPU schedules the computation for prediction subnet (the first three stages) and subsequently for approximation subnet (the last two stages). In stage 1⃝, the input data is fetched from the on-chip memory to the Input Buffer through the data bus. The weight buffer in each PE fetches the weight vectors from the on-chip memory. When receiving both input and weight vector, each PE conducts forward propagation of prediction subnet and generates prediction result and control vectors . In stage 2⃝, the control vector and the prediction result of the prediction subnet are sent to the on-chip memory. Specifically, is placed in a specified address (the grey region inside of the on-chip memory). In stage 3⃝, the controller gets from the on-chip memory. According to the value of , the controller invokes either the CPU or the approximation subnet through control bus (dotted lines). If approximation subnet is invoked, in stage 4⃝, each PE fetches input data and control vectors from buffers and conduct forward propagation of approximation subnet. In stage 5⃝, the approximation result is sent to the on-chip memory through the data bus.

Note that, the proposed NPU can statically allocate the computing/storage resource for the whole AXNet if the derived AXNet for an application is small enough. In this case, the weight vector can stay in the weight buffer of each PE all the time. Otherwise, the NPU can dynamically schedule the computation of AXNet layer by layer. In that hence, the input/output buffer of each PE and the on-chip memory will temporarily accommodate the intermediate results between adjacent layers.

We modify a general PE to compute the Hadamard product induced by the fusion of two neural networks. Figure 5 shows the internal structure of such PE. When the PE loads the input vector into x reg from Input Buffer and loads weight data into w reg and b reg from the Weight Buffer, the Multiply Add Unit calculates the dot product of the input vector and the weight vector. The resultant product is stored in the temporary reg. After adding the bias, the result is sent to the Activate Unit

, which implements the activate function (i.e., relu). Though NPU performs different computations in prediction subnet and approximation subnet, the PE has the same structure leveraging a

switch unit after Activate Unit. At first, the switch unit enables the blue dotted path which directly pushes the activation result into the output reg. When receiving an invoke signal in stage 3⃝, indicating the computation of approximation subnet, the switch unit activates the approximation subnet units (in solid orange lines). Inside of the PE, the Hadamard product reduces to a standard multiplication operation between one activation value, say , and the corresponding element in control vector . Therefore, a Multiplier can carry out the above computation. The output of is stored in o reg and waits for the transferring to Output Buffer.

V Experiments

Fig. 6: Comparisons on the true invocation (solid color bars) and predicted invocation (bars with grey lines).
Fig. 7: Comparisons on the prediction accuracy

V-a Experimental Setup

We compare the proposed AXNet to two typical previous methods, i.e., “onepass” [15] training and “iterative” [20] training, using identical optimizer (i.e., Adam), the same error metrics, error bound, number of hidden layers, activation function (i.e., ReLu) and loss functions (MSE for approximation and cross entropy for prediction). We choose target functions from a widely used dataset for approximate computing, AxBench, including FFT, Bessel, Blackscholes, jpeg, inversk2j, kmeans and Sobel [22]. AxBench provides tremendous amount of the training data and testing data. Note that we choose these benchmarks because first, they are typical applications covering predominant domains in approximate computing, and second, these choices follow the path of previous works [16, 20, 10].

AXNet has a similar structure and approximately equal parameter count as the neural networks in previous methods. However, AXNet introduces more parameters in the output layer of the prediction subnet. For a fair comparison, we compare i)the invocation by shrinking the structure of AXNet to match the parameter count ( difference) of the neural networks in previous methods, and ii)the parameter count by permuting the AXNet to match others’ invocation ( difference). Table I shows all experimental setup.

Benchmark Domain
Error bound &
Error metrics
Method A Topology P Topology
Para.
count
inversek2j Robotics
0.01
relative error
AXNet 2-6-2 2-4-8 84
2-5-2 2-3-7 64
previous 2-8-2 2-8-2 84
sobel
Image
Processing
0.01
image diff
AXNet 9-8-1 9-3-10 159
9-7-1 9-3-9 144
previous 9-8-1 9-8-2 187
FFT
Signal
Processing
0.05 AXNet 1-4-3-2 1-3-9 41
absolute error previous 1-4-4-2 1-4-2 56
bessel
Scientific
computing
0.05
absolute error
AXNet 2-2-2-1 2-4-6 57
2-2-2-1 2-2-6 39
previous 2-4-4-1 2-4-2 59
jpeg
Image
processing
0.001
image diff
AXNet 64-16-64 64-12-18 3129
64-6-64 64-6-8 1284
previous 64-16-64 64-16-2 3216
blackscholes
Financial
analysis
0.001
relative error
AXNet 6-6-1 6-4-7 112
6-5-1 6-3-7 90
previous 6-8-1 6-8-2 138
kmeans
Machine
learning
0.01
image diff
AXNet 6-4-4-1 6-4-10 131
6-3-2-1 6-3-7 81
previous 6-4-4-1 6-8-2 127

TABLE I: Experimental setup of all benchmarks.

We used four evaluation metrics defined as follows:

  • True invocation: the proportion of safe-to-approximate samples among all the testing data.

  • Predicted invocation: the proportion of samples that the prediction subnet believes to be safe-to-approximate among all testing data.

  • Prediction accuracy: the proportion of samples that are safely approximated meanwhile predicted as safe-to-approximate.

  • Approximation error: the mean error of approximation results for those predicted safe-to-approximate samples, also called “overall error”. Concretely, it’s for all labeled safe-to-approximate by prediction subnet.

True invocation evaluates the ability of approximation subnet and the approximators in previous works. Predicted invocation and prediction accuracy measure the performance of prediction subnet and the predictor/classifier in previous works. Approximation error assess the overall performance of the approximate computing framework.

To evaluate the energy-efficiency, we first derive the speedup and energy reduction of AXNet and then obtain the improvement of energy-efficiency by

. Due to the space limit, we theoratically estimate the performance of AXNet accroading to the performance of NPU in

[6], which is valid due to the proposed AXNet is merely the same as the original work in terms of NPU design. The extra overhead of controller, the light modification of PE can be ignored in a NPU design.

V-B Result and analysis

Figure 7 shows the true and predicted invocation across different benchmarks. AXNet achieves greater true invocation than iterative and onepass methods in all benchmarks by 30.8% and 50.7% respectively in average. The greatest improvement in kmeans and jpeg benchmark take and respectively compared to the iterative method. AXNet also outnumbers other methods in terms of predicted invocation. Saturation effect happens in benchmark fft, bessel, and kmeans (discussion is at Chapter III-C), resulting in 100% predicted invocation.

Figure 7 shows the prediction accuracy. Although our end-to-end trainable AXNet is not sufficiently trained as iterative training, it shows a similar prediction accuracy. In some cases, like inversek2j, jpeg, and blackscholes, the classifier (the same as “predictor” in this paper) with iterative training outnumbers AXNet’s prediction accuracy. This is because the iterative training method selects the training data in favor of the classifier. In practice, AXNet still carries out more acceptable approximation than the previous methods due to the higher predicted invocation.

Figure 8 presents the overall approximation error that the user finally observes. In each benchmark, we normalize the overall error to that of the onepass method. In all cases, AXNet has an excellent reduction of error compared to onepass method but in some cases falls behind the iterative method. Note that the overall error is already under the error-bound and has no impact on the quality of approximate computing.

Fig. 8: Comparisons on the overall approximation error
Fig. 9: Comparison on the invocation in jpeg varying the parameter count. The topology of the approximation and prediction subnets, and the parameter count is labeled near the stars.
Fig. 10: (a) The ratio of parameter count and the training time of AXNet compared to iterative training. (b) Comparisons on the energy efficiency. (c) Investigation of the scalability of AXNet by fusing two subnets in three ways.

Figure 9 illustrates the variation of the true and predicted invocation by varying the network topology, i.e., adjusting the number of neurons in hidden layers. AXNet always achieves better (true and predicted) invocation than the iterative method. When two methods achieve the same true invocation, e.g., near , the iterative method uses a three-layer MLP with 64-16-64 neurons (namely 64-dimensional input, 16 neurons in the first hidden layer and 64-dimensional output, similarly from now on) for the approximator and an MLP with 64-16-2 neurons for the classifier. The total number of synaptic parameters is 3216. While AXNet only requires 64-6-64 for approximation subnet and 64-4-8 for prediction subnet, totally 1284 synaptic parameters. These results imply the fusion of the approximator and predictor in AXNet can eliminate tremendous redundant parameters that have little contribution to the model’s performance. However, we observe that the iterative method yields more stable invocation as the neural networks becomes larger because iterative method incurs much more training effort.

To validate the above observation in other benchmark functions, in the left side bars in Figure 10, we demonstrate the ratio of parameter count in AXNet to that of predecessor methods when they have similar true invocation as mentioned in Section 5.1. We observe that larger structure can achieve better parameter reduction. Jpeg benchmark requires thousands of parameters and AXNet makes reduction of parameter count (left side bar). We also normalize the training time of AXNet to that of the iterative method. The reduction of training time is as high as 90% in Jpeg and 74% in average among these benchmark functions. The right side bar in 10 shows the training time. AXNet consumes much less training time than iterative method. In the iterative method, some of the training data is intentionally discarded. The exact training times is unclear. Compared to iterative training, AXNet can achieve 13.8 and 32 speedup in training time for Bessel and Jpeg. Statistics suggest that FFT incurs the least training time, and thus the reduction of training time is only 50%.

Figure 10 depicts the energy-efficiency. AXNet outperforms the two previous works in all benchmark applications. The cost of proposed NPU is almost identical to that in the onepass method, except the approximation subnet unit in each PE. Thus, the enhancement of the true invocation contributes to the improvement of the energy-efficiency.

Figure 10 shows the examination of the subnet fusion technique described in Section 3.4. We try two ways for the fusion of subnets: apply Hadamard product only in the first hidden layer of the approximation subnet, and only in the second hidden layer, respectively. We test their true invocation in two representative benchmark functions as they require large approximation subnet, e.g., jpeg and sobel. In jpeg, we use a AXNet with topology 64-8-8-64 as approximation subnet and 64-12-18 (connect all, 1014 parameters) or 64-12-10 (connect one hidden layer, 910 parameters) as prediction subnet. Same in sobel: 9-6-6-1 as approximation subnet and 9-8-14 (206 parameters) or 9-8-8 (162 parameters) as prediction subnet. Figure 10 compares three ways of applying Hadamard product: at all hidden layers of approximation subnet (“all”), at the first hidden layer (“1st”), and at the second hidden layer (“2nd”). The results suggest that these three ways of fusion make no evident difference on the performance of the AXNet, which validates the effectiveness of the subnet fusion with a single control vector.

Vi Conclusion

This paper presents AXNet, an end-to-end trainable neural network for approximate computing with quality control. Guided by the multitask learning principle, AXNet fuses the approximator and predictor through Hadamard product. Experimental results show its superior invocation and gain of energy-efficiency over the existing neural approximate computing frameworks. We also provide the theoretical interpretation and experimental validation for AXNet’s advantage in approximation error. At last, AXNet incurs much less training time and smaller scale than the existing works. In future work, we will study the compression technique for AXNet and interpret the underlying mechanism that enables AXNet. We will also evaluate the speedup and energy reduction in a real AXNet NPU implementation.

References

  • [1] J. Baxter, “Learning internal representations,” pp. 311–320, 1995.
  • [2] R. Caruana, “Multitask learning,” in Learning to learn.    Springer, 1998, pp. 95–133.
  • [3] R. A. Caruana, “Multitask connectionist learning,” in In Proceedings of the 1993 Connectionist Models Summer School.    Citeseer, 1993.
  • [4] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture.    IEEE Computer Society, 2014, pp. 609–622.
  • [5] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner et al., “Razor: A low-power pipeline based on circuit-level timing speculation,” in Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture.    IEEE Computer Society, 2003, p. 7.
  • [6] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration for general-purpose approximate programs,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.    IEEE Computer Society, 2012, pp. 449–460.
  • [7] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.
  • [8] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.    MIT press Cambridge, 2016, vol. 1.
  • [9] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” international symposium on computer architecture, vol. 44, no. 3, pp. 243–254, 2016.
  • [10] X. He, G. Yan, Y. Han, and X. Li, “Acr: Enabling computation reuse for approximate computing,” in Design Automation Conference, 2016, pp. 643–648.
  • [11] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural networks, vol. 4, no. 2, pp. 251–257, 1991.
  • [12] D. S. Khudia, B. Zamirai, M. Samadi, and S. Mahlke, “Rumba: An online quality management system for approximate computing,” in Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on.    IEEE, 2015, pp. 554–566.
  • [13]

    S. Lawrence, I. Burns, A. D. Back, A. C. Tsoi, and C. L. Giles, “Neural network classification and prior class probabilities,”

    neural information processing systems, pp. 299–313, 1998.
  • [14] R. Longadge and S. Dongre, “Class imbalance problem in data mining review,” arXiv preprint arXiv:1305.1707, 2013.
  • [15] D. Mahajan, A. Yazdanbakhsh, J. Park, B. Thwaites, and H. Esmaeilzadeh, “Prediction-based quality control for approximate accelerators,” in Second Workshop on Approximate Computing Across the System Stack, WACAS, 2015.
  • [16] ——, “Towards statistical guarantees in controlling quality tradeoffs for approximate acceleration,” Acm Sigarch Computer Architecture News, vol. 44, no. 3, pp. 66–77, 2016.
  • [17] M. Samadi and et al., “Sage: self-tuning approximation for graphics engines,” international symposium on microarchitecture, pp. 13–24, 2013.
  • [18] X. Sui, A. Lenharth, D. S. Fussell, and K. Pingali, “Proactive control of approximate programs,” ACM SIGOPS Operating Systems Review, vol. 50, no. 2, pp. 607–621, 2016.
  • [19] T. Wang, Q. Zhang, N. S. Kim, and Q. Xu, “On effective and efficient quality management for approximate computing,” in Proceedings of the 2016 International Symposium on Low Power Electronics and Design.    ACM, 2016, pp. 156–161.
  • [20] C. Xu, X. Wu, W. Yin, Q. Xu, N. Jing, X. Liang, and L. Jiang, “On quality trade-off control for approximate computing using iterative training,” in Proceedings of the 54th Annual Design Automation Conference 2017.    ACM, 2017, p. 52.
  • [21] X. Xu and H. H. Huang, “Exploring data-level error tolerance in high-performance solid-state drives,” IEEE Transactions on Reliability, vol. 64, no. 1, pp. 15–30, 2015.
  • [22] A. Yazdanbakhsh, D. Mahajan, H. Esmaeilzadeh, and P. Lotfi-Kamran, “Axbench: A multiplatform benchmark suite for approximate computing,” IEEE Design & Test, vol. 34, no. 2, pp. 60–68, 2017.
  • [23] Z. Zabokrtsky, “Feature engineering in machine learning,” Institute of Formal and Applied Linguistics, Charles University in Prague, 2015.
  • [24] Q. Zhang and et al., “Approxann: an approximate computing framework for artificial neural network,” in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition.    EDA Consortium, 2015, pp. 701–706.