I Introduction
The conflict between increasing demand for computing and sluggish grow of hardware capability triggers the heated development of approximate computing, which has achieved massive success in both industry and research community. Many applications that do not require utterly accurate computation can achieve tremendous acceleration and drastic reduction of the energy consumption by leveraging approximate computing, especially in domains that call for realtime calculation, fast response and low power consumption such as learning [24], image processing [17] and scientific computation [21]. Approximation computing can be conduct in different hierarchies, such as hardware [5], system and software levels. Various approximate computing architectures [24, 17, 16] are advocated.
Neural network (NN) based approximate computing focus on the acceleration in softwarelevel and has many advantages when compared to previous methods. First, neural networks are proved to be able to fit any continuous function [11], and thus this method can universally be adopted by different tasks. Second, enormous parallelism in the neural networks is exploited by the rapid advancement of various neural network accelerators. An appropriate NN can be easily deserialized and deployed in the cloud [4] and on the edge [9] and therefore achieve high speedup.
However, a single neural network is not safe to serve as an accelerator due to the lack of approximation quality control. Various metrics can represent the approximation quality, e.g., the meansquare error and absolute error between the approximated value and true value, etc. Constraining those metrics can algorithm control the approximation quality. Dictinctive quality control mechanisms, such as statistical and linear models [16], Bayesnetwork [18], neural network [15], are proactively used to predict whether the approximator can safely approximate the output given the input data. Those unsafe input data are sent to CPU for exact computation. On the contrary, predictors can also posteriorly monitor the output and determine the quality of approximation at the runtime [12]. Predicted errors exceeding the errorbound incurs a rollback of execution [19]. This architecture can dynamically adjust the approximator at runtime but takes more computation effort. Previous work reports that the neural network based predictor outperforms others regarding the prediction accuracy [16].
New challenges emerge if both the approximate accelerator and predictor employ a neural network [15], denoted as the approximator and predictor for simplicity, respectively. Mahajan et al. [16] first train the best approximator and consequently the best predictor separately. The ignorance of the interaction between those two NNs plunges the approximate computing to a local optimum. To cope with this issue, Xu et al. [20] propose to iteratively and alternately train the approximator and predictor, by judicious selection of the training data in each iteration. This method reduces the approximation error. However, it inevitably causes exceptionally long training time. All these methods fail to find efficient cooperation of both NNs that produces the best speedup and approximation accuracy.
The obstacle to making the two NNs cooperation is that two NNs in the approximate computing framework—although share the same training data—have different tasks: prediction and regression. Inspired by multitask learning [3], this paper presents a novel neural network structure, namely AXNet. Instead of the weight sharing—a conventional method—AXNet fuses the approximator and predictor together, so that of AXNet a simple modification of the conventional backpropagation algorithm can train AXNet efficiently and effectively. We further propose a costeffective deployment in a typical NPU design. To our best knowledge, AXNet is the first neural approximator that can adopt the endtoend learning;
the proposed network fusion method has not seen in any previous work in machine learning domain.
The rest of the paper is organized as follows. Section II introduces the related works and motivation. Section III describes the proposed AXNet structure, the fusion methodology and the training algorithm. Section IV shows a case study on the deployment of AXNet in a typical NPU. Experimental results are visualized and analyzed in section V. Finally, section VI concludes this paper.
Ii Related Works and Motivation
This section first introduces the related works on neural approximate computing frameworks containing the predictor and approximator and then motivates this paper.
Mahajan et al. [15] propose an approximate computing architecture consisting of a neural approximator and a neural predictor (Figure 1). First, the approximator is trained to minimize its approximation error. In the training process, the input data of the target function entrances the approximator; the output of the approximator compares with the exact output value of the target function. The squareerror between the approximate and exact output values is defined as the lost function. Then, they validate the approximator using the same set of the input data and derive a series of approximation results and consequently the approximation errors. The input data is labeled as safetoapproximate if the resulting approximation error is within the userdefined errorbound. Then, the predictor is trained using pairs of the input data and the derived label. In this method, the approximator and predictor are trained once, denoted as “onepass” training. In this neural approximate computing framework, however, only the "safetoapproximate" data identified by the predictor can invoke the approximator. The effective approximation error only accounts those “safetoapproximate” data (Input data leading to significant approximation error will never enter the approximator). Thus, solely optimizing the approximator cannot efficiently minimize the approximation error of the whole framework. In onepass training method, the training process of the approximator and that of the predictor are isolated. There is no feedback from the predictor to the approximator.
(a) Existing structure with standalone neuron networks. (b) Weightsharing structure that we try.
To cope with this issue, Xu et al. [20] propose to train the approximator and predictor in multiple iterations. The training process in the first iteration is the same as the onepass training. In the next iteration, they train the approximator using a subset of the input data; the chosen input data was safetoapproximate in the last iteration. Consequently, the retrained approximator is validated again and generates updated labels for the whole training set, which are used to train the predictor again. Above process repeats iteratively. This training method, when compared to the onepass training, causes more precise approximated results. In fact, the predictor guides the training process of the approximator by selecting the training data. Nevertheless, the interference of the predictor also narrows down the generalization capability of the approximator, who is thereby impotent to the diversified dataset in the field. Their experimental results show data discrimination: two clusters appear in the input data space. In one cluster, the input data leads to much lower approximation error; while the one in the other cluster causes much higher approximation error. As a result, the iterative training method is not designated to improve the invocation—the speedup as well—of the approximate accelerator.
In previous works, the approximator and predictor are trained separately to minimize their loss functions. The difficulty of finding a joint loss function impedes us to make a good tradeoff between the quality and the energyefficiency. Besides, we have to pay a significant effort and spend much time to search numerous combinations of two sets of hyperparameters, such as batch size, training rate, and epoch numbers. It is well known in the Machine Learning field that endtoend training can decrease the supervision needed and balance the training of both NNs. A predictor and an approximator, associated with different tasks, form a composite structure—this is a typical multitask learning scenario. It has been proved that improved generalization error bounds can be achieved because of the shared parameters
[3, 1]. All above motivates us to design a holistic endtoend trainable neural network for approximate computing with quality guarantee.Iii Proposed AXNet structure and its training
Multitask learning can improve generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better [2]. Inspired by this, we train the approximator and predictor in parallel, rather than successively and separately, using a shared representation. To find a shared representation, we first try weight sharing mechanism—a common approach—but fail. Then, we success by fusing the neurons between the approximator and predictor.
Iiia Weight sharing mechanism: A false start
We first try a commonly used format of shared representation. We use Multilayer Perceptron (MLP) as the neural networks in this paper for clarity. We thereby merge the first hidden layers of the approximator and predictor. The rest of the two NNs remain separately, as shown in Figure
1. The resulting neural network contains a prediction subnet and an approximation subnet, inherit the predictor and approximator, respectively.The training procedure is composed of forwardpropagation (FP) and backwardpropagation (BP). In the FP stage, we apply the input data to the approximation subnet and derive the approximated value . The approximation error depends on the difference between and the exact output , as well as the error metric function , e.g., squareerror, etc. We derive the label by comparing with the error bound:
(1) 
tells whether the input data is safetoapproximate, and is further used in the cost function (2) to train the prediction subnet.
(2) 
wherein is the output of prediction subnet indicating the classification result, denotes the loss function of the approximation subnet and refers to the loss function of the prediction subnet, i.e., the cross entropy. In the BP stage–using Stochastic Gradient Descendent algorithm– (two sets of) gradients originated from two different loss functions, and , pass through all hidden layers of two subnets separately until reaching the layer with shared neurons. The sums of the two gradients are used to update the shared neurons.
Such neural network is endtoend trainable but has a highly unstable training process, which always converges to a low invocation. Figure 3(a) provides a preliminary experiment by training such a weightsharing neural network. We find that the gradient of the prediction subnet (CrossEntropy) is, in most cases, an order of magnitude higher than that of the approximation subnet (MSE), but their difference varies with time. Consequently, the gradients of the prediction subnet dominate the update of all the shared weights. At the beginning of the training procedure, the invocation of the approximation subnet significantly causes the turbulence of prediction results , resulting in a drastic change of . We then observe a significant fluctuation of the shared weights that aggravate swing of the invocation of the approximation subnet. Such interference between two subnets always leads to two controversial gradients before updating the shared weights, which in turn incurs the oscillation in the training procuedure. We cannot diminish this phenomena by scale the gradients due to the ignorance of the exact order of magnitude of these two gradients.
IiiB Structure and training of AXNet
To avoid above coupling effect between the two subnets, in this section, we describe our proposal AXNet. The structure of AXNet is shown in Figure 3.
Consider an approximation subnet which has an input vector with size , and hidden layers. Hidden layer has neurons and outputs a vector of activation values, . denotes the approximated values. The prediction subnet has an output layer . We split into vectors:
First vectors, called control vector, “control” the approximation subnet. Last vector is the prediction result and has one value if we apply simoid function in the preceding neurons, or two values when applying softmax activation. and are defined identically as in equation (1) and (2). Note that the output layer of the prediction subnet requires
neurons, depending on the choice of activation function for
, i.e., softmax or sigmoid.(3) 
The essence of AXNet is carrying out the Hadamard product (denote as "") between the activation vector and the corresponding control vector . The resulting vector is passed to the successive layers acting as input vector in the approximation subnet. Namely:
(4) 
wherein denotes the activation function of hidden layer . Consequently, all hidden layers of approximation subnet interlink with the output layer of prediction subnet.
The entire network can still be trained in an endtoend manner in backpropagation. The algorithm is shown in Algorithm 1 (refer to algorithms 6.3, 6.4 in [8]). A batch of training samples pass through prediction subnets to collect control vector (line 4). Then FP of approximation subnet (line 511) derives for training prediction subnet. In line 9, we apply Hadamard product to the activation value. In the BP stage (line 1221), the gradients of pass through approximation subnet and the gradients of both and are used for updating the prediction subnet.
AXNet shows excellent training stability and convergence rate, as shown in Figure 3
(b). Interestingly, the convergence of the prediction subnet falls behind that of the approximation subnet. We denote this phenomenon as “saturation effect”, and attribute the successful training of AXNet to this saturation effect. The cause of saturation effect is the skewness of
provided to train prediction subnet when the true invocation of approximation subnet is near or . According to previous study [13], if the number of training examples that correspond to each class—safetoapproximate or not—varies significantly between the classes, then it may be harder for the network to learn the rarer classes in some cases. Thus in the beginning, the prediction subnet fails to catch up with the immature approximation subnet. Different from the weightsharing method, the failed training of prediction subnet does not affect the training of the approximation subnet because the approximation subnet is relatively independent of the prediction subnet. Unfortunately, this property also damages the performance of AXNet when the approximation subnet is invoked almost . Under this circumstance, common techniques to tackle imbalance data can be used [14].Note that, previous works [20, 16] train the predictor sufficiently after the approximator. All these works, including this work, provide the evidence to advocate the delay (less effort) of training the predictor in the beginning of the training process, when the approximator is too weak to provide a highquality approximate output. Otherwise, the skewed samples (most of them are unsafetoapproximate) will destroy the training of the predictor. The resulting predictor makes inaccurate, if not absurd, predictions on data, which in turn misleads the training of the approximator. The same phenomena can be observed in training a Generative Adversible neural network (GAN) [7]. A common trick is to train the generator less frequently than train the discriminator.
IiiC Analysis and Interpretation
Besides the training stability, we mathematically prove other superior properties of AXNet:
First, AXNet improves the capacity of fitting the target function by introducing extra nonlinearity using the Hadamard product operations. Without loss of generality, suppose both the prediction subnet and the approximation subnet are MLPs with linear activation function. By rewriting the input vector that passes to a hidden layer of approximation subnet in a concrete mathematical form, we derive the Hadmard product (which is sent to next layer in approximation subnet) in dimension :
(5) 
wherein is dimensional input vector, refers to the weight matrix,
is the bias vector,
refers to the value of the vector , respectively. denotes the element of the weight matrix in . This equation tells that the combination of all input features, namely quadratic terms, are passed to the rest of the approximation subnet. The successive hidden layers have even higher order terms. The Hadamard product thereby introduces the higher order terms, extra nonlinearity and more complex representation of input features. High order terms of input features have been widely used in previous machine learning practices as feature engineering technique [23], but by handcraft selecting instead of automatically generating like this work.Second, thanks to the control vectors, AXNet can adjust the activation values of the hidden layers of the approximation subnet. Control vectors filter the activation value of the hidden layers in the approximation subnet through Hadamard product. Fig. 4 demonstrates a case study of this effect. Bessel function is suitable for visualization as it has twodimensional inputs, drew in axis, and onedimensional output, drew in vertical axis. These figures show the existence of the prediction subnet improves the fitting capacity of the approximation subnet. When the input () approaches to the corner (), the control value in vector suppresses the activation value (See Fig 4(d)). Same effect happens at other neurons in this layer. This is the reason AXNet (Figure 4(b)) produces better result than single approximator (Figure 4(c)) near the corner .
Third, this endtoend network structure inherently balances the two learning tasks and seeks the global optimism due to the joint loss function in equation 2. When we train two isolated NNs, each of them inevitably seeks for their respective optimal parameters. However, AXNet enforces one subnet considers the loss of the other during the training procedure. Two subnets thus coordinate to achieve the minimal loss, resulting in the maximal invocation and the minimal approximation error. This coordination is more effective than the one in [20] by selecting the training samples.
IiiD Subnet fusion with single control vector
A drawback of the current AXNet design is the increased neurons and synaptic weights in the output layer of prediction subnet as well as the extra Hardmard production. We need extra neurons for control vectors. If the approximation subnet become larger, the cost of AXNet is larger.
To resolve the above issue, we further orchestrate a simpler AXNet by interlinking prediction subnet with a single hidden layer, instead of all the hidden layers, of approximation subnet. In that hence, the prediction subnet only need a single control vector, which dramatically reduce the storage and computation overhead. The experimental result confirms that the simplified AXNet maintains its performance (Figure 10).
Iv Architecture of AXNet Accelerator
Due to the space limit, this section describes an simple NPU architecture, imitating the NPU architecture in [6], which fits AXNet. As shown in Figure 5, the NPU contains many processing engines (PEs), grouped as Tiles, a controller, an onchip memory, and a bus scheduler. A tile (encircled by the rounded rectangle) is composed of a set of identical PEs, an input buffer and output buffer, all of which are connected by an internal bus (we omit the internal bus arbiter for clarity). We adopt neuronlevel parallelism. Thus, each PE computes the output of a single neuron (as equation 4) in the prediction subnet or approximation subnet. The input /output buffer temporally stores the input/the output vector, and interfaces with the onchip memory. The onchip memory can interface with the DRAM, input/output buffers in the tile and CPU through the bus. It can store the weight matrix, the input samples and output results of AXNet, and the intermediate results transferring between two adjacent layers in the neural network. The controller is responsible for sending invoke signal through control bus (dotted lines) to PEs or CPU according to the prediction result (i.e., ). As data transfers concurrently between tiles, CPU and onchip memory, a bus scheduler is necessary to avoid bus conflict.
Figure 5 shows the data flow of executing AXNet. When the input sample comes into the onchip memory, the NPU schedules the computation for prediction subnet (the first three stages) and subsequently for approximation subnet (the last two stages). In stage 1⃝, the input data is fetched from the onchip memory to the Input Buffer through the data bus. The weight buffer in each PE fetches the weight vectors from the onchip memory. When receiving both input and weight vector, each PE conducts forward propagation of prediction subnet and generates prediction result and control vectors . In stage 2⃝, the control vector and the prediction result of the prediction subnet are sent to the onchip memory. Specifically, is placed in a specified address (the grey region inside of the onchip memory). In stage 3⃝, the controller gets from the onchip memory. According to the value of , the controller invokes either the CPU or the approximation subnet through control bus (dotted lines). If approximation subnet is invoked, in stage 4⃝, each PE fetches input data and control vectors from buffers and conduct forward propagation of approximation subnet. In stage 5⃝, the approximation result is sent to the onchip memory through the data bus.
Note that, the proposed NPU can statically allocate the computing/storage resource for the whole AXNet if the derived AXNet for an application is small enough. In this case, the weight vector can stay in the weight buffer of each PE all the time. Otherwise, the NPU can dynamically schedule the computation of AXNet layer by layer. In that hence, the input/output buffer of each PE and the onchip memory will temporarily accommodate the intermediate results between adjacent layers.
We modify a general PE to compute the Hadamard product induced by the fusion of two neural networks. Figure 5 shows the internal structure of such PE. When the PE loads the input vector into x reg from Input Buffer and loads weight data into w reg and b reg from the Weight Buffer, the Multiply Add Unit calculates the dot product of the input vector and the weight vector. The resultant product is stored in the temporary reg. After adding the bias, the result is sent to the Activate Unit
, which implements the activate function (i.e., relu). Though NPU performs different computations in prediction subnet and approximation subnet, the PE has the same structure leveraging a
switch unit after Activate Unit. At first, the switch unit enables the blue dotted path which directly pushes the activation result into the output reg. When receiving an invoke signal in stage 3⃝, indicating the computation of approximation subnet, the switch unit activates the approximation subnet units (in solid orange lines). Inside of the PE, the Hadamard product reduces to a standard multiplication operation between one activation value, say , and the corresponding element in control vector . Therefore, a Multiplier can carry out the above computation. The output of is stored in o reg and waits for the transferring to Output Buffer.V Experiments
Va Experimental Setup
We compare the proposed AXNet to two typical previous methods, i.e., “onepass” [15] training and “iterative” [20] training, using identical optimizer (i.e., Adam), the same error metrics, error bound, number of hidden layers, activation function (i.e., ReLu) and loss functions (MSE for approximation and cross entropy for prediction). We choose target functions from a widely used dataset for approximate computing, AxBench, including FFT, Bessel, Blackscholes, jpeg, inversk2j, kmeans and Sobel [22]. AxBench provides tremendous amount of the training data and testing data. Note that we choose these benchmarks because first, they are typical applications covering predominant domains in approximate computing, and second, these choices follow the path of previous works [16, 20, 10].
AXNet has a similar structure and approximately equal parameter count as the neural networks in previous methods. However, AXNet introduces more parameters in the output layer of the prediction subnet. For a fair comparison, we compare i)the invocation by shrinking the structure of AXNet to match the parameter count ( difference) of the neural networks in previous methods, and ii)the parameter count by permuting the AXNet to match others’ invocation ( difference). Table I shows all experimental setup.
Benchmark  Domain 

Method  A Topology  P Topology 


inversek2j  Robotics 

AXNet  262  248  84  
252  237  64  
previous  282  282  84  
sobel 


AXNet  981  9310  159  
971  939  144  
previous  981  982  187  
FFT 

0.05  AXNet  1432  139  41  
absolute error  previous  1442  142  56  
bessel 


AXNet  2221  246  57  
2221  226  39  
previous  2441  242  59  
jpeg 


AXNet  641664  641218  3129  
64664  6468  1284  
previous  641664  64162  3216  
blackscholes 


AXNet  661  647  112  
651  637  90  
previous  681  682  138  
kmeans 


AXNet  6441  6410  131  
6321  637  81  
previous  6441  682  127 
We used four evaluation metrics defined as follows:

True invocation: the proportion of safetoapproximate samples among all the testing data.

Predicted invocation: the proportion of samples that the prediction subnet believes to be safetoapproximate among all testing data.

Prediction accuracy: the proportion of samples that are safely approximated meanwhile predicted as safetoapproximate.

Approximation error: the mean error of approximation results for those predicted safetoapproximate samples, also called “overall error”. Concretely, it’s for all labeled safetoapproximate by prediction subnet.
True invocation evaluates the ability of approximation subnet and the approximators in previous works. Predicted invocation and prediction accuracy measure the performance of prediction subnet and the predictor/classifier in previous works. Approximation error assess the overall performance of the approximate computing framework.
To evaluate the energyefficiency, we first derive the speedup and energy reduction of AXNet and then obtain the improvement of energyefficiency by
. Due to the space limit, we theoratically estimate the performance of AXNet accroading to the performance of NPU in
[6], which is valid due to the proposed AXNet is merely the same as the original work in terms of NPU design. The extra overhead of controller, the light modification of PE can be ignored in a NPU design.VB Result and analysis
Figure 7 shows the true and predicted invocation across different benchmarks. AXNet achieves greater true invocation than iterative and onepass methods in all benchmarks by 30.8% and 50.7% respectively in average. The greatest improvement in kmeans and jpeg benchmark take and respectively compared to the iterative method. AXNet also outnumbers other methods in terms of predicted invocation. Saturation effect happens in benchmark fft, bessel, and kmeans (discussion is at Chapter IIIC), resulting in 100% predicted invocation.
Figure 7 shows the prediction accuracy. Although our endtoend trainable AXNet is not sufficiently trained as iterative training, it shows a similar prediction accuracy. In some cases, like inversek2j, jpeg, and blackscholes, the classifier (the same as “predictor” in this paper) with iterative training outnumbers AXNet’s prediction accuracy. This is because the iterative training method selects the training data in favor of the classifier. In practice, AXNet still carries out more acceptable approximation than the previous methods due to the higher predicted invocation.
Figure 8 presents the overall approximation error that the user finally observes. In each benchmark, we normalize the overall error to that of the onepass method. In all cases, AXNet has an excellent reduction of error compared to onepass method but in some cases falls behind the iterative method. Note that the overall error is already under the errorbound and has no impact on the quality of approximate computing.
Figure 9 illustrates the variation of the true and predicted invocation by varying the network topology, i.e., adjusting the number of neurons in hidden layers. AXNet always achieves better (true and predicted) invocation than the iterative method. When two methods achieve the same true invocation, e.g., near , the iterative method uses a threelayer MLP with 641664 neurons (namely 64dimensional input, 16 neurons in the first hidden layer and 64dimensional output, similarly from now on) for the approximator and an MLP with 64162 neurons for the classifier. The total number of synaptic parameters is 3216. While AXNet only requires 64664 for approximation subnet and 6448 for prediction subnet, totally 1284 synaptic parameters. These results imply the fusion of the approximator and predictor in AXNet can eliminate tremendous redundant parameters that have little contribution to the model’s performance. However, we observe that the iterative method yields more stable invocation as the neural networks becomes larger because iterative method incurs much more training effort.
To validate the above observation in other benchmark functions, in the left side bars in Figure 10, we demonstrate the ratio of parameter count in AXNet to that of predecessor methods when they have similar true invocation as mentioned in Section 5.1. We observe that larger structure can achieve better parameter reduction. Jpeg benchmark requires thousands of parameters and AXNet makes reduction of parameter count (left side bar). We also normalize the training time of AXNet to that of the iterative method. The reduction of training time is as high as 90% in Jpeg and 74% in average among these benchmark functions. The right side bar in 10 shows the training time. AXNet consumes much less training time than iterative method. In the iterative method, some of the training data is intentionally discarded. The exact training times is unclear. Compared to iterative training, AXNet can achieve 13.8 and 32 speedup in training time for Bessel and Jpeg. Statistics suggest that FFT incurs the least training time, and thus the reduction of training time is only 50%.
Figure 10 depicts the energyefficiency. AXNet outperforms the two previous works in all benchmark applications. The cost of proposed NPU is almost identical to that in the onepass method, except the approximation subnet unit in each PE. Thus, the enhancement of the true invocation contributes to the improvement of the energyefficiency.
Figure 10 shows the examination of the subnet fusion technique described in Section 3.4. We try two ways for the fusion of subnets: apply Hadamard product only in the first hidden layer of the approximation subnet, and only in the second hidden layer, respectively. We test their true invocation in two representative benchmark functions as they require large approximation subnet, e.g., jpeg and sobel. In jpeg, we use a AXNet with topology 648864 as approximation subnet and 641218 (connect all, 1014 parameters) or 641210 (connect one hidden layer, 910 parameters) as prediction subnet. Same in sobel: 9661 as approximation subnet and 9814 (206 parameters) or 988 (162 parameters) as prediction subnet. Figure 10 compares three ways of applying Hadamard product: at all hidden layers of approximation subnet (“all”), at the first hidden layer (“1st”), and at the second hidden layer (“2nd”). The results suggest that these three ways of fusion make no evident difference on the performance of the AXNet, which validates the effectiveness of the subnet fusion with a single control vector.
Vi Conclusion
This paper presents AXNet, an endtoend trainable neural network for approximate computing with quality control. Guided by the multitask learning principle, AXNet fuses the approximator and predictor through Hadamard product. Experimental results show its superior invocation and gain of energyefficiency over the existing neural approximate computing frameworks. We also provide the theoretical interpretation and experimental validation for AXNet’s advantage in approximation error. At last, AXNet incurs much less training time and smaller scale than the existing works. In future work, we will study the compression technique for AXNet and interpret the underlying mechanism that enables AXNet. We will also evaluate the speedup and energy reduction in a real AXNet NPU implementation.
References
 [1] J. Baxter, “Learning internal representations,” pp. 311–320, 1995.
 [2] R. Caruana, “Multitask learning,” in Learning to learn. Springer, 1998, pp. 95–133.
 [3] R. A. Caruana, “Multitask connectionist learning,” in In Proceedings of the 1993 Connectionist Models Summer School. Citeseer, 1993.
 [4] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al., “Dadiannao: A machinelearning supercomputer,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2014, pp. 609–622.
 [5] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner et al., “Razor: A lowpower pipeline based on circuitlevel timing speculation,” in Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2003, p. 7.
 [6] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration for generalpurpose approximate programs,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2012, pp. 449–460.
 [7] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.
 [8] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1.
 [9] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” international symposium on computer architecture, vol. 44, no. 3, pp. 243–254, 2016.
 [10] X. He, G. Yan, Y. Han, and X. Li, “Acr: Enabling computation reuse for approximate computing,” in Design Automation Conference, 2016, pp. 643–648.
 [11] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural networks, vol. 4, no. 2, pp. 251–257, 1991.
 [12] D. S. Khudia, B. Zamirai, M. Samadi, and S. Mahlke, “Rumba: An online quality management system for approximate computing,” in Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 2015, pp. 554–566.

[13]
S. Lawrence, I. Burns, A. D. Back, A. C. Tsoi, and C. L. Giles, “Neural network classification and prior class probabilities,”
neural information processing systems, pp. 299–313, 1998.  [14] R. Longadge and S. Dongre, “Class imbalance problem in data mining review,” arXiv preprint arXiv:1305.1707, 2013.
 [15] D. Mahajan, A. Yazdanbakhsh, J. Park, B. Thwaites, and H. Esmaeilzadeh, “Predictionbased quality control for approximate accelerators,” in Second Workshop on Approximate Computing Across the System Stack, WACAS, 2015.
 [16] ——, “Towards statistical guarantees in controlling quality tradeoffs for approximate acceleration,” Acm Sigarch Computer Architecture News, vol. 44, no. 3, pp. 66–77, 2016.
 [17] M. Samadi and et al., “Sage: selftuning approximation for graphics engines,” international symposium on microarchitecture, pp. 13–24, 2013.
 [18] X. Sui, A. Lenharth, D. S. Fussell, and K. Pingali, “Proactive control of approximate programs,” ACM SIGOPS Operating Systems Review, vol. 50, no. 2, pp. 607–621, 2016.
 [19] T. Wang, Q. Zhang, N. S. Kim, and Q. Xu, “On effective and efficient quality management for approximate computing,” in Proceedings of the 2016 International Symposium on Low Power Electronics and Design. ACM, 2016, pp. 156–161.
 [20] C. Xu, X. Wu, W. Yin, Q. Xu, N. Jing, X. Liang, and L. Jiang, “On quality tradeoff control for approximate computing using iterative training,” in Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 2017, p. 52.
 [21] X. Xu and H. H. Huang, “Exploring datalevel error tolerance in highperformance solidstate drives,” IEEE Transactions on Reliability, vol. 64, no. 1, pp. 15–30, 2015.
 [22] A. Yazdanbakhsh, D. Mahajan, H. Esmaeilzadeh, and P. LotfiKamran, “Axbench: A multiplatform benchmark suite for approximate computing,” IEEE Design & Test, vol. 34, no. 2, pp. 60–68, 2017.
 [23] Z. Zabokrtsky, “Feature engineering in machine learning,” Institute of Formal and Applied Linguistics, Charles University in Prague, 2015.
 [24] Q. Zhang and et al., “Approxann: an approximate computing framework for artificial neural network,” in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium, 2015, pp. 701–706.
Comments
There are no comments yet.