1 Introduction
Recent availability of highperformance computing platforms has enabled the success of deep neural networks (DNNs) in many demanding fields, especially in the domains of machine learning and computer vision. At the same time, applications of DNNs have proliferated to platforms ranging from data centers to embedded systems, which open up new challenges in lowpower, lowlatency implementations that can maintain stateoftheart accuracy. While systems with general purpose CPUs and GPUs are capable of processing very large DNNs, they have high power requirements and are not suitable for embedded systems, which has led to increasing interest in the design of lowpower custom hardware accelerators.
In designing lowpower hardware for DNNs, one major challenge stems from the high precision used in the network parameters. Stateoftheart DNNs in classification accuracy are typically implemented using single precision (32bit) floatingpoint, which requires large memory size for both the network parameters as well as the intermediate computations. Complex hardware multipliers and adders are also needed to operate on such representations.
On the other hand, the inherent resiliency of DNNs to insignificant errors, has resulted in a wide array of hardwaresoftware codesign techniques targeted for lowering the energy and memory footprint of these networks. Such techniques broadly aim either to lower the cost of each operation by reducing the precision [10, 14, 9] or to lower the number of required operations, for example by knowledge distillation [17, 16, 7].
While previous studies offer lowprecision DNNs with little reduction in accuracy, the smallest fixedpoint solutions proposed require 8bits or more for both the activation and network parameters. Furthermore, while solutions with binary and ternary precisions prove effective for smaller networks with small datasets, they often lead to unacceptable accuracy loss on large datasets such as ImageNet [19]. In addition, these lowprecision network techniques usually require precision specific network designs and therefore cannot readily be used on a specific network without an expensive architecture exploration.
In this work, we aim to tackle the lowpower highaccuracy challenge for DNNs by proposing a hardwaresoftware codesign solution to transform existing floatingpoint networks to 8bit dynamic fixedpoint networks with integer poweroftwo weights without changing the network topology. The use of poweroftwo weights enables a multiplierfree hardware accelerator design, which efficiently performs computation on dynamic fixedpoint precision. More specifically, our contributions in this paper are as follows:

[noitemsep]

We propose to compress floatingpoint networks to 8bit dynamic fixedpoint precision with integer poweroftwo weights. We then propose to finetune the quantized network using studentteacher learning to improve classification accuracy. Our technique requires no change to the network architecture.

We propose a new multiplierfree hardware accelerator for DNNs and synthesize it using an industry level library. Our custom accelerator efficiently operates using 8bit multiplierfree dynamic fixedpoint precision.

We also propose to utilize an ensemble of dynamic fixedpoint networks, resulting in improvements in classification accuracy compared to the floatingpoint counterpart, while still allowing large energy savings.

We evaluate our methodologies on two stateoftheart and demanding test sets, namely CIFAR10 and ImageNet, and use wellrecognized network architectures for our experiments. We compare our solution against a baseline floatingpoint accelerator and quantify the power and energy benefits of our methodology.
The rest of our paper is organized as follows. In Section 2, we provide a breif background on deep neural networks. In Section 3, we summarize previous work related to ours. Section 4 describes our methodologies and accelerator design. Next, in Section 6, we provide the results obtained from our methodologies and our custom accelerator. Here we discuss the performance from both hardware and accuracy perspectives. Finally, in Section 7 we conclude our work.
2 Background
Figure 1 shows the template structure of a deep neural network. While a large number of layer types are available in the literature, three layer types are more commonly used in DNNs:

[noitemsep]

Convolutional Layers
: Each neuron in this layer is connected to a subset of inputs with the same spatial dimensions as the kernels, which are typically 3dimensional as shown in Figure
1. The convolution operation can be formulated as . Here, x is the input subset, w is the kernel weight matrix, andis a scalar bias. These layers are used for feature extractions.

Pooling Layers: Pooling layers are simply used to down sample input data.

FullyConnected Layers
: These layers are similar to convolutional layers with differences being that inputs and kernels are onedimensional vectors. These layers are often used toward the end as classifier, where the output vector from the final layer (logits) is fed to a logistic function.

NonLinearity: For each scalar input , this layer outputs , where is a predefined nonlinear function, such as
, rectify linear unit (ReLU), etc.
DNNs typically are based on floatingpoint precision and trained with backpropagation algorithm. Each training step involves two phases: forward and backward. In the forward phase, the network is used to perform classification on the input. Afterward, the gradients are propagated back to each layer in the backward phase to update the network’s parameters. The biggest portion of the computational demands are required by the multiplier blocks utilized in the convolutional and fully connected layers.
3 Related Works
Previous work in software and hardware implementation of DNNs has been, for the most part, disconnected. Few studies have tried to optimize highly accurate designs with low power budgets. On the accuracy front, one aspect of condensing DNNs is to train much smaller networks from the large, cumbersome models [17, 16]
. Both models are based on floatingpoint precision. This approach proposes to train the student (smaller model) to mimic to the outputs of the teacher (larger model). The loss function for the training is composed of two parts: the losses with respect to the true labels and the outputs from the teacher model.
Alternatively, DNNs with low precision data formats have enormous potentials for reducing hardware complexity, power and latency. Not surprisingly, there exists a rich body of literature which studies such limited precisions. Previous work in this area have considered a wide range of reduced precision including fixed point [13, 15, 5], ternary (1,0,1) [12] and binary (1,1) [14, 8]. Furthermore, comprehensive studies of the effects of different precision on deep neural networks are also available. Gysel et al. [10] propose Ristretto, a hardwareoriented tool capable of simulating a wide range of signal precisions. While they consider dynamic fixedpoint, in their work the focus is on network accuracy so the hardware metrics are not evaluated. On the other hand, Hashemi et al. [9] provide a broad evaluation of different precisions and quantizations on both the hardware metrics and network accuracy. However, they do not evaluate dynamic fixed point.
In the hardware design domain, while few work have considered different bitwidth fixedpoint representations in their accelerator design [9, 6, 18], in contrast to the accuracy analysis, no evaluation of hardware designs using dynamic fixedpoint is available. We fill this gap by providing an accelerator design optimized to use dynamic fixedpoint representation for intermediate computations while using poweroftwo weights.
In recent years, a few works have focused on techniques to reduce the power demands of DNNs at the cost of small reductions in network accuracy. For instance, Tann et al. propose an incremental learning algorithm where the network in trained in incremental steps [11]. The idea is then to turn off large portions of the network in order to save energy if these portions are not needed to retain accuracy. While this work delivers significant power and energy saving with small network accuracy degradation, it is orthogonal to our work and can be applied in conjunction.
Sarwar et al.
propose a multiplierless neural network where an accurate multiplier is replaced with an alphabet set multiplier to save power. This work however, focuses on multilayer perceptrons and deep neural networks are not evaluated
[4]. In contrast, we evaluate our work on both CIFAR10 and ImageNet and highlight that our methodology is capable of delivering significant savings in energy while even showing improvements in accuracy.4 MultiplierFree Dynamic FixedPoint (MFDFP) Networks
In order to simplify the hardware implementation, we propose to alter the compute model by replacing multipliers with shift blocks and reducing signal bit width to 8 bits. We represent the signals using dynamic fixedpoint format since synaptic weights and signals in different layers can vary greatly in range. Employing a uniform fixedpoint representation across the layers would require large bit widths to accommodate for such range. As demonstrated by others [10, 9], even with 16bit fixedpoint, significant accuracy drop is observed when compared to floatingpoint representation.
Dynamic fixedpoint representation, as proposed in [13], can be represented using two variables , where is the bit width, and is the fractional length. Each bit number in this scheme can be interpreted as , where are the sign bit and the bit respectively. The term dynamic refers to the fact that different layers in DNNs can take on different values for depending on their ranges. In this work, we deploy 8bit dynamic fixedpoint for all of our experiments.
While we adopt our quantization process from the techniques in [10], our work differs from theirs in three aspects: () we perform hardwaresoftware analysis for poweroftwo weights and dynamic fixedpoint data path, () we propose to include studentteacher learning in the finetuning process, and () we demonstrate that an ensemble of two MFDFP networks can outperform the floatingpoint network while having significant savings in energy. These aspects are described in Algorithm 1 as three phases. Next, we describe these phases in more details.
4.1 Network Quantization (Phase 1)
In order to construct a dynamic fixedpoint network, we take as input a fully trained floatingpoint network. We first quantize on this input network by rounding its weights to the nearest powers of two. We also round the intermediate signals to 8bit dynamic fixedpoint using Ristretto [10] (line 1). We then perform finetuning on the network to recover from accuracy loss due to quantization (lines 1–1).
DNNs are typically trained using the backpropagation algorithm with variants of gradient descent methods, which can be illsuited for lowprecision networks. The computed gradients and learning rates are typically very small, which means that parameters may not be updated at all due to their lowprecision format. Intuitively, this requires high precision in order to converge to a good minima. However, integer poweroftwo weights only allow large increment jumps.
To combat this disparity, we adopt solution proposed by Courbariaux et al. [14] to keep two sets of weights during the training process: one in quantized precision and one in floatingpoint. As shown in Algorithm 1, during forward propagation, the floatingpoint weight set is stochastically or deterministically quantized before the input data is evaluated (line 1). For our work, we found that deterministic quantization gives better performance. The output result of the quantized network is then used to compute the loss with respect to the true label of the data (line 1). The gradients with respect to this loss are then used to update the floatingpoint parameters during backward propagation (line 1), and the process is repeated until convergence. This approach allows small gradients to accumulate over time and eventually cause incremental updates in the quantized weights.
4.2 Additional Finetuning (Phase 2)
On top of the technique from Courbariaux et al. [14], we propose additional training with a different loss function once training with hard labels no longer improves the performance. As shown in Algorithm 1 lines 1–1, in addition to using hard labels, we introduce studentteacher learning, where a student network is trained to mimic the outputs of a teacher network [17, 16]. Both networks are floatingpoint based, but the student typically has a far fewer number of parameters. In our work, we treat the dynamic fixedpoint network as the student and the floatingpoint network as the teacher.
The loss function in the studentteacher learning incorporates the knowledge learned by the teacher model [16]. Suppose S is the student network, and T is the teacher with output logit vectors and
and class probability
and respectively. The softmax regression function is relaxed by introducing a temperature parameter such that and . Let be the parameters of the student network, then the loss function for the student model is defined to be:(1) 
where is a tunable parameter, is the cross entropy, and is the onehot true data label. Using , we have where is the length of of vectors . With zeromeaned (), the approximated gradient is then:
(2) 
4.3 Ensemble of MFDFP Networks (Phase 3)
Deploying an ensemble of DNNs has been proven to be a simple and effective method to boost the inference accuracy of a DNN [21]. The idea is to independently train multiple DNNs of the same architecture and use them to evaluate each input. The output is then chosen based on the majority of votes. Suppose the ensemble consists of networks producing output logit vectors , . Then the output class can simply be the maximum element in .
This idea is amenable in scenarios where there exists enough time or energy budget to justify evaluating the input on a number of networks. In section 6.2 we highlight that, since the reduction in energy from the proposed MFDFP are so dramatic, the designer may implement an ensemble of MFDFP networks in parallel and still save significantly in energy consumption. More specifically, we show that an ensemble of multiplierfree dynamic fixedpoint networks can outperform a floatingpoint network while still achieving significant energy saving. In order to construct such ensemble, we run Algorithm 1 multiple times with different starting floatingpoint networks on line 1.
5 Hardware Accelerator Design
As discussed in Section 4, while we maintain lowprecision in both network signals and parameters for efficiency, providing the network with the flexibility to change the location of the radix point from layer to layer is necessary for minimizing the accuracy degradation. While improving the accuracy, this scheme incurs complexities in the hardware design as some bookkeeping in needed to keep track of the location of the radix point in different parts of the network. In the proposed accelerator, we enable such flexibility by providing each set of calculations with details on the indices of both the input feature maps as well as the output activation. More specifically, we implement this feature by adding control signals dedicated to both the input feature, and the output activation radix indices. Dedicated hardware is then added to the hardware to shift the result to the correct index as determined by the radix indices.
On the other hand, while dynamic fixedpoint representation for synaptic weights and activation maps allows for compact bit widths, during inference, we would still need to perform fixedpoint multiplications. As described in Section 4, we propose to quantize the weights to integer poweroftwo, which would allow the expensive multiplications to be replaced with arithmetic shifts. These shift operators are far more hardwarefriendly than fullscale multipliers. In this quantization scheme, for each weight , we represent its quantized version using two numbers , where is the sign of the weight , and is the exponent for the power of 2 (i.e., ). Here, performs rounding to the nearest integer. Note that we bound since our input data is limited to 8 bits. For each input , is then transformed into , where represents the shift operator. In addition, we observe that the magnitudes of the weights is less than 1, so our rounding leads to 8 possible exponents . Therefore the weights can be encoded into 4bit representation. This observation is used to simplify our hardware architecture significantly as discussed in 6.2.
To further improve the accuracy, we ensure that there is no loss in intermediate values by mitigating the possibility of overflows. In order to do so, we ensure that all intermediate signals have large enough wordwidth, thereby effectively increasing the width of the intermediate wires as needed. To illustrate our idea, Figure 2(a) shows the simplified structure of a single neuron in our proposed implementation, highlighting the main feature of the accelerator design. In Figure 2(a), the dedicated hardware implementing the dynamic fixedpoint scheme is shown as “Accumulator & Routing”. Here and represent the locations of the radix points for the input features and output activations respectively.
In order to integrate our proposed neuron architecture into a fullscale hardware accelerator, we utilize a tilebased implementation inspired from DianNao [3]
, where each cycle a small number of physical neurons is fed a new set of data for calculation. We implement three separate memory subsystems assigned to input data, weights, and output data, respectively. This memory subsystem ensures the isolation of memory transfers from the calculation for maximum throughput. The computation itself is performed in neural processing units (NPUs) containing a number of processing units each implementing 16 neurons with 16 synapses.
Figure 2(b) illustrates the organization of the proposed hardware accelerators. Here we want to stress the benefits of our methodologies relative to the floatingpoint design. Thus, an architectural design space exploration such as altering number of hardware neurons and synapses is out of the scope of this work.
In order to incorporate the proposed ensemble of networks, the number of processing units is increased as needed to parallelize the computation of an ensemble of networks. Note that the memory subsystems as well as the control logic also need to be modified to account for the number of processing units. In section 6.2 we evaluate our methodologies using a single processing unit, resulting in a single multiplierfree dynamic fixed point (MFDFP) network, and two processing units, which form an ensemble of two networks.
We also implement and compare our hardware design with a conventional 32bit floatingpoint architecture using a single processing unit as a baseline. Compared to our proposed design, the baseline implementation utilizes multipliers in the first stage of the design and keeps the bitwidth constant at 32bits throughout the design for both the activations and the network parameters.
6 Experimental Results
6.1 Experimental Setup
In this section, we discuss our results on the CIFAR10 and ImageNet 2012 datasets [2, 19] using the wellknown DNN architectures from [2] and [20]
respectively. We remove all local response normalization layers since they are not amenable to our multiplierfree hardware implementation. All of our experiments are based on Caffe
[1].For CIFAR10, we begin by training the floatingpoint networks using the benchmark architecture. For the ImageNet benchmark, we obtain the floatingpoint model from Caffe Model Zoo^{1}^{1}1https://github.com/BVLC/caffe/wiki/ModelZoo. We then run the networks on their corresponding training set data to obtain the presoftmax output logits. From these floatingpoint networks, we construct our proposed MFDFP networks using Algorithm 1.
For our hardware evaluations, we compile our designs using Synopsys Design Compiler and a 65 nm standard cell library in the typical processing corner. We synthesize our hardware so that we have zero timing slack for the floatingpoint design. Therefore, we use a constant clock frequency of 250MHz for all our experiments. While the utilization of barrel shifters instead of multipliers provides us with timing slacks which can be used to boost the frequency, we choose to keep the frequency constant as changing the frequency adds another dimension for evaluation which is out of the scope of this work.
6.2 Results
We evaluate our proposed methodology as well as our custom hardware accelerator on CIFAR10 and ImageNet using a broad range of performance metrics including accuracy, power consumption, design area, inference time, and inference accuracy. Table 1 summarizes the design area and the power consumption of the proposed multiplierfree custom accelerator. Values shown in parenthesis, (in,w), reflect the number of bits required for the representation of inputs and weights respectively. We also implement a floatingpoint version of our accelerator as a baseline design and for comparison. As shown in the table our accelerator can achieve significant benefits in both design area and power consumption using both one processing unit and using an ensemble of two networks. Next we report the results when using our methodologies and hardware accelerator for our benchmarks.
Design  Power  Area  Power  

Area  Cons.  Saving  Saving  
Precision ()  ()  ()  ()  () 
Floatingpoint(32,32)  16.52  1361.61  0  0 
Proposed MFDFP(8,4)  1.99  138.96  87.97  89.79 
Ens. MFDFP(8,4)  3.96  270.27  76.00  80.15 
CIFAR10  ImageNet  
Classification  Time  Energy  Energy  Classification  Time  Energy  Energy  
Precision  Accuracy ()  ()  ()  Saving ()  Accuracy ()  ()  ()  Saving () 
FloatingPoint (32,32)  81.53  246.52  335.68  0  56.95 (79.88)  15666.45  21332.38  0 
MFDFP (8,4)  80.77  246.27  34.22  89.81  56.16 (79.13)  15666.06  2176.96  89.80 
Ensemble MFDFP  82.61  246.27  66.56  80.17  57.57 (80.29)  15666.06  4234.07  80.15 
Figure 3 shows the classification error rate of the baseline floatingpoint network as well as the finetuning process of MFDFP for the ImageNet benchmark. Here, we observe that by finetuning using just data labels (Phase 1), we achieve significant performance with less than a 1% increase in error rate than the floatingpoint counterpart. Additional training using the studentteacher model (Phase 2) as described in Section 4.2 on top of just data labels, allows us to reduce the error rate even more. In this experiment, we observed that more benefit is achieved when the studentteacher training is started from a nonglobal optimal point in the data labelsonly training. More specifically, the value of in Algorithm 1 line 1 should be close to convergence but not the global optimal point in the training process. In either case, the studentteacher learning provides consistently better performance than using the data labelsonly training. For this training, we set , and start with a learning rate of 1e03. We decrease the rate by a factor of 10 when learning levels off and stop the training when the learning rate drops below 1e07.
Furthermore, in Table 2, we summarize the accuracy, inference time, and the energy performance of our proposed techniques. As shown in the table, our methodology can achieve energy savings as high as 89% in the case of single MFDFP network with a maximum of 0.79% degradation in accuracy for both benchmarks. This is especially significant as there is absolutely no modification to network depth and channel size. In addition, with the extra area budget, we can implement two processing units in our accelerator and, for each benchmark, we deploy an ensemble of two MFDFP networks trained using different starting points. As shown in Table 2, we can outperform the floating networks in both benchmarks using this ensemble while still achieving significant energy saving.
Finally, while we design our methodology with memory footprint in mind, we do not include the power consumption of the main memory subsystem in our evaluations. However, as a general guideline, our methodology emphasizes on reductions in network precisions and therefore requires 8 less memory compared to a floatingpoint implementation as shown in Table 3. For the ensemble method, the memory requirement essentially doubles from single MFDFP, however, they are still far lower than the floatingpoint networks.
Precision  CIFAR10 (MB)  ImageNet (MB) 

FloatingPoint  0.3417  237.95 
MFDFP  0.0428  29.75 
Ensemble MFDFP  0.0855  59.50 
7 Conclusion
In this work we proposed a novel hardwaresoftware codesign approach which enables seamless mapping of fullprecision deep neural networks to a multiplierfree dynamic fixedpoint network. In our work, no change to the network architecture is required to maintain accuracy within acceptable bounds. We also formalized the use of studentteacher learning for accuracy improvements in lowprecision networks. In addition, we proposed a hardware design capable of incorporating the dynamic fixed point as well as the multiplierfree design aspects. We proposed to utilize an ensemble of lower precision MLDFP networks to increase the accuracy even further. We evaluated our designs using two wellrecognized and demanding datasets, namely CIFAR10 and ImageNet running on networks well studied in the literature. Using one single MFDFP network on our tastbenches, our design achieves up to 90% energy savings with an insignificant accuracy drop of approximately 1%. Using an ensemble of two networks, energy savings of 80% are achievable while delivering accuracy gains of more than 1% for CIFAR10, and 0.5% for ImageNet top1 classification accuracy.
Acknowledgment
This work is supported by NSF grant 1420864. We would like to thank NVIDIA Corporation for their generous GPU donation.
plus 0.3ex
References
 [1] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 [2] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009.
 [3] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A smallfootprint highthroughput accelerator for ubiquitous machinelearning. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pages 269–284, 2014.
 [4] S.S. Sarwar, S. Venkataramani, A. Raghunathan, K. Roy. Multiplierless Artificial Neurons Exploiting Error Resiliency for EnergyEfficient Neural Computing. In Proc. DATE, 2016.
 [5] C.Z. Tang, and H.K. Kwan. Multilayer feedforward neural networks with single powersoftwo weights. In IEEE Transactions on Signal Processing, pages 2724–2727, 1993.

[6]
M. Sankaradas, J. Murugan, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and H.P. Graf.
A Massively Parallel Coprocessor for Convolutional Neural Networks.
In Proc. IEEE ASAP, 2009.  [7] A. Romero, N. Ballas, S.E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. FitNets: Hints for Thin Deep Nets. arXiv preprint arXiv:1412.6550, 2014.
 [8] D. Soudry, I. Hubara, and R. Meir. Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights. In Proc. NIPS, pages 963–971, 2014.
 [9] S. Hashemi, N. Anthony, H. Tann, R.I. Bahar, and S. Reda. Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks. In Proc. DATE, 2017.
 [10] P. Gysel, M. Motamedi, S. Ghiasi, “HardwareOriented Approximation of Convolutional Neural Networks,” in ICLR Workshop, 2016.
 [11] H. Tann, S. Hashemi, I. Bahar, S. Reda, “Runtime Configurable Deep Neural Networks for EnergyAccuracy Tradeoff,” in Proceedings of IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, p. 34. 2016.
 [12] K. Hwang, and W. Sung. Fixedpoint feedforward deep neural network design using weights +1, 0, and 1. In IEEE SiPS, 2014.
 [13] M. Courbariaux, Y. Bengio, and D. JeanPierre. Low precision arithmetic for deep learning. In arXiv preprint arXiv:1412.7024, 2014.
 [14] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. In arXiv preprint arXiv:1602.02830, 2016.
 [15] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep Learning with Limited Numerical Precision. In arXiv preprint arXiv:1502.02551, 2015.
 [16] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In arXiv preprint arXiv:1503.02531, 2015.
 [17] C. Bucilua, R. Caruana, and A. NiculescuMizil Model compression. In Proc. of ACM SIGKDD, 2006.
 [18] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGAbased Accelerator Design for Deep Convolutional Neural Networks. In Proc. ACM/SIGDA FPGA, 2015.
 [19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, and A.C. Berg, and F. Li. ImageNet Large Scale Visual Recognition Challenge. In IJCV, pages 211–252, 2015
 [20] A. Krizhevsky, I. Sutskever, and G.E. Hinton Imagenet classification with deep convolutional neural networks. In Proc. NIPS, 2012.
 [21] J. Ba, and R. Caruana. Do deep nets really need to be deep? In Proc. NIPS, pages 2654–2662, 2014.
Comments
There are no comments yet.