Streaming MANN: A Streaming-Based Inference for Energy-Efficient Memory-Augmented Neural Networks

by   Seongsik Park, et al.
Seoul National University

With the successful development of artificial intelligence using deep learning, there has been growing interest in its deployment. The mobile environment is the closest hardware platform to real life, and it has become an important platform for the success or failure of artificial intelligence. Memory-augmented neural networks (MANNs) are neural networks proposed to efficiently handle question-and-answer (Q&A) tasks, well-suited for mobile devices. As a MANN requires various types of operations and recurrent data paths, it is difficult to accelerate the inference in the structure designed for other conventional neural network models, which is one of the biggest obstacles to deploying MANNs in mobile environments. To address the aforementioned issues, we propose Streaming MANN. This is the first attempt to implement and demonstrate the architecture for energy-efficient inference of MANNs with the concept of streaming processing. To achieve the full potential of the streaming process, we propose a novel approach, called inference thresholding, using Bayesian approach considering the characteristics of natural language processing (NLP) tasks. To evaluate our proposed approaches, we implemented the architecture and method in a field-programmable gate array (FPGA) which is suitable for streaming processing. We measured the execution time and power consumption of the inference for the bAbI dataset. The experimental results showed that the performance efficiency per energy (FLOPS/kJ) of the Streaming MANN increased by a factor of up to about 126 compared to the results of NVIDIA TITAN V, and up to 140 if inference thresholding is applied.


page 4

page 6

page 7

page 12

page 13

page 18

page 19


AdderNet and its Minimalist Hardware Design for Energy-Efficient Artificial Intelligence

Convolutional neural networks (CNN) have been widely used for boosting t...

A General Neural Network Hardware Architecture on FPGA

Field Programmable Gate Arrays (FPGAs) plays an increasingly important r...

FTRANS: Energy-Efficient Acceleration of Transformers using FPGA

In natural language processing (NLP), the "Transformer" architecture was...

Towards Fast and Energy-Efficient Binarized Neural Network Inference on FPGA

Binarized Neural Network (BNN) removes bitwidth redundancy in classical ...

Towards Enabling Dynamic Convolution Neural Network Inference for Edge Intelligence

Deep learning applications have achieved great success in numerous real-...

SAMO: Optimised Mapping of Convolutional Neural Networks to Streaming Architectures

Toolflows that map Convolutional Neural Network (CNN) models to Field Pr...

The streaming rollout of deep networks - towards fully model-parallel execution

Deep neural networks, and in particular recurrent networks, are promisin...

I Introduction

Deep neural networks (DNNs) require more computing power and storage than most mobile devices can provide. So mobile DNNs are commonly trained and run on remote servers. This limits performance, relies on network availability, and increases maintenance. It motivates the development of on-device inference.

In a dataflow architecture (DFA), data goes directly from one processing element to another, reducing the need for energy-consuming memory accesses [1]. Layer-wise parallelization and recurrent paths can be implemented on DFAs, through the use of fine-grained parallelism. DFAs have therefore been used to realize inference on mobile devices [2, 3, 4].

Memory-augmented neural networks (MANNs), which include memory networks [5]

, are recurrent neural networks (RNNs) with external memory to increase learning capacity. MANNs require both recursive and memory operations in each layer, making them difficult to parallelize on CPUs or GPUs.

We propose an accelerator for MANNs based on a field-programmable gate array (FPGA), which uses a DFA to realize energy-efficient inference in the domain of natural language processing (NLP), which is a major application of MANNs. We also introduce a data-based method of maximum inner-product search (MIPS), called inference thresholding, together with an efficient index ordering. This speeds up inference and the operation time of the output layer, which is particularly important in tasks with large classes, such as NLP.

Our implementation outperformed a GPU in terms of energy efficiency (FLOPS/kJ) by a factor of 126 on the bAbI dataset [6], and by 140 when inference thresholding was also used. The contributions of this paper are as follows:

  • [topsep=0pt,itemsep=0ex,partopsep=1ex,parsep=1ex,leftmargin=*]

  • A streaming-based inference architecture for MANNs, which we believe is the first.

  • Fast inference on this hardware using inference thresholding.

  • Implementation and validation of this approach on an FPGA.

Fig. 1: Proposed architecture of the FPGA-based accelerator for MANNs

Ii Memory-Augmented Neural Networks

MANNs, which are RNNs with more storage, are designed for question answering (QA) and other NLP tasks [5]. A MANN consists of external memory and a controller, and it learns how to read and write information from and to the memory. The memory operations of a MANN can be divided into three types: addressing, write, and read. Content-based addressing is usually employed in MANNs, and can be expressed as follows:


where is the read weight of the th memory element at time , is the address memory, is the number of memory elements, and is a read key.

Each memory element stores an embedded sentence vector as follows:


where is a word-embedding weight, and is an input sentence consisting of word indices. A memory read begins with the generation of a read key in the memory controller after the previous write. The read key at time is found as follows:


where is a question vector, and is an output vector from the controller, which is described as follows:


where is a read vector, and is the weight of the controller. The read vector for content-based addressing is generated by a content memory as follows:


The predicted label , produced by inference is given by


where is the weight of the output layer, and

is a logit with index


Iii Hardware Architecture

Fig. 1 shows the architecture and data flow of our accelerator, which consists of several modules which receive inference data and trained models (, , and ) from a host computer in the form of streams through a FIFO queue. A pre-trained model with appropriate data is passed to each module.

Control signals from the host, embedded in the data, pass to the CONTROL module, which has an inference control component that signals other modules. For example, in a QA task, context data in the form of sentences , together with the question , arrive in the input stream (green line in Fig. 1). When this stream is finished, the READ module generates a read key , and the MEM module uses this key to read a vector from the context memory. Reads can be recursive because the READ module is composed of an RNN. After all read operations are complete, the OUTPUT module returns the answer to the question through the FIFO queue to the host.

The INPUT & WRITE modules receive input data from the host and write embedded vectors to context and address memory in the MEM module. In an NLP task, a discrete and sparse sentence vector (e.g. a bag-of-words) is converted into a dense embedded vector by the embedding layer. If the input to a MANN includes word indices, then the efficiency of embedding in the INPUT & WRITE module can be improved, as shown in Eq. 2. The embedding module in the INPUT & WRITE module only needs to read the columns of the embedding weight corresponding to the indices of the input words. This reduces the number of memory accesses needed to read the embedding weights, and the number of multiplications needed to calculate the embedding vector, which lead to improving energy efficiency.

The MEM module consists of the address memory, which is content-addressible (Eq. 1) and context memory, which generates a read vector by soft-addressing based on the attention at obtained from the address memory (Eq. 5). The address and context memory together store the embedded vector from the INPUT & WRITE module. This requires costly operations such as softmax, which incurs an exponentiation and a division, which cannot be parallelized on an FPGA. The MEM module is therefore implemented with element-wise sequential operations which can exploit fine-grained parallelism.

The READ module is an RNN, and the OUTPUT module is a fully connected neural network. The READ module generates the read key which is used to calculate the attention at in the MEM module, and receives a read vector from the MEM module (Eqs. 3 and 5). The blue line in the READ module in Fig. 1 shows how a recurrent READ path can be implemented efficiently.

The OUTPUT module predicts the label based on the read vector, by multiplying the vector and the weight matrix of the output layer , as shown in Eq. 6. Matrix multiplication is implemented as a series of dot products because the hardware is insufficient to parallelize it directly. In the OUTPUT module the logit of each index is sequentially calculated to find the maximum logit; this takes up a lot of the inference time.

Iv Fast Inference Method

Iv-a Inference Thresholding

Input : training dataset ,
inference data
Output : prediction label
Notations: : vector of logits, : logit value at th index,
: pre-trained model, : dimension of output vector,
: thresholding constant,
: histogram of when ,
: histogram of when  

Step 1: Estimate logit distributions

for in :
      Do forward pass ,
      if == :
           for in :
                if == :
                          for in :
                               Estimate from
                                 Step 2: Set the inference thresholds
                               for in :
                                      Step 3: Set the efficient index order
                                    for in :
                                         avg. silhouette coefficient of
                                         indices sorted by in descending order   Step 4: Inference thresholding
                                         Do forward pass until output layer
                                         for in :
                                              if :
Algorithm 1 Inference Thresholding

A MANN implemented as a DFA can exploit fine-grained parallelism in each layer. However, in an NLP task the dimension of the output is much larger than that of the embedding , making it difficult to parallelize operations in the output layer [7]. Thus, when calculating a logit in the output layer, we must sequentially calculate the dot product of the input vector and the row of the weight matrix corresponding to the index in the output module, as shown in Fig. 2-(a). Because the operation time of the output layer is , the inference time increases with .

We implement the output layer sequentially, but limit the computation required by introducing inference thresholding (Algo. 1). We approximate the MIPS by speculating that, given , the index will be the predicted label . If we can conjecture the maximum logit for index with sufficient confidence, then we need not compare the remaining logits.

Inference thresholding was motivated by observing logit distributions in a trained model in which the logits are fitted to the mixture models, as shown in Fig. 2-(b). To predict whether logit is the maximum value of all logits , we consider two distributions: in one, is the maximum, and in the other it is not.

On this basis we can estimate conditional probability density functions (PDFs)

for the training label

by kernel density estimation (Step 1 in Algo. 


). The PDFs of the inference dataset can be approximated by those obtained from the training dataset. By applying Bayes’ theorem to the approximated PDFs, we can obtain the posteriors of the logits for the inference dataset

as follows:


where is the probability that the index is a training label .

To apply estimated probabilities to the inference process in the output layer, we compare each logit with a threshold

, which is the the smallest value of those logits of which the estimated posterior probability

is larger than :


where is a thresholding constant (Step 2 in Algo. 1). This yields a speculative value for the label.

Fig. 2: MIPS in the OUTPUT module: (a) the conventional method needs to compare all logits; (b) inference thresholding stops the comparison if .
Fig. 3: Evaluation of the effect of inference thresholding and index ordering: in terms of the accuracy and number of comparisons required in the MIPS against threshold constant , on the bAbI dataset (ITH = inference thresholding).

Iv-B Efficient Index Order for Inference Thresholding

Inference thresholding is quicker if we order the logits so that those for which thresholding is most effective come first (Fig. 2). This can be seen as determining whether the logit belongs to the class . From this perspective, inference thresholding will be more effective for a logit with a long inter-class distance and a short intra-class distance. We therefore sort the indices into descending order of silhouette coefficient [8] (Step 3 in Algo. 1).

The effect of inference thresholding and index ordering is depicted in Fig. 3. As the threshold constant decreases, MIPS requires fewer comparisons but accuracy declines. Ordering improves both accuracy and speed.

V Experimental Results

Fig. 4: Energy efficiency of inference on the bAbI dataset on various configurations compared with the GPU (ITH = inference thresholding).

We implemented the accelerator and measured the inference time and power consumption on an Intel Core i9-7900X CPU, and on an NVIDIA TITAN V GPU, and a Xilinx Virtex UltraScale VCU107 FPGA linked to the same CPU.

Time and power measurement were made for 20 tasks from the bAbI QA dataset [6]. Timings, which included transmission of the pre-trained model and inference data to the GPU and FPGA, were repeated 100 times; power measurements were made over five minutes. We ran the FPGA at 25, 50, 75, and 100 MHz to evaluate the effect of the host-FPGA interface. We set the thresholding constant to 1.0, which reduced accuracy by less than 0.1%.

Averaged timings and power measurements are listed in Table LABEL:tab:experiment_result_gain. Running on the FPGA, the accelerator took less time at higher frequencies, as we would expect: but the improvement was not linear. Inference thresholding reduced timings by 6-18%, depending on frequency. The accelerator ran between 5.2 and 7.5 times faster than the GPU, and between 5.6 and 8.0 times faster than the CPU. The GPU used most power, and the FPGA running at 25MHz used least. The CPU used 1.7 times less energy than the GPU, and the FPGA used 74 times less, or 140 times less using inference thresholding.

Results on individual tasks are shown in Fig. 4, again normalized to the performance of the GPU. The FPGA implementation was the most energy-efficient across all tasks, and inference thresholding increased the margin.

[ pos = t, center, caption = Average measurement results, speedup, and energy-efficiency of inference on the bAbI dataset, captionskip = -1ex, mincapwidth = label = tab:experiment_result_gain, doinside = ]l—rr—rr [a]normalized to the result on the GPU Configurations & Time (s) & Power (W) & Speedup[a] & FLOPS/kJ[a]
CPU & 242.77 & 23.28 & 0.94 & 1.70
GPU & 226.90 & 45.36 & 1.00 & 1.00
25 Mhz & 43.54 & 14.71 & 5.21 & 83.74
50 Mhz & 34.95 & 17.53 & 6.49 & 109.06
75 Mhz & 31.96 & 19.02 & 7.10 & 120.24
100 Mhz & 30.28 & 20.10 & 7.49 & 126.72
FPGA + Inference thresholding
25 Mhz & 35.36 & 17.36 & 6.42 & 107.61
50 Mhz & 30.81 & 20.11 & 7.36 & 122.35
75 Mhz & 29.18 & 20.18 & 7.78 & 135.87
100 Mhz & 28.53 & 20.53 & 7.95 & 139.75

Inference thresholding is more beneficial at low operating frequencies. As the frequency increases, inference time is dominated by the interface between the host and the FPGA. If this were not the case, we estimate that our approach would use 162 times less energy than the GPU.

Inference thresholding did not have a significant effect on the inference time of our accelerator running on the CPU or GPU. On the CPU, the output layer only represents a small part of the computation; and the GPU can process the output layer in parallel.

Vi Related Work

Vi-a DNN Inference Accelerator

Hardware matrix multiplications can reduce inference times for CNN models [9, 2]. Several architectures [2, 4, 3] have been introduced for different types of RNN, such as LSTMs and GRUs. These accelerators save energy, but are not readily extensible to the memory operations required in MANNs. A method of accelerating inference of MANNs has been studied [10], but it has not been implemented in hardware.

Vi-B Maximum Inner-Product Search

In applications with large search spaces, including NLP, MIPS takes a long time. Hence, approximations using hashing [11], or clustering [12] have been proposed. Some of these approaches, including sparse access memory [13] and hierarchical memory networks [14], have also been used to accelerate memory reads and writes in MANNs. However these techniques may be too slow to be used in the output layer of a DNN in resource-limited environments.

Vii Conclusion

We believe that the DFA-based approach, and its implementation on an FPGA, which are reported in this paper, represent the first attempt at energy-efficient inference specifically for MANNs. We also introduce a method of speculation about the inference results which avoids computations which are difficult to parallelize. This reduces computation times and saves energy at an extremely small cost in accuracy. We believe that this work shows how inference tasks such as QA may be preformed in mobile devices. We also expect that our data-based MIPS will find applications in large-class inference.


This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) [2016M3A7B4911115, 2018R1A2B3001628], the Strategic Initiative for Microbiomes in Agriculture and Food (Ministry of Agriculture, Food and Rural Affairs) [918013-4], and the Brain Korea 21 Plus Project in 2018.