End-to-end learning with deep neural networks (DNN) has taken the stage in the past few years, achieving state-of-the-art performance in multiple domains including computer vision[Szegedy et al.2017], text analysis [Sutskever et al.2014, Jean et al.2015] and speech recognition [Xiong et al.2016]. End-to-end learning maps input to output based solely on data. However, in many tasks, enabling a DNN to interact with an existing program, knowledge, or application through its Application Programming Interface (API) can lead to a superior solution. An API defines a functionality. Accessing an API is accomplished via its input and output parameters. For example, consider the simple question: ”Is 7.2 greater than 4.5?”. We could answer this question by letting a DNN handle the natural language part, and accessing a simple API to solve the logical part. We refer to this as a hybrid solution. This type of solution can outperform pure DNN models by using analytic techniques and closed algorithms already implemented in external applications. Such external applications do not require learning, thereby using them should reduce the total amount of training data needed.
Training a DNN to interact with an external application through its API poses an inherent difficulty: most DNN training procedures rely on different variants of gradient back-propagation. Therefore, it is only possible to train them end-to-end if all solution parts are implemented as neural networks, or at least can return a valid gradient. Consequently, all parts of the overall solution must be differentiable.
One of the approaches that enable DNNs to make use of external application functionality is that of the Neural Programmer Interface [Neelakantan et al.2016]. The authors implement differentiable functions that approximate the desired functionality, and implant them in the network, thus bypassing the call to the external application. Another option [Liang2016]
is a two-step approach that includes program induction or program generation. First, the system translates the task’s input into a program i.e., a sequence of API calls. Second, it executes the program on an external engine. Reinforcement Learning (RL) is another possible approach to learn a policy - what actions should be taken and when. When learning a policy, RL lets the environment handle the external application functionality and enables the DNN to learn the policy by splitting the learning process into a sequence of discrete decisions. Such splitting prevents the RL process from using gradient guidance.
Our approach, which we refer to as Estimate and Replace, integrates existing non-differentiable applications into DNNs. We use an estimator subnetwork, which we call EstiLayer, to estimate each of the non-differentiable applications. We then train the whole network, which we call EstiNet, to comply with the EstiLayer interface during an end-to-end optimization process. At inference time, we replace each EstiLayer with its external application counterpart and let the EstiNet model solve the task by accessing external application interfaces as needed.
Estimate and Replace streamlines the integration process. It relies on existing training data and replacing an EstiLayer with its external application counterpart rather than keeping it within the DNN as is, leads to better results at inference time. The key to this approach is an innovative multi-task training process. It forces parts of the network, the EstiLayers, to conform to a predefined functionality, while training other parts of the network to interface with the EstiLayers correctly. Training an internal layer within a DNN to execute a predefined function poses several challenges. These are described in more detail in Section 3.1.
Our work advances existing research with the following main contributions: 1) An entirely new approach to combine DNNs with existing applications. 2) A new training method that supports end-to-end optimization while constraining part of the network to conform to a predefined functionality. 3) We show that using this approach we can train a DNN end-to-end using less data compared to a similar architecture without component estimation and replacement.
2 Related Work
Task-specific architectures for end-to-end deep learning require large datasets and work very well when such data is available, as in the case of neural machine translation[Bahdanau et al.2014]
. General purpose end-to-end architectures, suitable for multiple tasks, include the Neural Turing Machine[Graves et al.2014] and its successor, the Differential Neural Computer [Graves et al.2016] (DNC). There is no external application integration in these architectures. Other architectures, such as the Neural Programmer architecture [Neelakantan et al.2016] allow end-to-end training while constraining parts of the network to execute predefined operations by re-implementing specific operations as static differentiable components. This approach has two drawbacks. It requires re-implementation of the API in a differentiable way, which may be difficult, and it lacks the accuracy and possible scalability advantages of an external API.
Program Induction and Program Generation: Program induction is a different approach to interaction with external APIs. The goal is to construct a program comprising a series of operations based on the input, and then execute the program to get the results. When the input is a natural language query, as in our focus, it is possible to use semantic parsing to transform the query into a logical form that describes the program [Liang2016]. Early works required natural language query-program pairs to learn the mapping. Recent works, (e.g., [Pasupat and Liang2015]) require only query-answer pairs for training.
Other approaches include neural network-based program induction, [Andreas et al.2016], translation of a query into a program using sequence-to-sequence deep learning methods [Lin et al.2017], and learning the program from execution traces [Reed and De Freitas2015, Cai et al.2017]. One major difficulty of neural methods is the need to make a discrete selection of each step in the program. Some works, (e.g., [Andreas et al.2016]) overcome this difficulty by substituting the real gradient with an estimate of the gradient using the REINFORCE algorithm [Williams1992].
Reinforcement Learning: Learning to execute the right operation can be viewed as a reinforcement learning problem. For a given input, the agent has to select the optimal action from a set of available actions. The action selection repeats following feedback based on the previous action selection. Earlier works that took this approach include [Branavan et al.2009], and later [Artzi and Zettlemoyer2013]. Recently, [Zaremba and Sutskever2015] proposed a reinforcement extension to Neural Turing Machines [Graves et al.2014]. In [Tamar et al.2016], the authors pose a value iteration based solution for reinforcement learning tasks as an end-to-end learning task with a Value Iteration Network (VIN). VIN are shown to learn how to plan a sequence of actions for a given task.
3 Estimate and Replace
In this section we present Estimate and Replace. Our approach enables a DNN to interact with external, non-differentiable applications, as illustrated in Figure 1. At the heart of this approach lie estimator subnetworks, or EstiLayers, which we use to estimate each external application. The EstiLayers encourage the DNN model, which we refer to as an EstiNet, to comply to the predefined EstiLayers’ interface during an end-to-end optimization process. At inference time, we replace each EstiLayer with its external application counterpart, and let the EstiNet access it as needed. By replacing EstiLayers with external applications that share the same interface, we can use the strengths and advantages of each of the computational components, namely the DNN and the external application, in an accurate and efficient way.
3.1 Interface Learning Challenges
The Estimate and Replace approach implies a DNN ability to learn an application interface during training and use it at inference time. This approach raises three significant training challenges:
Selecting the right interface: To successfully complete a task, we have to train the DNN to access the right API at the right time. Selecting from a collection of APIs is a discrete operation and thus, non-differentiable. This inherent discrete API selection poses an immediate difficulty on the end-to-end training process.
Constraining the interface parameters: To replace an EstiLayer with its application counterpart at inference time, we have to constrain its interface to comply with the API input and output parameters. Constraining an EstiLayer to an input and output definition as part of an end-to-end training process poses another challenge.
Executing a sequence of API calls: Most tasks require the execution of a sequence of API calls for successful completion. Executing a call sequence poses an additional burden on the training process. The network now needs to consume the output of an API call and adjust it to conform with the input of the successive call. Addressing this additional network orchestration difficulty is beyond the scope of this work. We plan to confront this challenge in future work.
3.2 The EstiNet Model
To address the abovementioned challenges, EstiNet defines an abstraction with three conceptual subnetworks: 1) Input representation 2) Selectors and 3) EstiLayers. These three subnetworks provide the main building blocks required to solve a specific task. Next, we describe the main functionality of each of these three subnetworks.
3.2.1 Input Representation
EstiNet uses an input representation subnetwork to represent the task input. The task data derives the exact architecture of this subnetwork. The input representation architecture allows the selectors to extract the needed information from the input.
EstiNet uses selector subnetworks
for two complementary tasks: (1) To figure out the proper API call and (2) To extract the proper API arguments from the input. The selectors assign a probability distribution over a set of classes, and use Gumbel Softmax[Jang et al.2017] to facilitate discrete selections. The task data and the available APIs define the exact number of selectors and their specific output.
EstiNet uses EstiLayer subnetworks
to estimate an API and its functionality. The main role of the EstiLayer is to estimate an external, non-differentiable application with a differentiable subnetwork, allowing us to train the network end-to-end with stochastic gradient descent (SGD). Forcing an EstiLayer to estimate the application’s functionality and comply with its API allows us to replace it with the actual application at inference time, while keeping the overall functionality intact. In this work, we implement all EstiLayers with a general purpose DNN architecture[Vaswani et al.2017]. EstiLayers are not limited to non-differential applications and are also valuable for estimating complex differentiable APIs. Using EstiLayers eliminates the need to directly implement a complex API, by allowing the network to learn the functionality from the data. We can then use the external application at inference time to achieve better performance.
3.3 Auxiliary Functionality
Enabling EstiNet to interact with an external application requires additional functionality, which we describe below.
3.3.1 Number Representation
Numerical APIs (e.g., greater-than
) use numbers as their arguments. Forcing an EstiLayer and a numerical API to have the same interface necessitates a compatible number representation. To this end, EstiNet must represent a number in a format that can be translated to and from a concrete number. Moreover, many tasks require representations for both numbers and words. It may be desirable to embed both representations into the same vector space. Previous works either represent numbers as is[Ling et al.2017] or replace them with a predefined token. In this work, we use a number-embedding approach that allows EstiNet to handle numbers and words interchangeably.
3.3.2 Adaptation Function
EstiNet uses adaptation functions to adapt selectors output to the required API input, and to adapt API output to the required input of higher network layers at inference time. The adaptation function may be non-differentiable; thus, it cannot be used during back-propagation. Nevertheless, EstiNet uses these functions for label generation during training while training (see Section 3.4.2 for more details).
3.4 Training Procedures
The procedure of training a DNN end-to-end, which we refer to as plain training, is carried out by providing task labels as supervision. This supervision allows the model to learn the input-to-output mapping end-to-end, but does not provide any guidance for learning an application interface. We use plain training as our baseline and examine two enhanced training procedures to support the Estimate and Replace approach: offline and online training. Next, we describe these two procedures in detail.
3.4.1 Offline Training
Offline training is a two-step procedure. First, we train an EstiLayer to estimate an application’s functionality. We create training data by generating application input, and then record its output. Second, we load the trained EstiLayer into the EstiNet model and train the EstiNet end-to-end while keeping the EstiLayer parameters fixed. A procedure we refer to as offline-trainable changes the EstiLayer functionality by allowing the optimizer to update the EstiLayer parameters during the end-to-end training process. This changes the learned interface with the selectors subnetwork. Consequently, we expect performance at inference time to decline. We report the results of these two training procedures in Section 5.2.
3.4.2 Online Training
Online training is a multi-task training procedure that jointly trains the EstiLayers and the whole EstiNet model. It calculates an additional loss value, based on a label that it generates online from the external application. Let be some model input sample. Let and be the selectors’ input to the EstiLayer interface while in soft and hard selection modes, respectively. We then define the EstiNet model prediction to be , the soft online prediction to be , and the hard online prediction to be . Furthermore, we define to be the task label for input sample and to be the API label for the API input sample . We can now define the total loss to be:
where and are hyper parameters of the model and ,
Figure 2 presents a schematic diagram of the online training procedure. A variant of the online training procedure is online-pretraining; here, we start by training the EstiLayers as in the offline training and then use it during the online training. This procedure yielded the best EstiNet performance, and as such, is our recommended training procedure.
3.5 Performance Assessment
We assessed EstiNet in two different modes: test modes and inference mode. In test mode we used a dedicated test set applied to the learned network. In inference mode we replaced each EstiLayer with its external application counterpart, and set the selectors to perform hard selections. We then measured EstiNet accuracy on the test set. EstiNet demonstrated significant performance improvement in inference mode compared to test mode. This improvement reveals the main potential advantage of EstiNet over existing approaches.
4 Experiment Details
We applied the Estimate and Replace approach to a generated table-based, question-answering task (TAQ). We ran a supporting experiment on an auxiliary dataset to demonstrate the ability of Estimate and Replace to learn from less data. In the following section, we provide the TAQ task description and a detailed implementation of the EstiNet model that solves it.
4.1 The TAQ Task
For the TAQ task we generated a table-based question answering dataset. The TAQ dataset input has two parts: a question and a table. To correctly answer a question from this dataset, the DNN has to access the right table column and apply non-differentiable logic on it using a parameter it extracts from the query. For example, consider a table that describes the number of medals won by each country during the last Olympics, and a query such as: ”Which countries won more than 7 gold medals?” To answer this query the DNN has to extract the argument (7 in this case) from the query, access the relevant column (namely, gold medals), and execute the ’greater than’ operation with the extracted argument and column content (namely a vector of numbers) as its parameters. The operation’s output vector holds the indexes of the rows that satisfy the logic condition (greater-than in our example). The final answer contains the names of the countries (i.e., from the countries column) in the selected rows.
4.1.1 Taq Api
Solving the TAQ task requires five basic logic functions: equal-to, less-than, greater-than, max, and min. Each such function defines an API that is composed of two inputs and one output. The first input is a vector of numbers, namely, a column in the table. The second is a scalar, namely, an argument from the question or NaN if the scalar parameter is not relevant. The output is one binary vector, the same size as the input vector. The output vector indicates the selected rows for a specific query and thus provides the answer.
4.1.2 TAQ Data
We generated tables in which the first row contains column names and the first column contains a list of entities (e.g., countries, teams, products, etc.). Subsequent columns contained the quantitative properties of an entity (e.g., population, number of wins, prices, discounts, etc.). Each TAQ-generated table consisted of 5 columns and 25 rows. We generated entity names (i.e., nations and clubs) for the first column by randomly selecting from a closed list. We generated values for the rest of the columns by sampling from a uniform distribution. We sampled values between 1 and 100 for the train set tables, and between 300 and 400 for the test set tables. We further created 2 sets of randomly generated questions that use the 5 API functions. The set includes 20,000 train questions on the train tables and 4,000 test questions on the test tables.
4.2 Implementation Details
In this section we provide more details on the exact implementation of the various EstiNet model parts that we designed to solve the TAQ task.
4.2.1 Input Representation
The TAQ input was composed of words, numbers, queries, and tables. The following describes the EstiNet representation for each of these elements.
Word Representation: EstiNet uses word pieces as in [Wu et al.2016] to represent words. A word is a concatenation of word pieces. EstiNet represents each word as an average value of its piece embedding.
Number Representation: EstiNet aims to accurately represent a number and embed it into the same word vector space. Thus, its number representation follows the float32 scheme [Kahan1996]. Specifically, it starts by representing a number as a 32 dimension Boolean vector . It then adds redundancy factor by multiplying each of the digits
times. Last, it pads theresulting vector with zeros.We tried several representation schemes. This approach resulted in the best EstiNet performance.
Query Representation: EstiNet represents a query as a matrix of word embeddings and uses the LSTM model [Hochreiter and Schmidhuber1997] to encode the query matrix into a vector representation: where is the last LSTM output and is the dimension of the LSTM.
Table Representation: For the TAQ task, EstiNet represents a table with rows and
columns as a three dimensional tensor. It represents a cell in a table as the piece average of its words.
The EstiNet TAQ model uses three selector types: operation, argument, and column. Operation selectors select the correct API. Argument selectors select an argument from the query and hand it to the API. The column selector’s role is to select a column from the table and hand it to the API. EstiNet implements each selector subnetwork as a classifier. Letbe the predicted class matrix, where the total number of classes is and each class is represented by a vector of size . For example, for a selector that has to select a word from a sentence, the matrix contains the word embeddings of the words in the sentence. One may consider various selector implementation options. We use a simple, fully connected network implementation in which is the parameter matrix and is the bias. We define to be the selector prediction before activation and to be the prediction after the softmax activation layer. At inference time, the selector transforms its soft selection into a hard selection to satisfy the API requirements. EstiNet enables that using Gumbel Softmax hard selection functionality.
The EstiNet TAQ model uses five EstiLayers to estimate each of the five logic operations. Each EstiLayer is a general purpose subnetwork that we implement with a transformer network encoder[Vaswani et al.2017]. Specifically, we use identical layers, each of which consists of two sub-layers. The first is a multi-head attention with
heads, and the second is a fully connected feed forward two-layer network, activated separately on each cell in the sequence. We then employ a residual connection around each of these two sub-layers, followed by layer normalization. Last, we apply linear transformation on the encoder output, adding bias and applying the Gumbel Softmax.
In this section we report on the performance of Estimate and Replace interacting with external applications. We start with the performance of the EstiNet model on the TAQ task. We then provide the performance results of offline and online training procedures and compare them with the plain training baseline. Last, we demonstrate the advantages of Estimate and Replace on learning from less data.
5.1 TAQ Performance
depicts the TAQ accuracy of the three model configurations: 1. Train: train model and train dataset 2. Test: train model and test dataset and 3. Inference: inference model and test dataset. As shown, the train model accuracy on the train dataset reaches 95% after 20 epochs. Interestingly, inference accuracy on the test set is even higher and reaches 100%, while test accuracy is lower than 40%. The graph demonstrates the ability of the TAQ model to learn the logic interface despite its low generalization, indicated by the low test accuracy. Intermediate selector labels, which only exist for generated data, allow us to further assess selector accuracy in learning the interface with the estimators. Figure3 presents the per-epoch accuracy of the three selectors during model training. The figure shows the selectors’ ability to perfectly learn the EstiLayers’ interface after approximately 10 epochs. Selector test performance indicates the same (not shown in the figure). Figures 3- 3 further assess EstiLayer accuracy in estimating the logic operation. The figures show that EstiLayers achieve near perfect accuracy on the train set, while test set performance is way below optimal. Most importantly, even though suboptimal EstiLayer test performance drives the overall low model performance on the test set, it has no effect on model performance while in inference mode. Clearly, this is because during inference mode we replace each EstiLayer with its application counterpart.
5.2 Training Procedures
. In plain training, which we use as our baseline (first line in the table), we train the TAQ model end-to-end without any additional constraints on the selector-estimator interface. As shown, the model can overfit the train set (0.9) but test set performance is low (0.11). Note that inference performance has no real meaning in that case, as there is no predefined selector-estimator interface. Thus, replacing the estimator with the external API makes no sense. Our main goal is to force a model to learn a predefined interface. To achieve that, we envision a two-step training procedure, which we refer to as offline training. We first train the estimators to overfit the train set and then use the trained estimator model during the end-to-end training process. The second line in the table shows the result of the offline experiment. As shown, the model fails to fit the train set and shows only 0.09 accuracy. Moreover, low train model accuracy results in low inference performance (0.17). We hypothesized that fixing the estimator parameters during the end-to-end training process prevents the rest of the model from fitting the train set. Thus, we ran a third experiment, offline-trainable, which allows the optimizer to update EstiLayer parameters during the end-to-end training process. This enabled the model to successfully fit the train set (0.97) but derogated the interface accuracy at inference time (0.33). With that failure in mind, we looked for a way to let the model learn the TAQ task end-to-end and at the same time force the estimator to comply with a predefined interface. We designed the online training procedure as a multi-task optimization process with two unrelated loss functions: 1) the task and 2) the external functionality (see Section3.4.2 for more details). In our first online training experiment (online entry in the table), we trained the estimators and the entire model without first initializing the estimator parameters. We assumed the estimators would gain their estimation functionality during the multi-task optimizing process. Indeed, this training procedure led to significant improvement in inference performance (0.69). However, the end-to-end model still faced difficulties in fitting the train set, gaining only 0.76 accuracy. To improve the end-to-end model learning performance, we experimented with the online-pretraining procedure (fifth line in the table), in which we started the end-to-end training process with trained estimator models. As indicated by the table, the improved online training procedure succeeded in fitting the train set (0.98) and achieved near perfect results at inference time (0.98). The low test set performance (0.47) is a phenomenon that usually indicates suboptimal model performance, (i.e., due to lack of data). However, in our case it demonstrates high generalization abilities in situations that lack data.
5.3 Learning from Less Data
To demonstrate the ability of Estimate and Replace to learn from less data, we ran an auxiliary experiment on a simpler dataset of greater-than/less-than questions. The questions came from 10 different templates, all requiring a true/false answer for 2 real numbers. For example: Out of x and y, is the first bigger ? where are float numbers sampled from a distribution. The aim of this simple dataset was to demonstrate the ability of EstiNet to learn from less data. We compared the performance of the EstiNet model in plain and online training procedures. The plain training procedure served as a baseline and we measured its performance at test mode. This is because with plain training, the model does not learn to interact with external applications, thus, inference performance has no meaning. On the other hand, online training lets the DNN model learn to interact with external applications, thus, we can measure its inference performance. Our experiment contained 5 train sets with 250, 500, 1,000, 5,000 or 10,000 questions and a test set with 1,000 questions. Table 2 summarizes the performance differences. The results show that with online training the model generalizes better and accuracy differences between the two training procedures increase as the amount of training data decreases. It is interesting to note that to achieve, for example, 0.97 accuracy, the online training only needs samples that are 5% of the data the plain training needs. We attribute the superiority of the EstiNet online training performance to its learning abilities. The model learns to interact with an external application to solve the logical part of the question.
|Train set size||250||500||1,000||5,000||10,000|
6 Conclusions and Future Work
Our work presents a new approach to overcoming the non-differentiability challenge, while integrating existing applications with DNNs. We use EstiLayers to estimate the external non-differentiable functionality. We then train the DNN end-to-end with the EstiLayers as functionality place -holders. The DNN learns the interface with the EstiLayers and uses it at inference time. Defining and learning an application interface, as well as learning when to use it, involves several non-differentiable aspects that must be overcome by the training process. These include hard selection, typed arguments, and functionality orchestration. We successfully demonstrated the advantages of EstiNet in learning and using a set of predefined, non-differentiable interfaces. We plan to follow-up on this work and extend it in two related directions. One direction is to dynamically learn the interface signature of a given functionality. Another is to learn to orchestrate a set of interfaces to solve a complex task. Achieving these will allow us to apply Estimate and Replace to real-world problems such as challenges in the finance and elementary science domains.
- [Andreas et al.2016] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016.
[Artzi and Zettlemoyer2013]
Yoav Artzi and Luke Zettlemoyer.
Weakly supervised learning of semantic parsers for mapping instructions to actions.Transactions of the Association for Computational Linguistics, 1:49–62, 2013.
- [Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[Branavan et al.2009]
S. R. K. Branavan, Harr Chen, Luke S. Zettlemoyer, and Regina Barzilay.
Reinforcement learning for mapping instructions to actions.
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1, ACL ’09, pages 82–90, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.
- [Cai et al.2017] Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion. arXiv preprint arXiv:1704.06611, 2017.
- [Graves et al.2014] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- [Graves et al.2016] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adrià Puigdomènech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538:471 EP –, Oct 2016. Article.
- [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- [Jang et al.2017] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. stat, 1050:1, 2017.
- [Jean et al.2015] Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1–10. Association for Computational Linguistics, 2015.
Ieee standard 754 for binary floating-point arithmetic.1996.
- [Liang2016] Percy Liang. Learning executable semantic parsers for natural language understanding. Commun. ACM, 59(9):68–76, August 2016.
[Lin et al.2017]
Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, and Michael D Ernst.
Program synthesis from natural language using recurrent neural networks.Technical report, Technical Report UW-CSE-17-03-01, University of Washington Department of Computer Science and Engineering, Seattle, WA, USA, 2017.
- [Ling et al.2017] Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
- [Neelakantan et al.2016] Arvind Neelakantan, Quoc V Le, Martin Abadi, Andrew McCallum, and Dario Amodei. Learning a natural language interface with neural programmer. arXiv preprint arXiv:1611.08945, 2016.
- [Pasupat and Liang2015] Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305, 2015.
- [Reed and De Freitas2015] Scott Reed and Nando De Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279, 2015.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc., 2014.
- [Szegedy et al.2017] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. 2017.
- [Tamar et al.2016] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016.
- [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
- [Williams1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- [Wu et al.2016] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- [Xiong et al.2016] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256, 2016.
- [Zaremba and Sutskever2015] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. CoRR, abs/1505.00521, 2015.