1 Introduction
Endtoend supervised learning with deep neural networks (DNNs) has taken the stage in the past few years, achieving stateoftheart performance in multiple domains including computer vision
(Szegedy et al., 2017)(Sutskever et al., 2014; Jean et al., 2015), and speech recognition (Xiong et al., 2016). Many of the tasks addressed by DNNs can be naturally decomposed to a series of functions. In such cases, it might be advisable to learn neural network approximations for some of these functions and use precise existing functions for others. Examples of such tasks include Semantic Parsing and Question Answering. Since such a decomposition relies partly on precise functions, it may lead to a superior solution compared to an approximated one based solely on a learned neural model.Decomposing a solution into trainable networks and existing functions requires matching the output of the networks to the input of the existing functions, and viceversa. The input and output are defined by the existing functions’ interface. We shall refer to these functions as blackbox functions (bbf), focusing only on their interface. For example, consider the question: “Is 7.2 greater than 4.5?” Given that number comparison is a solved problem in symbolic computation, a natural solution would be to decompose the task into a twostep process of (i) converting the natural language to an executable program, and (ii) executing the program on an arithmetic module. While a DNN may be a good fit for the first step, it would not be a good fit for the second step, as DNNs have been shown to generalize poorly to arithmetic or symbolic functionality (Fodor & Pylyshyn, 1988; He et al., 2016).
In this work, we propose a method for performing endtoend training of a decomposed solution comprising of a neural network that calls blackbox functions. Thus, this method benefits from both worlds. We empirically show that such a network generalizes better than an equivalent endtoend network and that our training method is more efficient at learning than existing methods used for training a decomposed solution.
The main challenge in decomposing a task to a collection of neural network modules and existing functions is that effective neural network training using gradientbased techniques requires the entire computation trajectory to be differentiable. We outline three existing solutions for this task: (i) EndtoEnd Training: Although a task is naturally decomposable, it is possible to train a network to fit the symbolic functionality of the task to a differentiable learned function without decomposing it. Essentially, that means solving the problem endtoend, foregoing the existing blackbox function. (ii) Using Intermediate Labels: If we insist, however, on using the blackbox function, it is possible to train a network to supply the desired input to the blackbox function by providing intermediate labels for translating the task input to the appropriate blackbox function inputs. However, intermediate labels are, most often, expensive to obtain and thus produce a bottleneck in gathering data. (iii) BlackBox Optimization:
Finally, one may circumvent the need for intermediate labels or differentiable approximation with Reinforcement Learning (RL) or Genetic Algorithms (GA) that support blackbox function integration during training. Still, these algorithms suffer from high learning variance and poor sample complexity.
We propose an alternative approach called Estimate and Replace that finds a differentiable function approximation, which we term blackbox estimator, for estimating the blackbox function. We use the blackbox estimator as a proxy to the original blackbox function during training, and by that allow the learnable parts of the model to be trained using gradientbased optimization. We compensate for not using any intermediate labels to direct the learnable parts by using the blackbox function as an oracle for training the blackbox estimator. During inference, we replace the blackbox estimator with the original nondifferentiable blackbox function.
Endtoend training of a solution composed of trainable components and blackbox functions poses several challenges we address in this work—coping with nondifferentiable blackbox functions, fitting the network to call these functions with the correct arguments, and doing so without any intermediate labels. Two more challenges are the lack of prior knowledge on the distribution of inputs to the blackbox function, and the use of gradientbased methods when the function approximation is near perfect and gradients are extremely small.
This work is organized as follows: In Section 2, we formulate the problem of decomposing the task to include calls to a blackbox function. Section 3 describes the network architecture and training procedures. In Section 4, we present experiments and comparison to Policy Gradientbased RL, and to fully neural models. We further discuss the potential and benefits of the modular nature of our approach in Section 6.
2 Learning Blackbox Function Interfaces without Intermediate Labels
In this work, we consider the problem of training a DNN model to interact with blackbox functions to achieve a predefined task. Formally, given a labeled pair , such that some target function satisfies , we assume that there exist:
(1)  
(2) 
Such that , where is the number of arguments in the blackbox input domain . The domains can be structures of discrete, continuous, and nested variables.
The problem then is to fit given a dataset and given an oracle access to . Then is an argument extractor function, which takes as input and outputs a tuple of arguments , and is a blackbox function, which takes these arguments and outputs the final result. Importantly, we require no sample for which the intermediate blackbox interface argument labels are available. We note that this formulation allows the use of multiple functions simultaneously, e.g., by defining an additional argument that specifies the “correct” function, or a set of arguments that specify ways to combine the functions’ results.
3 The Estimate and Replace Approach and the EstiNet Model
In this section we present the Estimate and Replace approach which aims to address the problem defined in Section 2. The approach enables training a DNN that interacts with nondifferentiable blackbox functions (bbf), as illustrated in Figure 1 (a). The complete model, termed EstiNet, is a composition of two modules—argument extractor and blackbox estimator—which learn and respectively. The blackbox estimator subnetwork serves as a differential estimator to the blackbox function during an endtoend gradientbased optimization. We encourage the estimator to directly fit the blackbox functionality by using the blackbox function as a label generator during training. At inference time, we replace the estimator with its blackbox function counterpart, and let this hybrid solution solve the endtoend task at hand in an accurate and efficient way. In this way, we eliminate the need for intermediate labels. We refer to running a forwardpass with the blackbox estimator as test mode and running a forwardpass with the blackbox function as inference mode. By leveraging the blackbox function as in this mode, EstiNet shows better gerealization than an endtoend neural network model. In addition, EstiNet suggests a modular architecture with the added benefits of module reuse and model interpretability.
Adapters
EstiNet uses an adaptation function to adapt the argument extractor’s output to the blackbox function input, and to adapt blackbox function’s output to the appropriate final output label format (see Figure 1 (b)). For example, EstiNet uses such a function to convert soft classification distributions to hard selections, or to map classes of text token to concrete text.
3.1 Training an EstiNet Model
The modular nature of the EstiNet model presents a unique training challenge: EstiNet is a modular architecture where each of its two modules, namely the argument extractor and the blackbox estimator is trained using its own inputlabel pair samples and loss function.
3.1.1 EstiNet’s Loss Functions and Datasets
We optimize EstiNet model parameters with two distinct loss functions—the target loss and the blackbox loss. Specifically, we optimize the argument extractor’s parameters with respect to the target loss using the task’s dataset during endtoend training. We optimize the blackbox estimator’s parameters with respect to the blackbox loss while training it on the blackbox dataset:
The blackbox dataset
We generate inputoutput pairs for the blackbox dataset by sending an input sample to the blackbox function and recording its output as the label. We experimented in generating input samples in two ways: (1) offline sampling—in which we sample from an apriori blackbox input distribution, or from a uniform distribution in absence of such; and (2) online sampling—in which we use the output of the argument extractor module during a forward pass as an input to the blackbox function, using an adaptation function as needed for recording the output (see Figure
1 (b)).3.1.2 Training Procedures
Having two independent datasets and loss functions suggest multiple training procedure options. In the next section we discuss the most prominent ones along with their advantages and disadvantages. We provide empirical evaluation of these procedures in Section 4.
Offline Training
In offline training we first train the blackbox estimator using offline sampling. We then fix its parameters and load the trained blackbox estimator into the EstiNet model and train the argument extractor with the task’s dataset and target loss function. A disadvantage of offline training is noisy training due to the distribution difference between the offline blackbox apriori dataset and the actual posterior inputs that the argument extractor computes given the task’s dataset during training. That is, the distribution of the dataset with which we trained the blackbox estimator is different than the distribution of input it receives during the target loss training.
Online Training
In online training we aim to solve the distribution difference problem by jointly training the argument extractor and the blackbox estimator using the target loss and blackbox loss respectively. Specifically, we train the blackbox estimator with the blackbox dataset generated via online sampling during the training process.^{1}^{1}1We note that this problem is reminiscent of, but different from, MultiTask Learning, which involves training the same parameters using multiple loss functions. In our case, we train nonoverlapping parameters using two losses: Let and be the two respective losses, and and be the parameters of the argument extractor and blackbox estimator modules. Then the gradient updates of the EstiNet during Online Training are: Figure 1 (b) presents a schematic diagram of the online training procedure. We note that the online training procedure suffers from a cold start problem of the argument extractor. Initially, the argument extractor generates noisy input for the blackbox function, which prevents it from generating meaningful labels for the blackbox estimator.
Hybrid Training
In hybrid training we aim to solve the cold start problem by first training the blackbox estimator offline, but refraining from freezing its parameters. We load the estimator into the EstiNet model and continue to train it in parallel with the argument extractor as in online training.
3.1.3 Regularizing blackbox estimator overconfidence
In all of the above training procedures, we essentially replace the use of intermediate labels with the use of a blackbox dataset for implicitly training the argument extractor via backpropagation. As a consequence, if the gradients of the blackbox estimator are small, it will make it difficult for the argument extractor to learn. Furthermore, if the blackbox estimator is a classifier, it tends to grow overly confident as it trains, assigning very high probabilities to specific answers and very low probabilities for the rest
(Pereyra et al., 2017). Since these classification functions are implemented with a softmax layer, output values that are close to the function limits
result in extremely small gradients. Meaning that in the scenario where the estimator reaches local optima and is very confident, its gradient updates become small. Through the chain rule of backpropagation, this means that even if the argument extractor is not yet at local optima, its gradient updates become small as well, which complicates training.
To overcome this phenomenon, we follow Szegedy et al. (2016) and Pereyra et al. (2017), regularizing the high confidence by introducing (i) Entropy Loss – adding the negative entropy of the output distribution to the loss, therefore maximizing the entropy and encouraging less confident distributions, and (ii) Label Smoothing Regularization – adding the cross entropy (CE) loss between the output and the training set’s label distribution (for example, uniform distribution) to the standard CE loss between the predicted and ground truth distributions. Empirical validation of the phenomenon and our proposed solution are detailed in Section 4.3.
4 Experiments
We present four experiments in increasing complexity to test the Estimate and Replace approach and compare its performance against existing solutions. Specifically, the experiments demonstrate that by leveraging external blackbox functions, we achieve better generalization and better learning efficiency in comparison with existing competing solutions, without using intermediate labels. Appendix A contains concrete details of the experiments.
4.1 TextLogic
We start with a simple experiment that presents the ability of our Estimate and Replace approach to learn a proposed decomposition solution. We show that by leveraging a precise external function, our method performs better with less training data. In this experiment, we train a network to answer simple greaterthan/lessthan logical questions on real numbers, such as: “is 7.5 greater than 8.2?” We solve the textlogic task by constructing an EstiNet model with an argument extractor layer that extracts the arguments and operator (, and “” in the above example), and a blackbox estimator that performs simple logic operations (greater than and less than). We generate the TextLogic questions from ten different templates, all requiring a true/false answer for two float numbers.
Results
We compare the performance of the EstiNet model with a baseline model. This baseline model is equivalent to our model in its architecture, but is trained endtoend with the task labels as supervision. This supervision allows the model to learn the inputtooutput mapping, but does not provide any guidance for decomposing the task and learning the blackbox function interface. We used online training for the EstiNet model. Table 1 summarizes the performance differences. The EstiNet model generalizes better than the baseline, and the accuracy difference between the two training procedures increases as the amount of training data decreases. This experiment presents the advantage of the Estimate and Replace approach to train a DNN with less data. For example, to achieve accuracy of 0.97, our model requires only 5% of the data that the baseline training requires.
Train set size  250  500  1,000  5,000  10,000 

Baseline  0.533  0.686  0.859  0.931  0.98 
EstiNet  0.966  0.974  0.968  0.995  1.0 
Difference  81%  41%  13%  7%  2% 
4.2 ImageAddition
With the second experiment we seek to present the ability of our Estimate and Replace approach to generalize by leveraging a precise external function. In addition, we compare our approach to an Actor Criticbased RL algorithm. The ImageAddition task is to sum the values captured by a sequence of MNIST images. Previously, Trask et al. (2018) have shown that their proposed Neural Arithmetic Logic Unit (NALU) cell generalizes better than previous solutions while solving this task^{2}^{2}2They refer to this task as the MNISTAddition task in their work.
with standard endtoend training. We solve the task by constructing an EstiNet model with an argument extractor layer that classifies the digit in a given MNIST image, and a blackbox estimator that performs the sum operation. The argument extractor takes an unbounded series of MNIST images as input, and outputs a sequence of MNIST classifications of the same length. The blackbox estimator, which is a composition of a Long ShortTerm Memory (LSTM) layer and a NALU cell, then takes the argument extractor’s output as its input and outputs a single regression number. Solving the ImageAddition task requires the argument extractor to classify every MNIST image correctly without intermediate digit labels. Furthermore, because the sequence length is unbounded, unseen sequence lengths result in unseen sum ranges which the solution must generalize to.
Results vs. EndtoEnd
Table 2 shows a comparison of EstiNet performance with an endtoend NALU model. Both models were trained on sequences of length . The argument extractor achieves 98.6% accuracy on MNIST test set classification. This high accuracy indicates that the EstiNet is able to learn the desired behavior, where the arguments are the digits shown in the MNIST images. Thus, it can generalize to any sequence length by leveraging the sum operation. Our NALUbased EstiNet outperforms the plain NALUbased endtoend network.
Model  k = 10  k = 100 

NALU  1.42  7.88 
EstiNet  0.42  3.3 
Results vs. RL
We compare the EstiNet performance with an ACbased RL agent as an existing solution for training a neural network calling a blackbox function without intermediate labels. We compare the learning efficiency of the two models by the amount of gradient updates required to reach optima. Results in Figure 2 show that EstiNet significantly outperforms the RL agent.
4.3 ImageLookup
The third experiment tests the capacity of our approach to deal with nondifferentiable tasks, in our case a lookup operation, as oppose to the differentiable addition operation presented in the previous section. With this experiment, we present the effect of replacing the blackbox estimator with the original blackbox function. We are given a dimensional lookup table where is the digits domain in the range of . The imagelookup input is a sequence of length of MNIST images with corresponding digits . The label for is . We solve the imagelookup task by constructing an EstiNet model with an argument extractor similar to the previous task and a blackbox estimator that outputs the classification prediction.
Results
Results are shown in Table 3. Successfully solving this task infers the ability to generalize to the blackbox function, which in our case is the ability to replace or update the original lookup table with another at inference time without the need to retrain our model. To verify this we replace the lookup table with a randomly generated one at test mode and observe performance decrease, as the blackbox estimator did not learn the correct lookup functionality. However, in inference mode, where we replace the blackbox estimator with the unseen blackbox function, performance remains high.
#MNIST images  Train  Test  Inference  Argument Extractor  Estimator 

0.98  0.11  0.97  0.99  0.98  
0.97  0.1  0.97  0.99  0.98  
0.69  0.1  0.95  0.986  0.7 
We also used the ImageLookup task to validate the need for confidence regularization as described in Section 3.1.3. Figure 3 shows empirical results of correlation between overconfidence at the blackbox estimator output distribution and small gradients corresponding to the argument extractor, as well as the vice versa when confidence regularizers are applied.
4.4 TextLookupLogic (TLL)
For the last experiment, we applied the Estimate and Replace approach to solve a more challenging task. The task combines logic and lookup operations. In this task, we demonstrate the generalization ability on the input – a database table in this instance. The table can be replaced with a different one at inference time, like the blackbox function from the previous tasks. In addition, with this experiment we compare the offline, online and hybrid training modes. For this task, we generated a tablebased question answering dataset. For example, consider a table that describes the number of medals won by each country during the last Olympics, and a query such as: ”Which countries won more than 7 gold medals?”. We solve this task by constructing an EstiNet model with an argument extractor layer that (i) extracts the argument from the text, (ii) chooses the logical operation to perform (out of: equalto, lessthan, greaterthan, max and min), and (iii) chooses the relevant column to perform the operation on, along with a blackbox estimator that performs the logic operation on the relevant column.
Results
Table 4 summarizes the TLL model performance for the training procedures described in Section 3.1. In offline training the model fails to fit the training set. Consequently, low training model accuracy results in low inference performance. We hypothesize that fixing the estimator parameters during the endtoend training process prevents the rest of the model from fitting the train set. The online training procedure indeed led to significant improvement in inference performance. Hybrid training further improved upon online training fitting the training set and performance carried similarly to inference mode.
Training Type  Train  Test  Infer 

Offline  0.09  0.02  0.17 
Online  0.76  0.22  0.69 
Hybrid  0.98  0.47  0.98 
5 Related Work
EndtoEnd Learning
Taskspecific architectures for endtoend deep learning require large datasets and work very well when such data is available, as in the case of neural machine translation
(Bahdanau et al., 2014). General purpose endtoend architectures, suitable for multiple tasks, include the Neural Turing Machine
(Graves et al., 2014) and its successor, the Differential Neural Computer (Graves et al., 2016). Other architectures, such as the Neural Programmer architecture (Neelakantan et al., 2016) allow endtoend training while constraining parts of the network to execute predefined operations by reimplementing specific operations as static differentiable components. This approach has two drawbacks: it requires reimplementation of the blackbox function in a differentiable way, which may be difficult, and it lacks the accuracy and possibly also scalability of an exisiting blackbox function. Similarly Trask et al. (2018) present a Neural Arithmetic Logic Unit (NALU) which uses gated base functions to allow better generalization to arithmetic functionality.Program Induction and Program Generation
Program induction is a different approach to interaction with blackbox functions. The goal is to construct a program comprising a series of operations based on the input, and then execute the program to get the results. When the input is a natural language query, it is possible to use semantic parsing to transform the query into a logical form that describes the program (Liang, 2016). Early works required natural language queryprogram pairs to learn the mapping, i.e., intermediate labels. Recent works, (e.g., Pasupat & Liang (2015)) require only queryanswer pairs for training. Other approaches include neural networkbased program induction (Andreas et al., 2016) translation of a query into a program using sequencetosequence deep learning methods (Lin et al., 2017), and learning the program from execution traces (Reed & De Freitas, 2015; Cai et al., 2017).
Reinforcement Learning
Learning to execute the right operation can be viewed as a reinforcement learning problem. For a given input, the agent must select an action (input to blackbox function) from a set of available actions. The action selection repeats following feedback based on the previous action selection. Earlier works that took this approach include Branavan et al. (2009), and Artzi & Zettlemoyer (2013). Recently, Zaremba & Sutskever (2015) proposed a reinforcement extension to NTMs. Andreas et al. (2016) overcome the difficulty of discrete selections, necessary for interfacing with an external function, by substituting the gradient with an estimate using RL. Recent work by Liang et al. (2018) and Johnson et al. (2017) has shown to achieve stateoftheart results in Semantic Parsing and Question Answering, respectively, using RL.
6 Discussion
Interpretability via Composability
Lipton (2016) identifies composability as a strong contributor to model interpretability. They define composability as the ability to divide the model into components and interpret them individually to construct an explanation from which a human can predict the model’s output. The Estimate and Replace approach solves the blackbox interface learning problem in a way that is modular by design. As such, it provides an immediate interpretability benefit. Training a model to comply with a welldefined and wellknown interface inherently supports model composability and, thus, directly contributes to its interpretability.
For example, suppose you want to let a natural language processing model interface with a WordNet service to receive additional synonym and antonym features for selected input words. Because the WordNet interface is interpretable, the intermediate output of the model to the WordNet service (the words for which the model requested additional features) can serve as an explanation to the model’s final prediction. Knowing which words the model chose to obtain additional features for gives insight to how it made its final decision.
Reusability via Composability
An additional clear benefit of model composability in the context of our solution is reusability. Training a model to comply with a welldefined interface induces welldefined module functionality which is a necessary condition for module reuse.
7 Conclusion
Current solutions for learning using blackbox functionality in neural network prediction have critical limitations which manifest themselves in at least one of the following aspects: (i) poor generalization, (ii) low learning efficiency, (iii) underutilization of available optimal functions, and (iv) the need for intermediate labels. In this work, we proposed an architecture, termed EstiNet, and a training and deployment process, termed Estimate and Replace, which aim to overcome these limitations. We then showed empirical results that validate our approach.
Estimate and Replace is a twostep training and deployment approach by which we first estimate a given blackbox functionality to allow endtoend training via backpropagation, and then replace the estimator with its concrete blackbox function at inference time. By using a differentiable estimation module, we can train an endtoend neural network model using gradientbased optimization. We use labels that we generate from the blackbox function during the optimization process to compensate for the lack of intermediate labels. We show that our training process is more stable and has lower sample complexity compared to policy gradient methods. By leveraging the concrete blackbox function at inference time, our model generalizes better than endtoend neural network models. We validate the advantages of our approach with a series of simple experiments. Our approach implies a modular neural network that enjoys added interpretability and reusability benefits.
Future Work
We limit the scope of this work to tasks that can be solved with a single blackbox function. Solving the general case of this problem requires learning of multiple blackbox interfaces, along unbounded successive calls, where the final prediction is a computed function over the output of these calls. This introduces several difficult challenges. For example, computing the final prediction over a set of blackbox functions, rather than a direct prediction of a single one, requires an additional network output module. The input of this module must be compatible with the output of the previous layer, be it an estimation function at training time, or its blackbox function counterpart at inference time, which belong to different distributions. We reserve this area of research for future work.
As difficult as it is, we believe that artificial intelligence does not lie in mere knowledge, nor in learning from endless data samples. Rather, much of it is in the ability to extract the right piece of information from the right knowledge source for the right purpose. Thus, training a neural network to intelligibly interact with blackbox functions is a leap forward toward stronger AI.
References
 Andreas et al. (2016) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016.
 Artzi & Zettlemoyer (2013) Yoav Artzi and Luke Zettlemoyer. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics, 1:49–62, 2013.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Branavan et al. (2009) S. R. K. Branavan, Harr Chen, Luke S. Zettlemoyer, and Regina Barzilay. Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1  Volume 1, ACL ’09, pp. 82–90, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. ISBN 9781932432459. URL http://dl.acm.org/citation.cfm?id=1687878.1687892.
 Cai et al. (2017) Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion. arXiv preprint arXiv:1704.06611, 2017.
 Fodor & Pylyshyn (1988) Jerry A Fodor and Zenon W Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(12):3–71, 1988.
 Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwinska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adrià Puigdomènech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538:471 EP –, Oct 2016. URL http://dx.doi.org/10.1038/nature20101. Article.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016. URL http://arxiv.org/abs/1603.05027.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Jean et al. (2015) Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1–10. Association for Computational Linguistics, 2015. doi: 10.3115/v1/P151001. URL http://www.aclweb.org/anthology/P151001.
 Johnson et al. (2017) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, FeiFei Li, C. Lawrence Zitnick, and Ross B. Girshick. Inferring and executing programs for visual reasoning. CoRR, abs/1705.03633, 2017. URL http://arxiv.org/abs/1705.03633.

Kahan (1996)
William Kahan.
Ieee standard 754 for binary floatingpoint arithmetic.
1996.  Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Liang et al. (2018) Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc Le, and Ni Lao. Memory augmented policy optimization for program synthesis and semantic parsing. 2018.
 Liang (2016) Percy Liang. Learning executable semantic parsers for natural language understanding. Commun. ACM, 59(9):68–76, August 2016. ISSN 00010782. doi: 10.1145/2866568. URL http://doi.acm.org/10.1145/2866568.

Lin et al. (2017)
Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, and Michael D Ernst.
Program synthesis from natural language using recurrent neural networks.
Technical report, Technical Report UWCSE170301, University of Washington Department of Computer Science and Engineering, Seattle, WA, USA, 2017.  Lipton (2016) Zachary Chase Lipton. The mythos of model interpretability. CoRR, abs/1606.03490, 2016. URL http://arxiv.org/abs/1606.03490.

Mnih et al. (2016)
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy
Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu.
Asynchronous methods for deep reinforcement learning.
In
International conference on machine learning
, pp. 1928–1937, 2016.  Neelakantan et al. (2016) Arvind Neelakantan, Quoc V Le, Martin Abadi, Andrew McCallum, and Dario Amodei. Learning a natural language interface with neural programmer. arXiv preprint arXiv:1611.08945, 2016.
 Pasupat & Liang (2015) Panupong Pasupat and Percy Liang. Compositional semantic parsing on semistructured tables. arXiv preprint arXiv:1508.00305, 2015.
 Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. Regularizing neural networks by penalizing confident output distributions. CoRR, abs/1701.06548, 2017. URL http://arxiv.org/abs/1701.06548.
 Reed & De Freitas (2015) Scott Reed and Nando De Freitas. Neural programmerinterpreters. arXiv preprint arXiv:1511.06279, 2015.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 3104–3112. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5346sequencetosequencelearningwithneuralnetworks.pdf.

Szegedy et al. (2016)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and
Zbigniew Wojna.
Rethinking the inception architecture for computer vision.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016
, pp. 2818–2826. IEEE Computer Society, 2016. ISBN 9781467388511. doi: 10.1109/CVPR.2016.308. URL https://doi.org/10.1109/CVPR.2016.308. 
Szegedy et al. (2017)
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, volume 4, pp. 12, 2017.  Trask et al. (2018) Andrew Trask, Felix Hill, Scott Reed, Jack W. Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units. CoRR, abs/1808.00508, 2018. URL http://arxiv.org/abs/1808.00508.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 Xiong et al. (2016) Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256, 2016.
 Zaremba & Sutskever (2015) Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. CoRR, abs/1505.00521, 2015. URL http://arxiv.org/abs/1505.00521.
Appendix A Experiment Details and Hyperparameters
a.1 Image Experiments
The ImageAddition and ImageLookup tasks use the MNIST training and test sets. The input is a sequence of MNIST images, sampled uniformly from the training set. The blackbox function is a sum operation which receives a sequence of digits in range
represented as onehot vectors. For ImageLookup, the input sequence length defines the task (we’ve tested
. implies a lookup table of size ). For ImageAddition, we’ve trained on input length and tested on. The implementation was done in PyTorch.
Architecture
The argument extractor for both tasks is a composition of two convolutional layers (, ), each followed by local maxpooling, and a fullyconnected layer (), which outputs the MNIST classification for an image. The argument extractors for each image share their parameters and each one outputs an MNIST classification for one image in the sequence. The sum estimator is an LSTM network, followed by a NALU cell on the final LSTM output, which results in a regression floating number. The lookup estimator is a composition of fullyconnected layers (
) with ReLU activations. The architecture parameters are detailed in Table
5.Argument Extractor  

# filters  
filter size  
stride  
# filters  
filter size  
stride  
dimensions  
Lookup Estimator  
dimensions  
Sum Estimator  
LSTM # layers  
LSTM hidden size  
NALU # layers  
NALU hidden size 
Training
We used the hybrid training procedure where the pretraining of the estimator (offline training) continued until either performance reached 90%, or stopped increasing, on synthetic 10class (MNIST) distributions which were sampled uniformly. The hyperparameters of the model are in Table 6. We note that confidence regularization was necessary to stabilize learning and mitigate vanishing gradients. The target losses are crossentropy and squared distance for lookup and addition respectively. The loss functions are:
Where LSR stands for Label Smoothing Regularization loss, stands for entropy, stands for the output classification, stands for the gold label (onehot), and and stand for the model and gold MNIST sum regressions, respectively. The weighted component of the loss is the online loss. The weighted component is threshold entropy loss regularization on the argument extractor’s MNIST classifications.
Parameter  Addition  Lookup 

Online loss  
Entropy loss  
Entropy loss threshold  
LSR confidence penalty  —  
LSR label distribution prior  —  Uniform 
Optimizer  Adam  Adam 
Learning rate  0.001  0.001 
Batch size  50  20 
Image tasks hyperparameters.
In the following we describe the RL environment and architecture used in our experiments. We employed fixed length episodes and experimented with . The MDP was modeled as follows: at each step a sample is randomly selected from the MNIST dataset, where the handwritten image is used as the state, i.e. . The agent responds with an action from the set . The reward, , in all steps except the last step is 0, and equals to the sum of absolute errors between the labels of the presented examples and the agent responses in the last step:
Where is the digit label.
We use A3C as detailed by Mnih et al. (2016) as the learning algorithm containing a single worker which updates the master network at the end of each episode. The agent model was implemented using two convolutional layers with filters of size followed by a maxpooling size . The first convolutional layer contains 10 filters while the second contains filters. The last two layers were fully connected of sizes 256 and 10 respectively with ELU activation, followed by a softmax. We employed Adam optimization (Kingma & Ba, 2014) with learning rate .
a.2 Text Experiments
The TextLogic and TextLookupLogic experiments were implemented in TensorFlow on synthetic datasets generated from textual templates and sampled numbers. We give concrete details for both experiments.
a.2.1 TextLookupLogic
For the TLL task we generated a tablebased question answering dataset. The TLL dataset input has two parts: a question and a table. To correctly answer a question from this dataset, the DNN has to access the right table column and apply nondifferentiable logic on it using a parameter it extracts from the query. For example, consider a table that describes the number of medals won by each country during the last Olympics, and a query such as: “Which countries won more than 7 gold medals?” To answer this query the DNN has to extract the argument (7 in this case) from the query, access the relevant column (namely, gold medals), and execute the ’greater than’ operation with the extracted argument and column content (namely a vector of numbers) as its parameters. The operation’s output vector holds the indexes of the rows that satisfy the logic condition (greaterthan in our example). The final answer contains the names of the countries (i.e., from the countries column) in the selected rows.
The blackbox function interface
Solving the TLL task requires five basic logic functions: equalto, lessthan, greaterthan, max, and min. Each such function defines an API that is composed of two inputs and one output. The first input is a vector of numbers, namely, a column in the table. The second is a scalar, namely, an argument from the question or NaN if the scalar parameter is not relevant. The output is one binary vector, the same size as the input vector. The output vector indicates the selected rows for a specific query and thus provides the answer.
TLL data
We generated tables in which the first row contains column names and the first column contains a list of entities (e.g., countries, teams, products, etc.). Subsequent columns contained the quantitative properties of an entity (e.g., population, number of wins, prices, discounts, etc.). Each TLLgenerated table consisted of 5 columns and 25 rows. We generated entity names (i.e., nations and clubs) for the first column by randomly selecting from a closed list. We generated values for the rest of the columns by sampling from a uniform distribution. We sampled values between 1 and 100 for the train set tables, and between 300 and 400 for the test set tables. We further created 2 sets of randomly generated questions that use the 5 functions. The set includes 20,000 train questions on the train tables and 4,000 test questions on the test tables.
Input representations
The TLL input was composed of words, numbers, queries, and tables. We used word pieces as detailed by Wu et al. (2016) to represent words. A word is a concatenation of word pieces: is an average value of its piece embedding.
The exact numerical value of numbers is important to the decision. To accurately represent a number and embed it into the same word vector space, we used number representation following the float32 scheme (Kahan, 1996). Specifically, it starts by representing a number as a 32 dimension Boolean vector . It then adds redundancy factor by multiplying each of the digits
times. Last, it pads the
resulting vector with zeros. We tried several representation schemes. This approach resulted in the best EstiNet performance.We represent the query as a matrix of word embeddings and use an LSTM model (Hochreiter & Schmidhuber, 1997) to encode the query matrix into a vector representation: where is the last LSTM output and is the dimension of the LSTM.
Each table with rows and
columns is represented as a three dimensional tensor. It represents a cell in a table as the piece average of its words.
Argument Extractors Architecture
The EstiNet TLL model uses three types of “selectors” (argument extractors): operation, argument, and column. Operation selectors select the correct blackbox function. Argument selectors select an argument from the query and hand it to the API. The column selector’s role is to select a column from the table and pass it to the blackbox function. We implement each selector subnetwork as a classifier. Let be the predicted class matrix, where the total number of classes is and each class is represented by a vector of size . For example, for a selector that has to select a word from a sentence, the matrix contains the word embeddings of the words in the sentence. One may consider various selector implementation options. We use a simple, fully connected network implementation in which is the parameter matrix and is the bias. We define to be the selector prediction before activation and to be the prediction after the softmax activation layer. At inference time, the selector transforms its soft selection into a hard selection to satisfy the API requirements. EstiNet enables that using Gumbel Softmax hard selection functionality.
Estimator Architecture
We use five estimators to estimate each of the five logic operations. Each estimator is a general purpose subnetwork that we implement with a transformer network encoder
Vaswani et al. (2017). Specifically, we use identical layers, each of which consists of two sublayers. The first is a multihead attention withheads, and the second is a fully connected feed forward twolayer network, activated separately on each cell in the sequence. We then employ a residual connection around each of these two sublayers, followed by layer normalization. Last, we apply linear transformation on the encoder output, adding bias and applying a Gumbel Softmax.
a.2.2 TextLogic
The task input is a sentence that contains a greaterthan or lessthan question generated from a set of ten possible natural language patterns. The argument extractor must choose the correct tokens from the input to pass to the estimator/blackbox function, which executes the greaterthan/lessthan functionality. For example: Out of x and y , is the first bigger ? where are float numbers sampled from a distribution. The architecture is a very simple derivative of the TLL model with two selectors for the two floating numbers, and a classification of the choice between greaterthan or lessthan.
Comments
There are no comments yet.