Humans have the ability to process symbolic knowledge and maintain symbolic thought (Unger and Deacon, 1998). When reasoning, humans do not require combinatorial enumeration of examples but instead utilise invariant patterns with placeholders replacing specific entities. Symbolic cognitive models (Lewis, 1999) embrace this perspective with the human mind seen as an information processing system operating on formal symbols such as reading a stream of tokens in natural language. The language of thought hypothesis (Morton and Fodor, 1978) frames human thought as a structural construct with varying sub-components such as “X went to Y”. By recognising what varies across examples, humans are capable of lifting examples into invariant principles that account for other instances. This symbolic thought with variables is learned at a young age through symbolic play (Piaget, 2001). For instance a child learns that a sword can be substituted with a stick (Frost et al., 2004) and engage in pretend play.
|X:bernhard is a Y:frog|
|Z:lily is a Y:frog|
|Z:lily is A:green|
|what colour is X:bernhard|
Although variables are inherent in models of computation and symbolic formalisms, as in first-order logic (Russell and Norvig, 2018), they are pre-engineered and used to solve specific tasks by means of unification or assignments that bound variables to given values. However, when learning from data only, being able to recognise when and which symbols should take on different values, i.e. symbols that can act as variables, is crucial for lifting examples into general principles that are invariant across multiple instances. Figure 1 shows the invariant learned by our approach: if someone is the same thing as someone else then they have the same colour. With this invariant, our approach can solve all of the training and test examples in task 16 of the bAbI dataset (Weston et al., 2015).
In this paper we address the question of whether a machine can learn and use the notion of a variable, i.e. a symbol that can take on different values. For instance, given an example of the form “bernhard is a frog” the machine would learn that the token “bernhard” could be someone else and the token “frog” could be something
else. If we consider unification a selection of the most appropriate value for a variable given a choice of values, we can reframe it as a form of attention. Attention models(Bahdanau et al., 2014; Luong et al., 2015; Chaudhari et al., 2019) allow neural networks to focus, attend to certain parts of the input often for the purpose of selecting a relevant portion. Since attention mechanisms are also differentiable they are often jointly learned within a task. This perspective motivates our idea of a unification mechanism that utilises attention and is therefore fully differentiable which we refer to as soft unification.
Hence, we propose an end-to-end differentiable neural network approach for learning and utilising the notion of a variable that in return can lift examples into invariants used by the network to perform reasoning tasks. Specifically, we (i) propose a novel architecture capable of learning and using variables by lifting a given example through soft unification, (ii) present the empirical results of our approach on four datasets and (iii) analyse the learned invariants that capture the underlying patterns present in the tasks. Our implementation using Chainer (Tokui et al., 2015) is publicly available at https://github.com/nuric/softuni with the accompanying data.
2 Soft Unification
Reasoning with variables involves identifying what variables are, the setting in which they are used as well as the process by which they are assigned values. When the varying components, i.e. variables, of an example are identified, the remaining structure can be lifted into an invariant which then accounts for multiple other instances.
Definition 1 (Variable).
Given a set of symbols , a variable X is defined as a pair X where is the default symbol of the variable and x
is a discrete random variable of which the support is. The representation of a variable is equal to the expected value of the corresponding random variable x given the default symbol :
where is a -dimensional real-valued feature of a symbol .
For example, could be an embedding and would become a weighted sum of symbol embeddings as in conventional attention models. The default symbol of a variable is intended to capture the variable’s bound meaning following the idea of referants by Frege (1948). We denote variables using X, Y, A etc. such as X:bernhard where X is the name of the variable and bernhard the default symbol as shown in Figure 1.
Definition 2 (Invariant).
Given a structure (e.g. list, grid) over , an invariant is a pair where is the invariant example such as a tokenised story and is a function representing the degree to which the symbol is considered a variable. Thus, the final representation of a symbol included in , is:
the linear interpolation between its representationand its variable bound value with itself as the default symbol .
We adhere to the term invariant and refrain from mentioning rules, unground rules, etc. used in logic-based formalisms, e.g. Muggleton and de Raedt (1994), since neither the invariant structure needs to be rule-like nor the variables carry logical semantics. This distinction is clarified in Section 6.
Definition 3 (Unification).
Given an invariant and an example , unification binds the variables in to symbols in . Defined as a function
, unification binds variables by computing the probability mass functions,in equation 1, and returns the unified representation using equation 2. The probability mass function of a variable X: is:
where is the unifying feature of a symbol and is applied element wise to symbols in . If is differentiable, it is referred to as soft unification.
We distinguish from to emphasise that the unifying properties of the symbols might be different from their representations. For example, could represent a specific person whereas the notion of someone.
Overall soft unification incorporates 3 learnable components: which denote the base features, variableness and unifying features of a symbol respectively. Given an upstream, potentially task specific, network , an invariant and an input example with a corresponding desired output , the following holds:
where now predicts based on the unified representation produced by . In this work, we focus on , the invariants it produces together with the interaction of .
3 Unification Networks
Since soft unification is end-to-end differentiable, it can be incorporated into existing task-specific upstream architectures. We present 3 architectures that modelet al., 2014) to demonstrate the flexibility of our approach. In all cases, the dimensional representation of symbols are learnable embeddings with randomly initialised by and
the one-hot encoding of the symbol. The variableness of symbols is a learnable weightwhere and
is the sigmoid function. We consider every symbol independently a variable irrespective of its surrounding context and leave further contextualised formulations as future work. The underlying intuition of this configuration is that auseful symbol for a correct prediction might need to take on other values for different inputs. This usefulness can be viewed as the inbound gradient to the corresponding parameter and acting as a gate. For further model details including the size of the embeddings, please refer to Appendix A.
Unification MLP (UMLP) (: MLP, : RNN) We combine soft unification into a multi-layer perceptron to process fixed length inputs. In this case, the structure is a sequence of symbols with a fixed length , e.g. a sequence of digits 4234. Given an embedded input , the upstream MLP computes the output symbol based on the flattened representations where is the output of the last layer. However, to compute the unifying features , definition 3, uses a bi-directional GRU (Cho et al., 2014) running over such that where is the hidden state of the GRU at symbol and is a learnable parameter. This model emphasises the flexibility around the boundary of and that the unifying features can be computed in any differentiable manner.
Unification CNN (UCNN) (: CNN, : CNN) Given a grid of embedded symbols where is the width and the height, we use a convolutional neural network such that the final prediction is where
this time is the result of global max pooling andare learnable parameters. We also model using a separate convolutional network with the same architecture as and set where
are the convolutional layers. The grid is padded with 0s to obtainafter each convolution such that every symbol has a unifying feature. This model conveys how soft unification can be adapted to the specifics of the domain for example by using a convolution in a spatially structured input.
Unification Memory Networks (UMN) (: MemNN, : RNN) Soft unification does not need to happen prior to in a fashion but can also be incorporated at any intermediate stage multiple times. To demonstrate this ability, we unify the symbols at different memory locations at each iteration of a Memory Network (Weston et al., 2014). Memory networks can handle a list of lists structure such as a tokenised story as shown in Figure 2. The memory network uses the final hidden state of a bi-directional GRU (outer squares in Figure 2) as the sentence representations to compute a context attention. At each iteration, we unify the words between the attended sentences using the same approach in UMLP with another bi-directional GRU (inner diamonds in Figure 2) for unifying features . Following equation 2, the new unified representation of the memory slot is computed and uses it to perform the next iteration. Concretely,
produces an unification tensorwhere and is the number of sentences and words in the invariant respectively, and is the number of sentences in the example such that after the context attentions are applied over and , we obtain as the unified sentence at that iteration. Note that unlike in the UMLP case, the sentences can be of varying length. The prediction is then where is the hidden state of the invariant after iterations. This setup, however, requires pre-training such that the context attentions match the correct sentences.
A task might contain different questions such as “Where is X?” and “Why did X go to Y?”. To let the models differentiate between questions and potentially learn different invariants, we extend them with a repository of invariants and aggregate the predictions from each invariant. One simple approach is to sum the predictions of the invariants used in UMLP and UCNN. Another approach could be to use features from the invariants such as memory representations in the case of UMN. For UMN, we weigh the predictions using a bilinear attention based on the hidden states at the first iteration and such that
. To initially form the repository of invariants, we use the bag-of-words representation of the questions and find the most dissimilar ones based on their cosine similarity as a heuristic to obtain varied examples.
We use 4 datasets consisting of context, query and an answer : fixed length sequences of symbols, shapes of symbols in a grid, story based natural language reasoning with the bAbI (Weston et al., 2015)
dataset and logical reasoning represented as logic programs, examples shown in Table1 with further samples in Appendix B. In each case we use an appropriate model: UMLP for fixed length sequences, UCNN for grid and UMN for iterative reasoning. We use synthetic datasets of which the data generating distributions are known to evaluate not only the quantitative performance but also the quality of the invariants learned by our approach.
|Where is Mary?||kitchen||1k, 50|
|p(a).||True||1k, 10k, 50|
Fixed Length Sequences We generate sequences of length with 8 unique symbols represented as digits to predict (i) a constant, (ii) the head of the sequence, (iii) the tail and (iv) the duplicate symbol. We randomly generate 1000 triples and then only take the unique ones to ensure the test split contains unseen examples. The training is then performed over a 5-fold cross-validation.
Grid To spatially organise symbols, we generate a grid of size with 8 unique symbols organised into box of identical symbol, a vertical, diagonal or horizontal sequence of length 3, a cross or a plus shape and a triangle. In each task we predict (i) the identical symbol, (ii) the head of the sequence, (iii) the centre of the cross or plus and (iv) the corner of the triangle respectively. We follow the same procedure from sequences and randomly generate 1000 discarding duplicate triples.
bAbI The bAbI dataset has become a standard benchmark for evaluating memory based networks. It consists of 20 synthetically generated natural language reasoning tasks (refer to Weston et al. (2015) for task details). We take the 1k English set and use 0.1 of the training set as validation. Each token is lower cased and considered a unique symbol. Following previous works (Seo et al., 2016; Sukhbaatar et al., 2015), we take multiple word answers also to be a unique symbol in .
Logical Reasoning To demonstrate the flexibility of our approach and distinguish our notion of a variable from that used in logic based formalisms, we generate logical reasoning tasks in the form of logic programs using the procedure by Cingillioglu and Russo (2018). The tasks involve learning over 12 classes of logic programs exhibiting varying paradigms of logical reasoning including negation by failure (Clark, 1978). We generate 1k and 10k logic programs per task for training with 0.1 as validation and another 1k for testing. We set the arity of literals to 1 or 2 using one random character from the English alphabet for predicates and constants, e.g. and an upper case character for logical variables, e.g. .
We probe three aspects of soft unification: the impact of unification on performance over unseen data, the effect of multiple invariants and data efficiency. To that end, we train UMLP and UCNN with and without unification, UMN with pre-training using 1 or 3 invariants over either the entire training set or only 50 examples. Every model is trained 3 times via back-propagation using Adam (Kingma and Ba, 2014) on an Intel Core i7-6700 CPU using the following objective function:
where is the negative log-likelihood with sparsity regularisation over at to discourage the models from utilising spurious number of variables. For UMLP and UCNN, we set for training just the unified output and the converse for the non-unifying versions. To pre-train the UMN, we start with
for 40 epochs then setto jointly train the unified output. For iterative tasks, the mean squared error between hidden states at each iteration and, in the strongly supervised cases, the negative log-likelihood for the context attentions are also added to the objective function. Further details such as batch size and total number of epochs are available in Appendix C.
Figure 3 portrays how soft unification generalises better to unseen examples in test sets - the same sequence or grid never appears in both the training and test sets as outlined in Section 4 - over plain models. Despite having more trainable parameters than alone, the models with unification not only maintain higher accuracy in each iteration and solve the tasks in as few as 250 iterations with training examples but also improve accuracy by when trained with only per task. We believe soft unification architecturally biases the models towards learning structural patterns which in return achieves better results on recognising common patterns of symbols across examples. Results with multiple invariants are identical and the models seem to ignore the extra invariants due to the fact that the tasks can be solved with a single invariant and the regularisation applied on zeroing out unnecessary invariants; further results in Appendix D. The fluctuations in accuracy around iterations 750 to 1000 in UCNN are also caused by penalising which forces the model to relearn the task with less variables half way through training.
|# Invs / Model||1||3||1||3||3||N2N||GN2N||EntNet||QRN||MemNN|
Following Tables 2 and 3, we observe a trend of better performance through strong supervision, more data per task and using only 1 invariant. We believe strong supervision aids with selecting the correct sentences to unify and in a weak setting the model attempts to unify arbitrary context sentences often failing to follow the iterative reasoning chain. The increase in performance with more data and strong supervision is consistent with previous work reflecting how can be bounded by the efficacy of modelled as a memory network. As a result, only in the supervised case do we observe a minor improvement over MemNN by 0.7 in Table 2 and no improvement in the weak case over DMN or IMA in Table 3 with failing 17/20 and 12/12 tasks when trained using only 50 examples. The increase in error rate with 3 invariants, we speculate, stems from having more parameters and more pathways in the model rendering training more difficult.
|# Invs / Model||1||3||1||3||1||3||1||1||3||3||DMN||IMA|
After training, we can extract the learned invariants by applying a threshold on indicating whether a symbol is used as a variable or not. We set for all datasets except for bAbI, we use . The magnitude of this threshold seems to depend on the amount of regularisation , equation 5, and the number of training steps along with batch size all controlling how much is pushed towards 0. Sample invariants shown in Figure 4 describe the common patterns present in the tasks with parts that contribute towards the final answer becoming variables. Extra symbols such as is or travelled do not emerge as variables, as shown in Figure 3(a); we attribute this behaviour to the fact that changing the token travelled to went does not influence the prediction but changing the action, the value of Z:left to ‘picked’ does. However, based on random initialisation, our approach can convert a random symbol into a variable and let compensate for the unifications it produces. For example, the invariant “X:8 5 2 2” could predict the tail of another example by unifying the head with the tail using , equation 3, of those symbols. Pre-training as done in UMN seems to produce more robust and consistent invariants compared to immediately training since, we speculate, by equation 4 might encourage .
Interpretability versus Ability A desired property of interpretable models is transparency (Lipton, 2018). A novel outcome of the learned invariants in our approach is that they provide an approximation of the underlying general principle present in the data such as the structure of multi-hop reasoning shown in Figure 3(e). However, certain aspects such as temporal reasoning are still hidden inside . In Figure 3(b), although we observe Z:morning as a variable, the overall learned invariant captures nothing about how changing the value of Z:morning alters the behaviour of . The model might look before or after a certain time point X:bill went somewhere depending what Z:morning binds to. Without the regularising term on , we initially noticed the models using, one might call extra, symbols as variables and binding them to the same value occasionally producing unifications such as “bathroom bathroom to the bathroom” and still predicting, perhaps unsurprisingly, the correct answer as bathroom. Hence, regularising with the correct amount in equation 5 seems critical in extracting not just any invariant but one that represents the common structure.
Soft unification from equation 3 reveals three main patterns: one-to-one, one-to-many or many-to-one bindings as shown in Figure 5. Figures 4(a) and 4(d) capture what one might expect unification to look where variables unify with their corresponding counterparts. However, occasionally the model can optimise to use less variables and squeeze the required information into a single variable, for example by binding Y:bathroom to john and kitchen as in Figure 4(b). We believe this occurs due to the sparsity constraint on encouraging the model to be as conservative as possible. In a similar fashion, the unification can bind a single variable Y:o to both other constants as in Figure 4(e). Finally, if there are more variables than needed as in Figure 4(f), we observe a many-to-one binding with Y:w and Z:e mapping to the same constant. This behaviour begs the question how does the model differentiate between and . We speculate the model uses the magnitude of and to encode the difference despite both variables unifying with the same constant.
7 Related Work
Learning an underlying general principle in the form of an invariant is often the means for arguing generalisation in neural networks. For example, Neural Turing Machines(Graves et al., 2014) are tested on previously unseen sequences to support the view that the model might have captured the underlying pattern or algorithm. In fact, Weston et al. (2014) claim “MemNNs can discover simple linguistic patterns based on verbal forms such as (X, dropped, Y), (X, took, Y) or (X, journeyed to, Y) and can successfully generalise the meaning of their instantiations.” However, this claim is based on the output of and unfortunately it is unknown whether the model has truly learned such a representation or indeed is utilising it. Our approach sheds light to this ambiguity and presents these linguistic patterns explicitly as invariants ensuring their utility through without solely analysing the output of on previously unseen symbols. Although we associate these invariants with our existing understanding of the task to mistakenly anthropomorphise the machine, for example by thinking it has learned X:mary as someone, it is important to acknowledge that these are just symbolic patterns. In these cases, our interpretations do not necessarily correspond to any understanding of the machine, relating to the Chinese room argument made by Searle (1980).
Learning invariants by lifting ground examples is related to least common generalisation (Reynolds, 1970) by which inductive inference is performed on facts (Shapiro, 1981) such as generalising went(mary,kitchen) and went(john,garden) to went(X,Y). Unlike in a predicate logic setting, our approach allows for soft alignment and therefore generalisation between varying length sequences. Existing neuro-symbolic systems (Broda et al., 2002) focus on inducing rules that adhere to given logical semantics of what a variable and a rule are. For example, (Evans and Grefenstette, 2018) constructs a network by rigidly following the given semantics of first-order logic. Similarly, Lifted Relational Neural Networks (Sourek et al., 2015) ground first-order logic rules into a neural network while Neural Theorem Provers (Rocktäschel and Riedel, 2017) build neural networks using backward-chaining (Russell and Norvig, 2018) on a given background knowledge base with templates. This architectural approach for combining logical variables is also observed with TensorLog (Cohen, 2016) and Logic Tensor Networks (Serafini and d’Avila Garcez, 2016) while grounding logical rules can also be used as regularisation (Hu et al., 2016). However, the notion of a variable is pre-engineered rather than learned with a focus on presenting a practical approach to solving certain problems whereas our motivation stems from a cognitive perspective.
At first it may seem the learned invariants, Section 6, make the model more interpretable; however, this transparency is not of the model but of the data. The invariant captures patterns that potentially approximates the data generating distribution but we still do not know how the model
uses them upstream. Thus, from the perspective of explainable artificial intelligence (XAI)(Adadi and Berrada, 2018), learning invariants or interpreting them do not constitute an explanation of the reasoning model even though “if someone goes somewhere then they are there” might look like one. Instead, it can be perceived as causal attribution (Miller, 2019) in which someone being somewhere is attributed to them going there. This perspective also relates to gradient based model explanation methods such as Layer-Wise Relevance Propagation (Bach et al., 2015) and Grad-CAM (Selvaraju et al., 2017; Chattopadhay et al., 2018). Consequently, a possible view on , equation 2, is a gradient based usefulness measure such that a symbol utilised upstream by to determine the answer becomes a variable similar to how a group of pixels in an image contribute more to its classification.
Finally, one can argue that our model maintains a form of counterfactual thinking (Roese, 1997) in which soft unification creates counterfactuals on the invariant example to alter the output of towards the desired answer, equation 4. The question where Mary would have been if Mary had gone to the garden instead of the kitchen is the process by which an invariant is learned through multiple examples during training. This view relates to methods of causal inference (Pearl, 2019; Holland, 1986) in which counterfactuals are vital as demonstrated in structured models by Pearl (1999).
We presented a new approach for learning variables and lifting examples into invariants through the usage of soft unification. Evaluating on four datasets, we analysed how Unification Networks perform comparatively to existing similar architectures while having the benefit of lifting examples into invariants that capture underlying patterns present in the tasks. Since our approach is end-to-end differentiable, we plan to apply this technique to multi-modal tasks in order to yield multi-modal invariants for example in visual question answering.
We would like to thank Murray Shanahan for his helpful comments, critical feedback and insights regarding this work.
- Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, pp. 52138–52160. External Links: Cited by: §7.
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE 10 (7), pp. e0130140. External Links: Cited by: §7.
- Neural machine translation by jointly learning to align and translate. External Links: Cited by: §1.
- Neural-symbolic learning systems. Springer London. External Links: Cited by: §7.
Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks.
2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. External Links: Cited by: §7.
- An attentive survey of attention models. External Links: Cited by: §1.
- On the properties of neural machine translation: encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, External Links: Cited by: §3.
- DeepLogic: towards end-to-end differentiable logical reasoning. External Links: Cited by: §4, Table 3.
- Negation as failure. In Logic and Data Bases, pp. 293–322. External Links: Cited by: §4.
- TensorLog: a differentiable deductive database. arXiv:1605.06523. External Links: Cited by: §7.
- Learning explanatory rules from noisy data. 61, pp. 1–64. External Links: Cited by: §7.
- Sense and reference. 57 (3), pp. 209. External Links: Cited by: §2.
- The developmental benefits of playgrounds. Association for Childhood Education International. External Links: Cited by: §1.
- Neural turing machines. External Links: Cited by: §7.
- Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. External Links: Cited by: Table 7.
- Tracking the world state with recurrent entity networks. ICLR. External Links: Cited by: Table 7.
- Statistics and causal inference. Journal of the American Statistical Association 81 (396), pp. 945–960. External Links: Cited by: §7.
- Harnessing deep neural networks with logic rules. ACL. External Links: Cited by: §7.
- Adam: a method for stochastic optimization. ICLR. External Links: Cited by: §5.
Ask me anything: dynamic memory networks for natural language processing. In ICML, pp. 1378–1387. External Links: Cited by: Table 7.
- Cognitive modeling, symbolic. pp. 525–527. Cited by: §1.
- The mythos of model interpretability. Communications of the ACM 61 (10), pp. 36–43. External Links: Cited by: §6.
- Gated end-to-end memory networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1–10. External Links: Cited by: Table 7, Table 2.
- Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, External Links: Cited by: §1.
- Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1–38. External Links: Cited by: §7.
- The language of thought.. Vol. 75, Philosophy Documentation Center. External Links: Cited by: §1.
- Inductive logic programming: theory and methods. The Journal of Logic Programming 19-20, pp. 629–679. External Links: Cited by: §2.
- Probabilities of causation: three counterfactual interpretations and their identification. Synthese 121 (1-2), pp. 93–149. Cited by: §7.
The seven tools of causal inference with reflections on machine learning. Communications of the ACM 62 (3), pp. 54–60. External Links: Cited by: §7.
- The psychology of intelligence. Routledge. External Links: Cited by: §1.
- Transformational systems and algebraic structure of atomic formulas. Machine intelligence 5, pp. 135–151. Cited by: §7.
- End-to-end differentiable proving. pp. 3791–3803. External Links: Cited by: §7.
- Counterfactual thinking.. Psychological Bulletin 121 (1), pp. 133–148. External Links: Cited by: §7.
- Artificial intelligence: a modern approach (3rd edition). Pearson. External Links: Cited by: §1, §7.
- Minds, brains, and programs. 3 (3), pp. 417–424. External Links: Cited by: §7.
- Grad-CAM: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626. External Links: Cited by: §7.
- Query-reduction networks for question answering. ICLR. External Links: Cited by: Table 7, §4.
Logic tensor networks: deep learning and logical reasoning from data and knowledge. arXiv:1606.04422. External Links: Cited by: §7.
- Inductive inference of theories from facts. Yale University, Department of Computer Science. Cited by: §7.
- Lifted relational neural networks. arXiv:1508.05128. External Links: Cited by: §7.
- End-to-end memory networks. In NIPS, pp. 2440–2448. External Links: Cited by: Table 7, §4, Table 2.
- Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), External Links: Cited by: §1.
- The symbolic species: the co-evolution of language and the brain. Vol. 82, Wiley. External Links: Cited by: §1.
- Towards ai-complete question answering: a set of prerequisite toy tasks. External Links: Cited by: Table 7, §1, §4, §4.
- Memory networks. External Links: Cited by: Table 7, §3, §3, Table 2, §7.
- Dynamic memory networks for visual and textual question answering. pp. 2397–2406. External Links: Cited by: Table 7.
Appendix A Model Details
a.1 Unification MLP & CNN
Unification MLP (UMLP) To model as a multi-layer perceptron, we take symbol embeddings of size and flatten sequences of length
into an input vector of size. The MLP consists of 2 hidden layers with non-linearity of sizes and respectively. To process the query, we concatenate the one-hot encoding of the task id to yielding a final input of size . For unification features , we use a bi-directional GRU with hidden size and an initial state of
. The hidden state at each symbol is taken with a linear transformation to givewhere is the hidden state of the biGRU. The variable assignment is then computed as an attention over the according to equation 3.
Unification CNN (UCNN) We take symbols embeddings of size to obtain an input grid . Similar to UMLP, for each symbol we append the task id as a one-hot vector to get an input of shape . Then consists of 2 convolutional layers with
filters each, kernel size of 3 and stride 1. We usenon-linearity in between the layers. We pad the grid with 2 columns and 2 rows to a such that the output of the convolutions yield again a hidden output of the same shape. As the final hidden output , we take a global max pool to over to obtain . Unification function is modelled identical to without the max pooling such that where is the hidden output of the convolutional layers.
a.2 Unification Memory Networks
Unlike previous architectures, with UMN we interleave into . We use embedding sizes of and model with an iterative memory network. We take the final hidden state of a bi-directional GRU, with initial state , to represent the sentences of the context and query in a -dimensional vector and the query . The initial state of the memory network is . At each iteration :
where is another -dimensional bi-directional GRU and with the element-wise multiplication and the concatenation of vectors. Taking as the context attention, we obtain the next state of the memory network:
and iterate many times in advance. The final prediction becomes . All weight matrices
and bias vectorsare independent but are tied across iterations.
Appendix B Generated Dataset Samples
Appendix C Training Details
c.1 Unification MLP & CNN
Both unification models are trained on a 5-fold cross-validation over the generated datasets for 2000 iterations with a batch size of 64. We don’t use any weight decay and save the training and test accuracies every 10 iterations, as presented in Figure 3.
c.2 Unification Memory Networks
We again use a batch size of 64 and pre-train for 40 epochs then together with for 260 epochs. We use epochs for UMN since the dataset sizes are fixed. To learn alongside , we combine error signals from the unification of the invariant and the example. Following equation 4, the objective function not only incorporates the negative log-likelihood of the answer but also the mean squared error between intermediate states and at each iteration as an auxiliary loss:
We pre-train by setting for 40 epochs and then set . For strong supervision we also compute the negative log-likelihood for the context attention , described in Appendix A
, at each iteration using the supporting facts of the tasks. We apply a dropout of 0.1 for all recurrent neural networks used and only for the bAbI dataset weight decay with 0.001 as the coefficient.
Appendix D Further Results
|# Invs / Model||1||3||1||3||1||3||1||1||3||3||DMN||IMA|
|1 Step NBF||46.4||38.7||3.8||28.8||1.1||21.6||0.1||1.1||8.0||47.6||21.0||2.0|
|2 Steps NBF||48.5||48.9||7.7||39.6||30.4||48.2||0.1||33.4||28.7||50.3||15.0||4.0|
|Xsandra went back to the Y:bathroom|
|is X:sandra in the Y:bathroom|
|X:m ( Y:e )||X:m ( Y:e )|
|X:a ( Y:w , Z:e )||X:a ( Y:w , Z:e )|
|X:m ( T )||X:m ( c )|
|X:x ( A ) not Z:q ( A )||X:x ( Y:z )|