Neural Analogical Matching

04/07/2020
by   Maxwell Crouse, et al.
ibm
Northwestern University
0

Analogy is core to human cognition. It allows us to solve problems based on prior experience, it governs the way we conceptualize new information, and it even influences our visual perception. The importance of analogy to humans has made it an active area of research in the broader field of artificial intelligence, resulting in data-efficient models that learn and reason in human-like ways. While cognitive perspectives of analogy and deep learning have generally been studied independently of one another, the integration of the two lines of research is a promising step towards more robust and efficient learning techniques. As part of a growing body of research on such an integration, we introduce the Analogical Matching Network: a neural architecture that learns to produce analogies between structured, symbolic representations that are largely consistent with the principles of Structure-Mapping Theory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/11/2018

Informing Artificial Intelligence Generative Techniques using Cognitive Theories of Human Creativity

The common view that our creativity is what makes us uniquely human sugg...
11/10/2005

Dimensions of Neural-symbolic Integration - A Structured Survey

Research on integrated neural-symbolic systems has made significant prog...
06/05/2018

Human-like generalization in a machine through predicate learning

Humans readily generalize, applying prior knowledge to novel situations ...
02/24/2018

Introduction to the SP theory of intelligence

This article provides a brief introduction to the "Theory of Intelligenc...
02/02/2020

A Machine Consciousness architecture based on Deep Learning and Gaussian Processes

Recent developments in machine learning have pushed the tasks that machi...
09/04/2020

Naive Artificial Intelligence

In the cognitive sciences, it is common to distinguish between crystal i...
10/25/2019

A Simple Descriptive Method Standard for Comparing Pairs of Stacked Bar Graphs

While a plethora of research has been devoted to extoling the power and ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Analogical reasoning is a form of inductive reasoning that cognitive scientists consider to be one of the cornerstones of human intelligence gentner2003we; hofstadter2001analogy; hofstadter1995fluid. Analogy shows up at nearly every level of human cognition, from low-level visual processing sagi2012difference to abstract conceptual change gentner1997analogical. Problem solving using analogy is common, with past solutions forming the basis for dealing with new problems holyoak1984development; novick1988analogical. Analogy also facilitates learning and understanding by allowing people to generalize specific situations into increasingly abstract schemas gick1983schema.

Many different theories have been proposed for how humans perform analogy mitchell1993analogy; chalmers1992high; gentner1983structure; holyoak1996mental. One of the most influential theories is Structure-Mapping Theory (SMT) gentner1983structure, which posits that analogy involves the alignment of structured representations of objects or situations subject to certain constraints. Key characteristics of SMT are its use of symbolic representations and its emphasis on relational structure, which allow the same principles to apply to a wide variety of domains.

Until now, the symbolic, structured nature of SMT has made it a poor fit for deep learning. The representations produced by deep learning techniques are incompatible with off-the-shelf SMT implementations like the Structure-Mapping Engine (SME) forbus2017extending

, while the symbolic graphs that SMT assumes as input are challenging to encode with traditional neural methods. However, recent advances in deep learning have made it possible to bridge the gap between the two traditions, providing the architectural tools needed to create neural networks that can learn to produce analogies.

Contributions: We introduce the Analogical Matching Network (AMN), a neural architecture that learns to produce analogies between symbolic representations. Though trained on purely synthetic data, we show over a diverse set of existing analogy problems that AMN’s outputs are largely consistent with SMT. With AMN, we aim to push the boundaries of deep learning and extend them to an important area of human cognition. It is our hope that future generations of neural architectures can reap the same benefits from analogy that symbolic reasoning systems and humans currently do. Code for this work is publicly available at https://github.com/mvcrouse/NeuralAnalogy.

2 Related Work

Though unrelated to analogical reasoning, it is worth noting that there has been a surge of interest in building deep learning systems for more formal reasoning. Prior work has investigated deep learning with automated reasoning

lee2019mathematical; paliwal2019graph, combinatorial problem solving vinyals2015pointer; khalil2017learning; kool2018attention; emami2018learning, dynamic programming xu2019can, abstract reasoning steenbrugge2018improving; santoro2018measuring and question-answering clark2018think; clark2015elementary; levesque2012winograd; tafjord2019quarel.

Many different computational models of analogy have been proposed holyoak1989analogical; o1999computability; forbus2017extending, each instantiating a different cognitive theory of analogy. The differences between them are compounded by the computational costs of analogical reasoning, a provably NP-HARD problem veale1997competence. These computational models are often used to test cognitive theories of human behavior, but they are also useful tools for applied tasks. For instance, the computational model we compare AMN to in this work, the Structure-Mapping Engine (SME), has been used in natural language question-answering ribeiro2013predicting; crouse2018learning

, computer vision

chen2019human; chen2018action, and machine reasoning klenk2005solving; friedman2010integrated.

Many of the early approaches to analogy were connectionist gentner1993analogy. The STAR architecture of halford1994connectionist

used tensor product representations of structured data to perform simple analogies of the form

. Drama eliasmith2001integrating was an implementation of the multi-constraint theory of analogy holyoak1996mental that employed a holographic representation similar to tensor products to embed structure. LISA hummel1997distributed; hummel2005relational was a hybrid symbolic connectionist approach to analogy. It staged the mapping process temporally, generating mappings from elements of the compared representations that were activated at the same time.

Cognitive perspectives of analogy have gone relatively unexplored in deep learning research, with only a few recent works that address them hill2019learning; zhang2019raven. Generally, prior deep learning work has only considered analogies of the form mikolov2013linguistic; reed2015deep, where the task would be to identify a relation that holds across a set of examples and then apply it to novel data. Still, such prior works demonstrated progress in applying analogy to more natural perceptual data in the form of images or language. As of yet, no work has explored a deep learning approach to analogy that operates over the graph-based symbolic representations used in standard computational models of analogy.

3 Structure-Mapping Theory

In Structure-Mapping Theory (SMT) gentner1983structure, analogy centers around the structural alignment of relational representations (see Figure 1). A relational representation is a set of logical expressions constructed from entities (e.g., sun), attributes (e.g., YELLOW), functions (e.g., TEMPERATURE), and relations (e.g., GREATER). Structural alignment is the process of producing a mapping between two relational representations (referred to as the base and target). A mapping is a triple , where is a set of correspondences between the base and target, is a set of candidate inferences (i.e., inferences about the target that can be made from the structure of the base), and is a structural evaluation score that measures the quality of . Correspondences are pairs of elements between the base and target (i.e., expressions or entities) that are identified as matching with one another. While entities can be matched together irrespective of their labels, there are more rigorous criteria for matching expressions. SMT asserts that matches should satisfy the following properties:

  1. [leftmargin=*]

  2. One-to-One: Each element of the base and target can be a part of at most one correspondence.

  3. Parallel Connectivity: Two expressions can be in a correspondence with each other only if their arguments are also in correspondences with each other.

  4. Tiered Identicality: Relations of expressions in a correspondence must match identically, but functions need not be identical if their correspondence would support structural connectivity.

  5. Systematicity: Preference should be given to mappings with more deeply nested expressions.

[1] nucleus [8] sun
[2] electron [9] planet
[3] MASS([1]) [10] MASS([8])
[4] MASS([2]) [11] MASS([9])
[5] ATTRACTS([1]], [2]) [12] TEMPERATURE([8]])
[6] REVOLVES-AROUND([2], [1]) [13] TEMPERATURE([9]])
[7] GREATER([3], [4]) [14] REVOLVES-AROUND([9], [8])
[15] GREATER([10], [11])
[16] GREATER([12], [13])
[17] ATTRACTS([9], [8])
[18] CAUSES(AND([15], [17]), [14])
[19] YELLOW([8]])
Figure 1: Relational and graph representations for models of the atom (left) and Solar System (right). Light green edges indicate the set of correspondences between the two graphs.

To understand these properties, we will use the example in Figure 1, which draws an analogy between the Solar System and the Rutherford model of the atom. A set of correspondences between the base (Solar System) and target (Rutherford atom) is a set of pairs of elements from both sets, e.g., . The one-to-one constraint restricts each element to be a member of at most one correspondence. Thus, if was a member of , then could not be added to . Parallel connectivity enforces correspondence between arguments if the parents are in correspondence. In this example, if was a member of , then both and would need to be members of . In addition, parallel connectivity respects argument order when dealing with ordered relations. Tiered identicality is not relevant in this example; however, if [10] used the label WEIGHT instead of MASS, tiered identicality could be used to match [3] and [10], since such a correspondence would allow for a match between their parents. The last property, systematicity, results in larger correspondence sets being preferred over smaller ones. Note that the singleton set satisfies SMT’s constraints, but it is clearly not useful by itself. Systematicity captures the natural preference for larger, more interesting matches.

Candidate inferences are statements from the base that are projected into the target to fill in missing structure gentner1993analogy; bowdle1997informativity; gentner1998analogy. Given a set of correspondences , candidate inferences are created from statements in the base that are supported by expressions in but are not part of themselves. Returning to Figure 1, one candidate inference would be CAUSES(AND([7],[5]),[6]), derived from [18] by substituting its arguments with the expressions they correspond to in the target. In this work, we adopt SME’s default criteria for computing candidate inferences. Valid candidate inferences are all statements that have some dependency that is included in the correspondences or an ancestor that is a candidate inference (e.g., an expression whose parent has arguments in the correspondences).

The concepts above carry over naturally into graph-theoretic notions. The base and target are considered semi-ordered directed-acyclic graphs (DAGs) and , where and are sets of nodes and and are sets of edges. Each node corresponds to some expression and has a label given by its relation, function, attribute, or entity name. Structural alignment is the process of finding a maximum weight bipartite matching , where satisfies the pairwise-disjunctive constraints imposed by parallel connectivity. Finding candidate inferences is then determining the subset of nodes from with support in .

4 Model

4.1 Model Components

Given a base and target , AMN produces a set of correspondences and a set of candidate inferences . A key design choice of this work was to avoid using rules or architectures that force particular outputs whenever possible. AMN is not forced to output correspondences that satisfy the constraints of SMT; instead, conformance with SMT is reinforced through performance on training data. Our architecture uses Transformers vaswani2017attention and pointer networks vinyals2015pointer and takes inspiration from the work of kool2018attention. A high-level overview is given in Figure 2, which shows how each of the three main components (graph embedding, correspondence selection, and candidate inference selection) interact with one another.

Figure 2: An overview of the model pipeline
Representing Structure:

When embedding the nodes of and , there are representational concerns to keep in mind. First, because matching should be done on the basis of structure, the labels of entities should not be taken into account during the alignment process. Second, because SMT’s constraints require AMN to be able to recognize when a particular node is part of multiple correspondences, AMN should maintain distinguishable representations for distinct nodes, even if those nodes have the same labels. Last, the architecture should not be vocabulary dependent, i.e., AMN should generalize to symbols it has never seen before. To achieve each of these, AMN first parses the original input into two separate graphs, a label graph and a signature graph (see Figure 3).

The label graph will be used to get an estimate of structural similarities. To generate the label graph, AMN substitutes each entity node’s label with a generic entity token. This reflects that entity labels have no inherent utility for producing matchings. Then, each function and predicate node is assigned a randomly chosen generic label (from a fixed set of such labels) based off its arity and orderedness. Assignments are made consistently across the entire graph, e.g.,

every instance of MASS in both the base and target would be assigned the same generic replacement label. This substitution means the original label is not used in the matching process, which allows AMN to generalize to new symbols.

The label graph is not sufficient to produce representations that can be used for the matching process, as it represents a node by only label-based features which are shared amongst different nodes (e.g., the label graph can’t distinguish between identical relations with different entities as the leaves), an issue known as the type-token distinction kahneman1992reviewing; wetzel2006types. To contend with this, a signature graph is constructed that represents nodes in a way that respects object identity. To construct the signature graph, AMN replaces each distinct entity with a unique identifier (drawn from a fixed set of possible identifiers). It then assigns each function and predicate a new label based solely on its arity and orderedness, ignoring the original symbol. For instance, ATTRACTS and REVOLVES-AROUND would be assigned the same label because they are both ordered predicates with arity 2.

Figure 3: Original graph (left), its label graph (middle), and its signature graph (right)

As all input graphs will be DAGs, AMN uses two separate DAG LSTMs crouse2019improving to embed the nodes of the label and signature graphs (equations detailed in Appendix 7.2.1). Each node embedding is computed as a function of its complete set of dependencies in the original graph. The set of label structure embeddings is written as and the set of signature embeddings is written as . Before passing these embeddings to the next step, each element of is scaled to unit length, i.e. each becomes , which gives our network an efficiently checkable criterion for whether or not two nodes are likely to be equal, i.e., when the dot product of two signature embeddings is 1.

Correspondence Selector:

The graph embedding procedure yields two sets of node embeddings (label structure and signature embeddings) for the base and target. We utilize the set of embedding pairs for each node of and , writing to denote the label structure embedding of node taken from and the signature embedding of node taken from . We first define the set of unprocessed correspondences

where

denotes vector concatenation,

is the tiered identicality threshold that governs how much the subgraphs rooted at two nodes may differ and still be considered for correspondence (in this work, we set ). The first element of each correspondence in , i.e., , is then passed through an -layered Transformer encoder (equations detailed in Appendix 7.2.3) to produce a set of encoded correspondences .

Figure 4: The correspondence selection process, where and are the start and stop tokens and , , and are the sets of encoded, selected, and remaining correspondences

The Transformer decoder selects a subset of correspondences that constitutes the best analogical match (see Figure 4). The attention-based transformations are only performed on the initial element of each tuple, i.e., in . We let be the processed set of all selected correspondences at timestep (after the attention layers) and be the set of all remaining correspondences (with and ). The decoder generates compatibility scores between each pair of elements, i.e., . These are combined with the signature embedding similarities to produce a final compatibility

where FFN is a two layer feed-forward network with ELU activations clevert2015fast. Recall that the signature components, i.e. and , were scaled to unit length. Thus, we would expect closeness in the original graph to be reflected by dot-product similarity and identicality to be indicated by a maximum value dot-product, i.e. or . Once each pair has been scored, AMN selects an element of to be added to . For each , we compute its value to be

where FFN is a two layer feed-forward network with ELU activations. A softmax is applied to these scores and the highest valued element is added to . The use of maximum, minimum, and average is intended to let the network capture both individual and aggregate evidence. Individual evidence is given by a pairwise interaction between two correspondences (e.g., two correspondences that together violate the one-to-one constraint). Conversely, aggregate evidence is given by the interaction of a correspondence with everything selected thus far (e.g., a correspondence needed for several parallel connectivity constraints). When END-TOK is selected, the set of correspondences returned is the set of node pairs from and associated with elements in .

Candidate Inference Selector:

The output of the correspondence selector is a set of correspondences . The candidate inferences associated with are drawn from the nodes of the base graph that were not used in . Let and be the subsets of that were and were not used in , respectively. We first extract all signature embeddings for both sets, i.e., and . In this module there are no Transformer components, with AMN operating directly on and .

AMN will select elements from to return. Like before, we let be the set of all selected elements from and be the set of all remaining elements from at timestep . AMN computes compatibility scores between pairs of output options with candidate inference and previously selected nodes, i.e. for each . The compatibility scores are given by a simple single-headed attention computation (see Appendix 7.2.2). Unlike the correspondence encoder-decoder, there are no other values to combine these scores with, so they are used directly to compute a value for each element of . AMN computes the value for a node as

A softmax is used and the highest valued element is added to . Once the special end token is selected, the decoding procedure stops and returns the set of nodes associated with elements in .

4.2 Model Scoring

Structural Match Scoring:

In order to avoid counting erroneous correspondence predictions towards the score of the output correspondences , we first identify all correspondences that are either degenerate or violate the constraints of SMT. Degenerate correspondences are correspondences between constants that have no higher-order structural support in (i.e., if either has no parent that participates in a correspondence in ). To determine if a correspondence violates SMT, we check whether the subgraphs of the base and target rooted at and satisfy the one-to-one matching, parallel connectivity, and tiered identicality constraints (see Section 3). The check can be computed in time linear with the size of the corresponding subgraphs. Let the valid subset of be . A correspondence is considered a root correspondence if there does not exist another correspondence such that and a node in is an ancestor of a node in . We define to be the set of all such root correspondences. For a correspondence in , its score is given as the size of the subgraph rooted at in the base. The structural match score for is then sum of scores for all correspondences in , i.e., . This repeatedly counts nodes that appear in the dependencies of multiple correspondences, which leads to higher scores for more interconnected matchings (in keeping with the systematicity preference of SMT).

Structural Evaluation Maximization:

Dynamically assigning labels to each example allows AMN to handle never-before-seen symbols, but its inherent randomness can lead to significant variability in terms of outputs. AMN combats this by running each test problem times and returning the predicted match that maximizes the structural evaluation score, i.e., . Notably, AMN does not attempt to alter or correct the mapping it chooses this way, so unlike systems like SME, the mapping it returns can include constraint violations.

5 Experiments

5.1 Data Generation and Training

AMN was trained on 100,000 synthetic analogy examples, where a single example consisted of base and target graphs, a set of correspondences, and a set of nodes from the base to be candidate inferences. To generate a synthetic example, we first generated a set of random graphs , which formed the basis for the correspondences. Next, we constructed the base by further generating graphs around . Likewise, for the target we built another set of graphs around . The graphs of were then used to form the correspondences between the base and target. Any element in that was an ancestor of a node from or a descendent of such an ancestor was considered a candidate inference. Figure 5 provides an example. In the figure, the dark green nodes indicate the initial random graphs after being copied into the base and target. The red and blue nodes show the graphs built around and . The light green edges indicate the gold set of correspondences generated from . During training, each generated example was turned into a batch of inputs by repeatedly running the encoding procedure (which dynamically assigns node labels) over the original base and target.

Figure 5: Synthetic example with a base (red), target (blue), and shared subgraphs (green)

5.2 Experimental Domains

Though all training was done with synthetic data, we evaluated the effectiveness of AMN on both synthetic data and data used in previous analogy experiments. The corpus of previous analogy examples was taken from the public release of SME111http://www.qrg.northwestern.edu/software/sme4/index.html. Importantly, AMN was not trained on the corpus of existing analogy examples (AMN never learned from a real-world analogy example). In fact, there was no overlap between the symbols used in that corpus and the symbols used for the synthetic data. We briefly describe each of the domains AMN was evaluated on below (more detailed descriptions can be found in forbus2017extending). Examples of AMN’s outputs can be found in Appendix 7.3.

  1. [leftmargin=*]

  2. Synthetic: this domain consisted of 1000 examples generated with the same parameters as the training data (useful as a sanity check for AMN’s performance).

  3. Visual Oddity: this problem setting was initially proposed to explore cultural differences to geometric reasoning in dehaene2006core. The work of lovett2011cultural modeled the findings of the original experiment computationally with qualitative visual representations and analogy. We extracted 3405 analogical comparisons from the computational experiment.

  4. Moral Decision Making: this domain was taken from the work of dehghani2008moraldm, who introduced a computational model of moral decision making that used SME to reason through moral dilemmas. From the works of dehghani2008moraldm; dehghani2008integrated, we extracted 420 analogical comparisons.

  5. Geometric Analogies: this domain originated from one of the first computational analogy experiments evans1964program. Each problem was an incomplete analogy of the form , where each of , , and were manually encoded geometric figures and the goal was to select the figure that best completed the analogy from an encoded set of possible answers. While in the original work all figures had to be manually encoded, in lovett2009solving; lovett2012modeling it was shown that the analogy problems could be solved with structure-mapping over automatic encodings (produced by the CogSketch system forbus2011cogsketch). From that work we extracted 866 analogies.

5.3 Results and Discussion

Domain Struct. Perf. Larger Equiv. Err. Free 1-to-1 Err. PC Err. Degen. Err.
Synthetic 1 0.702 0.000 0.308 0.342 0.007 0.106 0.018
Synthetic 16 0.948 0.001 0.671 0.684 0.006 0.021 0.009
Oddity 1 0.775 0.062 0.404 0.483 0.152 0.223 0.000
Oddity 16 0.957 0.075 0.492 0.571 0.130 0.139 0.000
Moral DM 1 0.617 0.014 0.017 0.076 0.001 0.169 0.030
Moral DM 16 0.968 0.081 0.210 0.352 0.000 0.039 0.015
Geometric 1 0.870 0.066 0.539 0.654 0.041 0.116 0.000
Geometric 16 1.038 0.069 0.707 0.783 0.029 0.043 0.000
(a) AMN correspondence prediction results in terms of performance ratio (left), solution type rate (middle,  better), and error rate (right,  better)
Domain Avg. CI F1 Avg. CI Prec. Avg. CI Rec. Avg. CI Acc. Avg. CI Spec.
Synthetic 16 0.899 0.867 0.964 0.860 0.733
Oddity 16 0.991 0.995 0.993 0.991 0.811
Moral DM 16 0.897 0.832 0.984 0.830 0.441
Geometric 16 0.960 0.954 0.993 0.951 0.838
(b) AMN candidate inference prediction results
Table 1: AMN experimental results

Table 1(a) shows the results for AMN across different values of , where

denotes the re-run hyperparameter detailed in Section 

4.2. When evaluating on the synthetic data, the comparison set of correspondences was given by the data generator; whereas when evaluating on the three other analogy domains, the comparison set of correspondences was given by the output of SME. It is important to note that we are using SME as our stand-in for SMT (as it is the most widely accepted computational model of SMT). Thus, we do not want significantly different results from SME in the correspondence selection experiments (e.g., substantially higher or lower structural evaluation scores).

In the Struct. Perf. column, the numbers reflect the average across examples of the structural evaluation score of AMN divided by that of the comparison correspondence sets. For the other columns of Table 1(a), the numbers represent average fractions of examples or correspondences (e.g., 0.684 should be interpreted as 68.4%). Candidate inference prediction performance was measured relative to the set of correspondences AMN generated, i.e., all candidate inferences were computed from the predicted

correspondences, and treated as the true positives. In many problems from the non-synthetic domains, every non-correspondence node was a candidate inference (which can lead to inflated precision and recall values). Thus, we also report the specificity (i.e., true negative rate) of AMN for

only problems with non-candidate inference nodes.

Analysis: The left side of Table 1(a) shows the average ratio of AMN’s performance (labeled Struct. Perf.), as measured by structural evaluation score, against the comparison method’s performance (i.e., data generator correspondences or SME). As can be seen, AMN was around 95-104% of SME’s performance in terms of structural evaluation score on the non-synthetic domains, which indicates that it was finding similar structural matches. Again, we note that higher structural evaluation scores do not necessarily indicate a “better” match, as our goal is to conform to SMT’s predictions.

The middle of Table 1(a) gives us the best sense of how well AMN modeled SMT. We observe AMN’s performance in terms of the proportion of larger, equivalent, and error-free matches it produces (labeled Larger, Equiv., and Err. Free, respectively). Error-free matches do not contain degenerate correspondences or SMT constraint violations, whereas equivalent and larger matches are both error-free and have the same / larger structural evaluation score as compared to gold set of correspondences. The Equiv. column provides the best indication that AMN could model SMT. It shows that of AMN’s outputs were SMT-satisfying, error-free analogical matches with the exact same structural score as SME (the lead computational model of SMT) in two of the non-synthetic analogy domains.

The right side of Table 1(a) shows the frequency of the different types of errors, including violations of the one-to-one / parallel connectivity constraints, and degenerate correspondences (labeled 1-to-1 Err., PC Err., and Degen. Err.). It shows that AMN had fairly low error rates across domains (except for Visual Oddity). Importantly, degenerate correspondences were very infrequent, which is significant because it verifies that AMN leveraged higher-order relational structure when generating matches.

Table 1(b)

shows that AMN was fairly effective in predicting candidate inferences. The high accuracy (labeled Avg. CI Acc.) scores for both the Visual Oddity and Geometric Analogies domains indicate that AMN was able to capture the notion of structural support when determining candidate inferences. The non-zero specificity (labeled Avg. CI Spec.) results show that, while it more often classified nodes as candidate inferences, it was capable of distinguishing non-candidate inference nodes as well.

6 Conclusions

In this paper, we introduced the Analogical Matching Network, a neural approach that learned to produce analogies consistent with Structure-Mapping Theory. Despite being trained on completely synthetic data, AMN was capable of performing well on a varied set of analogies drawn from previous work involving analogical reasoning. AMN demonstrated renaming invariance, structural sensitivity, and the ability to find solutions in a combinatorial search space, all of which are key properties of symbolic reasoners and are known to be important to human reasoning.

References

7 Appendix

7.1 Model Details

In the DAG LSTM, the node embeddings were 32-dimensional vectors and the edge embeddings were 16-dimensional vectors. For all Transformer components, our model used multi-headed attention with 2 attention layers each having 4 heads. In each multi-headed attention layer, the query and key vectors were projected to 128-dimensional vectors. The feed forward networks used in the Transformer components had one hidden layer with a dimensionality twice that of the input vector size. The feed forward networks used to compute the values in the correspondence selector used two 64-dimensional hidden layers. The models were constructed with the Pytorch

[paszke2019pytorch] library.

Loss Function:

As both the correspondence and candidate inference components use a softmax, the loss function is categorical cross entropy. Teacher forcing is used to guide the decoder to select the correct choices during training. With

the loss for correspondence selection and the loss for candidate inference selection, the final loss is given as (with in our experiments), which is minimized with Adam [kingma2014adam].

7.2 Background

7.2.1 DAG LSTMs

DAG LSTMs extend Tree LSTMs [tai2015improved] to DAG-structured data. As with Tree LSTMs, DAG LSTMs compute each node embedding as the aggregated information of all their immediate predecessors (the equations for the DAG LSTM are identical to those of the Tree LSTM). The difference between the two is that DAG LSTMs stage the computation of a node’s embedding based on the order given by a topological sort of the input graph. Batching of computations is done by grouping together updates of independent nodes (where two nodes are independent if they are neither ancestors nor predecessors of one another). As in [crouse2019improving], for a node, , its initial node embedding, , is assigned based on its label and arity. The DAG LSTM then computes the final embedding to be

where is element-wise multiplication,

is the sigmoid function,

is the predecessor function that returns the arguments for a node, , , , and are learned matrices per edge type. and represent input and output gates, and are memory cells, and is a forget gate.

7.2.2 Multi-Headed Attention

The multi-headed attention (MHA) mechanism of [vaswani2017attention] is used in our work to compare correspondences against one another. In this work, MHA is given two inputs, a query vector and a list of key vectors to compare the query vector against . In -headed attention, separate attention transformations are computed. For transformation we have

where each of , , and are learned matrices and is the dimensionality of . The final output vector for input is then given as a combination of its transformations

where each is a distinct learned matrix for each . In implementation, the comparisons of query and key vectors are batched together and performed as efficient matrix multiplications.

7.2.3 Transformer Encoder-Decoder

The Transformer-based encoder-decoder is given two inputs, a comparison set and an output set . At a high level, will be encoded into a new set , which will inform a selection process that picks elements of to return. In the context of pointer networks, the set begins as the encoded input set, i.e., .

Encoder:

First, the elements of , i.e. , are passed through layers of an attention-based transformation. For element in the -th layer (i.e., ) this is performed as follows

where LN denotes the use of layer normalization [ba2016layer], (Appendix 7.2.2) denotes the use of self multi-headed attention for layer (i.e., attention between and the other elements of ), and

is a two-layer feed-forward neural network with ELU

[clevert2015fast] activations. After layers of processing, the set of encoded inputs is given by

Decoder:

With encoded comparison elements and a set of potential outputs , the objective of the decoder is to use to inform the selection of some subset of output options to return. Decoding happens sequentially; at each timestep the decoder selects an element from (where END-TOK is a learned triple) to add to . If END-TOK is chosen, the decoding procedure stops and is returned.

Let be the set of elements that have been selected by timestep and be the remaining unselected elements at timetstep . First, is processed with an -layered attention-based transformation. For an element this is given by

where denotes the use of self multi-headed attention, denotes the use of multi-headed attention against elements of , and is a two-layer feed-forward neural network with ELU activations. We will consider the already selected outputs to be the transformed selected outputs, i.e., . For a pair, , we compute their compatibility as

where and are learned matrices, is the dimensionality of , and FFN is a two layer feed-forward network with ELU activations. This defines a matrix

of compatibility scores. One can then apply some operation (e.g., max pooling) to produce a vector of values

which can be fed into a softmax to produce a distribution over options from

. The highest probability element

from the distribution is then added to the set of selected outputs, i.e., .

7.3 AMN Example Outputs

For the outputs from the non-synthetic domains (all but the first figure), only small subgraphs of the original graphs are shown (the original graphs were too large to be displayed)

Figure 6: AMN output for an example from the Synthetic domain
Figure 7: AMN output for an example from the Visual Oddity domain
Figure 8: AMN output for an example from the Moral Decision Making domain
Figure 9: AMN output for an example from the Geometric Analogies domain