Embedding Symbolic Knowledge into Deep Networks

09/03/2019 ∙ by Yaqi Xie, et al. ∙ 0

In this work, we aim to leverage prior symbolic knowledge to improve the performance of deep models. We propose a graph embedding network that projects propositional formulae (and assignments) onto a manifold via an augmented Graph Convolutional Network (GCN). To generate semantically-faithful embeddings, we develop techniques to recognize node heterogeneity, and semantic regularization that incorporate structural constraints into the embedding. Experiments show that our approach improves the performance of models trained to perform entailment checking and visual relation prediction. Interestingly, we observe a connection between the tractability of the propositional theory representation and the ease of embedding. Future exploration of this connection may elucidate the relationship between knowledge compilation and vector representation learning.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent advances in design and training methodology of deep neural networks 


have led to wide-spread application of machine learning in diverse domains such as medical image classification 

Litjens2017 and game-playing Silver2017. Although demonstrably effective on a variety of tasks, deep NNs have voracious appetites; obtaining a good model typically requires large amounts of labelled data, even when the learnt concepts could be described succinctly in symbolic representation. As a result, there has been a surge of interest in techniques that combine symbolic and neural reasoning BSBB17 including a diverse set of approaches to inject existing prior domain knowledge into NNs, e.g., via knowledge distillation Hu2016HarnessingDN, probabilistic priors ansari2019hyperprior, or auxiliary losses Xu2018. However, doing so in a scalable and effective manner remains a challenging open problem. One particularly promising approach is through learned embeddings, i.e., real-vector representations of prior knowledge, that can be easily processed by NNs Kim2016; allamanis2016; evans2018; Zhu2015; Tai2015.

In this work, we focus on embedding symbolic knowledge expressed as logical rules. In sharp contrast to connectionist NN structures, logical formulae are explainable, compositional, and can be explicitly derived from human knowledge. Inspired by insights from the knowledge representation community, this paper investigates embedding alternative representation languages to improve the performance of deep networks. To this end, we focus on two languages: Conjunctive Normal Form (CNF) and decision-Deterministic Decomposable Negation Normal Form (d-DNNF) Darwiche2001b; Darwiche2002. Every Boolean formula can be succinctly represented in CNF, but CNF is intractable for most queries of interest such as satisfiability. On the other hand, representation of Boolean formula in d-DNNF may lead to exponential size blowup, but d-DNNF is tractable (polytime) for most queries such as satisfiability, counting, enumeration and the like Darwiche2002.

In comparison to prior work that treat logical formulae as symbol sequences, CNF and d-DNNF formulae are naturally viewed as graphs structures. Thus, we utilize recent Graph Convolutional Networks (GCNs) Kipf2017 (that are robust to relabelling of nodes) to embed logic graphs. We further employ a novel method of semantic regularization to learn embeddings that are semantically consistent with d-DNNF formulae. In particular, we augment the standard GCN to recognize node heterogeneity and introduce soft constraints on the embedding structure of the children of AND and OR nodes within the logic graph. An overview of our Logic Embedding Network with Semantic Regularization (LENSR) is shown in Fig 1.

Figure 1: LENSR overview. Our GCN-based embedder projects logic graphs representing formulae or assignments onto a manifold where entailment is related to distance; satisfying assignments are closer to the associated formula. Such a space enables fast approximate entailment checks — we use this embedding space to form logic losses that regularize deep neural networks for a target task.

Once learnt, these logic embeddings can then be used to form a logic loss that guides NN training; the loss encourages the NN to be consistent with prior knowledge. Experiments on a synthetic model-checking dataset show that LENSR is able to learn high quality embeddings that are predictive of formula satisfiability. As a real-world case-study, we applied LENSR to the challenging task of Visual Relation Prediction (VRP) where the goal is to predict relations between objects in images. Our empirical analysis demonstrates that LENSR significantly outperforms baseline models. Furthermore, we observe that LENSR with d-DNNF achieves a significant performance improvement over LENSR with CNF embedding. We propose the notion of embeddable-demanding to capture the observed behavior of a plausible relationship between tractability of representation language and the ease of learning vector representations.

To summarize, this paper contributes a framework for utilizing logical formulae in NNs. Different from prior work, LENSR is able to utilize d-DNNF structure to learn semantically-constrained embeddings. To the best of our knowledge, this is also the first work to apply GCN-based embeddings for logical formulae, and experiments show the approach to be effective on both synthetic and real-world datasets. Practically, the model is straight-forward to implement and use. We have made our source code available online at https://github.com/ZiweiXU/LENSR. Finally, our evaluations suggest a connection between the tractability of a normal form and its amenability to embedding; exploring this relationship may reveal deep connections between knowledge compilation Darwiche2002 and vector representation learning.

2 Background and Related Work

Logical Formulae, CNF and d-DNNF

Logical statements provide a flexible declarative language for expressing structured knowledge. In this work, we focus on propositional logic, where a proposition is a statement which is either True or False. A formula is a compound of propositions connected by logical connectives, e.g.. An assignment is a function which maps propositions to True or False. An assignment that makes a formula True is said to satisfy , denoted .

A formula that is a conjunction of clauses (a disjunction of literals) is in Conjunctive Normal Form (CNF). Let be the set of propositional variables. A sentence in Negation Normal Form (NNF) is defined as a rooted directed acyclic graph (DAG) where each leaf node is labeled with True, False, ; and each internal node is labeled with or and can have arbitrarily many children. Deterministic Decomposable Negation Normal Form (d-DNNF) Darwiche2001b; Darwiche2002 further imposes that the representation is: (i) Deterministic: An NNF is deterministic if the operands of in all well-formed boolean formula in NNF are mutually inconsistent; (ii) Decomposable: An NNF is decomposable if the operands of in all well-formed boolean formula in the NNF are expressed on a mutually disjoint set of variables. In contrast to CNF and more general forms, d-DNNF has many desirable tractability properties (e.g., polytime satisfiability and polytime model counting). These tractability properties make d-DNNF particularly appealing for complex AI applications Darwiche2001.

Although building d-DNNFs is a difficult problem in general, practical compilation can often be performed in reasonable time. We use c2d Darwiche2004NewAI, which can compile relatively large d-DNNFs; in our experiments, it took less than 2 seconds to compile a d-DNNF from a CNF with 1000 clauses and 1000 propositions on a standard workstation. Our GCN can also embed other logic forms expressible as graphs and thus, other logic forms (e.g., CNF) could be used when d-DNNF compilation is not possible or prohibitive

Logic in Neural Networks

Integrating learning and reasoning remains a key problem in AI and encompasses various methods, including logic circuits Liang2019

, Logic Tensor Networks 

Donadello2017; SerafiniG2016, and knowledge distillation Hu2016HarnessingDN. Our primary goal in this work is to incorporate symbolic domain knowledge into connectionist architectures. Recent work can be categorized into two general approaches.

The first approach augments the training objective with an additional logic loss as a means of applying soft-constraints Xu2018; StewartE2017; Tim2015; Demeester2016. For example, the semantic loss used in Xu2018

quantifies the probability of generating a satisfying assignment by randomly sampling from a predictive distribution. The second approach is via embeddings, i.e., learning vector based representations of symbolic knowledge that can be naturally handled by neural networks. For example, the ConvNet Encoder 

Kim2016 embeds formulae (sequences of symbols) using a stack of one-dimensional convolutions. TreeRNN allamanis2016 and TreeLSTM encoders Tai2015; Le2015; Zhu2015

recursively encode formulae using recurrent neural networks.

This work adopts the second embedding-based approach and adapts the Graph Convolutional Network (GCN) Kipf2017 towards embedding logical formulae expressed in d-DNNF. The prior work discussed above have focused largely on CNF (and more general forms), and have neglected d-DNNF despite its appealing properties. Unlike the ConvNet and TreeRNN/LSTM, our GCN is able to utilize semantic information inherent in the d-DNNF structure, while remaining invariant to proposition relabeling.

3 Logic Embedding Network with Semantic Regularization

(a) General Form
(b) CNF
(c) d-DNNF
(d) d-DNNF DAG
Figure 6: (a)–(c) Logic graphs examples of the formula in (a) General form, (b) CNF, (c) d-DNNF. This formula could encode a rule for “person wearing glasses” where denotes wear(person,glasses), denotes in(glasses,person), denotes exist(person) and denotes exist(glasses). (d) An example DAG showing a more complex d-DNNF logic rule.

In this section, we detail our approach, from logic graph creation to model training and eventual use on a target task. As a guide, Fig. 1 shows an overview of our model. LENSR specializes a GCN for d-DNNF formulae. A logical formula (and corresponding truth assignments) can be represented as a directed or undirected graph with nodes, , and edges . Individual nodes are either propositions (leaf nodes) or logical operators (), where subjects and objects are connected to their respective operators. In addition to the above nodes, we augment the graph with a global node, which is linked to all other nodes in the graph.

As a specific example (see Fig. 6), consider an image which contains a person and a pair of glasses. We wish to determine the relation between them, e.g., whether the person is wearing the glasses. We could use spatial logic to reason about this question; if the person is wearing the glasses, the image of the glasses should be “inside” the image of the person. Expressing this notion as a logical rule, we have: . Although the example rule above results in a tree structure, d-DNNF formulae are DAGs in general.

3.1 Logic Graph Embedder with Heterogeneous Nodes and Semantic Regularization

We embed logic graphs using a multi-layer Graph Convolutional Network Kipf2017, which is a first-order approximation of localized spectral filters on graphs Hammond2011; Defferrard2016. The layer-wise propagation rule is,


where are the learnt latent node embeddings at (note that ), is the adjacency matrix of the undirected graph

with added self-connections via the identity matrix

. is a diagonal degree matrix with . The layer-specific trainable weight matrices are , and

denotes the activation function. To better capture the semantics associated with the logic graphs, we propose two modifications to the standard graph embedder: heterogenous node embeddings and semantic regularization.

Heterogeneous Node Embedder.

In the default GCN embedder, all nodes share the same set of embedding parameters. However, different types of nodes have different semantics, e.g., compare an node v.s. a proposition node. Thus, learning may be improved by using distinct information propagation parameters for each node type. Here, we propose to use type-dependent logical gate weights and attributes, i.e., a different for each of the five node types (leaf, global, ).

Semantic Regularization.

d-DNNF logic graphs possess certain structural/semantic constraints, and we propose to incorporate these constraints into the embedding structure. More precisely, we regularize the children embeddings of gates to be orthogonal. This intuitively corresponds to the constraint that the children do not share variables (i.e., is decomposable). Likewise, we propose to constrain the gate children embeddings to sum up to a unit vector, which corresponds to the constraint that one and only one child of gate is true (i.e., is deterministic). The resultant semantic regularizer loss is:


where is our logic embedder, is the set of nodes, is the set of nodes, is the set of child nodes of , where .

3.2 Embedder Training with a Triplet Loss

As previously mentioned, LENSR minimizes distances between the embeddings of formulae and satisfying assignments in a shared latent embedding space. To achieve this, we use a triplet loss that encourages formulae embeddings to be close to satisfying assignments, and far from unsatisfying assignments.

Formally, let be the embedding produced by the modified GCN embedder. Denote as the embedding of d-DNNF logic graph for a given formula, and and as the assignment embeddings for a satisfying and unsatisfying assignment, respectively. For assignments, the logical graph structures are simple and shallow; assignments are a conjunction of propositions and thus, the pre-augmented graph is a tree with one gate. Our triplet loss is a hinge loss:


where is the squared Euclidean distance between vector and vector , is the margin. We make use of SAT solver, python-sat Ignatiev18PySAT, to obtain the satisfying and unsatisfying assignments. Training the embedder entails optimizing a combined loss:


where is the triplet loss above, is the semantic regularization term for d-DNNF formulae, and

is a hyperparameter that controls the strength of the regularization. The summation is over formulas and associated pairs of satisfying and unsatisfying assignments in our dataset. In practice, pairs of assignments are randomly sampled for each formula during training.

3.3 Target Task Training with a Logic Loss

Finally, we train the target model by augmenting the per-datum loss with a logic loss :


where is the embedding distance between the formula related to the input and the predictive distribution , is the task-specific loss (e.g., cross-entropy for classification), and is a trade-off factor. Note that the distribution may be any relevant predictive distribution produced by the network, including intermediate layers. As such, intermediate outputs can be regularized with prior knowledge for later downstream processing. To obtain the embedding of as , we first compute an embedding for each predicted relationship by taking an average of the relationship embeddings weighted by their predicted probabilities. Then, we construct a simple logic graph , which is embedded using q.

4 Empirical Results: Synthetic Dataset

Figure 10:

(a) Prediction loss (on the training set) as training progressed (line shown is the average over 10 runs with shaded region representing the standard error); (b) Formulae satisfiability v.s. distance in the embedding space, showing that LENSR learnt a good representation by projecting d-DNNF logic graphs; (c) Test accuracies indicate that the learned d-DNNF embeddings outperform the general form and CNF embeddings, and are more robust to increasing formula complexity.

In this section, we focus on validating that d-DNNF formulae embeddings are more informative relative to embeddings of general form and CNF formulae. Specifically, we conduct tests using a entailment checking problem; given the embedding of a formula f and the embedding of an assignment , predict whether satisfies f.

Experiment Setup and Datasets.

We trained 7 different models using general, CNF, and d-DNNF formulae (with and without heterogenous node embedding and semantic regularization). For this test, each LENSR model comprised 3 layers, with 50 hidden units per layer. LENSR produces 100-dimension embedding for each input formula/assignment. The neural network used for classification is a 2-layer perceptron with 150 hidden units. We set

in Eqn. 3 and in Eqn. 4. We used grid search to find reasonable parameters.

To explicitly control the complexity of formulae, we synthesized our own dataset. The complexity of a formula is (coarsely) reflected by its number of variables and the maximum formula depth . We prepared three datasets with and label their complexity as “low”, “moderate”, and “high”. We attempted to provide a good coverage of potential problem difficulty: the “low” case represents easy problems that all the compared methods were expected to do well on, and the “high” case represents very challenging problems. For each formula, we use the python-sat package Ignatiev18PySAT to find its satisfying and unsatisfying assignments. There are 1000 formulae in each difficulty level. We take at most 5 satisfying assignments and 5 unsatisfying assignments for each formula in our dataset. We converted all formulae and assignments to CNF and d-DNNF.

Formula Form HE SR Acc.(%)
Low Moderate High
General - - 89.63 (0.25) 70.32 (0.89) 68.51 (0.53)
CNF - 90.02 (0.18) 71.19 (0.93) 69.42 (1.03)
- 90.25 (0.15) 73.92 (1.01) 68.79 (0.69)
d-DNNF 89.91 (0.31) 82.49 (1.11) 70.56 (1.16)
90.22 (0.23) 82.28 (1.40) 71.46 (1.17)
90.27 (0.55) 81.30 (1.29) 70.54 (0.62)
90.35 (0.32) 83.04 (1.58) 71.52 (0.54)
Table 1: Prediction accuracy and standard error over 10 independent runs with model using different forms of formulae and regularization. Standard error shown in brackets. “HE” means the model is a heterogeneous embedder, “SR” means the model is trained with semantic regularization. “✓” denotes “with the respective property“ and “-” denotes “Not Applicable”. The best scores are in bold.
Results and Discussion.

Table 1 summarizes the classification accuracies across the models and datasets over 10 independent runs. In brief, the heterogeneous embedder with semantic regularization trained on d-DNNF formulae outperforms the alternatives. We see that semantic regularization works best when paired with heterogeneous node embedding; this is relatively unsurprising since the AND and OR operators are regularized differently and distinct sets of parameters are required to propagate relevant information.

In our experiments, we found the d-DNNF model to converge faster than the CNF and general form (Fig. (a)a). Utilizing both semantic regularization and heterogeneous node embedding further improves the convergence rate. The resultant embedding spaces are also more informative of satisfiability; Fig. (b)b shows that the distances between the formulae and associated assignments better reflect satisfiability for d-DNNF. This results in higher accuracies (Fig. (c)c), particularly on the moderate complexity dataset. We posit that the differences on the low and high regimes were smaller because (i) in the low case, all the methods performed reasonably well, and (ii) on the high regime, embedding the constraints helps to a limited extent and points to avenues for future work.

Overall, these results provide empirical evidence for our conjecture that d-DNNF are more amenable to embedding, compared to CNF and general form formulae.

5 Visual Relation Prediction

In this section, we show how our framework can be applied to a real-world task — Visual Relation Prediction (VRP) — to train improved models that are consistent with both training data and prior knowledge. The goal of VRP is to predict the correct relation between two objects given visual information in an input image. We evaluate our method on VRD Lu2016Visual. The VRD dataset contains 5,000 images with 100 object categories and 70 annotated predicates (relations). For each image, we sample pairs of objects and induce their spatial relations. If there is no annotation for a pair of object in the dataset, we label it as having “no-relation”.

Propositions and Constraints
Figure 13: (a) The 10 spatial relations used in the Visual Relation Prediction task, and an example image illustrating the relation in(helmet, person). (b) A prediction comparison between neural networks trained w/ and w/o LENSR. A tick indicates a correct prediction. In this example, the misleading effects of the street are corrected by spatial constraints on “skate on”.

The logical rules for the VRP task consist of logical formulae specifying constraints. In particular, there are three types of propositions in our model:

  • Existence Propositions  The existence of each object forms a proposition which is True if it exists in the image and False otherwise. For example, proposition p=exist(person) is True if a person is in the input image and False otherwise.

  • Visual Relation Propositions  Each of the candidate visual relation together with its subject and object forms a proposition. For example, wear(person, glasses) is a proposition and has value True if there is a person wearing glasses in the image and False otherwise.

  • Spatial Relation Propositions  In order to add spatial constraints, e.g. a person cannot wear the glasses if their bounding boxes do not overlap, we define 10 types of spatial relationships (illustrated in Fig. (a)a). We assign a proposition for each spatial relation such that the proposition evaluation is True if the relation holds and False otherwise, e.g. in(glasses, person). Furthermore, exactly one spatial relation proposition for a fixed subject object pair is True, i.e. spatial relation propositions for a fixed subject object pair are mutually exclusive.

The above propositions are used to form two types of logical constraints:

  • Existence Constraints.  The prerequisite of any relation is that relevant objects exist in the image. Therefore , where p is any of the visual or spatial relations introduced above.

  • Spatial Constraints.  Many visual relations hold only if a given subject and object follow a spatial constraint. For example, a person cannot be wearing glasses if the bounding boxes for the person and the glasses do not overlap. This observation gives us rules such as

For each image in the training set we can generate a set of clauses where . Each clause represents a constraint in image where is the constraint index, and each proposition represents a relation in image where is the proposition index in the constraint . We obtain the relations directly from the annotations for the image and calculate the constraints based on the definitions above. Depending on the number of objects in image , can contain 50 to 1000 clauses and variables. All these clauses are combined together to form a formula .

5.1 VRP Model Training

Figure 14: The framework we use to train VRP models with LENSR.

Using the above formulae, we train our VRP model in a two-step manner; first, we train the embedder using only f. The embedder is a GCN with the same structure as described in Sec. 4. Then, the embedder is fixed and the target neural network is trained to predict relation together with the logic loss (described in Sec. 3.3). The training framework is illustrated in Fig. 14. In our experiment, is a MLP with 2 layers and 512 hidden units. To elaborate:

Embedder Training.

For each training image, we generate an intermediate formula that only contains propositions related to the current image . To do so, we iterate over all clauses in f and add a clause into the intermediate formula if all subjects and objects of all literals in is in the image. The formula is then appended with existence and spatial constraints defined in Sec. 5.

To obtain the vector representation of a proposition, we first convert its corresponding relation into a phrase (e.g. p=wear(person, glasses) is converted to “person wear glasses”). Then, the GLoVe embeddings Pennington2014glove for each word are summed to form the embedding for the entire phrase. The formula is then either kept as CNF or converted to d-DNNF Darwiche2004NewAI depending on the embedder. Similar to Sec. 4, the assignments of are found and used to train the embedder using the triplet loss (Eqn. 3).

Target Model Training.

After is trained, we fix its parameters and use it to train the relation prediction network . In our relation prediction task, we assume the objects in the images are known; we are given the object labels and bounding boxes. Although this is a strong assumption, object detection is an upstream task that is handled by other methods and is not the focus of this work. Indeed, all compared approaches are provided with exactly the same information. The input to the model is the image, and the labels and bounding boxes of all detected objects, for example: The network predicts a relation based on the visual feature and the embedding of class labels:


where is the relation prediction, is the GLoVe embedding for the class labels of subjects and objects, is the relative bounding box positions of subjects and objects,

is the visual feature extracted from the union bounding box of the objects,

indicates concatenation of vectors. We compute the logic loss term as


where is the predicate for relation predicted to be hold in the image, and f is the formula generated from the input information. As previously stated, our final objective function is where is the cross entropy loss, and is a trade-off factor. We optimized this objective using Adam Kingma2015AdamAM with learning rate .

Although our framework can be trained end-to-end, we trained the logic embedder and target network separately to (i) alleviate potential loss fluctuations in joint optimization, and (ii) enable the same logic embeddings to be used with different target networks (for different tasks). The networks could be further optimized jointly to fine-tune the embeddings for a specific task, but we did not perform fine-tuning for this experiment.

5.2 Empirical Results

Table 2 summarizes our results and shows the top-5 accuracy score of the compared methods111Top-5 accuracy was used as our performance metric because a given pair of objects may have multiple relationships, and reasonable relations may not have been annotated in the dataset.. We clearly see that our GCN approach (with heterogeneous node embedding and semantic regularization) performs far better than the baseline model without logic embeddings. Note also that direct application of d-DNNFs via the semantic loss Xu2018 only resulted in marginal improvement over the baseline. A potential reason is that the constraints in VRP are more complicated than those explored in prior work: there are thousands of propositions and a straightforward use of d-DNNFs causes the semantic loss to rapidly approach . Our embedding approach avoids this issue and thus, is able to better leverage the encoded prior knowledge. Our method also outperforms the state-of-the-art TreeLSTM embedder Tai2015; since RNN-based embedders are not invariant to variable-ordering, they may be less appropriate for symbolic expressions, especially propositional logic.

As a qualitative comparison, Fig. (b)b shows an example where logic rules embedded by LENSR help the target task model. The top-3 predictions of neural network trained with LENSR are all reasonable answers for the input image. However, the top 3 relations predicted baseline model are unsatisfying and the model appears misled by the street between the subject and the object. LENSR leverages the logic rules that indicate that the “skate on” relation requires the subject to be “above” or “overlap above” the object, which corrects for the effect of the street.

Model Form HE SR Top-5 Acc. (%)
without logic - - - 84.30
with semantic loss Xu2018 - - - 84.76
with treeLSTM embedder Tai2015 CNF - - 85.76
d-DNNF - - 82.99
LENSR CNF - 85.39
- 85.70
d-DNNF 85.37
Table 2: Performance of VRP under different configurations. “HE” indicates a heterogeneous node embedder, “SR” means the model uses semantic regularization. “✓” denotes “with the respective property“ and “-” denotes “Not Applicable”. The best scores are in bold.

6 Discussion

Our experimental results show an interesting phenomena—the usage of d-DNNF (when paired with semantic regularization) significantly improved performance compared to other forms. This raises a natural question of whether d-DNNF’s embeddings are easier to learn. Establishing a complete formal connection between improved learning and compiled forms is beyond the scope of this work. However, we use the size of space of the formulas as a way to argue about ease of learning and formalize this through the concept of embeddable demanding.

Definition 1 (Embeddable-Demanding)

Let be two compilation languages. is at least as embeddable-demanding as iif there exists a polynomial such that for every sentence such that (i) . Here are the sizes of respectively, and may include auxiliary variables. (ii) The transformation from to is poly time. (iii) There exists a bijection between models of and models of .

Theorem 1

CNF is at least as embeddable-demanding as d-DNNF, but if d-DNNF is at least as embeddable-demanding as CNF then .

The proof and detailed theorem statement are provided in the Appendix. More broadly, Theorem 1 a first-step towards a more comprehensive theory of the embeddability for different logical forms. Future work in this area could potentially yield interesting insights and new ways of leveraging symbolic knowledge in deep neural networks.

7 Conclusion

To summarize, this paper proposed LENSR, a novel framework for leveraging prior symbolic knowledge. By embedding d-DNNF formulae using an augmented GCN, LENSR boosts the performance of deep NNs on model-checking and VRP tasks. The empirical results indicate that constraining embeddings to be semantically faithful, e.g., by allowing for node heterogeneity and through regularization, aids model training. Our work also suggests potential future benefits from a deeper examination of the relationship between tractability and embedding, and the extension of semantic-aware embedding to alternative graph structures. Future extensions of LENSR that embed other forms of prior symbolic knowledge could enhance deep learning where data is relatively scarce (e.g., real-world interactions with humans 

Soh2015 or objects taunyazov2019towards). To encourage further development, we have made our source code available online at https://github.com/ZiweiXU/LENSR.


This work was supported in part by a MOE Tier 1 Grant to Harold Soh and by the National Research Foundation Singapore under its AI Singapore Programme [AISG-RP-2018-005]. It was also supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Strategic Capability Research Centres Funding Initiative.


Appendix A Supplementary Material

a.1 Embedder Training Algorithm

The algorithm for training the embedder is summarized in Algorithm 1.

Input: f: CNF formula; : Training images; : Margin
Output: : Embedder

1:   init()
2:  repeat
3:     for all  do
4:         Create intermediate formula
6:         all_instances_in()
7:        for all  do
8:           if all_instances_in()  then
10:           end if
11:        end for
12:         append_constraints()
13:         sat_assig_of
14:         unsat_assig_of()
16:         update_with(,)
17:     end for
18:  until Convergence
19:  return  
Algorithm 1 trainEmbedder

a.2 Embeddable-Demanding

The significant performance improved due to usage of d-DNNF raises the question whether the language represented by d-DNNF has a smaller search space and therefore, potentially easier learning method. To this end, we introduce the concept of embeddable-demanding below

The following theorems uses the standard complexity theoretic terms and we refer to the reader to the standard text arora2009computational for detailed treatment of these concepts.

Definition 2 (Embeddable-Demanding)

Let be two compilation languages. is at least as embeddable-demanding as iff there exists a polynomial such that for every sentence such that (i) . Here are the sizes of respectively, and may include auxiliary variables. (ii) The transformation from to is poly time. (iii) There exists a bijection between models of and models of .

Theorem 2

CNF is at least as embeddable-demanding as d-DNNF, but if d-DNNF is at least as embeddable-demanding as CNF then .

Proof 1

(1) Prove that CNF is at least as embeddable-demanding as d-DNNF, i.e. for every formula in d-DNNF, there exists a polynomial size, and polynomial time computable CNF formula such that there is an one to one polynomial time computable mapping between models of to .

Observe that d-DNNF represents a circuit, which can be encoded into an equisatisfiable CNF formula of polynomial size due to NP-completeness of CNF. In particular, the usage of Tseytin encoding Tseitin1983 ensures that the resulting CNF is of linear size. Furthermore, let d-DNNF be defined over the set of variables denoted by , then Tseytin encoding introduces a set of auxiliary variables, say , for the resulting formula such that . Therefore, the mapping from models of to is achieved just by projection of models of on .

(2) Prove that if d-DNNF is at least as embeddable-demanding as CNF then . In other words, if for every formula in CNF, there exists a polynomial size, and polynomial time computable d-DNNF such that there is bijection between models of and models of , then . implies collapse of entire polynomial hierarchy, in particular .

Assume for every formula in CNF, there exists a polynomial size, and polynomial time computable d-DNNF such that there is a bijection between models of and models of . Since d-DNNF allows counting in polynomial time and the existence of bijection implies that the number of models of is equal to that of , then we can compute the number of models of an arbitrary CNF formula in polynomial time; therefore . In this context, it is worth noting that the entire polynomial polynomial hierarchy is shown to contain , i.e.,  toda1991pp.

a.3 Computing Infrastructure

We trained our models using Pytorch 0.4.1 on one NVIDIA GTX 1080 Ti 12GB GPU.

a.4 Hyper-parameters Selection

Our hyper-parameters includes: the margin in triplet loss of the embedder , the semantic regularizer weight and logic loss weight . The ranges considered are for ; for and for . We did grid search and set across all experiments.