SLASH: Embracing Probabilistic Circuits into Neural Answer Set Programming

10/07/2021 ∙ by Arseny Skryagin, et al. ∙ Technische Universität Darmstadt 56

The goal of combining the robustness of neural networks and the expressivity of symbolic methods has rekindled the interest in neuro-symbolic AI. Recent advancements in neuro-symbolic AI often consider specifically-tailored architectures consisting of disjoint neural and symbolic components, and thus do not exhibit desired gains that can be achieved by integrating them into a unifying framework. We introduce SLASH – a novel deep probabilistic programming language (DPPL). At its core, SLASH consists of Neural-Probabilistic Predicates (NPPs) and logical programs which are united via answer set programming. The probability estimates resulting from NPPs act as the binding element between the logical program and raw input data, thereby allowing SLASH to answer task-dependent logical queries. This allows SLASH to elegantly integrate the symbolic and neural components in a unified framework. We evaluate SLASH on the benchmark data of MNIST addition as well as novel tasks for DPPLs such as missing data prediction and set prediction with state-of-the-art performance, thereby showing the effectiveness and generality of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, Neuro-Symbolic AI approaches to learning (Hudson and Manning, 2019; d’Avila Garcez et al., 2019; Jiang and Ahn, 2020; d’Avila Garcez and Lamb, 2020), which integrates low-level perception with high-level reasoning by combining data-driven neural modules with logic-based symbolic modules, has gained traction. This combination of sub-symbolic and symbolic systems has been shown to have several advantages for various tasks such as visual question answering and reasoning (Yi et al., 2018), concept learning (Mao et al., 2019b) and improved properties for explainable and revisable models (Ciravegna et al., 2020; Stammer et al., 2021).

Rather than designing specifically tailored Neuro-Symbolic architectures, where often the neural and symbolic modules are disjoint and trained independently (Yi et al., 2018; Mao et al., 2019a; Stammer et al., 2021), deep probabilistic programming languages (DPPLs) provide an exciting alternative (Bingham et al., 2018; Tran et al., 2017; Manhaeve et al., 2018; Yang et al., 2020). Specifically, DPPLs integrate neural and symbolic modules via a unifying programming framework with probability estimates acting as the “glue” between separate modules allowing for reasoning over noisy, uncertain data and, importantly, joint training of the modules. Additionally, prior knowledge and biases in the form of logical rules can easily be added with DPPLs, rather than creating implicit architectural biases, thereby integrating neural networks into downstream logical reasoning tasks.

Figure 1:

SLASH Attention illustrated for a visual reasoning task. SLASH with Neural-Probabilistic Predicates consisting of a slot attention encoder and Probabilistic Circuits (PCs). The slot encoder is shared over all NPPs. Each PC learns the joint distribution over slot encodings,

, and object attributes,

, of a specific category, e.g. color attributes. Via targeted queries to the NPPs, one can obtain task-related probabilities, e.g. conditional probabilities for the task of set prediction. Given the probability estimates from the NPP(s) and a SLASH program, containing a set of facts and logical statements about the world, the probability of the truth value of a task-related query are computed via answer set programming. The entire system, including the neural and probabilistic modules, are finally trained end2end via a single loss function.

Object-centric deep learning has recently brought forth several exciting avenues of research by introducing inductive biases to neural networks to extract objects from visual scenes in an unsupervised manner 

(Zhang et al., 2019; Burgess et al., 2019; Engelcke et al., 2020; Greff et al., 2019; Lin et al., 2020; Locatello et al., 2020; Jiang and Ahn, 2020). We refer to Greff et al. (2020) for a detailed overview. A motivation for this specific line of investigation, which notably has been around for a longer period of time (Fodor and Pylyshyn, 1988; Marcus, 2019)

, is that objects occur as natural building blocks in human perception and possess advantageous properties for many cognitive tasks, such as scene understanding and reasoning. With a DPPL, these advancements can be improved by integrating the previously mentioned components into the DPPL’s programming framework and further adding constraints about objects and their properties in form of logical statements e.g. about color singularity, rather than implicitly enforcing this via one hot encodings.

We propose SLASH – a novel DPPL that, similar to the punctuation symbol, can be used to efficiently combine several paradigms into one. Specifically, SLASH represents a scalable programming language that seamlessly integrates probabilistic logical programming with neural representations and tractable probabilistic estimations. Fig. 1 shows an example instantiation of SLASH, termed SLASH Attention, for object-centric set prediction. SLASH consists of several key building blocks. Firstly, it makes use of Neural-Probabilistic Predicates (NPPs) for probability estimation. NPPs consist of neural and/or probabilistic circuit (PC) modules and act as a unifying term, encompassing the neural predicates of DeepProbLog and NeurASP, as well as purely probabilistic predicates. In this work, we introduce a much more powerful “flavor” of NPPs that consist jointly of neural and PC modules, taking advantage of the power of neural computations together with true density estimation of PCs. Depending on the underlying task one can thus ask a range of queries to the NPP, e.g. sample an unknown, desired variable, but also query for conditional class probabilities. Example NPPs consisting of a slot attention encoder and several PCs are depicted in Fig. 1 for the task of set prediction. The slot encoder is shared across all NPPs, whereas the PC of each NPP models a separate category of attributes. In this way, each NPP models the joint distribution over slot encodings and object attribute values, such as the color of an object. By querying the NPP, one can obtain task-related probability estimations, such as the conditional attribute probability.

The second component of SLASH is the logical program, which consists of a set of facts and logical statements defining the state of the world of the underlying task. For example, one can define the rules for when an object possesses a specific set of attributes (cf. Fig. 1). Thirdly, an ASP module is used to combine the first two components. Given a logical query about the input data, the logical program and the probability estimates obtained from the NPP(s), the ASP module produces a probability estimate about the truth value of the query, stating, e.g., how likely it is for a specific object in an image to be a large, dark red triangle. In contrast to query evaluation in Prolog (Colmerauer and Roussel, 1996; Clocksin and Mellish, 2003) which may lead to an infinite loop, many modern answer set solvers use Conflict-Driven-Clause-Learning (CDPL) which, in principle, always terminates.

Training in SLASH is performed efficiently in a batch-wise and end2end fashion, by integrating the parameters of all modules, neural and probabilistic, into a single loss term. SLASH thus allows a simple, fast and effective integration of sub-symbolic and symbolic computations. In our experiments, we investigate the advantages of SLASH in comparison to SOTA DPPLs on the benchmark task of MNIST-Addition (Manhaeve et al., 2018). We hereby show SLASH’s increased scalability regarding computation time, as well as SLASH’s ability to handle incomplete data via true probabilistic density modelling. Next, we show that SLASH Attention provides superior results for set prediction in terms of accuracy and generalization abilities compared to a baseline slot attention encoder. With our experiments, we thus show that SLASH is a realization of “one system – two approaches” (Bengio, 2019), that can successfully be used for performing various tasks and on a variety of data types.

We make the following contributions: (1) We introduce neural-probabilistic predicates, efficiently integrating answer set programming with probabilistic inference via our novel DPPL, SLASH. (2) We successfully train neural, probabilistic and logic modules within SLASH for complex data structures end2end via a simple, single loss term. (3) We show that SLASH provides various advantages across a variety of tasks and data sets compared to state-of-the-art DPPLs and neural models.

2 Neuro-Symbolic Logic Programming

Neuro-Symbolic AI can be divided into two lines of research, depending on the starting point. Both, however, have the same final goal: to combine low-level perception with logical constraints and reasoning.

A key motivation of Neuro-Symbolic AI (d’Avila Garcez et al., 2009; Mao et al., 2019b; Hudson and Manning, 2019; d’Avila Garcez et al., 2019; Jiang and Ahn, 2020; d’Avila Garcez and Lamb, 2020) is to combine the advantages of symbolic and neural representations into a joint system. This is often done in a hybrid approach where a neural network acts as a perception module that interfaces with a symbolic reasoning system, e.g. (Mao et al., 2019b; Yi et al., 2018). The goal of such an approach is to mitigate the issues of one type of representation by the other, e.g. using the power of symbolic reasoning systems to handle the generalizability issues of neural networks and on the other hand handle the difficulty of noisy data for symbolic systems via neural networks. Recent work has also shown the advantage of Neuro-Symbolic approaches for explaining and revising incorrect decisions (Ciravegna et al., 2020; Stammer et al., 2021). Many of these previous works, however, train the sub-symbolic and symbolic modules separately.

Deep Probabilistic Programming Languages (DPPLs) are programming languages that combine deep neural networks with probabilistic models and allow a user to express a probabilistic model via a logical program. Similar to Neuro-Symbolic architectures, DPPLs thereby unite the advantages of different paradigms. DPPLs are related to earlier works such as Markov Logic Networks (MLNs) (Richardson and Domingos, 2006). Thereby, the binding link is the Weighted Model Counting (MMC) introduced in LPMLN (Lee and Wang, 2016). Several DPPLs have been proposed by now, among which are Pyro (Bingham et al., 2018), Edward (Tran et al., 2017), DeepProbLog (Manhaeve et al., 2018), and NeurASP (Yang et al., 2020).

To resolve the scalability issues of DeepProbLog, which use Sentential Decision Diagrams (SDDs) (Darwiche, 2011) as the underlying data structure to evaluate queries, NeurASP (Yang et al., 2020), offers a solution by utilizing Answer Set Programming (ASP) (Dimopoulos et al., 1997; Soininen and Niemelä, 1999; Marek and Truszczyński, 1999; Calimeri et al., 2020). In this way, NeurASP changes the paradigm from query evaluation to model generation, i.e. instead of constructing an SDD or similar knowledge representation system, NeurASP generates a set of all possible solutions (one model per solution) and estimates the probability for the truth value of each of these solutions. Of those DPPLs that handle learning in a relational, probabilistic setting and in an end2end fashion, all of these are limited to estimating only conditional class probabilities.

3 The SLASH Framework

In this section, we introduce our novel DPPL, SLASH. Before we dive into the details of this, it is necessary to first introduce Neural-Probabilistic Predicates, for which we require an understanding of Probabilistic Circuits. Finally, we will present the learning paradigm of SLASH.

The term probabilistic circuit (PC) (Choi et al., 2020)

represents a unifying framework that encompasses all computational graphs which encode probability distributions and guarantee tractable probabilistic modelling. These include Sum-Product Networks (SPNs)

(Poon and Domingos, 2011) which are deep mixture models represented via a rooted directed acyclic graphs with a recursively defined structure.

((a)) NPPs come in various flavors depending on the data and underlying task.
((b)) Minimal SLASH program and query for set prediction.
Figure 2: (a) Depending on the data set and underlying task, SLASH requires a suitable Neural-Probabilistic Predicate (NPP) that computes query-dependent probability estimates. An NPP can be composed of neural and probabilistic modules, or (depicted via slash symbol) only one of these two. (b) A minimal SLASH program and query for the set prediction task, here only showing the NPP that models the color category per object. For the full program, we refer to the Appendix.

3.1 Neural-Probabilistic Predicates

Previous DPPLs, DeepProbLog (Manhaeve et al., 2018) and NeurASP (Yang et al., 2020), introduced the Neural Predicate as an annotated-disjunction or as a propositional atom, respectively, to acquire conditional class probabilities, , via the softmax function at the output of an arbitrary DNN. As mentioned in the introduction, this approach has certain limitations concerning inference capabilities. To resolve this issue, we introduce Neural-Probabilisitic Predicates (NPPs).

Formally, we denote with

(1)

a Neural-Probabilistic Predicate . Thereby, (i) npp is a reserved word to label an NPP, (ii) h a symbolic name of either a PC, NN or a joint of a PC and NN (cf. Fig. 1(a)), e.g., color_attr is the name of an NPP of Fig. 1(b). Additionally, (iii) denotes a “term” and (iv) are placeholders for each of the possible outcomes of . For example, the placeholders for color_attr are the color attributes of an object (Red, Blue, Green, etc.).
An NPP abbreviates an arithmetic literal of the form with and . Furthermore, we denote with a set of NPPs of the form stated in (Eq. 1) and the set of all rules of one NPP, which denotes the possible outcomes, obtained from an NPP in , e.g. for the example depicted in Fig. 1(b).

Rules of the form

(2)

are used as an abbreviation for application to multiple entities, e.g. multiple slots for the task of set prediction (cf. Fig. 1(b)). Hereby, Body of Eq. 2 is identified by (tautology, true) or (contradiction, false) during grounding. Rules of the form Head Body with appearing in Head are prohibited for .

In this work, we largely make use of NPPs that contain probabilistic circuits (specifically SPNs) which allow for tractable density estimation and modelling of joint probabilities. In this way, it is possible to answer a much richer set of probabilistic queries, i.e. , and .

In addition to this, we introduce the arguably more interesting type of NPP that combines a neural module with a PC. Hereby, the neural module learns to map the raw input data into an optimal latent representation, e.g. object-based slot representations. The PC, in turn, learns to model the joint distribution of these latent variables and produces the final probability estimates. This type of NPP nicely combines the representational power of neural networks with the advantages of PCs in probability estimation and query flexibility.

For making the different probabilistic queries distinguishable in a SLASH program, we introduce the following notation. We denote a given variable with and the query variable with . E.g., within the running example of set prediction (cf. Fig. 1 and 1(b)), with the query one is asking for . Similarly, with one is asking for and, finally, with for .

To summarize: an NPP can consist of neural and/or probabilistic modules and produces query-dependent probability estimates. Due to the flexibility of its definition, the term NPP contains the predicates of previous works (Manhaeve et al., 2018; Yang et al., 2020), but also more interesting predicates discussed above. The specific “flavor” of an NPP should be chosen depending on what type of probability estimation is required (cf. Fig 1(a)).

3.2 The SLASH Language and Semantics

Fig. 1 presents an illustration of SLASH, exemplified for the task of set prediction, with all of its key components. Having introduced the NPPs previously, which produce probability estimates, we now continue in the pipeline on how to use these probability estimates for answering logical queries. We begin by formally defining a SLASH program.

Definition 1.

A SLASH program is the union of , . Therewith, is the set of propositional rules (standard rules from ASP-Core-2 (Calimeri et al., 2020)), and is a set of Neural-Probabilistic Predicates of the form stated in Eq. 1.

Fig. 1(b) depicts a minimal SLASH program for the task of set prediction, exemplifying a set of propositional rules and neural predicates. Similar to NeurASP, SLASH requires ASP and as such adopts its syntax to most part. We therefore now address integrating our NPPs into an ASP compatible form to obtain the success probability for the logical query given all possible solutions. Thus, we define SLASH’s semantics. For SLASH to translate the program to the ASP-solver’s compatible form, the rules (Eq. 1) will be rewritten to the set of rules:

(3)

The ASP-solver should understand this as “Pick exactly one rule from the set”. After the translation is done, we can ask an ASP-solver for the solutions for . We denote a set of ASP constraints in the form , as queries (annotation). and each of the solutions with respect to as a potential solution, , (referred to as stable model, , in ASP). With we denote the projection of the onto , – the number of the possible solutions of the program agreeing with on . Because we aim to calculate the success probability of the query , we formalize the probability of a potential solution beforehand.

Definition 2.

We specify the probability of the potential solution, , for the program as the product of the probabilities of all atoms in divided by the number of potential solutions of agreeing with on :

(4)

Therefore, the definition of the query’s probability is defined as follows.

Definition 3.

The probability of the query given the set of possible solutions is defined as

(5)

Thereby, reads as " satisfies ". The probability of the set of queries is defined as the product of the probability of each. I.e.

(6)

3.3 Learning in SLASH

To define the gradient calculation that is required for learning, let us recall that due to SLASH’s semantics, our interest lies in rewarding the right solutions () and penalizing wrong ones (), for which we require gradient ascent. We therefore denote with the SLASH program under consideration, whereby is the set of the NPP’s parameters associated with . We assume that for the set of queries, , it holds that

We maximize the log-likelihood of under our program , i.e., we are searching for such that

(7)

Referring to the probabilities in as we calculate their gradients w.r.t.

via backpropagation as

(8)

Whereby is computed via backward propagation through the NPPs (cf. Eq. 10 in the appendix for backpropagation with joint NPPs) and defined as:

Proposition 1.

Let be a SLASH program and let be a query with . Furthermore, let be the label of the probability of an atom in , denoting . Thus, we have,

From this proposition it thus follows that the gradient for the right solution () does not have to be positive. Finally, we remark that, similarly to Manhaeve et al. (2018) and Skryagin et al. (2020), we use the learning from entailment setting. Thus, for the SLASH program with parameters and a set of pairs , where is a query and its desired success probability, we compute the final loss function, , as:

(9)

With this loss function we ultimately wish to maximize the estimated success probability. We remark that the defined loss function is true regardless of the NPP’s form (NN with Softmax, PC or PC jointly with NN). The only difference will be the second term, i.e. or depending on the NPP and task.

With the loss function (9) we have combined the optimization of each module, whether neural or probabilistic, into a single term that can be optimized via gradient ascent. Importantly, rather than requiring a novel loss function for each individual task and data set, with SLASH, it is possible to simply incorporate the specific requirements into the logic program. The training loss, however, remains the same. We refer to the Appendix A for further details.

4 Empirical Results

The advantage of SLASH lies in the efficient integration of neural, probabilistic and symbolic computations. To emphasize this, we conduct a variety of experimental evaluations.

Experimental Details. We use two benchmark data sets, namely MNIST (LeCun et al., 1998b) for the task of MNIST-Addition and a variant of the ShapeWorld data set (Kuhnle and Copestake, 2017)

for object-centric set prediction. For all experiments, other than run-time benchmarking, we present the average and the standard deviation over five runs with different random seeds for parameter initialization.

For ShapeWorld experiments, we generate a data set we refer to as ShapeWorld4. Images of ShapeWorld4 contain between one and four objects, with each object consisting of four attributes: a color (red, blue, green, gray, brown, magenta, cyan or yellow), a shade (bright, or dark), a shape (circle, triangle or square) and a size (small or big). Thus, each object can be created from 84 different combinations of attributes. Fig. 1 depicts an example image.

We measure performance via classification accuracies in the MNIST-Addition task. In our ShapeWorld4 experiments we present the average precision. We refer to appendix B for the SLASH programs and queries of each experiment, and appendix C for a detailed description of hyperparameters and further details.

Evaluation 1: SLASH outperforms SOTA DPPLs in MNIST-Addition. The task of MNIST-Addition (Manhaeve et al., 2018)

is to predict the sum of two MNIST digits, presented only as raw images. During test time, however, a model should classify the images directly. Thus, although a model does not receive explicit information about the depicted digits, it must learn to identify digits via indirect feedback on the sum prediction.

We compare the test accuracy after convergence and running time per epoch (RTE) between the three DPPLs: DeepProbLog

(Manhaeve et al., 2018), NeurASP (Yang et al., 2020) and SLASH, using a probabilistic circuit (PC) or a deep neural network (DNN) as NPP. Notably, the DNN used in SLASH (DNN) is the LeNet5 model (LeCun et al., 1998a) of DeepProbLog and NeurASP. We note that when using the PC as NPP, we have also extracted conditional class probabilities , by marginalizing the class variables to acquire the normalization constant from the joint , and calculating .

Test Acc. (%) RTE
DeepProbLog
NeurASP
SLASH (PC)
SLASH (DNN)
((a)) Baseline MNIST Addition.
DeepProbLog SLASH (PC)
50%
80%
90%
97%
((b)) Missing data MNIST Addition.
Table 1: MNIST Addition Results. Test accuracy corresponds to the percentage of correctly classified test images. (a) Test accuracies in percent and run time per epoch (RTE) for the MNIST Addition task with various DPPLs, including SLASH with an NPP that models the joint probabilities (SLASH (PC)) and one that models only conditional probabilities (SLASH (DNN)). (b) Test accuracies in percent for the MNIST Addition task with missing data, comparing DeepProbLog with SLASH (PC). The amount of missing data was varied between 50% and 97% of the pixels per image.

The results can be seen in Tab. 1(a)

. We observe that training SLASH with a DNN NPP produces SOTA accuracies compared to DeepProbLog and NeurASP, confirming that SLASH’s batch-wise loss computation leads to improved performances. We further observe that the test accuracy of SLASH with a PC NPP is slightly below the other DPPLs, however we argue that this may be since a PC, in comparison to a DNN, is learning a true mixture density rather than just conditional probabilities. The advantages of doing so will be investigated in the next experiments. Note that, optimal architecture search for PCs, e.g. for computer vision, is an open research question.

An important practical result can be found when comparing the RTE measurements. Particularly, we observe that even though using the same DNN as in considered DPPLs, SLASH (DNN) is approx. 20 times faster, than DeepProbLog and three times faster than NeurASP. Also, SLASH (PC) is faster than both DeepProbLog and NeurASP, although internally performing exact probabilistic inference. These differences in computation time are the result of parallel calls to the ASP solver of SLASH. These evaluations show SLASH’s superiority on the benchmark MNIST-Addition task.

Evaluation 2: Handling Missing Data with SLASH. SLASH offers the advantage of its flexibility to use various kinds of NPPs. Thus, in comparison to previous DPPLs, one can easily integrate NPPs into SLASH that perform joint probability estimation. For this evaluation, we consider the task of MNIST-Addition with missing data. We trained SLASH (PC) and DeepProbLog with the MNIST-Addition task with images in which a percentage of pixels per image has been removed. It is important to mention here that whereas DeepProbLog handles the missing data simply as background pixels, SLASH (PC) specifically models the missing data as uncertain data by marginalizing the denoted pixels at inference time. We use DeepProbLog here representative of DPPLs without true density estimation.

The results can be seen in Tab. 1(b) for , , and missing pixels per image. We observe that at , DeepProbLog and SLASH produce almost equal accuracies. With percent missing pixels, there is a substantial difference in the ability of the two DPPLs to correctly classify images, with SLASH being very stable. By further increasing the percentage of missing pixels, this difference becomes even more substantial with SLASH still reaching a test accuracy even when of the pixels per image are missing, whereas DeepProbLog degrades to an average of test accuracy. We further note that SLASH, in comparison to DeepProbLog, produces largely reduced standard deviations over runs. Thus, by utilizing the power of true density estimation SLASH, with an appropriate NPP, can produce more robust results in comparison to other DPPLs.

Figure 3: ShapeWorld4 Experiments. (a) Converged test average precision scores for the set prediction task with ShapeWorld4 (top) and ShapeWorld4 CoGenT (bottom). (b) Test average precision scores for set prediction with ShapeWorld4 over the training epochs. In these experiments we compared a baseline slot encoder versus SLASH Attention with slot attention and PC-based NPPs. For the CoGenT experiments, a model is trained on one training set and tested on two separate test conditions. The Condition A test set contains attribute compositions which were also seen during training. The Condition B test set contains attribute compositions which were not seen during training, e.g. yellow circles were not present in the training set, but present in Condition B test set.

Evaluation 3: Improved Concept Learning via SLASH. We show that SLASH can be very effective for the complex task of set prediction, which previous DPPLs have not tackled. We revert to the ShapeWorld4 data set for this setting.

For set prediction, a model is trained to predict the discrete attributes of a set of objects in an image (cf. Fig. 1 for an example ShapeWorld4 image). The difficulty for the model lies therein that it must match an unordered set of corresponding attributes (with varying number of entities over samples) with its internal representations of the image.

The slot attention module introduced by Locatello et al. (2020)

allows for an attractive object-centric approach to this task. Specifically, this module represents a pluggable, differentiable module that can be easily added to any architecture and, through a competitive softmax-based attention mechanism, can enforce the binding of specific parts of a latent representation into permutation-invariant, task-specific vectors, called slots.

In our experiments, we wish to show that by adding logical constraints to the training setting, one can improve the overall performances and generalization properties of such a model. For this, we train SLASH with NPPs as depicted in Fig. 1 consisting of a shared slot encoder and separate PCs, each modelling the mixture of latent slot variables and the attributes of one category, e.g. color. For ShapeWorld4, we thereby have altogether four NPPs. SLASH is trained via queries of the kind exemplified in Fig. 7 in the Appendix. We refer to this configuration as SLASH Attention.

We compare SLASH Attention to a baseline slot attention encoder using an MLP and Hungarian loss for predicting the object properties from the slot encodings as in Locatello et al. (2020). The results of these experiments can be found in Fig. 3a (top). We observe that the average precision after convergence on the held-out test set with SLASH Attention is greatly improved to that of the baseline model. Additionally, in Fig. 3b we observe that SLASH Attention reaches the average precision value of the baseline model in much fewer number of epochs. Thus, we can summarize that adding logical knowledge in the training procedure via SLASH can greatly improve the capabilities of a neural module for set prediction.

Evaluation 4: Improved Compositional Generalization with SLASH. To test the hypothesis that SLASH Attention possesses improved generalization properties in comparison to the baseline model, we ran experiments on a variant of ShapeWorld4 similar to the CLEVR Compositional Generalization Test (CoGenT) (Johnson et al., 2017). The goal of CoGenT is to investigate a model’s ability to handle novel combinations of attributes that were not seen during training.

For this purpose, we established two conditions within a ShapeWorld4 CoGenT data set: Condition (A) – the training and test data set contains squares with the colors gray, blue, brown, or yellow, triangles with the colors red, green, magenta, or cyan and circles of all colors. Condition (B) – the training set is as in Condition (A). However, the test set contains squares with the colors red, green, magenta, or cyan, triangles with the colors gray, blue, brown, or yellow and circles of all colors. The goal is to investigate how well a model can generalize that, e.g., also squares can have the color red, although never having seen evidence for this during training.

The resulting average precision test scores are presented in Fig. 3a (bottom). We observe that, even though the SLASH Program used for this experiment was not explicitly written to handle composition generalization, SLASH Attention shows greatly improved generalization capabilities. This can be seen in the approx. higher average precision scores on the Condition (B) test set in comparison to the baseline model. Importantly, this trend still holds even when subtracting the higher precision scores observed in Condition (A).

To summarize our findings from the experiments on set prediction: we observe that adding prior knowledge in the form of logical constraints via SLASH can greatly improve a neural module in terms of performance and generalizability. On a side note: training neural networks for novel tasks, often involves defining explicit loss functions, e.g. Hungarian loss for set prediction. In contrast with SLASH, no matter the choice of NPP and underlying task, the training loss remains the same. Task-related requirements simply need to be added as lines of code to the SLASH program. This additionally highlights SLASH’s versatility and flexibility.

Summary of all Empirical Results. All empirical results together clearly demonstrate that the flexibility of SLASH is highly beneficial and can easily outperform state-of-the-art: one can freely combine what is required to solve the underlying task — (deep) neural networks, probabilistic circuits, and logic.

5 Conclusion and Future Work

We introduced SLASH, a novel DPPL that integrates neural computations with tractable probability estimates and logical statements. The key ingredient of SLASH to achieve this are Neural-Probabilistic Predicates (NPPs) that can be flexibly constructed out of neural and/or probabilistic circuit modules based on the data and underlying task. With these NPPs, one can produce task-specific probability estimates. The details and additional prior knowledge of a task are neatly encompassed within a SLASH program with only few lines of code. Finally, via Answer Set Programming and Weighted Model Counting, the logical SLASH program and probability estimates from the NPPs are combined to estimate the truth value of a task-specific query. Our experiments show the power and efficiency of SLASH, improving upon previous DPPLs in the benchmark MNIST-Addition task in terms of performance, efficiency and robustness. Importantly, by integrating a SOTA slot attention encoder into NPPs and adding few logical constraints, SLASH demonstrates improved performances and generalizability in comparison to the pure slot encoder for the task of object-centric set prediction; a setting no DPPL has tackled yet. This shows the great potential of DPPLs to elegantly combine logical reasoning with neural computations and uncertainty estimates, and represents an immensely flexible and attractive path for future research.

Interesting avenues for future work include benchmarking SLASH on additional data types and tasks. One should explore unsupervised and weakly supervised learning using logic with SLASH and investigate how far logical constraints can help unsupervised object discovery. In direct alignment with our work, one should also investigate image generation via the beneficial feature of PCs to generate random samples. Actually, it should be possible to generate images that encapsulate logical knowledge bases. This is important to move from data-rich to knowledge-rich AI.

Ethics Statement

With our work, we have shown that one can add prior knowledge and logical constraints to the training of learning systems. We postulate that SLASH can therefore additionally be used to identify and remove biases or undesirable behavior, by adding constraints within the SLASH program. We observe that this feature, however, also has the potential danger to be used in the opposite way, e.g. explicitly adding bias and discriminatory factors to a system. To the best of our knowledge, our study does not raise any ethical, privacy or conflict of interest concerns.

Acknowledgments

This work has been supported by ICT-48 Network of AI Research Excellence Center “TAILOR” (EU Horizon 2020, GA No 952215), the BMEL/BLE under the innovation support program, project “AuDiSens” (FKZ28151NA187) as well as the German Federal Ministry of Education and Research and the Hessian Ministry of Science and the Arts within the National Research Center for Applied Cybersecurity “ATHENE”. Additionally, it has benefited from the Hessian research priority program LOEWE within the project WhiteBox, the HMWK cluster project “The Third Wave of AI” and the Collaboration Lab “AI in Construction” (AICO).

References

  • Y. Bengio (2019) From System 1 Deep Learning to System 2 Deep Learning. Note: Invited talk NeurIPS External Links: Link Cited by: §1.
  • E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman (2018) Pyro: Deep Universal Probabilistic Programming. In

    Journal of Machine Learning Research

    ,
    Cited by: §1, §2.
  • C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner (2019) Monet: unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390. Cited by: §1.
  • F. Calimeri, W. Faber, M. Gebser, G. Ianni, R. Kaminski, T. Krennwallner, N. Leone, M. Maratea, F. Ricca, and T. Schaub (2020) ASP-core-2 input language format. In Theory and Practice of Logic Programming, Cited by: §2, Definition 1.
  • Y. Choi, A. Vergari, and G. Van den Broeck (2020) Probabilistic circuits: a unifying framework for tractable probabilistic models. Technical report Technical report. Cited by: §3.
  • G. Ciravegna, F. Giannini, M. Gori, M. Maggini, and S. Melacci (2020) Human-driven fol explanations of deep learning.. In IJCAI, pp. 2234–2240. Cited by: §1, §2.
  • W. Clocksin and C. S. Mellish (2003) Programming in prolog. Springer-Verlag Berlin Heidelberg. Cited by: §1.
  • A. Colmerauer and P. Roussel (1996) The birth of prolog. In History of programming languages—II, Cited by: §1.
  • A. d’Avila Garcez and L. C. Lamb (2020) Neurosymbolic AI: the 3rd wave. CoRR abs/2012.05876. Cited by: §1, §2.
  • A. S. d’Avila Garcez, M. Gori, L. C. Lamb, L. Serafini, M. Spranger, and S. N. Tran (2019) Neural-symbolic computing: an effective methodology for principled integration of machine learning and reasoning. FLAP 6 (4), pp. 611–632. Cited by: §1, §2.
  • A. S. d’Avila Garcez, L. C. Lamb, and D. M. Gabbay (2009) Neural-symbolic cognitive reasoning. Cognitive Technologies, Springer. Cited by: §2.
  • A. Darwiche (2011) SDD: A New Canonical Representation of Propositional Knowledge Bases. In

    In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI)

    ,
    pp. 819–826. Note: http://reasoning.cs.ucla.edu/fetch.php?id=121&type=pdf Cited by: §2.
  • Y. Dimopoulos, B. Nebel, and J. Koehler (1997) Encoding planning problems in nonmonotonic logic programs. In European Conference on Planning, pp. 169–181. Cited by: §2.
  • M. Engelcke, A. R. Kosiorek, O. P. Jones, and I. Posner (2020) GENESIS: generative scene inference and sampling with object-centric latent representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: §1.
  • J. A. Fodor and Z. W. Pylyshyn (1988) Connectionism and cognitive architecture: a critical analysis. Cognition 28 (1-2), pp. 3–71. Cited by: §1.
  • K. Greff, R. L. Kaufman, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner (2019) Multi-object representation learning with iterative variational inference. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 2424–2433. Cited by: §1.
  • K. Greff, S. van Steenkiste, and J. Schmidhuber (2020) On the binding problem in artificial neural networks. CoRR abs/2012.05208. External Links: 2012.05208 Cited by: §1.
  • D. A. Hudson and C. D. Manning (2019) Learning by abstraction: the neural state machine. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 5901–5914. Cited by: §1, §2.
  • J. Jiang and S. Ahn (2020) Generative neurosymbolic machines. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §1, §1, §2.
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §4.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §C.3.
  • A. Kuhnle and A. Copestake (2017) ShapeWorld: a new test methodology for multimodal language understanding. In arXiv, Cited by: §C.1, §4.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998a) Gradient-Based Learning Applied To Document Recognition. In IEEE, pp. 2278–2324. Cited by: §4.
  • Y. LeCun, C. Cortes, and C. J.C. Burges (1998b) THE MNIST DATABASE of handwritten digits. External Links: Link Cited by: §4.
  • J. Lee and Y. Wang (2016) Weighted Rules under the Stable Model Semantics. In AAAI, Cited by: §2.
  • Z. Lin, Y. Wu, S. V. Peri, W. Sun, G. Singh, F. Deng, J. Jiang, and S. Ahn (2020) SPACE: unsupervised object-oriented scene representation via spatial attention and decomposition. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: §1.
  • F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf (2020) Object-centric learning with slot attention. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 11525–11538. Cited by: §C.2, §C.3, §1, §4, §4.
  • R. Manhaeve, S. Dumančić, A. Kimmig, T. Demeester, and L. De Raedt (2018) DeepProbLog: Neural Probabilistic Logic Programming. In NeurIPS, pp. 3753–3763. Cited by: §C.3, §1, §1, §2, §3.1, §3.1, §3.3, §4, §4.
  • J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu (2019a) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §1.
  • J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu (2019b) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §1, §2.
  • G. F. Marcus (2019) The algebraic mind: integrating connectionism and cognitive science. MIT press. Cited by: §1.
  • V. W. Marek and M. Truszczyński (1999) Stable models and an alternative logic programming paradigm. In The Logic Programming Paradigm, pp. 375–398. Cited by: §2.
  • R. Peharz, S. Lang, A. Vergari, K. Stelzner, A. Molina, M. Trapp, G. Van den Broeck, K. Kersting, and Z. Ghahramani (2020) Einsum networks: fast and scalable learning of tractable probabilistic circuits. In Proceedings of the 37th International Conference on Machine Learning (ICML), Cited by: §C.3.
  • H. Poon and P. Domingos (2011) Sum-Product Networks: A New Deep Architecture. In UAI, pp. 337–346. Cited by: §C.3, §C.3, §3.
  • M. Richardson and P. Domingos (2006) Markov logic networks. Machine Learning 68, pp. 107–136. Cited by: §2.
  • A. Skryagin, K. Stelzner, A. Molina, F. Ventola, and K. Kersting (2020) SPLog: sum-product logic. In In International Conference on Probabilistic Programming, Cited by: §3.3.
  • T. Soininen and I. Niemelä (1999) Developing a declarative rule language for applications in product configuration. In International Symposium on Practical Aspects of Declarative Languages, pp. 305–319. Cited by: §2.
  • W. Stammer, P. Schramowski, and K. Kersting (2021) Right for the right concept: revising neuro-symbolic concepts by interacting with their explanations. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 3619–3629. Cited by: §1, §1, §2.
  • D. Tran, M. D. Hoffman, R. A. Saurous, E. Brevdo, K. Murphy, and D. M. Blei (2017) Deep Probabilistic Programming. In ICLR, Cited by: §1, §2.
  • Z. Yang, A. Ishay, and J. Lee (2020) Einsum networks: fast and scalable learning of tractable probabilistic circuits. In IJCAI, Cited by: §C.3, §1, §2, §2, §3.1, §3.1, §4.
  • K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum (2018) Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 1039–1050. Cited by: §1, §1, §2.
  • Y. Zhang, J. S. Hare, and A. Prügel-Bennett (2019) Deep set prediction networks. In Advances in Neural Information Processing Systems 32: Annual Conference 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 3207–3217. Cited by: §1.

Appendix A Appendix A – Details on Parameter Learning

In the Appendix, we want to discuss details on parameter learning in SLASH. We consider the setting of the learning from entailment. I.e., for the given SLASH program with parameters and a set of pairs , where is a query and its desired success probability, we compute for a loss function :

To this end, we defined the loss function in the equation as

Since the data associated with the query is forwarded to a NPP so that SLASH obtains the success query using the output probabilities from this NPP, we can formally write

the last equation follows the notion of the set of classes with the cardinality of . Furthermore, we remark that the derived loss function applies for the case of a SLASH program entailing multiple NPPs. For example, we were considering the program for ShapeWorld4, which entails four of them. See Figure 6 for details.

Finally, we note that the mathematical transformations listed above hold for any type of NPP and the task dependent queries (NN with Softmax, PC or PC jointly with NN). The only difference will be the second term, i.e. or depending on the NPP and task. The NPP in a form of a single PC was depicted to be the example.

Backpropagation for joint NN and PC NPPs: If within the SLASH program,

, the NPP forwards the data tensor through a NN first, i.e., the NPP models a joint over the NN’s output variables by a PC, then we rewrite (

8) to

(10)

Thereby, is the set of the NN’s parameters and is computed by the backward propagation through the NN.

Appendix B Appendix B – SLASH Programs

Here, the interested reader will find the SLASH programs which we compiled for our experiments. Figure 4 presents the one for the MNIST Addition task, Figure 6

– for the set prediction task with slot attention encoder and the subsequent CoGenT test. Note the use of the “+” and “-” notation for indicating whether a random variable is given or being queried for.

1# Define images
2img(i1). img(i2).
3# Define Neural-Probabilistic Predicate
4npp(digit(1, X), [0,1,2,3,4,5,6,7,8,9]) :- img(X).
5# Define the addition of digits given two images and the resulting sum
6addition(A, B, N) :- digit(+A, -N1), digit(+B, -N2), N = N1 + N2.
Figure 4: SLASH Program for MNIST addition. The same program was used for the training with missing data.
1# Is 7 the sum of the digits in img1 and img2?
2:- addition(image_id1, image_id2, 7)
Figure 5: Example SLASH Query for MNIST addition. The same type of query was used for the training with missing data
1# Define slots
2slot(s1). slot(s2). slot(s3). slot(s4).
3# Define identifiers for the objects in the image
4# (there are up to four objects in one image).
5obj(o1). obj(o2). obj(o3). obj(o4).
6# Assign each slot to an object identifier
7{assign_one_slot_to_one_object(X, O): slot(X)}=1 :- obj(O).
8# Make sure the matching is one-to-one between slots
9# and objects identifiers.
10:- assign_one_slot_to_one_object(X1, O1),
11   assign_one_slot_to_one_object(X2, O2),
12   X1==X2, O1!=O2.
13# Define all Neural-Probabilistic Predicates
14npp(color_attr(1, X), [red, blue, green, grey, brown,
15             magenta, cyan, yellow, bg]) :- slot(X).
16npp(shape_attr(1, X), [circle, triangle, square, bg]) :- slot(X).
17npp(shade_attr(1, X), [bright, dark, bg]) :- slot(X).
18npp(size_attr(1, X), [big,small,bg]) :- slot(X).
19# Object O has the attributes C and S and H and Z if ...
20has_attributes(O, C, S, H, Z) :- slot(X), obj(O),
21                 assign_one_slot_to_one_object(X, O),
22                 color(+X, -C), shape(+X, -S),
23                 shade(+X, -H), size(+X, -Z).
Figure 6: SLASH Program for ShapeWorld4. The same program was used for the CoGenT experiments.
1# Does object o1 have the attributes red, circle, bright, small?
2:- has_attributes(o1, red, circle, bright, small)
Figure 7: Example SLASH Query for ShapeWorld4 experiments. In other words, this query corresponds to asking SLASH: “Is object 1 a small, bright red circle?”.

Appendix C Appendix C – Experimental Details

c.1 ShapeWorld4 Generation

The ShapeWorld4 and ShapeWorld4 CoGenT data sets were generated using the original scripts of (Kuhnle and Copestake, 2017) (https://github.com/AlexKuhnle/ShapeWorld). The exact scripts will be added together with the SLASH source code.

c.2 Average Precision computation (ShapeWorld4)

For the baseline slot encoder experiments on ShapeWorld4 we measured the average precision score as in Locatello et al. (2020). In comparison to the baseline slot encoder, when applying SLASH Attention, however, we handled the case of a slot not containing an object, e.g. only background variables, differently. Whereas Locatello et al. (2020) add an additional binary identifier to the multi-label ground truth vectors, we have added a background (bg) attribute to each category (cf. Fig. 6). A slot is thus considered to be empty (i.e. not containing an object) if each NPP returns a high conditional probability for the bg attribute.

c.3 Model Details

For those experiments using NPPs with PC we have used Einsum Networks (EiNets) for implementing the probabilistic circuits. EiNets are a novel implementation design for SPNs introduced by Peharz et al. (2020) that minimize the issue of computational costs that initial SPNs had suffered. This is accomplished by combining several arithmetic operations via a single monolithic einsum-operation.

For all experiments, the ADAM optimizer (Kingma and Ba, 2015) with and , and no weight decay was used.

MNIST-Addition Experiments For the MNIST-Addition experiments, we ran the DeepProbLog and NeurASP programs with their original configurations, as stated in (Manhaeve et al., 2018) and (Yang et al., 2020), respectively. For the SLASH MNIST-Addition experiments, we have used the same neural module as in DeepProbLog and NeurASP, when training SLASH with the neural NPP (SLASH (DNN)) represented in Tab. 2. When using a PC NPP (SLASH (PC)) we have used an EiNet with the Poon-Domingos (PD) structure (Poon and Domingos, 2011)

and normal distribution for the leafs. The formal hyperparameters for the EiNet are depicted in Tab. 

3.

The learning rate and batch size for the DNN were 0.005 and 100, for DeepProbLog, NeurASP and SLASH (DNN). For the EiNet, these were 0.01 and 100.

Type Size/Channels Activation Comment
Encoder - - -
Conv 5 x 5 1x28x28 - stride 1
MaxPool2d 6x24x24 ReLU kernel size 2, stride 2
Conv 5 x 5 6x12x12 - stride 1
MaxPool2d 16x8x8 ReLU kernel size 2, stride 2
Classifier - - -
MLP 16x4x4,120 ReLU -
MLP 120,84 ReLU -
MLP 84,10 - Softmax
Table 2: Neural module – LeNet5 for MNIST-Addition experiments.
Variables Width Height Number of Pieces Class count
784 28 28 [4,7,28] 10
Table 3: Probabilistic Circuit module – EiNet for MNIST-Addition experiments.
Type Size/Channels Activation Comment
Conv 5 x 5 32 ReLU stride 1
Conv 5 x 5 32 ReLU stride 1
Conv 5 x 5 32 ReLU stride 1
Conv 5 x 5 32 ReLU stride 1
Position Embedding - - -
Flatten axis: [0, 1, 2 x 3] - flatten x, y pos.
Layer Norm - - -
MLP (per location) 32 ReLU -
MLP (per location) 32 - -
Slot Attention Module 32 ReLU -
MLP 32 ReLU -
MLP 16 Sigmoid -
Table 4: Baseline slot encoder for ShapeWorld4 experiments.

ShapeWorld4 Experiments For the baseline slot attention experiments with the ShapeWorld4 data set we have used the architecture presented in Tab. 4. For further details on this, we refer to the original work of Locatello et al. (2020). The slot encoder had a number of 4 slots and 3 attention iterations over all experiments.

For the SLASH Attention experiments with ShapeWorld4 we have used the same slot encoder as in Tab. 4, however, we replaced the final MLPs with 4 individual EiNets with Poon-Domingos structure (Poon and Domingos, 2011). Their hyperparameters are represented in Tab. 5.

EiNet Variables Width Height Number of Pieces Class count
Color 32 8 4 [4] 9
Shape 32 8 4 [4] 4
Shade 32 8 4 [4] 3
Size 32 8 4 [4] 3
Table 5: Probabilistic Circuit module – EiNet for ShapeWorld4 experiments.

The learning rate and batch size for SLASH Attention were 0.01 and 512, for ShapeWorld4 and ShapeWorld4 CoGenT. The learning rate for the baseline slot encoder were 0.0004 and 512.