1 Introduction
In recent years, NeuroSymbolic AI approaches to learning (Hudson and Manning, 2019; d’Avila Garcez et al., 2019; Jiang and Ahn, 2020; d’Avila Garcez and Lamb, 2020), which integrates lowlevel perception with highlevel reasoning by combining datadriven neural modules with logicbased symbolic modules, has gained traction. This combination of subsymbolic and symbolic systems has been shown to have several advantages for various tasks such as visual question answering and reasoning (Yi et al., 2018), concept learning (Mao et al., 2019b) and improved properties for explainable and revisable models (Ciravegna et al., 2020; Stammer et al., 2021).
Rather than designing specifically tailored NeuroSymbolic architectures, where often the neural and symbolic modules are disjoint and trained independently (Yi et al., 2018; Mao et al., 2019a; Stammer et al., 2021), deep probabilistic programming languages (DPPLs) provide an exciting alternative (Bingham et al., 2018; Tran et al., 2017; Manhaeve et al., 2018; Yang et al., 2020). Specifically, DPPLs integrate neural and symbolic modules via a unifying programming framework with probability estimates acting as the “glue” between separate modules allowing for reasoning over noisy, uncertain data and, importantly, joint training of the modules. Additionally, prior knowledge and biases in the form of logical rules can easily be added with DPPLs, rather than creating implicit architectural biases, thereby integrating neural networks into downstream logical reasoning tasks.
Objectcentric deep learning has recently brought forth several exciting avenues of research by introducing inductive biases to neural networks to extract objects from visual scenes in an unsupervised manner
(Zhang et al., 2019; Burgess et al., 2019; Engelcke et al., 2020; Greff et al., 2019; Lin et al., 2020; Locatello et al., 2020; Jiang and Ahn, 2020). We refer to Greff et al. (2020) for a detailed overview. A motivation for this specific line of investigation, which notably has been around for a longer period of time (Fodor and Pylyshyn, 1988; Marcus, 2019), is that objects occur as natural building blocks in human perception and possess advantageous properties for many cognitive tasks, such as scene understanding and reasoning. With a DPPL, these advancements can be improved by integrating the previously mentioned components into the DPPL’s programming framework and further adding constraints about objects and their properties in form of logical statements e.g. about color singularity, rather than implicitly enforcing this via one hot encodings.
We propose SLASH – a novel DPPL that, similar to the punctuation symbol, can be used to efficiently combine several paradigms into one. Specifically, SLASH represents a scalable programming language that seamlessly integrates probabilistic logical programming with neural representations and tractable probabilistic estimations. Fig. 1 shows an example instantiation of SLASH, termed SLASH Attention, for objectcentric set prediction. SLASH consists of several key building blocks. Firstly, it makes use of NeuralProbabilistic Predicates (NPPs) for probability estimation. NPPs consist of neural and/or probabilistic circuit (PC) modules and act as a unifying term, encompassing the neural predicates of DeepProbLog and NeurASP, as well as purely probabilistic predicates. In this work, we introduce a much more powerful “flavor” of NPPs that consist jointly of neural and PC modules, taking advantage of the power of neural computations together with true density estimation of PCs. Depending on the underlying task one can thus ask a range of queries to the NPP, e.g. sample an unknown, desired variable, but also query for conditional class probabilities. Example NPPs consisting of a slot attention encoder and several PCs are depicted in Fig. 1 for the task of set prediction. The slot encoder is shared across all NPPs, whereas the PC of each NPP models a separate category of attributes. In this way, each NPP models the joint distribution over slot encodings and object attribute values, such as the color of an object. By querying the NPP, one can obtain taskrelated probability estimations, such as the conditional attribute probability.
The second component of SLASH is the logical program, which consists of a set of facts and logical statements defining the state of the world of the underlying task. For example, one can define the rules for when an object possesses a specific set of attributes (cf. Fig. 1). Thirdly, an ASP module is used to combine the first two components. Given a logical query about the input data, the logical program and the probability estimates obtained from the NPP(s), the ASP module produces a probability estimate about the truth value of the query, stating, e.g., how likely it is for a specific object in an image to be a large, dark red triangle. In contrast to query evaluation in Prolog (Colmerauer and Roussel, 1996; Clocksin and Mellish, 2003) which may lead to an infinite loop, many modern answer set solvers use ConflictDrivenClauseLearning (CDPL) which, in principle, always terminates.
Training in SLASH is performed efficiently in a batchwise and end2end fashion, by integrating the parameters of all modules, neural and probabilistic, into a single loss term. SLASH thus allows a simple, fast and effective integration of subsymbolic and symbolic computations. In our experiments, we investigate the advantages of SLASH in comparison to SOTA DPPLs on the benchmark task of MNISTAddition (Manhaeve et al., 2018). We hereby show SLASH’s increased scalability regarding computation time, as well as SLASH’s ability to handle incomplete data via true probabilistic density modelling. Next, we show that SLASH Attention provides superior results for set prediction in terms of accuracy and generalization abilities compared to a baseline slot attention encoder. With our experiments, we thus show that SLASH is a realization of “one system – two approaches” (Bengio, 2019), that can successfully be used for performing various tasks and on a variety of data types.
We make the following contributions: (1) We introduce neuralprobabilistic predicates, efficiently integrating answer set programming with probabilistic inference via our novel DPPL, SLASH. (2) We successfully train neural, probabilistic and logic modules within SLASH for complex data structures end2end via a simple, single loss term. (3) We show that SLASH provides various advantages across a variety of tasks and data sets compared to stateoftheart DPPLs and neural models.
2 NeuroSymbolic Logic Programming
NeuroSymbolic AI can be divided into two lines of research, depending on the starting point. Both, however, have the same final goal: to combine lowlevel perception with logical constraints and reasoning.
A key motivation of NeuroSymbolic AI (d’Avila Garcez et al., 2009; Mao et al., 2019b; Hudson and Manning, 2019; d’Avila Garcez et al., 2019; Jiang and Ahn, 2020; d’Avila Garcez and Lamb, 2020) is to combine the advantages of symbolic and neural representations into a joint system. This is often done in a hybrid approach where a neural network acts as a perception module that interfaces with a symbolic reasoning system, e.g. (Mao et al., 2019b; Yi et al., 2018). The goal of such an approach is to mitigate the issues of one type of representation by the other, e.g. using the power of symbolic reasoning systems to handle the generalizability issues of neural networks and on the other hand handle the difficulty of noisy data for symbolic systems via neural networks. Recent work has also shown the advantage of NeuroSymbolic approaches for explaining and revising incorrect decisions (Ciravegna et al., 2020; Stammer et al., 2021). Many of these previous works, however, train the subsymbolic and symbolic modules separately.
Deep Probabilistic Programming Languages (DPPLs) are programming languages that combine deep neural networks with probabilistic models and allow a user to express a probabilistic model via a logical program. Similar to NeuroSymbolic architectures, DPPLs thereby unite the advantages of different paradigms. DPPLs are related to earlier works such as Markov Logic Networks (MLNs) (Richardson and Domingos, 2006). Thereby, the binding link is the Weighted Model Counting (MMC) introduced in LP^{MLN} (Lee and Wang, 2016). Several DPPLs have been proposed by now, among which are Pyro (Bingham et al., 2018), Edward (Tran et al., 2017), DeepProbLog (Manhaeve et al., 2018), and NeurASP (Yang et al., 2020).
To resolve the scalability issues of DeepProbLog, which use Sentential Decision Diagrams (SDDs) (Darwiche, 2011) as the underlying data structure to evaluate queries, NeurASP (Yang et al., 2020), offers a solution by utilizing Answer Set Programming (ASP) (Dimopoulos et al., 1997; Soininen and Niemelä, 1999; Marek and Truszczyński, 1999; Calimeri et al., 2020). In this way, NeurASP changes the paradigm from query evaluation to model generation, i.e. instead of constructing an SDD or similar knowledge representation system, NeurASP generates a set of all possible solutions (one model per solution) and estimates the probability for the truth value of each of these solutions. Of those DPPLs that handle learning in a relational, probabilistic setting and in an end2end fashion, all of these are limited to estimating only conditional class probabilities.
3 The SLASH Framework
In this section, we introduce our novel DPPL, SLASH. Before we dive into the details of this, it is necessary to first introduce NeuralProbabilistic Predicates, for which we require an understanding of Probabilistic Circuits. Finally, we will present the learning paradigm of SLASH.
The term probabilistic circuit (PC) (Choi et al., 2020)
represents a unifying framework that encompasses all computational graphs which encode probability distributions and guarantee tractable probabilistic modelling. These include SumProduct Networks (SPNs)
(Poon and Domingos, 2011) which are deep mixture models represented via a rooted directed acyclic graphs with a recursively defined structure.3.1 NeuralProbabilistic Predicates
Previous DPPLs, DeepProbLog (Manhaeve et al., 2018) and NeurASP (Yang et al., 2020), introduced the Neural Predicate as an annotateddisjunction or as a propositional atom, respectively, to acquire conditional class probabilities, , via the softmax function at the output of an arbitrary DNN. As mentioned in the introduction, this approach has certain limitations concerning inference capabilities. To resolve this issue, we introduce NeuralProbabilisitic Predicates (NPPs).
Formally, we denote with
(1) 
a NeuralProbabilistic Predicate .
Thereby, (i) npp is a reserved word to label an NPP, (ii) h a symbolic name of either a PC, NN or a joint of a PC and NN (cf. Fig. 1(a)), e.g., color_attr is the name of an NPP of Fig. 1(b). Additionally, (iii) denotes a “term” and (iv) are placeholders for each of the possible outcomes of . For example, the placeholders for color_attr are the color attributes of an object (Red, Blue, Green, etc.).
An NPP abbreviates an arithmetic literal of the form with and .
Furthermore, we denote with a set of NPPs of the form stated in (Eq. 1) and the set of all rules of one NPP, which denotes the possible outcomes, obtained from an NPP in , e.g. for the example depicted in Fig. 1(b).
Rules of the form
(2) 
are used as an abbreviation for application to multiple entities, e.g. multiple slots for the task of set prediction (cf. Fig. 1(b)). Hereby, Body of Eq. 2 is identified by (tautology, true) or (contradiction, false) during grounding. Rules of the form Head Body with appearing in Head are prohibited for .
In this work, we largely make use of NPPs that contain probabilistic circuits (specifically SPNs) which allow for tractable density estimation and modelling of joint probabilities. In this way, it is possible to answer a much richer set of probabilistic queries, i.e. , and .
In addition to this, we introduce the arguably more interesting type of NPP that combines a neural module with a PC. Hereby, the neural module learns to map the raw input data into an optimal latent representation, e.g. objectbased slot representations. The PC, in turn, learns to model the joint distribution of these latent variables and produces the final probability estimates. This type of NPP nicely combines the representational power of neural networks with the advantages of PCs in probability estimation and query flexibility.
For making the different probabilistic queries distinguishable in a SLASH program, we introduce the following notation. We denote a given variable with and the query variable with . E.g., within the running example of set prediction (cf. Fig. 1 and 1(b)), with the query one is asking for . Similarly, with one is asking for and, finally, with for .
To summarize: an NPP can consist of neural and/or probabilistic modules and produces querydependent probability estimates. Due to the flexibility of its definition, the term NPP contains the predicates of previous works (Manhaeve et al., 2018; Yang et al., 2020), but also more interesting predicates discussed above. The specific “flavor” of an NPP should be chosen depending on what type of probability estimation is required (cf. Fig 1(a)).
3.2 The SLASH Language and Semantics
Fig. 1 presents an illustration of SLASH, exemplified for the task of set prediction, with all of its key components. Having introduced the NPPs previously, which produce probability estimates, we now continue in the pipeline on how to use these probability estimates for answering logical queries. We begin by formally defining a SLASH program.
Definition 1.
Fig. 1(b) depicts a minimal SLASH program for the task of set prediction, exemplifying a set of propositional rules and neural predicates. Similar to NeurASP, SLASH requires ASP and as such adopts its syntax to most part. We therefore now address integrating our NPPs into an ASP compatible form to obtain the success probability for the logical query given all possible solutions. Thus, we define SLASH’s semantics. For SLASH to translate the program to the ASPsolver’s compatible form, the rules (Eq. 1) will be rewritten to the set of rules:
(3) 
The ASPsolver should understand this as “Pick exactly one rule from the set”. After the translation is done, we can ask an ASPsolver for the solutions for . We denote a set of ASP constraints in the form , as queries (annotation). and each of the solutions with respect to as a potential solution, , (referred to as stable model, , in ASP). With we denote the projection of the onto , – the number of the possible solutions of the program agreeing with on . Because we aim to calculate the success probability of the query , we formalize the probability of a potential solution beforehand.
Definition 2.
We specify the probability of the potential solution, , for the program as the product of the probabilities of all atoms in divided by the number of potential solutions of agreeing with on :
(4) 
Therefore, the definition of the query’s probability is defined as follows.
Definition 3.
The probability of the query given the set of possible solutions is defined as
(5) 
Thereby, reads as " satisfies ". The probability of the set of queries is defined as the product of the probability of each. I.e.
(6) 
3.3 Learning in SLASH
To define the gradient calculation that is required for learning, let us recall that due to SLASH’s semantics, our interest lies in rewarding the right solutions () and penalizing wrong ones (), for which we require gradient ascent. We therefore denote with the SLASH program under consideration, whereby is the set of the NPP’s parameters associated with . We assume that for the set of queries, , it holds that
We maximize the loglikelihood of under our program , i.e., we are searching for such that
(7) 
Referring to the probabilities in as we calculate their gradients w.r.t.
via backpropagation as
(8) 
Whereby is computed via backward propagation through the NPPs (cf. Eq. 10 in the appendix for backpropagation with joint NPPs) and defined as:
Proposition 1.
Let be a SLASH program and let be a query with . Furthermore, let be the label of the probability of an atom in , denoting . Thus, we have,
From this proposition it thus follows that the gradient for the right solution () does not have to be positive. Finally, we remark that, similarly to Manhaeve et al. (2018) and Skryagin et al. (2020), we use the learning from entailment setting. Thus, for the SLASH program with parameters and a set of pairs , where is a query and its desired success probability, we compute the final loss function, , as:
(9) 
With this loss function we ultimately wish to maximize the estimated success probability. We remark that the defined loss function is true regardless of the NPP’s form (NN with Softmax, PC or PC jointly with NN). The only difference will be the second term, i.e. or depending on the NPP and task.
With the loss function (9) we have combined the optimization of each module, whether neural or probabilistic, into a single term that can be optimized via gradient ascent. Importantly, rather than requiring a novel loss function for each individual task and data set, with SLASH, it is possible to simply incorporate the specific requirements into the logic program. The training loss, however, remains the same. We refer to the Appendix A for further details.
4 Empirical Results
The advantage of SLASH lies in the efficient integration of neural, probabilistic and symbolic computations. To emphasize this, we conduct a variety of experimental evaluations.
Experimental Details. We use two benchmark data sets, namely MNIST (LeCun et al., 1998b) for the task of MNISTAddition and a variant of the ShapeWorld data set (Kuhnle and Copestake, 2017)
for objectcentric set prediction. For all experiments, other than runtime benchmarking, we present the average and the standard deviation over five runs with different random seeds for parameter initialization.
For ShapeWorld experiments, we generate a data set we refer to as ShapeWorld4. Images of ShapeWorld4 contain between one and four objects, with each object consisting of four attributes: a color (red, blue, green, gray, brown, magenta, cyan or yellow), a shade (bright, or dark), a shape (circle, triangle or square) and a size (small or big). Thus, each object can be created from 84 different combinations of attributes. Fig. 1 depicts an example image.
We measure performance via classification accuracies in the MNISTAddition task. In our ShapeWorld4 experiments we present the average precision. We refer to appendix B for the SLASH programs and queries of each experiment, and appendix C for a detailed description of hyperparameters and further details.
Evaluation 1: SLASH outperforms SOTA DPPLs in MNISTAddition. The task of MNISTAddition (Manhaeve et al., 2018)
is to predict the sum of two MNIST digits, presented only as raw images. During test time, however, a model should classify the images directly. Thus, although a model does not receive explicit information about the depicted digits, it must learn to identify digits via indirect feedback on the sum prediction.
We compare the test accuracy after convergence and running time per epoch (RTE) between the three DPPLs: DeepProbLog
(Manhaeve et al., 2018), NeurASP (Yang et al., 2020) and SLASH, using a probabilistic circuit (PC) or a deep neural network (DNN) as NPP. Notably, the DNN used in SLASH (DNN) is the LeNet5 model (LeCun et al., 1998a) of DeepProbLog and NeurASP. We note that when using the PC as NPP, we have also extracted conditional class probabilities , by marginalizing the class variables to acquire the normalization constant from the joint , and calculating .


The results can be seen in Tab. 1(a)
. We observe that training SLASH with a DNN NPP produces SOTA accuracies compared to DeepProbLog and NeurASP, confirming that SLASH’s batchwise loss computation leads to improved performances. We further observe that the test accuracy of SLASH with a PC NPP is slightly below the other DPPLs, however we argue that this may be since a PC, in comparison to a DNN, is learning a true mixture density rather than just conditional probabilities. The advantages of doing so will be investigated in the next experiments. Note that, optimal architecture search for PCs, e.g. for computer vision, is an open research question.
An important practical result can be found when comparing the RTE measurements. Particularly, we observe that even though using the same DNN as in considered DPPLs, SLASH (DNN) is approx. 20 times faster, than DeepProbLog and three times faster than NeurASP. Also, SLASH (PC) is faster than both DeepProbLog and NeurASP, although internally performing exact probabilistic inference. These differences in computation time are the result of parallel calls to the ASP solver of SLASH. These evaluations show SLASH’s superiority on the benchmark MNISTAddition task.
Evaluation 2: Handling Missing Data with SLASH. SLASH offers the advantage of its flexibility to use various kinds of NPPs. Thus, in comparison to previous DPPLs, one can easily integrate NPPs into SLASH that perform joint probability estimation. For this evaluation, we consider the task of MNISTAddition with missing data. We trained SLASH (PC) and DeepProbLog with the MNISTAddition task with images in which a percentage of pixels per image has been removed. It is important to mention here that whereas DeepProbLog handles the missing data simply as background pixels, SLASH (PC) specifically models the missing data as uncertain data by marginalizing the denoted pixels at inference time. We use DeepProbLog here representative of DPPLs without true density estimation.
The results can be seen in Tab. 1(b) for , , and missing pixels per image. We observe that at , DeepProbLog and SLASH produce almost equal accuracies. With percent missing pixels, there is a substantial difference in the ability of the two DPPLs to correctly classify images, with SLASH being very stable. By further increasing the percentage of missing pixels, this difference becomes even more substantial with SLASH still reaching a test accuracy even when of the pixels per image are missing, whereas DeepProbLog degrades to an average of test accuracy. We further note that SLASH, in comparison to DeepProbLog, produces largely reduced standard deviations over runs. Thus, by utilizing the power of true density estimation SLASH, with an appropriate NPP, can produce more robust results in comparison to other DPPLs.
Evaluation 3: Improved Concept Learning via SLASH. We show that SLASH can be very effective for the complex task of set prediction, which previous DPPLs have not tackled. We revert to the ShapeWorld4 data set for this setting.
For set prediction, a model is trained to predict the discrete attributes of a set of objects in an image (cf. Fig. 1 for an example ShapeWorld4 image). The difficulty for the model lies therein that it must match an unordered set of corresponding attributes (with varying number of entities over samples) with its internal representations of the image.
The slot attention module introduced by Locatello et al. (2020)
allows for an attractive objectcentric approach to this task. Specifically, this module represents a pluggable, differentiable module that can be easily added to any architecture and, through a competitive softmaxbased attention mechanism, can enforce the binding of specific parts of a latent representation into permutationinvariant, taskspecific vectors, called slots.
In our experiments, we wish to show that by adding logical constraints to the training setting, one can improve the overall performances and generalization properties of such a model. For this, we train SLASH with NPPs as depicted in Fig. 1 consisting of a shared slot encoder and separate PCs, each modelling the mixture of latent slot variables and the attributes of one category, e.g. color. For ShapeWorld4, we thereby have altogether four NPPs. SLASH is trained via queries of the kind exemplified in Fig. 7 in the Appendix. We refer to this configuration as SLASH Attention.
We compare SLASH Attention to a baseline slot attention encoder using an MLP and Hungarian loss for predicting the object properties from the slot encodings as in Locatello et al. (2020). The results of these experiments can be found in Fig. 3a (top). We observe that the average precision after convergence on the heldout test set with SLASH Attention is greatly improved to that of the baseline model. Additionally, in Fig. 3b we observe that SLASH Attention reaches the average precision value of the baseline model in much fewer number of epochs. Thus, we can summarize that adding logical knowledge in the training procedure via SLASH can greatly improve the capabilities of a neural module for set prediction.
Evaluation 4: Improved Compositional Generalization with SLASH. To test the hypothesis that SLASH Attention possesses improved generalization properties in comparison to the baseline model, we ran experiments on a variant of ShapeWorld4 similar to the CLEVR Compositional Generalization Test (CoGenT) (Johnson et al., 2017). The goal of CoGenT is to investigate a model’s ability to handle novel combinations of attributes that were not seen during training.
For this purpose, we established two conditions within a ShapeWorld4 CoGenT data set: Condition (A) – the training and test data set contains squares with the colors gray, blue, brown, or yellow, triangles with the colors red, green, magenta, or cyan and circles of all colors. Condition (B) – the training set is as in Condition (A). However, the test set contains squares with the colors red, green, magenta, or cyan, triangles with the colors gray, blue, brown, or yellow and circles of all colors. The goal is to investigate how well a model can generalize that, e.g., also squares can have the color red, although never having seen evidence for this during training.
The resulting average precision test scores are presented in Fig. 3a (bottom). We observe that, even though the SLASH Program used for this experiment was not explicitly written to handle composition generalization, SLASH Attention shows greatly improved generalization capabilities. This can be seen in the approx. higher average precision scores on the Condition (B) test set in comparison to the baseline model. Importantly, this trend still holds even when subtracting the higher precision scores observed in Condition (A).
To summarize our findings from the experiments on set prediction: we observe that adding prior knowledge in the form of logical constraints via SLASH can greatly improve a neural module in terms of performance and generalizability. On a side note: training neural networks for novel tasks, often involves defining explicit loss functions, e.g. Hungarian loss for set prediction. In contrast with SLASH, no matter the choice of NPP and underlying task, the training loss remains the same. Taskrelated requirements simply need to be added as lines of code to the SLASH program. This additionally highlights SLASH’s versatility and flexibility.
Summary of all Empirical Results. All empirical results together clearly demonstrate that the flexibility of SLASH is highly beneficial and can easily outperform stateoftheart: one can freely combine what is required to solve the underlying task — (deep) neural networks, probabilistic circuits, and logic.
5 Conclusion and Future Work
We introduced SLASH, a novel DPPL that integrates neural computations with tractable probability estimates and logical statements. The key ingredient of SLASH to achieve this are NeuralProbabilistic Predicates (NPPs) that can be flexibly constructed out of neural and/or probabilistic circuit modules based on the data and underlying task. With these NPPs, one can produce taskspecific probability estimates. The details and additional prior knowledge of a task are neatly encompassed within a SLASH program with only few lines of code. Finally, via Answer Set Programming and Weighted Model Counting, the logical SLASH program and probability estimates from the NPPs are combined to estimate the truth value of a taskspecific query. Our experiments show the power and efficiency of SLASH, improving upon previous DPPLs in the benchmark MNISTAddition task in terms of performance, efficiency and robustness. Importantly, by integrating a SOTA slot attention encoder into NPPs and adding few logical constraints, SLASH demonstrates improved performances and generalizability in comparison to the pure slot encoder for the task of objectcentric set prediction; a setting no DPPL has tackled yet. This shows the great potential of DPPLs to elegantly combine logical reasoning with neural computations and uncertainty estimates, and represents an immensely flexible and attractive path for future research.
Interesting avenues for future work include benchmarking SLASH on additional data types and tasks. One should explore unsupervised and weakly supervised learning using logic with SLASH and investigate how far logical constraints can help unsupervised object discovery. In direct alignment with our work, one should also investigate image generation via the beneficial feature of PCs to generate random samples. Actually, it should be possible to generate images that encapsulate logical knowledge bases. This is important to move from datarich to knowledgerich AI.
Ethics Statement
With our work, we have shown that one can add prior knowledge and logical constraints to the training of learning systems. We postulate that SLASH can therefore additionally be used to identify and remove biases or undesirable behavior, by adding constraints within the SLASH program. We observe that this feature, however, also has the potential danger to be used in the opposite way, e.g. explicitly adding bias and discriminatory factors to a system. To the best of our knowledge, our study does not raise any ethical, privacy or conflict of interest concerns.
Acknowledgments
This work has been supported by ICT48 Network of AI Research Excellence Center “TAILOR” (EU Horizon 2020, GA No 952215), the BMEL/BLE under the innovation support program, project “AuDiSens” (FKZ28151NA187) as well as the German Federal Ministry of Education and Research and the Hessian Ministry of Science and the Arts within the National Research Center for Applied Cybersecurity “ATHENE”. Additionally, it has benefited from the Hessian research priority program LOEWE within the project WhiteBox, the HMWK cluster project “The Third Wave of AI” and the Collaboration Lab “AI in Construction” (AICO).
References
 From System 1 Deep Learning to System 2 Deep Learning. Note: Invited talk NeurIPS External Links: Link Cited by: §1.

Pyro: Deep Universal Probabilistic Programming.
In
Journal of Machine Learning Research
, Cited by: §1, §2.  Monet: unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390. Cited by: §1.
 ASPcore2 input language format. In Theory and Practice of Logic Programming, Cited by: §2, Definition 1.
 Probabilistic circuits: a unifying framework for tractable probabilistic models. Technical report Technical report. Cited by: §3.
 Humandriven fol explanations of deep learning.. In IJCAI, pp. 2234–2240. Cited by: §1, §2.
 Programming in prolog. SpringerVerlag Berlin Heidelberg. Cited by: §1.
 The birth of prolog. In History of programming languages—II, Cited by: §1.
 Neurosymbolic AI: the 3rd wave. CoRR abs/2012.05876. Cited by: §1, §2.
 Neuralsymbolic computing: an effective methodology for principled integration of machine learning and reasoning. FLAP 6 (4), pp. 611–632. Cited by: §1, §2.
 Neuralsymbolic cognitive reasoning. Cognitive Technologies, Springer. Cited by: §2.

SDD: A New Canonical Representation of Propositional Knowledge Bases.
In
In Proceedings of the TwentySecond International Joint Conference on Artificial Intelligence (IJCAI)
, pp. 819–826. Note: http://reasoning.cs.ucla.edu/fetch.php?id=121&type=pdf Cited by: §2.  Encoding planning problems in nonmonotonic logic programs. In European Conference on Planning, pp. 169–181. Cited by: §2.
 GENESIS: generative scene inference and sampling with objectcentric latent representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, Cited by: §1.
 Connectionism and cognitive architecture: a critical analysis. Cognition 28 (12), pp. 3–71. Cited by: §1.
 Multiobject representation learning with iterative variational inference. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 2424–2433. Cited by: §1.
 On the binding problem in artificial neural networks. CoRR abs/2012.05208. External Links: 2012.05208 Cited by: §1.
 Learning by abstraction: the neural state machine. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 5901–5914. Cited by: §1, §2.
 Generative neurosymbolic machines. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §1, §1, §2.

CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §4.  Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §C.3.
 ShapeWorld: a new test methodology for multimodal language understanding. In arXiv, Cited by: §C.1, §4.
 GradientBased Learning Applied To Document Recognition. In IEEE, pp. 2278–2324. Cited by: §4.
 THE MNIST DATABASE of handwritten digits. External Links: Link Cited by: §4.
 Weighted Rules under the Stable Model Semantics. In AAAI, Cited by: §2.
 SPACE: unsupervised objectoriented scene representation via spatial attention and decomposition. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, Cited by: §1.
 Objectcentric learning with slot attention. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 11525–11538. Cited by: §C.2, §C.3, §1, §4, §4.
 DeepProbLog: Neural Probabilistic Logic Programming. In NeurIPS, pp. 3753–3763. Cited by: §C.3, §1, §1, §2, §3.1, §3.1, §3.3, §4, §4.
 The neurosymbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, Cited by: §1.
 The neurosymbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, Cited by: §1, §2.
 The algebraic mind: integrating connectionism and cognitive science. MIT press. Cited by: §1.
 Stable models and an alternative logic programming paradigm. In The Logic Programming Paradigm, pp. 375–398. Cited by: §2.
 Einsum networks: fast and scalable learning of tractable probabilistic circuits. In Proceedings of the 37th International Conference on Machine Learning (ICML), Cited by: §C.3.
 SumProduct Networks: A New Deep Architecture. In UAI, pp. 337–346. Cited by: §C.3, §C.3, §3.
 Markov logic networks. Machine Learning 68, pp. 107–136. Cited by: §2.
 SPLog: sumproduct logic. In In International Conference on Probabilistic Programming, Cited by: §3.3.
 Developing a declarative rule language for applications in product configuration. In International Symposium on Practical Aspects of Declarative Languages, pp. 305–319. Cited by: §2.
 Right for the right concept: revising neurosymbolic concepts by interacting with their explanations. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 1925, 2021, pp. 3619–3629. Cited by: §1, §1, §2.
 Deep Probabilistic Programming. In ICLR, Cited by: §1, §2.
 Einsum networks: fast and scalable learning of tractable probabilistic circuits. In IJCAI, Cited by: §C.3, §1, §2, §2, §3.1, §3.1, §4.
 Neuralsymbolic VQA: disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 38, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 1039–1050. Cited by: §1, §1, §2.
 Deep set prediction networks. In Advances in Neural Information Processing Systems 32: Annual Conference 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 3207–3217. Cited by: §1.
Appendix A Appendix A – Details on Parameter Learning
In the Appendix, we want to discuss details on parameter learning in SLASH. We consider the setting of the learning from entailment. I.e., for the given SLASH program with parameters and a set of pairs , where is a query and its desired success probability, we compute for a loss function :
To this end, we defined the loss function in the equation as
Since the data associated with the query is forwarded to a NPP so that SLASH obtains the success query using the output probabilities from this NPP, we can formally write
the last equation follows the notion of the set of classes with the cardinality of . Furthermore, we remark that the derived loss function applies for the case of a SLASH program entailing multiple NPPs. For example, we were considering the program for ShapeWorld4, which entails four of them. See Figure 6 for details.
Finally, we note that the mathematical transformations listed above hold for any type of NPP and the task dependent queries (NN with Softmax, PC or PC jointly with NN).
The only difference will be the second term, i.e. or depending on the NPP and task.
The NPP in a form of a single PC was depicted to be the example.
Backpropagation for joint NN and PC NPPs: If within the SLASH program,
, the NPP forwards the data tensor through a NN first, i.e., the NPP models a joint over the NN’s output variables by a PC, then we rewrite (
8) to(10) 
Thereby, is the set of the NN’s parameters and is computed by the backward propagation through the NN.
Appendix B Appendix B – SLASH Programs
Here, the interested reader will find the SLASH programs which we compiled for our experiments. Figure 4 presents the one for the MNIST Addition task, Figure 6
– for the set prediction task with slot attention encoder and the subsequent CoGenT test. Note the use of the “+” and “” notation for indicating whether a random variable is given or being queried for.
Appendix C Appendix C – Experimental Details
c.1 ShapeWorld4 Generation
The ShapeWorld4 and ShapeWorld4 CoGenT data sets were generated using the original scripts of (Kuhnle and Copestake, 2017) (https://github.com/AlexKuhnle/ShapeWorld). The exact scripts will be added together with the SLASH source code.
c.2 Average Precision computation (ShapeWorld4)
For the baseline slot encoder experiments on ShapeWorld4 we measured the average precision score as in Locatello et al. (2020). In comparison to the baseline slot encoder, when applying SLASH Attention, however, we handled the case of a slot not containing an object, e.g. only background variables, differently. Whereas Locatello et al. (2020) add an additional binary identifier to the multilabel ground truth vectors, we have added a background (bg) attribute to each category (cf. Fig. 6). A slot is thus considered to be empty (i.e. not containing an object) if each NPP returns a high conditional probability for the bg attribute.
c.3 Model Details
For those experiments using NPPs with PC we have used Einsum Networks (EiNets) for implementing the probabilistic circuits. EiNets are a novel implementation design for SPNs introduced by Peharz et al. (2020) that minimize the issue of computational costs that initial SPNs had suffered. This is accomplished by combining several arithmetic operations via a single monolithic einsumoperation.
For all experiments, the ADAM optimizer (Kingma and Ba, 2015) with and , and no weight decay was used.
MNISTAddition Experiments For the MNISTAddition experiments, we ran the DeepProbLog and NeurASP programs with their original configurations, as stated in (Manhaeve et al., 2018) and (Yang et al., 2020), respectively. For the SLASH MNISTAddition experiments, we have used the same neural module as in DeepProbLog and NeurASP, when training SLASH with the neural NPP (SLASH (DNN)) represented in Tab. 2. When using a PC NPP (SLASH (PC)) we have used an EiNet with the PoonDomingos (PD) structure (Poon and Domingos, 2011)
and normal distribution for the leafs. The formal hyperparameters for the EiNet are depicted in Tab.
3.The learning rate and batch size for the DNN were 0.005 and 100, for DeepProbLog, NeurASP and SLASH (DNN). For the EiNet, these were 0.01 and 100.
Type  Size/Channels  Activation  Comment 

Encoder       
Conv 5 x 5  1x28x28    stride 1 
MaxPool2d  6x24x24  ReLU  kernel size 2, stride 2 
Conv 5 x 5  6x12x12    stride 1 
MaxPool2d  16x8x8  ReLU  kernel size 2, stride 2 
Classifier       
MLP  16x4x4,120  ReLU   
MLP  120,84  ReLU   
MLP  84,10    Softmax 
Variables  Width  Height  Number of Pieces  Class count 

784  28  28  [4,7,28]  10 
Type  Size/Channels  Activation  Comment 

Conv 5 x 5  32  ReLU  stride 1 
Conv 5 x 5  32  ReLU  stride 1 
Conv 5 x 5  32  ReLU  stride 1 
Conv 5 x 5  32  ReLU  stride 1 
Position Embedding       
Flatten  axis: [0, 1, 2 x 3]    flatten x, y pos. 
Layer Norm       
MLP (per location)  32  ReLU   
MLP (per location)  32     
Slot Attention Module  32  ReLU   
MLP  32  ReLU   
MLP  16  Sigmoid   
ShapeWorld4 Experiments For the baseline slot attention experiments with the ShapeWorld4 data set we have used the architecture presented in Tab. 4. For further details on this, we refer to the original work of Locatello et al. (2020). The slot encoder had a number of 4 slots and 3 attention iterations over all experiments.
For the SLASH Attention experiments with ShapeWorld4 we have used the same slot encoder as in Tab. 4, however, we replaced the final MLPs with 4 individual EiNets with PoonDomingos structure (Poon and Domingos, 2011). Their hyperparameters are represented in Tab. 5.
EiNet  Variables  Width  Height  Number of Pieces  Class count 

Color  32  8  4  [4]  9 
Shape  32  8  4  [4]  4 
Shade  32  8  4  [4]  3 
Size  32  8  4  [4]  3 
The learning rate and batch size for SLASH Attention were 0.01 and 512, for ShapeWorld4 and ShapeWorld4 CoGenT. The learning rate for the baseline slot encoder were 0.0004 and 512.