1 Introduction
The unification of lowlevel perception and highlevel reasoning is a longstanding problem in artificial intelligence, which has the potential to not only bring the areas of logic and learning closer together but also demonstrate how abstract concepts might emerge from sensory data. Precisely because deep learning methods dominate perceptionbased learning, including vision, speech, and linguistic grammar, there is fastgrowing literature on how to integrate symbolic reasoning and deep learning. Efforts have ranged from providing a truththeory to deep learning [Serafini and Garcez, 2016, Garcez et al., 2012], neural architectures that enable differential computation for symbolic constraints [Xu et al., 2017, Bošnjak et al., 2017, Rocktäschel and Riedel, 2017, Santoro et al., 2017], and embeddings for graph and relational data [Yang et al., 2014, Lin et al., 2015, Niepert et al., 2016, Dumancic et al., 2018]. Approaches such as DeepProbLog [Manhaeve et al., 2018]
, on the other hand, treat deep learning as an external computation and integrate its predictions as an external predicate in a probabilistic logic programming framework. Broadly, efforts seem to fall into three camps: those focused on
semantic characterizations (i.e., define a logic whose formulas capture deep learning), constrained learning (i.e., integrate symbolic constraints in deep learning), and hybrid methods (allow neural computations and symbolic reasoning to coexist separately, to enjoy the strengths of both worlds).In this paper, we identify another dimension to this inquiry: what do the hidden layers really capture, and how can we reason about that logically? In particular, we consider autoencoders (AEs) [Goodfellow et al., 2016, Kingma and Welling, 2013, Rezende et al., 2014]
. As a variant of neural networks, AE frameworks are perhaps the most popular for dimensionality reduction, but its inner workings are entirely opaque and mysterious. Basically, given an encoder
e, one first applies it to input data to obtain a feature layer (FL) and then attempts to recover from FL using a decoder d. Constraints on FLcan lead to massive reductions on the dimensionality and identify salient features for applications such as anomaly detection. (See, for example,
[Dumancic et al., 2018] that is a purely logical approach inspired by autoencoding principles.) Thus, we ask the question: can we inject a logical language onto the FL to perform Boolean reasoning over the FL’s variables?The exact choice of the language would depend on what we intend to do with the logic. A purely discrete representation such as propositional logic may not be very interesting or insightful about what the FL
really captures, especially in cases where there may be probabilities assigned to image labels. In that regard, there has been an interesting development in knowledge representation over the last few years. As a special case of probabilistic logical models
[De Raedt and Kimmig, 2015, Getoor and Taskar, 2007] tractable probabilistic models have emerged as an extension to data structures such as binary decision diagrams (BDDs). In particular probabilistic sentential decision diagram (PSDDs) [Kisa et al., 2014], for example, are a complete and canonical representation of a probabilistic distribution defined over the models of a propositional theory. By imposing certain properties on the propositional representation, such as decomposability and determinism, probabilistic queries can be answered in polynomial time in the size of the data structure by way of model counting. Its parameters can be learned efficiently from data, which allows us to view the representation through a generative lens over a logical base. We discuss below how these features are put to use, but more generally, we see the work as a step in repurposing deep learning in logical space to contributing to the emergence of highlevel reasoning from a lowlevel system. Our contributions are orthogonal in many regards to the existing literature on neurosymbolic systems and thus we imagine there would be space for looking at other kinds of integration with the existing literature.Interestingly, we take note of multiple approaches for visually inspecting and interpreting NNs in the literature [Szegedy et al., 2013, Yosinski et al., 2015], with a special focus on understanding convolutions of deep networks after training [Simonyan et al., 2013, Zeiler and Fergus, 2014]. While many of these methods yield various analysis of what happens in a given NN, including saliency maps [Simonyan et al., 2013], they differ in thrust significantly from our contributions. For example, optimization methods are usually used to infer and decode regions of interest in a specific layer of a pretrained network. In contrast to this, our logical approach uses a symbolic framework to make sense of the NN over a generative model. As such we do not infer the meaning of individual variables, although it is possible to visualize them, but rather compute conditional probabilities over those variables.
It should also be noted that recent circuit models attempt to tackle vision problems too (e.g., [Poon and Domingos, 2011, Gens and Domingos, 2012, Liang and Van den Broeck, 2019]), and so it is possible to realize the entire image classification pipeline using these tractable probabilistic models (however usually as a discriminative model). We think this is a very exciting development. What our work attempts to do, however, is to inspect stateoftheart deep learning architectures (especially models like AEs that are very powerful for dimensionality reduction) via such symbolic generative models, in case such architectures are already in place, or are tackling problems still to be addressed using a pure circuit scheme. In that regard, as mentioned, our work is to be seen as attempting to repurpose the latent space in a logical manner.
Our approach offers the following capabilities. We learn a PSDD over a discretized FL
, which yields a joint distribution over the individual variables, including image labels, of the
FL. This allows us, among other things, to visualize these individual variables by conditional sampling. In particular, this enables us to generate example images for a class to get a sense of what was learned. Moreover, the modular structure of the proposed model makes it possible to learn relations over multiple images at a time. Finally, because of the logical structure that we impose, noisy labels can also be handled. We also discuss how we can evaluate the learned representation over wellknown datasets, and also discuss both reconstructability (i.e., generative capabilities) and classification accuracy.At the outset, from an engineering (as opposed to mathematical) viewpoint, it should be noted that, at this point, since circuit software packages have not enjoyed the same amount of maturity as deep learning packages, the reported accuracy is not as competitive as stateoftheart systems. Nonetheless, although this reported accuracy is lower, the model has considerably more functionality, including the ability to sample prototypical images for each of the learned classes, ultimately aiding us in understanding what has been learned by the model (i.e., visually showing us what the model thinks a given class represents).
2 Preliminaries
2.1 Probabilistic Sentential Decision Diagrams
Sentential decision diagrams (SDDs) were first introduced in [Darwiche, 2011] and are tractable representations of propositional knowledge bases. SDDs are shown to be a strict subset of deterministic decomposable negation normal form (dDNNF), a popular representation for probabilistic reasoning applications [Chavira and Darwiche, 2008]
due to their desirable properties. Decomposability and determinism especially ensure tractable probabilistic inference. PSDDs extend SDDs with probabilities, and are a complete and canonical representation of joint probability distributions
[Kisa et al., 2014].Intuitively, PSDDs are parametrized directed acyclic graphs (DAGs), as seen in Figure 1
. Here, each terminal node represents a univariate (Bernoulli) distribution over a binary variable (e.g.
) with a probability represented by the tuple . Within the tree, each node is either an AND or an OR node. An AND node has two inputs termed prime for the left one and sub for the right one. The OR node can have an arbitrary number of inputs, where each of the input wires is annotated by a probability together making up a normalised distribution over the variables represented by the corresponding vtree. Moreover, OR and AND gates always alternate, such that a given OR node can also be represented as a set of AND nodes or decisions: .In order to retain the desirable properties of SDDs for inference (tractability) and canonicity, similar syntactic restrictions hold here as well. Firstly, each of the AND gates has to be decomposable, meaning that the vtree nodes represented by prime and sub share no variables. In other words, the prime and sub have to represent probability distributions over disjoint sets of variables. Analogously, determinism demands that for each possible world (or assignment), there can be at most one prime that assigns a nonzero probability to it (the specific world).
In [Liang et al., 2017, Bekker et al., 2015], a learning regime is proposed that is capable of learning a PSDD as well as the underlying SDD and vtree [Pipatsrisawat and Darwiche, 2010] directly from data. It works by iteratively updating and improving the structure of the PSDD to better fit the data. It does so by applying specified clone and split operations to a PSDD at each step. Learning is then carried out until a time limit is reached or a predefined score converges on the validation data (if present, otherwise training data). This ’score’ is based on the loglikelikhood of the model given the data, but takes the size of the tree into account as well. The loglikelihood of PSDD given data is then a sum of loglikelihood contributions per node:
(1) 
where is the number of examples that satisfy the node context of and the base of prime .
Additionally, [Liang et al., 2017] also proposed an algorithm for learning ensembles of PSDDs (EMLearnPSDD) which is built on the learnPSDD algorithm and the soft structural EM algorithm by [Friedman, 1998]. This algorithm consists of two nested learners, where the outer EM is learning the structure and the inner EM is learning the parameters. This is the algorithm we predominantly use in our experiments.
2.2 Neural Networks & Autoencoders
An AE is a specific instance of an artificial neural network (NN) that is intended to reproduce the given input as an output [Goodfellow et al., 2016]. It consists of two parts: the encoder e and decoder d. Internally, it has a hidden layer referred to as the feature layer (FL) such that for some input data and where is the reconstructed input, , and dim is the dimensionality of the FL. Restrictions are usually imposed on the structure of the network such as reduced dimentionality in the FL or added noise on the input. By reducing the dimension of the FL relative to the dimension of the input , the network is intended to learn the most valuable features for reconstructing the original image. The learning procedure then constitutes learning the encoder and decoder function simultaneously by minimising the reconstruction loss (e.g., mean squared error) penalising for being dissimilar to .
The variational autoencoder (VAE) [Kingma and Welling, 2013, Rezende et al., 2014] is a variant of AE that builds on the stochastic generalisation of the classical AE architecture, where instead of a deterministic function, e and d are stochastic mappings and . Thus, we can also view e and d
as conditional probability distributions. Utilising this probabilistic interpretation, the VAE framework defines a distribution
(e.g the Gaussian distribution) such that
FLsamples can be drawn from that distribution. Then we can use the KullbackLeibler divergence (
) to enforce the encoder network to be as similar as possible to our chosen distribution while at the same time maximizing the prior which is achieved by updating the weights based on the gradient of:(2) 
The reparameterization trick [Kingma and Welling, 2013]
then allows us to enable stochastic gradient descent by reformulating the task using stochastic input layers.
Finally, we leverage the GumbelMax trick [Gumbel, 1954, Maddison et al., 2014] which yields the GumbelSoftmax Distribution that is defined as a continuous distribution over the simplex that can approximate samples from a categorical distribution [Jang et al., 2016].
3 Methodology & Evaluation Metrics
In this section, we propose a novel model for representing and learning a symbolic generative model from a neural network that is trained over unstructured data . This is possible by means of the intermediate FL, defined over discrete variables. The model is to be considered as generative but over a set of domains, meaning that we can perform conditional sampling for a given domain with respect to other domains. Domains, written as represent disjoint subsets of . For example, suppose we are given an image dataset such as MNIST. Here, the images are denoted by domain (say, ) and the corresponding labels by domain (say, ) such that . The model is then able to approximate the distributions and moreover, can sample from ) and .
3.1 Architecture
We now discuss the formal and architectural components of our model.
Definition 1
: (Feature layer) The FL is a finite set of discrete (typically, Boolean) variables; that is, for some . Intuitively, FL represents an encoded discretized version of the original data . If the data is split into different domains (disjoint sets, as explained above) , then the FL can also be split into disjoint domains: . The size of FL is then the number of variables and clearly, . Further, for a given domain , we use to denote the number of possible values that the discrete variables can take.
3.1.1 Encoders and Decoder Specification
For a given domain , we have an encoder and a decoder such that and . This is like the usual AE setup, where we map from the input domain (e.g. ) to an intermediate representation (e.g. ) using the encoder and then map back to the original domain using the decoder. This formulation also works for stochastic encoder/decoder networks with and utilising the GumbelSoftmax distribution.
Essentially, encoders and decoders are functions of varying complexity depending on the domain. While some encoderdecoder pairs may be learned from data, others can be defined deterministically. Revisiting the MNIST dataset, for example, here we may define domain to be the images (e.g. ) and domain to represent the corresponding labels (e.g. ). In particular, the encoder and decoder for domain (
) are deep convolutional neural networks. As for domain
, the encoder/decoder can be defined as the onehot encoding and the function repectively (, ).The learning of these functions (if applicable) is done in an unsupervised manner using VAEs in our setup and is referred to as learning phase I throughout this article. A visualization of the pipeline can be seen in Figure 2.
3.1.2 The Logical Interpretation
The (logical) generative model represents the dependencies between the individual variables of the FL; in other words, it represents the joint probability distribution over the variables of the FL, i.e., . In this paper, we chose to use PSDDs [Kisa et al., 2014] though other models such as sum product networks (SPNs) [Poon and Domingos, 2011] could have been used as well. This choice was based on the reported ability of PSDDs to handle constraints in the learning regime. These could be oneof label constraints or any other kind of Boolean function over the inputs.
3.1.3 Learning
The learning of the system is done in two phases (see Figure 2). Learning phase I denotes the learning of the encoderdecoder function pairs for each domain (independently and unsupervised). Once learning phase I is completed we can use the encoders to map the data to the FL representation and learn the logical generative model (learning phase II), that is, the PSDD. Essentially, the variables of the FL are the propositions of the PSDD.
3.2 Querying
Due to the generative property of PSDDs, we are able to perform any query of the form: where is the query and is the evidence, both Boolean functions over the variables of the FL. Furthermore, by Theorem 7 of [Kisa et al., 2014] such probabilities can be computed in one pass through the tree and thus in polynomial time w.r.t. the size of the graph. (Thus, these are referred to as tractable models in the literature.) Specifically, in the MNIST example, such queries would take the form: where and correspond to two domains of the data ( and ).
3.2.1 Generative Query
Given evidence , we define the task of a generative query as one that samples values for all variables in the FL which are not assigned in the evidence. That is, , which is discussed in Algorithm 1 and is equivalent to .
As mentioned before the resulting assignments of variables returned from the algorithm can then be decoded using the decoder d.
3.3 Evaluation
In order to evaluate our model on image datasets, we focus on two main aspects. First, classification accuracy and secondly, the recoverability of the trained model in terms of how it interprets a given domain, explained below.
Classification accuracy is used as a quantifiable score that is easily comparable to other learning systems. We train the model on multiple image datasets before asking the model to classify unseen images (
) into one of the possible categories () using the following formulation (here denotes a sample drawn from a distribution):(3)  
(4)  
(5) 
We investigate the interpretability of the model by manually inspecting samples drawn from the distribution for some evidence. For the MNIST dataset, for example, we sample images for each category or class and check if the images correspond to the class. Such samples will be computed as follows (where denotes the images as before, and are the class labels):
(6)  
(7)  
(8) 
Finally, we analyse the variables of the FL. This sheds some light on the inner workings of the model, and gives us an insight into what the individual variables capture. Basically, we approximate the expectation of decoded FL samples where, in a binary setting, samples would be drawn conditional on a specific variable being true or false:
(9) 
This is approximated by:
(10) 
Here we define the decoded image to be pixels in width and pixels in height. Then each greyscale image is normalized to be an element of . What follows is that and as such it has to be normalized accordingly in order to produce an image.
(11) 
Here, is depicted as a tuple of images corresponding to the variable being true vs. false and vice versa.
4 Experiments
In this section, we investigate the predictive accuracy on unseen data (via a held out test set) as well as the generative power of the model. Firstly, we consider the standard classification task. Secondly, we run experiments where noise is added to the label of each entry; that is, each training entry has additional random labels specified (in addition to the correct one). Thirdly, we explore tasks which consist of at least two images and possibly a symbolic value. In one such experiment for example, a data point is defined over two images, representing successive integers and we are then interested in generating one image given the other (e.g., generate 7 if the first image is 6.). These are referred to as functional tasks. We conclude with an analysis of the FL. A comprehensive listing of the experimental setups and experiments run are given in the supplement Appendix.
4.0.1 Data
In order to get a comprehensive understanding of the capabilities of the proposed model, we used three different datasets. First, the MNIST dataset [LeCun et al., 1998] containing 28x28 (grayscale) images that represent handwritten digits, along with the corresponding class label. After the first set of experiments on the MNIST dataset, we used the hyperparameters for the best performing models and rerun the experiments on the FASHION dataset [Xiao et al., 2017], which contains 28x28 (grayscale) images of fashion items and the corresponding labels belong to one of the 10 categories. Finally, to investigate the scalability of the model, we used the EMNIST (extendedMNIST) [Cohen et al., 2017] dataset, where the images are handwritten numbers and letters of the English alphabet with the corresponding labels (47 classes in total).
4.0.2 Hardware
Since most experiments involved two training phases using very different optimization methods, we made use of two different cluster architectures in order to improve performance. Learning phase I is concerned with learning the parameters of a deep NN using minibatch gradient descent, and back propagation were run on GPU clusters. The cluster nodes used here are a combination of Dell PowerEdge R730 and Dell PowerEdge T630. Each has two 16 core Xeon CPUs, where the GPUs use NVIDIA cards Tesla K40m, GeForce GTX Titan X and GeForce Titan X. Learning phase II, on the other hand, uses the learnPSDD structure learning algorithm, and this was run on CPU clusters, where each of the 21 nodes is a Dell PowerEdge R815 with four 16 core Opteron CPUs and 256GB of memory.
4.1 Classification Task
Given training examples consisting of images and their labels, we trained the encoderdecoder pair unsupervised in the first instance, and the PSDD on the whole FL ( + representing the image and label respectively) in the second instance. The hyperparameters explored here include the vtree
search algorithm used, as well as the option of compressing the label, a onehot encoding to a binary representation (e.g.
). Furthermore, we varied the number of variables in and the categorical dimension of such variables (denoted by and respectively). Note that the categorical dimension essentially corresponds to whether we interpret the features in a Boolean space vs finite multivalued space.4.1.1 Mnist
The best classification accuracy on the MNIST dataset was measured at: using 32 binary variables (and a categorical dimension of 2) and a onehot encoded . In comparison, we note that discriminative models such as convolutional NN achieve 99.3% [LeCun et al., 1998] on the same task. A more comprehensive overview is given in Figure 3. Here we can see that there is a clear trade off between expressiveness of the FL and the ability for the PSDD learning to interpret this FL. In other words, if the FL is too small, then the neural model will not be able to learn a meaningful mapping, retaining valuable information in the encoding, and if the FL is too large, the PSDD learner struggles to find correlations between the variables.
From the best setup above, to test the generative abilities, we sampled images for each of the 10 categories using the proposed conditionalsampling algorithm. These samples are depicted in Figures 3(a), 3(b) for 2 of the 10 classes. Since these are samples, we should expect to see some variation, corresponding to the Figures. In a sense, the system demonstrates a prototypical understanding of what the labels represent. It is interesting to relate this insight to approaches such as [Lake et al., 2015] that involve an explicit token construction framework for generating images. We imagine that it might be possible to use the variables induced in our frameworks as a token generator, which we leave for the future.
4.1.2 Fashion
For the FASHION dataset, we achieve a classification accuracy of
using the hyperparameters of the best performing MNIST model. Interestingly enough, the reconstruction loss (binarycrossentropy) of the neural model is smaller (thus, better) in this scenario than in the MNSIT case, and the PSDD score is larger (thus, better) as well. The predictive accuracy on the held out test set is still considerably lower. This can be due to many reasons, most notably perhaps due to the additional complexity of the images. The additional variability of FASHION influences the computed reconstruction loss of the VAE, as it computes an average over pixel difference between the original image and the reconstructed one. For testing the generative abilities, once again, we sampled images for 2 of the 10 classes, as shown in Figures
3(c) and 3(d).4.1.3 Emnist
Finally, when running experiments on the EMNIST dataset, we found that the system is not quite capable of scaling to such a large number of image classes, which we suspect seriously affects the performance of the PSDD learner. Additionally, the VAE is confronted with a much more complex task in differentiating symbols (e.g., “1” and “l”). Here we only recorded an overall best accuracy of , where would be the expected random accuracy. It is an interesting question for the future to consider how to handle so many image classes with a Boolean learner.
4.2 Noisy Label Task
As mentioned earlier, we are interested in challenging the system by providing randomly generated additional labels to the correct one during training. To evaluate the experiment, we computed the accuracy on a held out set (for MNIST) only containing the right label (no noise). Generally speaking, as expected, we observed a decreasing accuracy with increasing noise: for example, adding one additional label (noisy1) decreases accuracy by to . Even if 2 and 3 noise labels are added, the accuracy only decreases to and respectively. Intuitively, the experiment requires the PSDD to reason about the possible labels for a given image and thus, we show that the logical model performs this reasoning in a satisfactory manner.
4.3 Functional Tasks
The idea here is to have training examples consisting of at least two images and maybe a symbolic value that denotes the relationship between these images. However, no semantic characterisation is provided for this symbolic value in our setup, so the system tries to map the image pairs to the value purely from visual features. (Thus, the machinery of logically defining such functions, as seen in, e.g., DeepProbLog [Manhaeve et al., 2018], could be used to extend our framework further.)
The simplest one is where we provide an image, and expect the system to generate a second image such that the integers present in the images are successors. This demonstration is depicted in Figure 5, where we observe that the predecessor/successor integer’s image was generated successfully.
In an additional set of experiments, we also provide a symbolic variable () that is the evaluation of a mathematical function over the two images. One example of such a function is the Boolean logic XOR. Here, we first train the unsupervised VAE on the whole (e.g., MNIST) data, and then create a custom training dataset where each entry contains two images, either “0” or “1”, and the result of applying the logical XOR operation on the label of these two images (). Thus, the FL in this task is made of three individual parts: representing two images and , representing the evaluation of the XOR function on the original labels of the two images. We can then evaluate the accuracy on the correctness of the predicted symbolic value on a held out test set. Conversely, we can sample for one of the images given the other image and a specified value. To reiterate, this is purely visual reasoning, so to clarify that, we can also repeat the experiment with the FASHION dataset, treating Tshirts and Trousers to correspond to true and false respectively (e.g., ). (All other digits in MNIST and all other image classes in FASHION are discarded.) The classification accuracy that we measured on a held out test set were and for MNIST and FASHION respectively. Generated samples are shown in Figure 6. As an example of a more complex function, we also conducted experiments on MNIST, where and range over all images present in the dataset but constitutes the result of the arithmetic plus operation on the original labels of the two images, such that . Here, we have many more possible values and multiple combinations of images that correspond to the same value. However, the recorded classification accuracy on the held out test set is only . Thus, the conclusion to be drawn here is although the logical generative model does allow us to formulate challenging tasks over mathematical and logical functions, it currently only resolves this in terms of the visual features. So a second interesting direction for the future is to understand how to go beyond this and find a way to incorporate (or learn) the semantic meaning of the mathematical function. PSDDs [Liang et al., 2017], for example, can be trained with constraints which might offer a possible way to make progress in this direction.
4.4 Fl Analysis
To understand what a given variable in the FL actually represents, we use the generative query algorithm and Equation 11. To evaluate this more concretely, we sample images for our best performing model on MNIST and FASHION. In Figure 7, we computed five times with for each variable of . In Figure 7, each row corresponds to one of the first 14 variables of the FL (in order from 1 to 14, top to bottom). What this demonstrates is that the model accords meaningful elements of the images to each variable of the FL. Indeed, we see that individual variables correspond to different shapes such as slightly bent lines in row/variable 1 of Figure 6(a) or circular objects in row/variable 4. In a sense, the FL is able to identify discrete visual components for the images, which hints at a compact and compositional understanding of the domain in question.
5 Conclusion & Discussion
In this work, we were interested in understanding what precisely the latent space of AEs capture, and whether that space could be inspected from a logical viewpoint. In that regard, we motivated the learning of a symbolic generative model on the FL, which allows us to inspect the hidden layers and perform logical reasoning over the variables of these layers. For example, by means of a conditional sampling algorithm, we were able to generate prototypical images for a label, and moreover, generate labels for images.
As mentioned previously, with regards to the standard classification task we see that our model can not compete with other stateoftheart systems such as deep convolutional neural networks (CNNs). However, when making such comparisons, one should consider that discriminate models such as CNNs are not generative, whereas generative models (e.g. VAEs) are not appropriate for classifying images as they may not discriminate between individual images.
Although one might, in general, consider that a lower performance is a reasonable tradeoff in exchange for increased functionality and interpretability, we do not think this is “fundamental” tradeoff: our observation has been that the PSDD software seems more capable of handling intricate Boolean reasoning in comparison to SPN’s, but both struggle somewhat when considering multivalued discrete variables. Note that by increasing the “encoding” space, we are allowing for more granular reconstructions of the latent space, and so we should expect to reach the performance of stateoftheart models. However, unfortunately, the PSDD software struggles when considering many multivalued discrete variables, so there is an engineering effort required. In contrast, many conventional deep learning software packages have benefited from considerable optimizations.
In addition to classifying images, the model was put to test in challenging tasks capturing structural, logical or mathematical relationships between pairs of images, as well as the handling of noisy labels. While we did observe scalability issues when considering a very large set of classes, the underlying framework still offers an insightful logical view of the hidden layers. This provides the space for interesting avenues for the future, such as integrating our framework with existing neurosymbolic frameworks. In particular, can one of these frameworks provide a way to reason about mathematical functions in a semantic manner (perhaps also learn them), rather than the purely visual quality exploited in the current setup? Can proposals from statistical relational learning [Getoor and Taskar, 2007] help us capture and reason about intricate logical relationships between variables? The overall goal, then, is to get a better grasp of how abstract concepts and highlevel reasoning might emerge from lowlevel sensory data. We hope that this work, which attempts to repurpose a deep learning framework in logical space, provides some of the insights on how that is possible, and at the same time, shows the benefits of using a symbolic generative model in a differential latent space.
Acknowledgements
We would like to thank John Quinn for his valuable input on this work and many fruitful conversations. Anton Fuxjaeger was supported by the Engineering and Physical Sciences Research Council (EPSRC) Centre for Doctoral Training in Pervasive Parallelism (grant EP/L01503X/1) at the University of Edinburgh, School of Informatics. Vaishak Belle was supported by the Royal Society University Research Fellowship. We would also like to thank our reviewers for their helpful suggestions.
References
 [Bekker et al., 2015] Bekker, J., Davis, J., Choi, A., Darwiche, A., and Van den Broeck, G. (2015). Tractable learning for complex probability queries. In Advances in Neural Information Processing Systems, pages 2242–2250.

[Bošnjak et al., 2017]
Bošnjak, M., Rocktäschel, T., Naradowsky, J., and Riedel, S. (2017).
Programming with a differentiable forth interpreter.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 547–556. JMLR. org.  [Chavira and Darwiche, 2008] Chavira, M. and Darwiche, A. (2008). On probabilistic inference by weighted model counting. Artificial Intelligence, 172(67):772–799.
 [Cohen et al., 2017] Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. (2017). Emnist: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373.
 [Darwiche, 2011] Darwiche, A. (2011). Sdd: A new canonical representation of propositional knowledge bases. In IJCAI ProceedingsInternational Joint Conference on Artificial Intelligence, volume 22, page 819.
 [De Raedt and Kimmig, 2015] De Raedt, L. and Kimmig, A. (2015). Probabilistic (logic) programming concepts. Machine Learning, 100(1):5–47.
 [Dumancic et al., 2018] Dumancic, S., Guns, T., Meert, W., and Blockleel, H. (2018). Autoencoding logic programs. In International Conference on Machine Learning, Location: Stockholm, Sweden.
 [Friedman, 1998] Friedman, N. (1998). The bayesian structural em algorithm. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pages 129–138. Morgan Kaufmann Publishers Inc.
 [Garcez et al., 2012] Garcez, A. S. d., Broda, K. B., and Gabbay, D. M. (2012). Neuralsymbolic learning systems: foundations and applications. Springer Science & Business Media.
 [Gens and Domingos, 2012] Gens, R. and Domingos, P. (2012). Discriminative learning of sumproduct networks. In Advances in Neural Information Processing Systems, pages 3239–3247.
 [Getoor and Taskar, 2007] Getoor, L. and Taskar, B. (2007). Introduction statistical relational learning.
 [Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning, volume 1. MIT press Cambridge.
 [Gumbel, 1954] Gumbel, E. J. (1954). Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office.
 [Jang et al., 2016] Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144.
 [Kingma and Welling, 2013] Kingma, D. P. and Welling, M. (2013). Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 [Kisa et al., 2014] Kisa, D., Van den Broeck, G., Choi, A., and Darwiche, A. (2014). Probabilistic sentential decision diagrams. In KR.
 [Lake et al., 2015] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
 [LeCun et al., 1998] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
 [Liang et al., 2017] Liang, Y., Bekker, J., and Van den Broeck, G. (2017). Learning the structure of probabilistic sentential decision diagrams. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI).
 [Liang and Van den Broeck, 2019] Liang, Y. and Van den Broeck, G. (2019). Learning logistic circuits. In Proceedings of the 33rd Conference on Artificial Intelligence (AAAI).

[Lin et al., 2015]
Lin, Y., Liu, Z., Sun, M., Liu, Y., and Zhu, X. (2015).
Learning entity and relation embeddings for knowledge graph completion.
In Twentyninth AAAI conference on artificial intelligence.  [Maddison et al., 2014] Maddison, C. J., Tarlow, D., and Minka, T. (2014). A* sampling. In Advances in Neural Information Processing Systems, pages 3086–3094.
 [Manhaeve et al., 2018] Manhaeve, R., Dumančić, S., Kimmig, A., Demeester, T., and De Raedt, L. (2018). Deepproblog: Neural probabilistic logic programming. arXiv preprint arXiv:1805.10872.
 [Niepert et al., 2016] Niepert, M., Ahmed, M., and Kutzkov, K. (2016). Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023.
 [Pipatsrisawat and Darwiche, 2010] Pipatsrisawat, T. and Darwiche, A. (2010). A lower bound on the size of decomposable negation normal form. In AAAI.
 [Poon and Domingos, 2011] Poon, H. and Domingos, P. (2011). Sumproduct networks: A new deep architecture. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 689–690. IEEE.
 [Rezende et al., 2014] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
 [Rocktäschel and Riedel, 2017] Rocktäschel, T. and Riedel, S. (2017). Endtoend differentiable proving. In Advances in Neural Information Processing Systems, pages 3788–3800.
 [Santoro et al., 2017] Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., and Lillicrap, T. (2017). A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976.
 [Serafini and Garcez, 2016] Serafini, L. and Garcez, A. d. (2016). Logic tensor networks: Deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422.
 [Simonyan et al., 2013] Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
 [Szegedy et al., 2013] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
 [Xiao et al., 2017] Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
 [Xu et al., 2017] Xu, J., Zhang, Z., Friedman, T., Liang, Y., and den Broeck, G. V. (2017). A semantic loss function for deep learning with symbolic knowledge. CoRR, abs/1711.11157.

[Yang et al., 2014]
Yang, M.C., Duan, N., Zhou, M., and Rim, H.C. (2014).
Joint relational embeddings for knowledgebased question answering.
In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
, pages 645–650.  [Yosinski et al., 2015] Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., and Lipson, H. (2015). Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579.
 [Zeiler and Fergus, 2014] Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer.