1 Introduction
In the last few years, the availability of a large amount of supervised data caused a significant improvement in the performances of subsymbolic approaches like artificial neural networks. In particular, deep neural networks have achieved impressive results in several tasks, thanks to their ability to jointly learn the decision function and the data representation from the lowlevel perception inputs
[13, 24]. However, the dependency on the amount and quality of training data is also a major limitation of this class of approaches. Standard neural networks can struggle to represent relational knowledge on different input patterns, or relevant output structures, which have been shown to bring significant benefits in many challenging applications like image segmentation tasks [3]. For this reason, several work has been done in the direction of learning and representing relations using embeddings [17, 31, 43, 8, 33, 1] and in developing and injecting relational features into the learning process [38, 34].On the other hand, symbolic approaches [4, 21, 32] are generally based on probabilistic logic reasoners, and can express highlevel relational dependencies in a certain domain of discourse and perform exact or approximate inference in presence of uncertainty. Markov Logic Networks (MLN) [35] and its variants like Probabilistic Soft Logic [2] are relational undirected models, mapping First–Order Logic formulas to a Markov network, and allowing to train the parameters of the reasoner and perform inference under uncertainty.
Another related line of research studies hybrid approaches leveraging neural networks to learn the structure of the reasoning process like done, for instance, by Relational Restricted Boltzmann machines
[20]and autoencoding logic programs
[9]. Similarly, Neural Markov Logic Networks [30] extend MLN by defining the potential functions as general neural networks which are trained together with the model parameters. Neural Theorem Prover [36, 37] is an endtoend differentiable prover that shows stateoftheart performances on some link prediction benchmarks by combining Prolog backward chain with a soft unification scheme. TensorLog [45, 19]is a recent framework to reuse the deep learning infrastructure of TensorFlow to perform probabilistic logical reasoning.
Whereas the previously discussed methods provide a large step forward in the definition of a flexible and datadriven reasoning process, they do not still allow to cooptimize the lowlevel learners processing the environmental data. Methods bridging the gap between symbolic and subsymbolic levels are commonly referred as neurosymbolic approaches [11, 20, 40]. An early attempt to integrate learning and reasoning is the work by Lippi et al. [25]. The main limitation of this work is that it was devised adhoc to solve a specific task in bioinformatics and it does not define a general methodology to apply it to other contexts.
A methodology to inject logic into deep reinforcement learning has been proposed by Jiang et al.
[18], while a distillation method to inject logic knowledge into the network weights is proposed by Hu et al. [16]. Deep Structured Models [3] define a schema to inject complex output structures into deep learners. The approach is general but it does not focus on logic knowledge but on imposing statistical structure on the output predictions. Hazan et al. [15] integrate learning and inference in Conditional Random Fields [42], but they also do not focus on logic reasoning. The Semantic Loss [44] allows to translate the desired output structure of a learner via the definition of a loss, which can also accommodate logic knowledge. However, the loss and the resulting reasoning process is fixed, thus limiting the flexibility of the approach. Deep ProbLog [27] is a neurosymbolic approach, based on the probabilistic logic programming language ProbLog [4] and approximating the predicates via deep learning. This approach is very flexible but it is limited to cases where exact inference is possible, as it lacks a modular and scalable solution like the one proposed in this paper.Deep Logic Models (DLM) [28] are instead capable of jointly training the sensory and reasoning layers in a single differentiable architecture, which is a major advantage with respect to related approaches like Semanticbased Regularization [5]
, Logic Tensor Networks
[6] or Neural Logic Machines [7]. However, DLM is based on a brittle stacking of the learning and reasoning modules, failing to provide a real tight integration on how lowlevel learner employs the supervised data. For this reason, DLM requires the employment of heuristics like training plans to make learning effective.
This paper presents Relational Neural Machines (RNM), a novel framework introducing fundamental improvements over previous stateoftheartmodels in terms of scalability and in the tightness of the connection between the trainer and the reasoner. A RNM is able to perfectly replicate the effectiveness of training from supervised data of standard deep architectures, while still cotraining a reasoning module over the environment that is built during the learning process. The bonding is very general as any (deep) learner can be integrated and any output or input structure can be expressed. On the other hand, when restricted to pure symbolic reasoning, RNM can replicate the expressivity of Markov Logic Networks [35].
The outline of the paper is as follows. Section 2 presents the model and how it can be used to integrate logic and learning. Section 3 studies tractable approaches to perform inference and model training from supervised and unsupervised data. Section 4 shows the experimental evaluation of the proposed ideas on various datasets. Finally, Section 5 draws some conclusions and highlights some planned future work.
2 Model
A Relational Neural Machine establishes a probability distribution over a set of
output variables of interest , given a set of predictions made by one or multiple deep architectures, and the model parameters. In this paper the output variables are assumed to be binary, i.e. , but the model can be extended to deal with continuous values for regression tasks.Unlike standard neural networks which compute the output via a simple forward pass, the output computation in an RNM can be decomposed into two stages: a lowlevel stage processing the input patterns, and a subsequent semantic stage, expressing constraints over the output and performing higher level reasoning. In this paper, it is assumed that there is a single network processing the input sensorial data, but the theory is trivially extended to any number of learners. The first stage processes input patterns , returning the values using the network with parameters . The higher layer takes as input and applies reasoning using a set of constraints, whose parameters are indicated as , then it returns the set of output variables .
A RNM model defines a conditional probability distribution in the exponential family defined as:
(1) 
where is the partition function and the potentials express some properties on the input and output variables. The parameters determine the strength of the potentials .
This model can express a vast range of typical learning tasks. We start reviewing how to express simple classification problems, before moving to general neurosymbolic integration mixing learning and reasoning. A main advantage of RNMs is that they can jointly express and solve these use cases, which are typically been studied as stacked separate problems.
In a classical and pure supervised learning setup, the patterns are i.i.d., it is therefore possible to split the
into disjoint sets grouping the variables of each pattern, forming separate cliques. Let us indicate as the portion of the output and function variables referring to the processing of an input pattern . A single potential is needed to represent supervised learning, and this potential decomposes over the patterns as:(2) 
where is the set of supervised patterns. This yields the distribution,
(3) 
Onelabel classification.
The mutual exclusivity rule requires to assign a zero probability to assignments stating that a pattern can belong to more than one class. The following potential is defined for any generic input pattern :
When only the potential is used, each pattern corresponds to a set of outputs independent on the other pattern outputs given the , the partition function decomposes over the patterns and the probability distribution simplifies to:
This result provides an elegant justification for the usage of the softmax output for networks used in onelabel classification tasks.
Multilabel.
The following potential is expressed for each input pattern: .
When plugging in the previously defined potential into the potential in Equation 2 and the result plugged into Equation 3, the partition function can be decomposed into one component for each pattern and class, since each pattern and classification output is independent on all the other classifications:
where
is the sigmoid function,
are the set of positive and negative classes for pattern . This result provides an elegant justification for the usage of a sigmoidal output layer for multilabel classification tasks.2.1 Neurosymbolic integration
The most interesting and general case is when the presented model is used to perform both learning and reasoning, which is a task referred in the literature as neurosymbolic integration.
The general model described in Equation 1 is materialized with one potential enforcing the consistency with the supervised data together with potentials representing the logic knowledge. Using a similar approach to Markov Logic Networks, a set of First–Order Logic (FOL) formulas is input to the system, and there is a potential for each formula. The general form of the conditional probability distribution becomes:
(4) 
where it is assumed that some (or all) the predicates in a KB are unknown and need to be learned together with the parameters driving the reasoning process.
A grounded expression (the same applies to atom or predicate) is a FOL rule whose variables are assigned to specific constants. It is assumed that the undirected graphical model has the following structure, each grounded atom corresponds to a node in the graph, and all nodes connected by at least one rule are connected on the graph, so that there is one clique (and then potential) for each grounding of the formula in . It is assumed that all the potentials resulting from the th formula share the same weight , therefore the potential is the sum over all groundings of in the world , such that: where assumes a value equal to and if the grounded formula holds true and false. This yields the probability distribution:
Example.
It is required to train a classifier detecting the objects on images for a multiobject detection task in real world pictures. A knowledge graph may be available to describe hierarchical dependencies among the object classes, or object compositions. Pictures may be correlated by the locations where they have been shot. Table
1 shows the knowledge that could be used to express such a task, where the unknown predicates to be trained are indicated as the set . Other predicates like may be known a priori based the metainformation attached to the images. Figure 1 shows the graphical model correlating the output variables and the for the the inputs instantiated for the rules , , and . The goal of the training process is to train the classifiers approximating the predicates, but also to establish the relevance of each rule. For example, the formula is likely to be associated to a higher weight than , which are unlikely to correlate in the data.Logic Tensor Networks.
Logic Tensor Networks (LTN) [39] is a framework to learn neural networks under the constraints imposed by some prior knowledge expressed as a set of FOL clauses. As shown in this paragraph, LTN is a special case of a RNM, when the parameters are frozen. In particular, an LTN expresses each FOL rule via a continuous relaxation of a logic rule using fuzzy logic. The strength of the rule is assumed to be known a priori and not trained by the LTN. These rules provide a prior for the functions. Therefore, assuming the parameters are fixed, an LTN considers the following distribution:
where is the continuous relaxation of the th logic rule, is used to express the fitting of the supervised data and the prior gives preference to the functions respecting the logic constraints. The parameters of the of an LTN can be optimized via gradient ascent by maximizing the likelihood of the training data.
Semantic Based Regularizaion.
SemanticBased Regularization (SBR) [5], defines a learning and reasoning framework which allows to train neural networks under the constraints imposed by the prior logic knowledge. The declarative language Lyrics [29] is available to provide a flexible and easy to use frontend for the SBR framework. At training time, SBR employs the knowledge like done by LTN, while SBR uses a continuous relaxation of the
th logic rule and of the output vector at inference time. Therefore, SBR can also be seen as a special instance of a RNM, when the
parameters are frozen and the continuous relaxation of the logic is used at test time. Both LTN and SBR have a major disadvantage over RNM, as they can not learn the weights of the reasoner, which are required to be known a priori. This is very unlikely to happen in most of the real world scenarios, where the strength of each rule must be cotrained with the learning system.3 Learning and Inference
Training.
A direct model optimization in RNM is intractable in most interesting cases, as a the computation of the partition function requires a summation over all possible assignments of the output variables. However, if a partition function is assumed to be factorized into separate independent groups of variables , it holds that:
A particularly interesting case is when it is assumed that the partition function factorizes over the potentials like done in piecewise likelihood [41]:
where is the subset of variables in that are involved in the computation of . We indicate as piecewiselocal probability for the th constraint:
(5) 
Under this assumption, the factors can be distributed over the potential giving the following generalized piecewise likelihood:
If the variables in are binary, the computation of requires summation over all possible assignments which has complexity. Using the local decomposition this is reduced to , where is the size of the largest potential. When a single potential involves too many variables, the pseudolikelihood decomposition can be used, where each variable is factorized into a separate component with linear complexity with respect to the numbers variables:
where the factorization is performed with respect of the single variables, which has a cost proportional to .
Assuming that the constraints are divided into two groups , for which the local piecewise partitioning and the pseudolikelihood approximations are used, the distribution becomes:
If the th constraint is factorized using the partitioning, the derivatives of the loglikelihood with respect to the model potential weights are:
(6)  
and with respect to the learner parameters:
(7) 
In the following of this section, it is assumed that all potentials are approximated using the piecewise local approximation to keep the notation simple, the extension to the pseudo likelihood is trivial, it is enough to replace the with the in Equation 6.
Training for neurosymbolic integration.
An interesting case is when a potential represents the level of satisfaction of a logical constraint over its groundings in the world . In this case the predicates of the th formula are grounded with a set of groundings, and indicates the set of outputs for the grounded predicates in the world . Therefore, the potential is the sum over the grounded formulas:
where is the satisfaction of the formula (False or True) by the grounded predicates .
optnorm  Product  Gödel  Łukaseiwicz 

)  )  
Therefore, each grounding corresponds to a separate potential, even if they are all sharing the same weight. Assuming that each grounding of a formula is independent on all the others, then we can approximate the as:
where are the total number of groundings of the th formula in and each grounded formula shares the same local partition function . can be efficiently computed by precomputing , indicating the number of possible different grounding assignments satisfying or not satisfying the th formula. Clearly, since for a formula with atoms, there are possible assignments, it holds that , yielding:
Using the piecewise local approximation for each grounding, the derivatives with respect of the model parameters become:
Let us indicate as the average satisfaction of the th constraint over the data training data, then the gradient is null when for all constraints:
(8) 
The expected value of the satisfaction of the formula for a grounding, can be a efficiently computed for a valued as:
yielding the following optimal assignment to the th parameter for a given assignment :
(9) 
which shows that the loglikelihood is maximized by selecting a
equal to difference between the log odds of the constraint satisfaction of the data and the log odds of the prior satisfaction of the constraint if all assignments are equally probable.
When the world is fully observed during training, indicates the training data assignments, then substituting into Equation 9 returns the maximum likelihood assignment for the parameters.
When the world is not fully observed during training, an iterative EM schema can be used to marginalize over the unobserved data in the expectation step using the inference methodology as described in the next paragraph. Then, the average constraint satisfaction can be recomputed, and then the parameters can be updated in the maximization step. This process is then iterated until convergence. Algorithm 1 reports the complete training algorithm for RNMs.
Inference.
The MAP inference process searches the most probable assignment of the given the evidence and the fixed parameters . The problem of finding the best assignment to the unobserved query variables given the evidence and current parameters can be stated as:
(10) 
where indicates a full assignment to the variables, split into the query and evidence sets.
Gradientbased techniques can not be readily used to optimize the MAP problem stated by Equation 10, since the problem is discrete. A possible solution could be to relax the values into the interval and assume that each potential has a continuous surrogate which collapses into the original potential when the assume crisp values and is continuous with respect to each . As described in the following, continuous surrogates are very appropriate to describe the potentials representing logic knowledge for neurosymbolic integration and probabilistic logic reasoning.
When relaxing the potentials to accept continuous input variables, the MAP solution can be found by gradientbased techniques by computing the derivative with respect of each output variable:
(11) 
TNorms Fuzzy Logics.
Fuzzy logics extend the set of Boolean logic truthvalues to the unit interval and, as a consequence, they can be exploited to convert Boolean logic expressions into continuous and differentiable ones. In particular, a tnorm fuzzy logic [14] is defined upon the choice of a certain tnorm [23]. A tnorm is a binary operation generalizing to continuous values the Boolean logic conjunction (), while it recovers the classical AND when the variables assume the crisp values (false) or (true).
Throughout this paper, we assume that given a certain variable assuming a continuous value , its negation (also said strong negation) is evaluated as . Moreover, a tnorm and the strong negation allows the definition of additional logical connectives. For instance, the implication () may be defined as the residuum of the tnorm, while the OR () operator, also called tconorm, may be defined according to the DeMorgan law with respect to the tnorm and the strong negation.
Different tnorm fuzzy logics have been proposed in the literature. Table 2 reports the operations computed by different logic operators for the three fundamental continuous tnorms, i.e. Product, Gödel and Łukasiewicz logics. Furthermore, a fragment of the Łukasiewicz logic [12] has been recently proposed for translating logic inference into a differentiable optimization problem, since it defines a large class of clauses which are translated into convex functions.
An important role defining different ways to aggregate logical propositions on (possibly large) sets of domain variables is played by quantifiers. The universal quantifier and the existential quantifier express the fact that a clause should hold true over all or at least one grounding. Both the universal and existential quantifier are generally converted into realvalued functions according to different aggregation functions, e.g. the universal one as a tnorm and the existential one as a tconorm over the groundings. When multiple universally or existentially quantified variables are present, the conversion is recursively performed from the outer to the inner variables as already stated. For example, consider the rule
where are three unary predicates defined on the input set . In this case, the output vector is defined as follows,
where is the output of predicate when grounded with . The continuous surrogate for the FOL rule grounded over all patterns in the domain, in case of the product tnorm and universal quantifier converted with the arithmetic mean, is given by:
4 Experiments
The proposed model has been experimentally evaluated on two different datasets where the relational structure on the output or input data may be exploited.
4.1 MNIST Following Pairs
This small toy task is designed to highlight the capability of RNMs to learn and employ soft rules that are holding only for a subportion of the whole dataset. The MNIST dataset contains images of handwritten digits, and this task assumes that additional relational logic knowledge is available to reason over the digits. In particular, given a certain subset of images, a binary predicate between image pairs is considered. Given two images , whose corresponding digits are denoted by , a link between and is established if the second digit follows the first one, i.e. . However, it is assumed that the predicate is noisy, therefore for , there is a given degree of probability that the is established anyway. The knowledge about the predicate can be represented by the following FOL formula
where is a binary predicate indicating if a number is the digit class of the image . Since the predicate holds true also for pairs of nonconsecutive digits, the above rule is violated by a certain percentage of digit pairs. Therefore, the manifold established by the predicate can help in driving the prediction, but the noisy links force the reasoner to be flexible about how to employ the knowledge.
The training set is created by randomly selecting 50 images from the MNIST dataset and by adding the relation with an incremental degree of noise. For each degree of noise in the training set, we created an equally sized test set with the same degree of noise. A neural network with
hidden sigmoid neurons is used to process the input images.
Figure 2 reports a comparison between RNM and the baseline provided by the neural network varying the percentage of links that are predictive of a digit to follow another one. When the link predicate only holds for consecutive digit pairs, RNM is able to perfectly predict the images on the test set using this information. When the link becomes less informative (more noisy), RNM is still able to employ the rule as a soft suggestion. However, when the percentage of predictive links approaches
, the relation is not informative at all, as it does not add any information on top of the prior probability that two randomly picked up numbers follow each other. In this case, RNM is still able to detect that the formula is not useful, and only the supervised data is used to learn the predictions. As a result, the predictive accuracy of RNM matches the one of the neural network.
4.2 Document Classification on the Citeseer dataset.
The CiteSeer dataset [26] is a collection of scientific papers, each one assigned to one of the classes: and . The papers connect to each other by a citation network which contains links. Each paper in the dataset is described via its bagofwords, e.g. a vector where the th element has a value equal to or , depending on whether the th word in the vocabulary is present or not present in the document, respectively. The overall dictionary for this experiment contains unique words. The domain knowledge used for this task state that connected papers tend to be about the same topic:
where is an evidence predicate (e.g. its value over the groundings is known apriori) determining whether a pattern cites another one. Different topics are differently closed with respect to other fields, and the above rules hold with different degrees.
A neural network with three hidden layers with
units and RELU activation functions and one output layer using the softmax activation is used for this task as baseline. RNM employs the same network but with no output layer as the output layer is computed as part of the inference process as shown in Section
2 for the one and multilabel classification cases. The Adam optimizer [22] is used to update the weights. A variable portion of the data is sampled for training, of which % of this data is kept as validation set, while the remaining data is used as test set.% training data  NN baseline  SBR  RNM 

90  0.723  0.726  0.732 
75  0.717  0.719  0.726 
50  0.707  0.712  0.726 
25  0.674  0.682  0.709 
10  0.645  0.650  0.685 
Fully Observed Case.
The train and test sets are kept separated, and all links between train and test papers are dropped, so that the train and test data are two separate worlds. Table 3 reports the result obtained by the baseline neural network, compared against the baseline model and SBR trained using the Lyrics framework as average over ten different samples of the train and test data. Since SBR can not learn the weight of the rules, these are validated by selecting the best performing one on the validation set. RNM improves over the other methods for all tested configurations, thanks to its ability of selecting the best weights for each rule, exploiting the fact that each research community has a different intracommunity citation rate.
% training data  NN baseline  SBR  RNM 

90  0.726  0.780  0.780 
75  0.708  0.764  0.766 
50  0.695  0.747  0.753 
25  0.667  0.729  0.735 
10  0.640  0.703  0.708 
Partially Observed Case.
This experiment assumes that the training, validation and test data are available at training time [10], even if only the training labels are used during the training process. This configuration models a real world scenario where a partial knowledge of a world is given, but it is required to perform inference over the unknown portion of the environment. In this tranductive experiment, all Citeseer papers are supposed to be available in a single world together with the full citation network. Only a variable percentage of the supervised data is used during training. Therefore, the world is only partially observed at training time, and the EM schema described by Algorithm 1 must be used during training.
Table 4 reports the accuracy results obtained by the baseline neural network, compared against the baseline model and SBR. The SBR weights are validated by selecting the best performing one on the validation set. RNM improves over the other methods for all tested configurations, thanks to its ability of selecting the best weights for each rule.
5 Conclusions and Future Work
This paper presented Relational Neural Machines a novel framework to provide a tight integration between learning from supervised data and logic reasoning, allowing to improve the quality of both modules processing the lowlevel input data and the highlevel reasoning about the environment. The presented model provides significant advantages over previous work in terms of scalability and flexiblity, while dropping any tradeoff in exploiting the supervised data. The preliminary experimental results are promising, showing that the tighter integration between symbolic and a subsymbolic levels helps in exploiting the input and output structures. As future work, we plan to undertake a larger experimental exploration of RNM on real world problems for more structured problems.
References

[1]
Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, and Charles
Sutton, ‘Learning continuous semantic representations of symbolic
expressions’, in
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 80–88. JMLR. org, (2017).  [2] Stephen H Bach, Matthias Broecheler, Bert Huang, and Lise Getoor, ‘Hingeloss markov random fields and probabilistic soft logic’, Journal of Machine Learning Research, 18, 1–67, (2017).
 [3] LiangChieh Chen, Alexander Schwing, Alan Yuille, and Raquel Urtasun, ‘Learning deep structured models’, in International Conference on Machine Learning, pp. 1785–1794, (2015).
 [4] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen, ‘Problog: A probabilistic prolog and its application in link discovery’, in Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pp. 2468–2473, San Francisco, CA, USA, (2007). Morgan Kaufmann Publishers Inc.
 [5] Michelangelo Diligenti, Marco Gori, and Claudio Sacca, ‘Semanticbased regularization for learning and inference’, Artificial Intelligence, 244, 143–165, (2017).
 [6] I Donadello, L Serafini, and AS d’Avila Garcez, ‘Logic tensor networks for semantic image interpretation’, in IJCAI International Joint Conference on Artificial Intelligence, pp. 1596–1602, (2017).
 [7] Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou, ‘Neural logic machines’, in International Conference on Learning Representations, (2019).
 [8] Sebastijan Dumančić and Hendrik Blockeel, ‘Demystifying relational latent representations’, in International Conference on Inductive Logic Programming, pp. 63–77. Springer, (2017).
 [9] Sebastijan Dumančić, Tias Guns, Wannes Meert, and Hendrik Blockeel, ‘Learning relational representations with autoencoding logic programs’, in Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 6081–6087. AAAI Press, (2019).
 [10] Alexander Gammerman, Volodya Vovk, and Vladimir Vapnik, ‘Learning by transduction’, in Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 148–155. Morgan Kaufmann Publishers Inc., (1998).
 [11] Artur S d’Avila Garcez, Krysia B Broda, and Dov M Gabbay, Neuralsymbolic learning systems: foundations and applications, Springer Science & Business Media, 2012.
 [12] Francesco Giannini, Michelangelo Diligenti, Marco Gori, and Marco Maggini, ‘On a convex logic fragment for learning and reasoning’, IEEE Transactions on Fuzzy Systems, (2018).
 [13] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio, Deep learning, volume 1, MIT press Cambridge, 2016.
 [14] Petr Hájek, Metamathematics of fuzzy logic, volume 4, Springer Science & Business Media, 2013.
 [15] Tamir Hazan, Alexander G Schwing, and Raquel Urtasun, ‘Blending learning and inference in conditional random fields’, The Journal of Machine Learning Research, 17(1), 8305–8329, (2016).
 [16] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard H. Hovy, and Eric P. Xing, ‘Harnessing deep neural networks with logic rules’, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 712, 2016, Berlin, Germany, Volume 1: Long Papers, (2016).
 [17] Shoaib Jameel and Steven Schockaert, ‘Entity embeddings with conceptual subspaces as a basis for plausible reasoning’, in Proceedings of the Twentysecond European Conference on Artificial Intelligence, pp. 1353–1361. IOS Press, (2016).
 [18] Zhengyao Jiang and Shan Luo, ‘Neural logic reinforcement learning’, arXiv preprint arXiv:1904.10729, (2019).
 [19] William W Cohen Fan Yang Kathryn and Rivard Mazaitis, ‘Tensorlog: Deep learning meets probabilistic databases’, Journal of Artificial Intelligence Research, 1, 1–15, (2018).
 [20] Navdeep Kaur, Gautam Kunapuli, Tushar Khot, Kristian Kersting, William Cohen, and Sriraam Natarajan, ‘Relational restricted boltzmann machines: A probabilistic logic learning approach’, in International Conference on Inductive Logic Programming, pp. 94–111. Springer, (2017).
 [21] Angelika Kimmig, Stephen Bach, Matthias Broecheler, Bert Huang, and Lise Getoor, ‘A short introduction to probabilistic soft logic’, in Proceedings of the NIPS Workshop on Probabilistic Programming: Foundations and Applications, pp. 1–4, (2012).
 [22] Diederik P Kingma and Jimmy Ba, ‘Adam: A method for stochastic optimization’, arXiv preprint arXiv:1412.6980, (2014).
 [23] E.P. Klement, R. Mesiar, and E. Pap, Triangular Norms, Kluwer Academic Publisher, 2000.
 [24] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, ‘Deep learning’, nature, 521(7553), 436, (2015).
 [25] Marco Lippi and Paolo Frasconi, ‘Prediction of protein residue contacts by markov logic networks with grounding–specific weights’, Bioinformatics, 25(18), 2326–2333, (2009).
 [26] Qing Lu and Lise Getoor, ‘Linkbased classification’, in Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 496–503, (2003).
 [27] Robin Manhaeve, Sebastijan Dumančić, Angelika Kimmig, Thomas Demeester, and Luc De Raedt, ‘Deepproblog: Neural probabilistic logic programming’, arXiv preprint arXiv:1805.10872, (2018).
 [28] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori, ‘Integrating learning and reasoning with deep logic models’, in Proceedings of the European Conference on Machine Learning, (2019).
 [29] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori, ‘Lyrics: a general interface layer to integrate ai and deep learning’, arXiv preprint arXiv:1903.07534, (2019).
 [30] Giuseppe Marra and Ondřej Kuželka, ‘Neural markov logic networks’, arXiv preprint arXiv:1905.13462, (2019).
 [31] Pasquale Minervini, Luca Costabello, Emir Muñoz, Vít Nováček, and PierreYves Vandenbussche, ‘Regularizing knowledge graph embeddings via equivalence and inversion axioms’, in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 668–683. Springer, (2017).
 [32] Stephen Muggleton and Luc De Raedt, ‘Inductive logic programming: Theory and methods’, The Journal of Logic Programming, 19, 629–679, (1994).
 [33] Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio, ‘Holographic embeddings of knowledge graphs’, in Thirtieth Aaai conference on artificial intelligence, (2016).
 [34] Mathias Niepert, ‘Discriminative gaifman models’, in Advances in Neural Information Processing Systems, pp. 3405–3413, (2016).
 [35] Matthew Richardson and Pedro Domingos, ‘Markov logic networks’, Machine learning, 62(1), 107–136, (2006).
 [36] Tim Rocktäschel and Sebastian Riedel, ‘Learning knowledge base inference with neural theorem provers’, in Proceedings of the 5th Workshop on Automated Knowledge Base Construction, pp. 45–50, (2016).
 [37] Tim Rocktäschel and Sebastian Riedel, ‘Endtoend differentiable proving’, in Advances in Neural Information Processing Systems, pp. 3788–3800, (2017).
 [38] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap, ‘A simple neural network module for relational reasoning’, in Advances in neural information processing systems, pp. 4967–4976, (2017).
 [39] Luciano Serafini, Ivan Donadello, and Artur d’Avila Garcez, ‘Learning and reasoning in logic tensor networks: theory and application to semantic image interpretation’, in Proceedings of the Symposium on Applied Computing, pp. 125–130. ACM, (2017).
 [40] Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, Steven Schockaert, and Ondrej Kuzelka, ‘Lifted relational neural networks: Efficient learning of latent relational structures’, Journal of Artificial Intelligence Research, 62, 69–100, (2018).
 [41] Charles Sutton and Andrew McCallum, ‘Piecewise pseudolikelihood for efficient training of conditional random fields’, in Proceedings of the 24th international conference on Machine learning, pp. 863–870. ACM, (2007).
 [42] Charles Sutton, Andrew McCallum, et al., ‘An introduction to conditional random fields’, Foundations and Trends® in Machine Learning, 4(4), 267–373, (2012).
 [43] Quan Wang, Bin Wang, and Li Guo, ‘Knowledge base completion using embeddings and rules’, in TwentyFourth International Joint Conference on Artificial Intelligence, (2015).

[44]
Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den Broeck, ‘A semantic loss function for deep learning with symbolic knowledge’, in
Proceedings of the 35th International Conference on Machine Learning (ICML), (July 2018).  [45] Fan Yang, Zhilin Yang, and William W Cohen, ‘Differentiable learning of logical rules for knowledge base reasoning’, in Advances in Neural Information Processing Systems, pp. 2319–2328, (2017).