The success of deep learning relies on the availability of a large amount of supervised training data. This prevents a wider application of machine learning in real world applications, where the collection of training data is often a slow and expensive process, requiring an extensive human intervention. The introduction of prior-knowledge into the learning process is a fundamental step in overtaking these limitations. First, it does not require the training process to induce the rules from the training set, therefore reducing the number of required training data. Secondly, the use of prior knowledge can be used to express the desired behavior of the learner on any input, providing better behavior guarantees in an adversarial or uncontrolled environment.
This paper presents LYRICS (Learning Yourself Reasoning and Inference with ConstraintS), a TensorFlow environment based on a declarative language for integrating prior knowledge into machine learning, which allows the full expressiveness of First Order Logic (FOL) to define the knowledge. LYRICS has its root in frameworks like Semantic Based Regularization (SBR) [5, 6]
built on top of Kernel Machines and Tensor Logic Networks (TLN)
that can be applied to neural networks. These frameworks transform the FOL clauses into a set of constraints that are jointly optimized during learning. However, LYRICS generalizes both approaches by allowing to enforce the prior knowledge transparently at training and test time and dropping all constraints regarding the form of the prior knowledge. The lack of declarative frontend makes SBR and TLN hard to extend beyond the classical classification tasks, where they have been applied in the past. On the other hand, LYRICS define a declarative language, which drops the barrier to build models exploiting the available domain knowledge in any machine learning context.
In particular, any many-sorted first-order logical theory can be expressed in the framework, allowing to declare domains of different sort, with constants, predicates and functions. LYRICS provides a very tight integration of learning and logic as any computational graph can be bound to a FOL predicate. This allows to constrain the learner both during training and inference. Since the framework is agnostic to the learners that are bound to the predicates, the framework can be used in a vast range of applications including classification, generative or adversarial ML, sequence to sequence learning, collective classification, etc.
1.1 Previous work
In the past few years many authors tackled specific applications by integrating logic and learning. Minervini et al.  proposes to use prior knowledge to correct the inconsistencies of an adversarial learner. Their methodology is designed ah-hoc for the tackled task, and limited to Horn clauses. A methid to distill the knowledge in the weights of a learner is presented in , which is also based on a fuzzy generalization of FOL. However, the definition of the framework is limited to universally quantified formulas and to a small set of logic operators. Another line of research [18, 4] attempts at using logical background knowledge to improve the embeddings for Relation Extraction. However, these works are also based on ad-hoc solutions that lack a common declarative mechanism that can be easily reused. They are all limited to a subset of FOL and they allow to injecting the knowledge at training time, with no guarantees that the output on the test set respect the knowledge.
Markov Logic Networks (MLN)  and Probabilistic Soft Logic (PSL) [11, 2] provide a generic AI interface layer for machine learning by implementing a probabilistic logic. However, the integration with the underlying learning processes working on the low-level sensorial data is shallow: a low-level learner is trained independently, then frozen and stacked with the AI layer providing a higher-level inference mechanism. The language proposed in this paper instead allows to directly improve the underlying learner, while also providing the higher-level integration with logic. TensorLog  is a more recent framework to integrate probabilistic logical reasoning with the deep-learning infrastructure of TF, however TensorLog is limited to reasoning and does not allow to optimize the learners while performing inference. TensorFlow Distributions  and Edward 
2 The Declarative Language
LYRICS defines a TensorFlow (TF)111https://www.tensorflow.org/ environment in which learning and reasoning are integrated. The definition of the knowledge in the presented environment starts by defining a certain number of domains in the considered world. A domain determines a collection of individuals of the world that share the same representation space and, thus, can be analyzed and manipulated in a homogeneous way. For example, a domain can collect the set of considered pixel images, or the sentences of a book as bag-of-words. The domains are then filled with their “inhabitants”, on which the learning and reasoning will be carried.
For example, a domain called can be defined as:
where data_images is the placeholder of the input data. The elements of a domain are a sort of “anonymous” individuals that are collectively processed. On the other hand, an individual of a domain can also be separately specified, and a specific behavior can be defined for it. In particular, specific individuals can be added to a domain using the following construct:
A function can be defined to map elements from the input domains into an element of an output domain. A unary function takes as input an element from a domain and transforms it into an element of the same or of another domain, while an -ary function takes as input elements, mapping them into an output element of its output domain. The following example defines an image “encoder” function:
where the FOL function is bound to its TF implementation, which in this case is the CNNEncoder function.
A predicate can be defined as a function mapping elements of the input domains to truth values, as for example: isCat(x), or . For example, a predicate approximated by a neural network NN, taking as input the patterns in the domain can be defined as:
Finally, it is possible to state the knowledge about the world by means of a set of constraints. Each constraint is a generic FOL formula using as atoms the previously defined functions and predicates. For instance, if we are given the domain Images, composed of animal images, and two predicates bird and flies defined on it, the user can express the knowledge that all the birds fly by means of the constraint:
3 From Logic to Learning
TensorFlow, which the framework is built on top, performs computations by building a computational graph, where nodes of the graph are operations manipulating all the tensors represented by their incoming edges. TF performs automatic differentiation of a generic computational graph by the exploitation of the chain rule of calculus. Since the framework implements all its components within TensorFlow, any TensorFlow model can be integrated in LYRICS. The proposed framework compiles a high level description of the knowledge into a computational graph by translating each piece of logic knowledge as constraints. The resulting computational graph is optimized exploiting the standard TF optimization mechanism.
Domains and Individuals. Domains and individuals allow users to provide data to the framework as tensors and represent the leaves of the computational graph. While domains are represented by constant tensors, individuals can be represented by both constant and variable tensors. In this way, the user is allowed to provide the knowledge of the existence of a certain individual, even if its feature representation is unknown, and its representation will be optimized to be coherent with the other pieces of knowledge provided.
Functions and Predicates. Functions allow the mapping among different tensors of (possibly) different domains, while predicates allow to express the truth degree of some property for those tensors. Both functions and predicates can be implemented using any TF computational graph. If the graph does not contain any variable tensor (i.e. it is not parametric), then we say it to be given; otherwise all the variables will be automatically learned to maximize the constraints satisfaction and we say the function/predicate to be learnable
. Learnable functions can be (deep) neural networks, kernel machines, radial basis functions, whose weights must be learned.
Constraints. The integration of learning and logical reasoning is achieved by compiling the logical rules into continuous real-valued constraints. The logical rules correlate all the defined elements and enforce some desired behaviour on them.
Variables, functions, predicates, logical connectives and quantifiers can all be seen as nodes of an expression tree . The evaluation of a constraint corresponds to a post-fix visit of the expression tree, where the visit action builds the correspondent portion of computational graph. In particular:
visiting a variable substitutes the variable with a tensor in the domain it belongs to;
visiting a function or predicate provides input tensors to the TF models implementing those functions;
visiting a connective combines predicates by means of the real-valued operations associated to the connective;
visiting a quantifier aggregates the outputs of the expressions obtained for the single variable groundings.
Connective and quantifiers are threated using the fuzzy generalization of FOL that was first proposed by Novak . In particular, a T-norm fuzzy logic  generalizes Boolean logic to variables assuming values in . A T-norm fuzzy logic is defined by its T-norm that models the logical AND, and from which the other operations can be derived. Table 1 shows some possible implementation of the connectives using different T-norms.
Let’s consider a FOL formula with variables , and let
indicate the vector of predicates andbe the set of all grounded predicates over all the groundings of the variables. The degree of truth of a formula containing an expression with a universally quantified variable is the average of the T-norm generalization , when grounding over :
The truth degree of the existential quantifier is instead defined as the maximum of the T-norm expression over the domain of the quantified variable:
When multiple quantified variables are present, the conversion is recursively performed from the outer to the inner variables. Figure 1 shows the translation of a logic formula into its expression tree and successively into a TensorFlow computational graph.
Supervisions. Any available supervision for the functions or predicates can be integrated into the learning problem. LYRICS provides a placeholder where this fitting is expressed, called PointwiseConstraint
. This construct points to a computational graph where a loss is applied for each supervision (e.g. the loss defaults to the cross-entropy loss for any neural network classifier) and it can be overridden to achieve a different behavior:
where model is the function for which to enforce supervisions labels on data inputs.
Cost Function. Let us assume to be given a knowledge base
, consisting of a set of FOL formulas, where some of the elements (individuals, functions or predicates) are unknown. The learning process aims at finding a good approximation of each unknown element, so that the estimated values will satisfy the FOL formulas for the input samples. A computational graph for each constraint is built and letand be the vectors of learnable functions and predicates, respectively. Let indicate the degree of satisfaction of the -th constraint evaluated on all its inputs . The following term is minimized to satisfy the constraints:
where is a weight for the -th logical constraint. However, LYRICS allows to integrate classical supervised learning and learning from constraints modeling the prior knowledge. The overall cost function is composed by two terms, one forcing the fitting of the supervised examples (both on functions and predicates) and one for the logical constraints satisfaction:
where denote the supervised data for functions and predicates respectively,
is a loss function andis the weight of the constraint term.
4 Learning and Reasoning with LYRICS
This section presents a list of examples illustrating the range of learning tasks that can be expressed in the proposed framework. In particular, it is shown how it is possible to force label coherence in semi-supervised or transductive learning tasks, how to implement collective classification over the test set, rule deduction from the learned predicates as in classical Inductive logic programming (ILP), pure logical reasoning and how to address generative tasks or pattern completion in the case of missing features. The examples are presented using LYRICS syntax directly to show that the final implementation of a problem fairly retraces its abstract definition. The software of the framework and the experiments are made available at ”URL hidden during the blind review process”.
In this task we assume to have available a set of points distributed along an outer and inner circles. The inner and outer points belong and do not belong to some given class , respectively. A random selection of points is supervised (either positively or negatively), as shown in Figure 2. The remaining points are split into unsupervised training points, shown in Figure 2 and points left as test set. A neural network is assumed to have been created in TF to approximate the predicate .
The network can be trained by making it fit the supervised data. So, given the vector of data X, a neural network NN_A and the vector of supervised data X_s, with the vector of associated labels y_s, the supervised training of the network can be expressed by the following:
Let’s now assume that we want to express manifold regularization for the learned function: e.g. points that are close should be similarly classified. This can be expressed as:
where f_close is a given function determining if two patterns are close. The training is then re-executed starting from the same initial conditions as in the supervised-only case.
Collective classification  performs the class assignments exploiting any known correlation among the test patterns. This paragraph shows how to exploit these correlations in LYRICS. Here, we assume that the patterns are represented as datapoints. The classification task is a multi-label problem where the patterns belongs to three classes . In particular, the class assignments are defined by the following membership regions: . These regions correspond to three overlapping rectangles as shown in Figure 3
. The examples are partially labeled and drawn from a uniform distribution on both the positive and negative regions for all the classes.
In a first stage, the classifiers for the three classes are trained in a supervised fashion using a two-layer neural network taking four positive and four negative examples for each class. This is implemented via the following declaration:
The test set is composed by random points and the assignments performed by the classifiers after the training are reported in Figure 3. In a second stage, it is assumed that it is available some prior knowledge about the task at hand. In particular, any pattern must belongs to (at least) one of the classes or . Furthermore, it is known that class is defined as the intersection of and . The collective classification step is performed by seeking the class assignments that are close to the initial classifier predictions but also respect the logical constraints on the test set:
where X_test the set of test datapoints and the outputs priorsA, priorsB, priorsC of the classifiers act as prior for the final assignments. As we can see from Fig.3, the collective step fixes some wrong predictions.
It is also possible to mark a set of constraints as test only, in order to perform model checking. Model checking can be used as a fundamental step to perform rule deduction using the Inductive Logic Programming techniques .
The presented framework can be used as a tool for pure logical reasoning. This case is illustrated by the following example, where a few individuals are added to the domain People without any underlying data representation by the statement::
The individuals are assumed to be related via parental relations defined by the following predicates:
where the given binary predicate eq holds true iff the two input individuals are the same person.
Finally, some known relations are known between the individuals:
The prior knowledge provided for this task expresses some well-known semantics about parental constraints. For example, it is possible to express that nobody can be father or grandfather of himself as:
Another two rules state that fathership is an asymmetric relation, so that if you are father or grandfather of someone, he can not be your father or grandfather. Furthermore, someone can not be father and grandfather of someone at the same time, these are expressed as:
Another rule expresses that the father of the father is a grandfather, and that one person has at most one father in the considered world:
The learning task seeks to infer the unknown relations among the individuals. After starting the learning phase, the predicate values for all the groundings are outputted, and it is correctly concluded that the following facts hold true: grandFather("Marco", "Michele"), grandFather("Marco", "Giuseppe"), grandFather("Marco", "Francesco"), etc. On the other hand, nothing can be concluded regarding who is the grandfather of “Franco” and “Andrea”, so leaving these values to be equal to their prior values. Once the training has been performed and the grounded predicates have been computed, model checking can be performed by stating the rules that should be verified. For example:
As expected, the evaluation of the rule would return that it is perfectly verified by the computed assignments.
images of handwritten digits representing the , and digits are extracted from the MNIST dataset. We want to solve both a classification task, aiming at identifying which digit an image represents, and a generation task, learning generative functions producing images from images. In particular, two generative functions, next and prev must be learned such that, given an image of a digit, they will produce an image of the next and previous digit, respectively (using a circular mapping such that is the next digit of and is the previous digit of ).
This generative task is solved in two steps: first, the classifier is learned in a purely supervised fashion, then the image generator is trained in an unsupervised fashion by exploiting the knowledge of the relations among the classes. The domain of images and the binding of the predicates , , to the outputs of the neural network are expressed as:
is a network with a softmax activation function on the output layer and theSlice command selects a specific network output given its index in the output layer. This allows to bind different predicates to the same network, which is useful in many classification tasks to limit the number of parameters and to share across the predicates the development of the latent feature representations.
The following definition declares the generative task:
where the function
is implemented by the cosine similarity. The firstrules define the meaning of the next and prev mapping and the last enforce that next is the inverse of previous and vice-versa. In Figure 4, it is shown an input image (left) and the output of the functions next (center) and prev (right).
A set of patterns are drawn from a double moon shaped distribution as show in Figure 5. The patterns distributed along the lower moon belong to class , while patterns along the lower one do not belong to the class. This learning task is expressed as:
Let us now assume that there are two new individuals and for which no feature representation is available, but it is known that and belong and not belong to class , respectively. This can be expressed as:
We assume to know in advance that the first and second individuals are close to a positive and negative example for class , respectively. This can be stated as:
where close is a given predicate deciding whether two points are close and eq implements a differentiable equality function defined by .
Since the feature representations of and are not defined and are left as free variables, the framework can learn them in order to respect the constraints defined by the logic rules. Figure 5 shows the values of individuals after training. The have been correctly placed, where the data distribution of the corresponding classes of the points is high.
Document Classification on the Citeseer dataset.
This section applies the proposed framework to a standard ML dataset. The CiteSeer dataset222https://linqs.soe.ucsc.edu/data  consists of scientific papers, each one assigned to one of classes: Agents, AI, DB, ML and HCI. The papers are not independent as they are connected by a citation network with links. Each paper in the dataset is described via its bag-of-word representation, which is a vector having the same size of the vocabulary with the -th element having a value equal to or , depending on whether the -th word in the vocabulary is present or not present in the document, respectively. The dictionary consists of unique words. This learning task is expressed as:
where the first line defines the domain of scientific articles to classify, and one predicate for each class is defined and bound to an output of a neural network , which features a softmax activation function on the output layer.
The domain knowledge that if a paper cites another one, they are likely to share the same topic, is expressed as:
where f_cite is a given function determining whether a pattern cites another one. Finally, the supervision on a variable size training set can be provided by means of:
where X_s is a subset of the domain of papers on which we enforce supervisions y_s.
|% data in training set|
|ICA Naive Bayes||76.83|
|GS Naive Bayes||76.80|
|ICA Logistic Regression||77.32|
|GS Logistic Regression||76.99|
|Loopy Belief Propagation||77.59|
reports the accuracy obtained by a neural network with one hidden layer (200 hidden neurons) trained in a supervised fashion and by training the same network from supervision and logic knowledge in LYRICS, varying the amount of available training data and averaged over 10 random splits of the training and test data. The improvements over the baseline are statistically significant for all the tested configurations. Table3 compares the neural network classifiers against other two content-based classifiers, namely logistic regression (LR) and Naive Bayes (NB), and against collective classification approaches using network data: Iterative Classification Algorithm (ICA)  and Gibbs Sampling (GS)  both applied on top of the output of LR and NB content-based classifiers. Furthermore, the results against the two top performers on this task: Loopy Belief Propagation (LBP)  and Relaxation Labeling through Mean-Field Approach (MF)  are reported. The accuracy values are obtained as average over 10-folds created by random splits of size % and % of the overall data for the train and test sets, respectively. Unlike the other network based approaches that only be run at test-time (collective classification), LYRICS can distill the knowledge in the weights of the neural network. The accuracy results are the highest among all the tested methodologies in spite that the underlying neural network classifier trained only via the supervisions did perform slightly worse than the other content-based competitors.
This paper presents a novel and general framework, called LYRICS, to bridge logic reasoning and deep learning. The framework is directly implemented in TensorFlow, allowing a seaming-less integration that is architecture agnostic. The frontend of the framework is a declarative language based on First Order Logic. This paper presents a set of examples illustrating the generality and expressivity of the framework, which can be applied to a large range of tasks, including classification, pattern generation, and symbolic reasoning.
-  Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: OSDI. vol. 16, pp. 265–283 (2016)
-  Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss markov random fields and probabilistic soft logic. arXiv preprint arXiv:1505.04406 (2015)
-  Cohen, W.W.: Tensorlog: A differentiable deductive database. arXiv preprint arXiv:1605.06523 (2016)
-  Demeester, T., Rocktäschel, T., Riedel, S.: Lifted rule injection for relation embeddings. arXiv preprint arXiv:1606.08359 (2016)
-  Diligenti, M., Gori, M., Maggini, M., Rigutini, L.: Bridging logic and kernel machines. Machine learning 86(1), 57–88 (2012)
Diligenti, M., Gori, M., Saccà, C.: Semantic-based regularization for learning and inference. Artificial Intelligence (2015)
-  Diligenti, M., Roychowdhury, S., Gori, M.: Image classification using deep learning and prior knowledge. In: Proceedings of Third International Workshop on Declarative Learning Based Programming (DeLBP) (February 2018)
-  Dillon, J.V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M., Saurous, R.A.: Tensorflow distributions. arXiv preprint arXiv:1711.10604 (2017)
-  Hájek, P.: Metamathematics of fuzzy logic, vol. 4. Springer Science & Business Media (1998)
-  Hu, Z., Ma, X., Liu, Z., Hovy, E., Xing, E.: Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318 (2016)
-  Kimmig, A., Bach, S., Broecheler, M., Huang, B., Getoor, L.: A short introduction to probabilistic soft logic. In: Proceedings of the NIPS Workshop on Probabilistic Programming: Foundations and Applications. pp. 1–4 (2012)
-  Lu, Q., Getoor, L.: Link-based classification. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03). pp. 496–503 (2003)
-  Minervini, P., Demeester, T., Rocktäschel, T., Riedel, S.: Adversarial sets for regularising neural link predictors. arXiv preprint arXiv:1707.07596 (2017)
-  Muggleton, S., De Raedt, L.: Inductive logic programming: Theory and methods. The Journal of Logic Programming 19, 629–679 (1994)
-  Neville, J., Jensen, D.: Iterative classification in relational data. In: Proc. AAAI-2000 Workshop on Learning Statistical Models from Relational Data. pp. 13–20 (2000)
-  Novák, V., Perfilieva, I., Močkoř, J.: Mathematical principles of fuzzy logic (1999)
-  Richardson, M., Domingos, P.: Markov logic networks. Machine learning 62(1), 107–136 (2006)
-  Rocktäschel, T., Singh, S., Riedel, S.: Injecting logical background knowledge into embeddings for relation extraction. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 1119–1129 (2015)
-  Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI magazine 29(3), 93 (2008)
-  Serafini, L., Garcez, A.S.d.: Learning and reasoning with logic tensor networks. In: AI* IA. pp. 334–348 (2016)
-  Tran, D., Hoffman, M.D., Saurous, R.A., Brevdo, E., Murphy, K., Blei, D.M.: Deep probabilistic programming. In: International Conference on Learning Representations (2017)