Learning and T-Norms Theory

07/26/2019 ∙ by Giuseppe Marra, et al. ∙ Università di Siena UNIFI 2

Deep learning has been shown to achieve impressive results in several domains like computer vision and natural language processing. Deep architectures are typically trained following a supervised scheme and, therefore, they rely on the availability of a large amount of labeled training data to effectively learn their parameters. Neuro-symbolic approaches have recently gained popularity to inject prior knowledge into a deep learner without requiring it to induce this knowledge from data. These approaches can potentially learn competitive solutions with a significant reduction of the amount of supervised data. A large class of neuro-symbolic approaches is based on First-Order Logic to represent prior knowledge, that is relaxed to a differentiable form using fuzzy logic. This paper shows that the loss function expressing these neuro-symbolic learning tasks can be unambiguously determined given the selection of a t-norm generator. When restricted to simple supervised learning, the presented theoretical apparatus provides a clean justification to the popular cross-entropy loss, that has been shown to provide faster convergence and to reduce the vanishing gradient problem in very deep structures. One advantage of the proposed learning formulation is that it can be extended to all the knowledge that can be represented by a neuro-symbolic method, and it allows the development of a novel class of loss functions, that the experimental results show to lead to faster convergence rates than other approaches previously proposed in the literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Networks [1]

have been a break-through for several classification problems involving sequential or high-dimensional data. However, deep neural architectures strongly rely on a large amount of labeled data to develop powerful feature representations. Unfortunately, it is difficult and labor intensive to annotate large collections of data. Prior knowledge based on logic is a natural solution to make learning efficient when the training data is scarce. The integration of logic inference with learning could also overcome another limitation of deep architectures, which is that they mainly act as black-boxes from a human perspective, making their usage challenging in failure critical applications. For these reasons, neuro-symbolic approaches 

[2]

integrating logic and learning have recently gained a lot of attention in the machine learning and artificial intelligence communities. One of the most common approaches relies on expressing the prior knowledge using First-Order Logic (FOL). The integration with a deep learner is then implemented by relaxing the FOL knowledge to a differentiable form using t-norms. The resulting constraints can be enforced using gradient-based optimization techniques 

[3, 4].

However, most work in this area approached the problem of translating rules expressed by logic into a differentiable form as a collection of heuristics, that often lack consistency and have no clear justification from a theoretical point of view. For example, there is no agreement about the relation between the selected t-norm and the aggregation performed to express quantification. This paper shows that it is possible to fully derive the cost function expressing the prior knowledge, given only the selection of a t-norm generator. Both the translation of the logic connectives, the quantifiers and the final loss implementing the constraints are univocally determined by the generator choice, that leads to a principled and semantically consistent translation.

As shown by this paper, the classical fitting of the training data common to all supervised learning schemes becomes a special case of a logic constraint. Since a careful choice of the loss function has been pivotal into the success of deep learning, it is interesting to study the relation between the supervised loss and the generator choice. Indeed, when restricted to simple supervised learning, the presented theoretical apparatus provides a clean justification to the popular cross-entropy loss [5], that has been shown to provide faster convergence and to reduce the vanishing gradient problem in very deep structures. Moreover, the experimental results show that, when integrating logic knowledge, a careful choice of the generator can both provide a faster convergence speed of the training process and a better final accuracy, following from having preserved the semantics of the prior knowledge. While these results can be applied to any neuro-symbolic learning task, the theory suggests also the definition of new loss functions for supervised learning, which are potentially more effective than the cross-entropy to limit the vanishing gradient issue.

The paper is organized as follows: Section II presents some prior work on the integration of learning and logic inference, while Section III presents the basic concepts about t-norms, generators and aggregator functions. Section IV introduces the learning frameworks used to represent supervised learning in terms of logic rules and Section V presents the experimental results. Finally, Section VI draws some conclusions.

Ii Related Works

Neuro-symbolic approaches [2] express the internal or output structure of the learner using logic. First-Order Logic (FOL) is often selected as the declarative framework for the knowledge because of its flexibility and expressive power. This class of methodologies is rooted in previous work from the Statistical Relational Learning community, which developed frameworks for performing logic inference in the presence of uncertainty. For example, Markov Logic Networks [6] and Probabilistic Soft Logic [7] integrate First Order Logic (FOL) and graphical models. A common solution to integrate logic reasoning with uncertainty and deep learning relies on using deep networks to approximate the FOL predicates, and the overall architecture is optimized end-to-end by relaxing the FOL into a differentiable form, which translates into a set of constraints. This approach is followed with minor variants by Semantic Based Regularization [3], the Lyrics framework [8]

, Logic Tensor Networks 

[4], the Semantic Loss  [9] and DeepProbLog [10] extending the ProbLog [11, 12] framework by using predicates approximated by jointly learned functions.

Within this class of approaches, it is of fundamental importance to define how to perform the fuzzy relaxation of the formulas in the knowledge base. For instance, Serafini et al. [13] introduces a learning framework where formulas are converted according to Łukasiewicz logic t-norm and t-conorms. Giannini et al. [14] also proposes to convert the formulas according to Łukasiewicz logic, however this paper exploits the weak conjunction in place of the t-norms to get convex functional constraints. A more practical approach has been considered in Semantic Based Regularization (SBR), where all the fundamental t-norms have been evaluated on different learning tasks [3]. However, it does not emerge from this prior work a unified principle to express the cost function to be optimized with respect to the selected fuzzy logic. For example, all the aforementioned approaches rely on a fixed loss function linearly measuring the distance of the formulas from the 1-value. Even if it may be justified from a logical point of view (), it is not clear whether this choice is principled from a learning standpoint, since all deep learning approaches use very different loss functions to enforce the fitting of the supervised data.

From a learning point of view, different quantifier conversions can be taken into account and validated, as well. For instance, the arithmetic mean and maximum operator have been used to convert the universal and existential quantifiers in Diligenti et al. [3], respectively. Different possibilities have been considered for the universal quantifier in Donadello et al. [4], while the existential quantifier depends on this choice via the application of the strong negation using the DeMorgan law. The Arithmetic mean operator has been shown to achieve better performances in the conversion of the universal quantifier [4], with the existential quantifier implemented by Skolemization. However, the universal and existential quantifiers can be thought of as a generalized AND and OR, respectively. Therefore, converting the quantifiers using a mean operator has no direct justification inside a logic theory.

Iii Fundamentals of T-Norm Fuzzy Logic

Many-valued logics have been introduced in order to extend the admissible set of truth values from true (), false () to a scale of truth-degree having absolutely true and absolutely false as boundary cases. In particular, in fuzzy logic the set of truth values coincides with the real unit interval . In this section, the basic notions of fuzzy logic together with some remarkable examples are introduced. According to [15], a fuzzy logic can be defined upon a certain t-norm (triangular norm) representing an extension of the Boolean conjunction.

Definition 1 (t-norm).

is a t-norm if and only if for every :

is a continuous t-norm if it is continuous as function.

Table I reports the algebraic definition of t-norms and other logical operators definable by the chosen t-norm for Gödel, Łukasiewicz and Product logics respectively, which are referred as the fundamental fuzzy logics because all the continuous t-norms can be obtained by ordinal sums [16, 17]. The notation of the logical operators in Table I is given by the following definitions according to a certain t-norm :

Definition 2 (definable connectives from a t-norm).
Gödel Łukasiewicz Product
TABLE I: The truth functions for the residuum, bi-residuum, weak conjunction, weak disjunction, residual negation, strong neation, t-conorms and material implication of the fundamental fuzzy logics.

Iii-a Archimedean T-Norms

In mathematics, t-norms [18, 19] are a special kind of binary operations on the real unit interval especially used in engineering applications of fuzzy logic. Table I reports the fundamental continuous t-norms, however in the literature a wide class of t-norms has been considered. In addition, there are several techniques to construct customized t-norms that are more suitable to deal with a certain problem by rotations or ordinal sums of other t-norms or defining parametric classes. This section introduces Archimedean t-norms [20], a special class of t-norms that can be constructed by means of unary monotone functions, called generators.

Definition 3.

A t-norm is said to be Archimedean if for every , . In addition, is said strict if for all , otherwise is said nilpotent.

For instance, the Łukasiewicz () and Product () t-norms are respectively nilpotent and strict, while the Gödel () t-norm is not archimedean, indeed it is idempotent (, for all ). In addition, Łukasiewicz and Product t-norms are enough to represent the whole classes of nilpotent and strict Archimedean t-norms [19].

Theorem 1.

Any nilpotent t-norm is isomorphic to and any strict t-norm is isomorphic to .

A fundamental result for the construction of t-norms by additive generators is based on the following theorem [21].

Theorem 2.

Let be a strictly decreasing function with and for all in , and its pseudo-inverse. Then the function defined as

(1)

is a t-norm and is said an additive generator for . is strict if , otherwise is nilpotent. .

Example 1.

If we take , we get

while taking , we get

An interesting consequence of equation (1) is that it allows us to define also the other fuzzy connectives, deriving from the t-norm, as depending on the additive generator. For instance:

(2)

The isomorphism between addition on and multiplication on by the logarithm and the exponential functions allows two-way transformations between additive and multiplicative generators of a t-norm. If is an additive generator of a t-norm , then the strictly increasing function defined as is a multiplicative generator of , namely:

On the opposite, if is a multiplicative generator of , then is an additive generator of . For instance, and are multiplicative generators of and , respectively. Additive and multiplicative generators are isomorphic and we decide to focus on the former for simplicity. We only mention that both multiples of additive generators and positive powers of multiplicative generators determine the same t-norm.

Iii-B Parameterized classes of t-norms

Given a generator of a t-norm depending on a certain parameter, we can define a class of related t-norms depending on such parameter. For instance, given a generator function of a t-norm and , then , corresponding to the generator function denotes a class of increasing t-norms. In addition, let and denote the Drastic (defined by ) and Gödel t-norms respectively, we get:

On the other hand, several parameterized families of t-norms have been introduced and studied in the literature [19]. In the following we recall some prominent examples we will exploit in the experimental evaluation.

Definition 4 (The Schweizer-Sklar family).

For , consider:

The t-norms corresponding to this generator are called Schweizer-Sklar t-norms, and they are defined according to:

A Schweizer-Sklar t-norm is Archimedean if and only if , continuous if and only if , strict if and only if and nilpotent if and only if . This t-norm family is strictly decreasing for and continuous with respect to , in addition .

Definition 5 (Frank t-norms).

For , consider:

The t-norms corresponding to this generator are called Frank t-norms and they are strict if . The overall class of Frank t-norms is decreasing and continuous.

Iv Logic and Learning

This paper presents a theoretical apparatus that may be exploited in different learning settings, especially in contexts where some relational knowledge about the task is available, and the input patterns are not assumed to be independent and identically distributed. According to the learning from constraints paradigm [22], knowledge is represented by a set of constraints and the learning process is conceived as the problem of finding the task functions (implementing FOL predicates) that best satisfy the constraints. In particular in multi-task learning, additional information can be expressed by logical constraints, and supervisions are a special class of constraints forcing the fitting of the positive and negative examples for the task. An example for extra prior knowledge that may be available about a learning task, could be the statement like

“any pattern classified as a cat has to be classified as an animal”

, where cat and animal have to be thought of as the membership functions of two classes to learn. In such a sense, symbolic logic provides a natural way to express factual and abstract knowledge about a problem by means of logical formulas.

Iv-a Neuro-symbolic learning tasks

Let us consider a multi-task learning problem, where

denotes the vector of real-valued functions (

task functions) to be determined. Given the set of available data, a supervised learning problem can be generally formulated as where is a positive-valued functional denoting a certain loss. In this framework, this setup is expanded assuming that the task functions are FOL predicates and all the available knowledge about these predicates, including supervisions, is collected into a knowledge base expressed via a set of FOL formulas . The learning task is generally expressed as:

The link between FOL knowledge and learning can be summarized as follows.

  • Each Individual is an element of a specific domain, which can be used to ground the predicates defined on such domain. Any replacement of variables with individuals for a certain predicate is called grounding.

  • Predicates express the truth degree of some property for an individual (unary predicate) or group of individuals (n-ary predicate). In particular, this paper will focus on learnable predicate functions implemented by (deep) neural networks, even if other models could be used. FOL functions can also be included by learning their approximation like done for predicates, however function-free FOL is used in the paper to keep the notation simple.

  • The knowledge base (KB) is a collection of FOL formulas expressing the learning task. The integration of learning and logical reasoning is achieved by compiling the logical rules into continuous real-valued constraints, which correlate all the defined elements and enforce some desired behaviour on them.

For a given rule in the KB, individuals, predicates, logical connectives and quantifiers can all be seen as nodes of an expression tree [23]. The translation to a constraint corresponds to a post-fix visit of the expression tree, where the visit action builds the correspondent portion of computational graph. In particular:

  • visiting a variable substitutes the variable with the corresponding feature representation of the individual to which the variable is currently assigned;

  • visiting a predicate computes the output of the predicate with the current input groundings;

  • visiting a connective combines the grounded predicate values by means of the real-valued operations associated to the connective;

  • visiting a quantifier aggregates the outputs of the expressions obtained for the single individuals (variable groundings).

Thus, the compilation of the expression tree allows us to convert formulas into real-valued functions, represented by a computational graph, where predicate functions are composed by means of the truth-functions corresponding to connectives and quantifiers. Given a generic formula , we call the corresponding real-valued function its functional representation . This representation is tightly dependent on the particular choice of the translating t-norm. For instance, given two predicates and the formula , the functional representation of in Łukasiewicz logic is given by .

A special note concerns quantifiers that have to be thought of as aggregating operators with respect to the predicate domains. For instance, according to Novak [24], that first proposed a fuzzy generalization of first–order logic, the universal and existential quantifiers may be converted as the infimum and supremum over a domain variable (or minimum and maximum when dealing with finite domains) that are common to any t-norm fuzzy logic. In particular, given a formula depending on a certain variable , where denotes the available samples for one of the involved predicates in , the semantics of the quantifiers are fuzzified as:

As shown in the next section, this quantifier translation is not well justified for all t-norms and this paper provides a more principled approach to perform this translation.

Once all the formulas in are converted into real-valued functions, their distance from satisfaction (i.e. distance from 1-evaluation) can be computed according to a certain decreasing mapping expressing the penalty for the violation of any constraint. Assuming rule independence, learning can be formulated as the joint minimization over the single rules using the following loss function factorization:

(3)

where any denotes the weight for the logical constraint in the , which can be selected via cross-validation or jointly learned [25, 26], is the functional representation of the formula according to a certain t-norm fuzzy logic and is a decreasing function denoting the penalty associated to the distance from satisfaction of formulas, so that . This paper will show that the selected semantics of the converting a generic formula and the choice of the loss are intrinsically connected, and they can be both derived by the selection of a t-norm generator.

Iv-B Loss Functions by T-Norms Generators

In this section, we present a novel approach to combine the choice of both the fuzzy conversion of formulas and the penalty map according to a unified principle. In particular, we investigate the mapping of formulas into constraints by means of generated t-norm fuzzy logics, and we exploit the same additive generator of the t-norm to map the formulas to be satisfied into the functional constraints to be minimized, i.e. we consider . Moreover, since the quantifiers can be seen as generalized AND and OR over the grounded expressions (see Remark 1), we show that the same fuzzy conversion, so as the overall loss function, as expressed in Equation 3, only depends on the chosen t-norm generator.

Remark 1.

Given a formula defined on , the role of the quantifiers have to be interpreted as follows,

where denotes the available samples.

Given a certain formula depending on a variable that ranges in the set and its corresponding functional representation , the conversion of any universal quantifier may be carried out by means of an Archimedean t-norm , while the existential quantifier by a t-conorm. For instance, given the formula , we have:

(4)

where is an additive generator of the t-norm .

Since any generator function is decreasing and , the generator function is a very natural choice to be used as loss which can be used to map the fuzzy conversion of the formula, as reported in Equation 4, in a constraint to be minimized. By exploiting the same generator of to map into a loss function, we get the following term to be minimized:

(5)

As a consequence, the following result can be provided with respect to the convexity of the loss .

Proposition 1.

If is a linear function and is concave, is convex. If is a convex function and is linear, is convex.

Proof.

Both the arguments follow since if is concave (we recall that a linear function is both concave and convex, as well) and is a convex non-increasing function defined over a univariate domain, then is convex. ∎

Proposition 1 establishes a general criterion to define convex constraints according to a certain generator depending on the fuzzy conversion and, in turn, by the logical expression . In Example 2 are reported some application cases.

TABLE II: Example of the translation of with respect to the selection of a t-norm generator . The simplification expressed on the right side is general and can be applied for a wide range of logical operators.
Example 2.

If we get the Łukasiewicz t-norm, that is nilpotent. Hence, from Equation 5 we get:

In case is concave [14], this function is convex.

If (Product t-norm) from Equation 5 we get a generalization of the cross-entropy loss:

In case is linear (e.g. a literal), this function is convex.

So far, we only considered the case of a general formula . In the following, different cases of interest for are reported. Given an additive generator for a t-norm , additional connectives may be expressed with respect to , as reported e.g. by Equation III-A. If are two unary predicate functions sharing the same input domain , the following formulas yield the following penalty terms, where we supposed strict for simplicity:

According to a certain generator, different loss functions may arise from the same FOL formula. Further, one may think to consider customized loss components that are more suitable for a certain learning problem or exploiting the described construction to get already known machine learning loss, as for the cross-entropy loss (see Example 2).

Example 3.

If , with corresponding strict t-norm , the functional constraint 5 that is obtained applying to the formula is given by

While if , with corresponding nilpotent t-norm , the constraint is given by

An interesting property of this method consists in the fact that, in case of compound formulas, some occurrences of the generator may be simplified. For instance, this is shown in Table II for the formula . However, this property does not hold for all the connectives that are definable upon a certain generated t-norm (see Definition 2). For instance, becomes:

This suggests to identify the connectives that, on one hand allow the simplification of any occurrence of by applying in its corresponding functional expression, and on the other hand allow the evaluation of only on grounded predicates. For short, in the following we say that the formulas build upon such connectives have the simplification property.

Lemma 1.

Any formula , whose connectives are restricted to , has the simplification property.

Proof.

The proof is by induction with respect to the number of connectives occurring in .

  • If , i.e. for a certain , ; then , hence has the simplification property.

  • If , then for and we have the following cases.

    • If , then we get and the claim follows by inductive hypothesis on whose number of involved connectives is less or equal than . The argument still holds replacing with and with .

    • If , then we get

      As in the previous case, the claim follows by inductive hypothesis on .

    • The remaining of the cases can be treated at the same way and noting that .

The simplification property provides several advantages from an implementation point of view. On one hand it allows the evaluation of the generator function only on grounded predicate expressions and avoids an explicit computation of the pseudo-inverse . In addition, this property provides a general method to implement -ary t-norms, of which universal quantifiers can be seen as a special case since we only deal with finite domains (see more in Section IV-D).

The simplification property yields an interesting analogy between truth functions and loss functions. In logic, the truth degree of a formula is obtained by combining the truth degree of its sub-formulas by means of connectives and quantifiers. At the same way, the loss corresponding to a formula that satisfies the property is obtained by combining the losses corresponding to its sub-formulas and connectives and quantifiers combine losses rather than truth degrees.

Iv-C Example

Let’s consider a simple multi-label classification task where the objects must be detected in a set of input images , represented as a set of features. The learning task consists in determining the predicates , which return true if and only if the input image is predicted to contain the object , respectively. The positive supervised examples are provided as two sets (or equivalently their membership functions) with the images known to contain the object , respectively. The negative supervised examples for are instead provided as two sets . Furthermore, the location where the images have been taken is assumed to be known, and a predicate can be used to express the fact whether images have been taken in the same location. It is finally assumed that it is known as prior knowledge that two images taken in the same location are likely to contain the same object.

TABLE III: Example of the declarative a learning task expressed using FOL.

The semantics of the above learning task can be expressed using FOL via the statement declarations shown in Table III, where it was assumed that images have been taken in the same location and it holds that and . The statements define the constraints that the learners must respect, expressed as FOL rules. Please note that also the fitting of the supervisions are expressed as constraints, .

Given a selection of t-norm generator and a set of images , this DFL program is compiled into the following optimization task:

where is a meta-parameter deciding how strongly the -th contribution should be weighted, is the set of image pairs having the same location and the first two elements of the cost function express the fitting of the supervised data, while the latter two express the knowledge about co-located images.

Iv-D Discussion

The presented framework can be contextualized among a new class of learning frameworks, which exploit the continuous relaxation of FOL provided by fuzzy operators to integrate logic knowledge in the learning process [3, 27, 8]. All these frameworks require the user to define all the operators of a given t-norm fuzzy logic. On the other hand, the presented framework requires only the generator to be defined. This provides several advantages like a minimum implementation effort, and an improved numerical stability. Indeed, it is possible to apply the generator only on grounded atoms by exploiting the simplification property and this allows to apply the non-linear operation (generator) to the atoms, whereas all compositions are performed via stable operators (e.g. min,max,sum). On the contrary, the previous FOL relaxations correspond to an arbitrary mix of non-linear operators, which can potentially lead to numerically unstable implementations.

The presented framework provides a fundamental advantage in the integration with the tensor-based machine learning frameworks like TensorFlow 

[28]

or PyTorch 

[29]. Modern deep learning architectures can be effectively trained by leveraging tensor operations performed via Graphics Processing Units (GPU). However, this ability is conditioned on the possibility of concisely express the operators in terms of simple parallelizable operations like sums or products over arguments, which are often implemented as atomic operation in GPU computing frameworks and do not require to resort to slow iterative procedures. Fuzzy logic operator can not be easily generalized to their -ary form. For example, the Łukasiewicz conjunction can be generalized to -ary form as . On the other hand, the general SS t-norm for , , does not have any generalization and the implementation of the -ary form must resort to an iterative application of the binary form, which is very inefficient in tensor-based computations. Previous frameworks like LTN and SBR had to limit the form of the formulas that can be expressed, or carefully select the t-norms in order to provide efficient -ary implementations. However, the presented framework can express operators in -ary form in terms of the generators. Thanks to the simplification property, -ary operators for any Archimedean t-norm can always be expressed as .

V Experimental Results

The experimental results have been carried out using the Deep Fuzzy Logic (DFL) software framework111http://sailab.diism.unisi.it/deep-logic-framework/ which allows to express a learning task as a set of FOL formulas. The formulas are compiled into a learning task using the theory of generators described in the previous section. The learning task is then cast into an optimization problem like shown in Section IV-C and, finally, is solved via tensor propagation within the TensorFlow (TF) environment222https://www.tensorflow.org/ [28]. In the following of the section, it is assumed that each FOL constant corresponds to a tensor storing its feature representation. DFL functions and predicates correspond to a TF computational graph. If the graph does not contain any parameter, it is said to be given, otherwise the function/predicate is said to be learnable, and the parameters will be optimized to maximize the constraints satisfaction. Please note that any learner expressed as a TF computational graph can be transparently integrated into DFL.

V-a The learning task

The CiteSeer dataset [30] consists of 3312 scientific papers, each one assigned to one of 6 classes: Agents, AI, DB, IR, ML and HCI. The papers are not independent as they are connected by a citation network with 4732 links. This dataset defines a relational learning benchmark, where it is assumed that the representation of an input document is not sufficient for its classification without exploiting the citation network. The citation network is typically employed by assuming that two papers connected by a citation belong to the same category.

(a) SS - GD
(b) Frank - GD
(c) SS - Adam
(d) Frank - Adam
Fig. 1: Learning Dynamics in terms of test accuracy on a supervised task when choosing different t-norms generated by the parameterized SS and Frank families. (0(a)) and (0(b)) are learning processes optimized with Vanilla Gradient Descent, while (0(c)) and (0(d)) are learning processes optimized with Adam Gradient Descent.

This knowledge can be expressed by providing a general rule of the form: where is a binary predicate encoding the fact that is citing and is a task function implementing the membership function of one of the six considered categories. This logical formula expresses a general concept called manifold regularization, which often emerges in relational learning tasks. Indeed, by linking the prediction of two distinct documents, the behaviour of the underlying task functions is regularized enforcing smooth transition over the manifold induced by the relation.

Each paper is represented via its bag-of-words, which is a vector having the same size of the vocabulary with the i-th element having a value equal to 1 or 0, depending on whether the i-th word in the vocabulary is present or not present in the document, respectively. The dictionary consists of 3703 unique words. The set of input document representations is indicated as , which is split into a training and test set and , respectively. The percentage of documents in the two splits is varied across the different experiments. The six task functions with

are bound to the six outputs of a Multi-Layer-Perceptron (MLP) implemented in TF. The neural architecture has 3 hidden layers, with 100 ReLU units each, and softmax activation on the output. Therefore, the task functions share the weights of the hidden layers in such a way that all of them can exploit a common hidden representation. The

predicate is a given (fully known a prior) function, which outputs 1 if the document passed as first argument cites the document passed as second argument, otherwise it outputs 0. Furthermore, a given function is defined for each , such that it outputs iff is a positive example for the category (i.e. it belongs to that category). A manifold regularization learning problem can be defined in DFL by providing, , the following two rules:

(6)
(7)

where only positive supervisions have been provided because the trained networks for this task employ a softmax activation function on the output layer, which has the effect of imposing mutually exclusivity among the task functions, reinforcing the positive class and discouraging all the others. While this behaviour could have been trivially expressed using logic, this network architecture provides a principled baseline to compare against and it was therefore used across all the experiments for this dataset.

DLF allows the users to specify the weights of formulas, which are treated as hyperparameters. Since we use at most

constraints per predicate, the weight of the constraint expressing the fitting of the supervisions (Equation 7) is set to a fixed value equal to 1, while the weight of the manifold regularization rule expressed by Equation 6 is cross-validated from the grid of values .

V-B Results

V-B1 Convergence rate

this experimental setup aims at verifying the relation between the choice of the generator and speed of convergence of the training process. In particular, a simple supervised learning setup is assumed for this experiment, where the leaning task is defined by Equation 7 by simply enforcing the fitting of the supervised examples. The training and test sets are composed of 90% and of the total number of papers, respectively. Two parametric families of t-norms have been considered: the SS family (Definition 4) and the Frank family (Definition 5). Their parameter was varied to construct classical t-norms for some special values of the parameter but also to evaluate some intermediate ones. In order to keep a clear intuition behind the results, optimization was initially carried out using simple Gradient Descent with a fixed learning rate equal to . Results are shown in Figures (0(a)) and (0(b)): it is evident that strict t-norms tend to learn faster than nilpotent ones by penalizing more strongly highly unsatisfied ground formulas. This difference is still remarkably present, although slightly reduced, by exploiting the state-of-the-art dynamic learning rate optimization algorithm Adam [31] as shown in Figures 0(c) and 0(d). This finding is consistent with the empirically well known fact that the cross-entropy loss performs well in supervised learning tasks for deep architectures, because it is effective in avoiding gradient vanishing in deep architectures. The cross-entropy loss corresponds to a strict generator with and in the SS and Frank families, respectively. This selection corresponds to a fast and stable converging solution when paired with Adam, while there are faster converging solutions when using a fixed learning rate.

V-B2 Classification accuracy

the effect of the selection of the generator on classification accuracy is tested on a classification task with manifold regularization in the transductive setting, where all the data is available at training time, even if only the training set supervisions are used during learning. In particular, the data is split into different datasets, where of the available data is used a test set, while the remaining data forms the training data. During training, the fitting of the supervised data defined by Equation 7 can be applied only for the training data, while manifold regularization (Equation 6) can be enforced on all the available data. In this experiment, the Adam optimizer and the SS family of parametric t-norms have been employed. Table IV

shows the average test accuracy and its standard deviation over 10 different samples of the train/test splits. As expected, all generator selections improve the final accuracy over what obtained by pure supervised learning, as manifold regularization brings relevant information to the learner.

Test Supervised Manifold
Avg Accuracy Stddev Avg Accuracy Stddev
10% -1.5 72.44 0.8 79.07 1.07
-1.0 72.26 0.96 79.37 0.68
0.0 71.63 0.74 79.37 0.84
1.0 71.57 0.88 78.58 0.69
1.5 71.93 1.11 77.77 0.89
25% -1.5 72.22 0.46 77.17 0.70
-1.0 72.02 0.52 77.51 0.72
0.0 71.35 0.56 77.39 0.50
1.0 71.22 0.47 77.36 0.64
1.5 71.51 0.77 76.41 0.57
50% -1.5 70.94 0.56 75.52 0.46
-1.0 70.98 0.51 76.16 0.32
0.0 70.49 0.52 75.71 0.39
1.0 70.07 1.71 76.39 0.46
1.5 70.09 0.47 75.97 0.55
75% -1.5 67.06 0.58 72.25 0.50
-1.0 66.96 0.44 72.48 0.50
0.0 67.02 0.54 72.73 0.61
1.0 66.34 0.29 73.77 0.34
1.5 65.93 0.64 73.37 0.37
90% -1.5 61.09 0.78 66.02 2.51
-1.0 61.59 0.44 67.24 1.72
0.0 61.52 0.33 68.60 0.75
1.0 61.31 0.52 70.69 0.52
1.5 61.17 0.84 70.32 0.89
TABLE IV: Test accuracy of collective classification in transductive setting on the Citeseer dataset for different percentages of available training data and different selections of the parameter of the SS generator family.

Table IV also shows test accuracy when the parameter of the SS parametric family is selected from the grid , where values of move across strict t-norms (with being the product t-norm), while values greater than move across nilpotent t-norms. (with being the Łukasiewicz t-norm). Strict t-norms seem to provide slightly better performances than nilpotent ones on supervised task for almost all the test set splits. However, this does not hold in learning tasks using manifold regularization and a limited number of supervisions, where nilpotent t-norms perform better. An explanation of this behaviour can be found in the different nature of the two constraints, i.e. the supervision constraint of Equation 7 and the manifold regularization constraint of Equation 6. Indeed, while supervisions provide hard constraint that need to be strongly satisfied, manifold regularization is a general soft rule, which should allow exceptions. When the number of supervision is small and manifold regularization drives the learning process, the milder behaviour of nilpotent t-norms is better, as it more closely models the semantics of the prior knowledge. Finally, it is worth noticing that very strict t-norms (e.g. in the provided experiment) provide high standard deviations compared to other t-norms, especially in the manifold regularization setup. This shows the presence of a trade-off between the improved learning speed provided by strict t-norms and the instability due to their extremely non-linear behaviour.

V-B3 Competitive evaluation

Table V

compares the accuracy of the selected neural model (NN) trained only with supervised constraint against other two content-based classifiers, namely logistic regression (LR) and Naive Bayes (NB). These baseline classifiers have been compared against collective classification approaches using the citation network data: Iterative Classification Algorithm (ICA) 

[32] and Gibbs Sampling (GS) [33] applied on top of the output of the LR and NB content-based classifiers.

Classification
Method Accuracy
Naive Bayes 74.87
ICA Naive Bayes 76.83
GS Naive Bayes 76.80
Logistic Regression 73.21
ICA Logistic Regression 77.32
GS Logistic Regression 76.99
Loopy Belief Propagation 77.59
Mean Field 77.32
NN 72.26
DFL 79.37
TABLE V: Comparison of the accuracy on the Citeseer dataset obtained by content based and relational classifiers against supervised and relational learning expressed using DFL. All reported results are computed as average over random splits of the train and test data. The bold number indicates the best performer and a statistically significant improvement over the competitors.

Furthermore, the results are compared against the two top performers on this task: Loopy Belief Propagation (LBP) [34] and Relaxation Labeling through Mean-Field Approach (MF) [34]. Finally, the results of DFL built by training the same neural network with both supervision and manifold regularization constraints, for which it was used a generator from the SS family with . The accuracy values are obtained as an average over 10-folds created by random splits of % and % of the data for the train and test sets, respectively. Unlike the other relational approaches that can only be executed at inference time (collective classification), DFL can distill the knowledge in the weights of the neural network. The accuracy results are the highest among all the tested methodologies in spite of the fact that the neural network trained only on the supervisions performs slightly worse than the other content-based competitors.

Vi Conclusions

This paper presents a framework to embed prior knowledge expressed as logic statements into a learning task. In particular, it was shown how the choice of the t-norm generator used to convert the logic into a differentiable form defines the resulting loss function used during learning. When restricting the attention to supervised learning, the framework recovers popular loss functions like the cross-entropy loss, and allows to define new loss functions corresponding to the choice of the parameters of t-norm parametric forms. The presented theory has driven to the implementation of a general software simulator, called DFL, which bridges logic reasoning and deep learning using the unifying concept of t-norm generator, as general abstraction to translate any FOL declarative knowledge into an optimization problem solved in TensorFlow.

References

  • [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
  • [2] A. S. d. Garcez, K. B. Broda, and D. M. Gabbay, Neural-symbolic learning systems: foundations and applications.   Springer Science & Business Media, 2012.
  • [3] M. Diligenti, M. Gori, and C. Sacca, “Semantic-based regularization for learning and inference,” Artificial Intelligence, vol. 244, pp. 143–165, 2017.
  • [4] I. Donadello, L. Serafini, and A. d’Avila Garcez, “Logic tensor networks for semantic image interpretation,” in IJCAI International Joint Conference on Artificial Intelligence, 2017, pp. 1596–1602.
  • [5] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.   MIT press Cambridge, 2016, vol. 1.
  • [6] M. Richardson and P. Domingos, “Markov logic networks,” Machine learning, vol. 62, no. 1, pp. 107–136, 2006.
  • [7] S. H. Bach, M. Broecheler, B. Huang, and L. Getoor, “Hinge-loss markov random fields and probabilistic soft logic,” Journal of Machine Learning Research, vol. 18, pp. 1–67, 2017.
  • [8] G. Marra, F. Giannini, M. Diligenti, and M. Gori, “Lyrics: a general interface layer to integrate ai and deep learning,” arXiv preprint arXiv:1903.07534, 2019.
  • [9] J. Xu, Z. Zhang, T. Friedman, Y. Liang, and G. V. d. Broeck, “A semantic loss function for deep learning with symbolic knowledge,” arXiv preprint arXiv:1711.11157, 2017.
  • [10] R. Manhaeve, S. Dumančić, A. Kimmig, T. Demeester, and L. De Raedt, “Deepproblog: Neural probabilistic logic programming,” arXiv preprint arXiv:1805.10872, 2018.
  • [11] L. De Raedt, A. Kimmig, and H. Toivonen, “Problog: A probabilistic prolog and its application in link discovery,” in Proceedings of the 20th International Joint Conference on Artifical Intelligence, ser. IJCAI’07.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007, pp. 2468–2473. [Online]. Available: http://dl.acm.org/citation.cfm?id=1625275.1625673
  • [12] L. D. Raedt, P. Frasconi, K. Kersting, and S. M. (Eds),

    Probabilistic Inductive Logic Programming

    .   Springer, Lecture Notes in Artificial Intelligence, 2008, vol. 4911.
  • [13] L. Serafini, I. Donadello, and A. d. Garcez, “Learning and reasoning in logic tensor networks: theory and application to semantic image interpretation,” in Proceedings of the Symposium on Applied Computing.   ACM, 2017, pp. 125–130.
  • [14] F. Giannini, M. Diligenti, M. Gori, and M. Maggini, “On a convex logic fragment for learning and reasoning,” IEEE Transactions on Fuzzy Systems, 2018.
  • [15] P. Hájek, Metamathematics of fuzzy logic.   Springer Science & Business Media, 2013, vol. 4.
  • [16] P. S. Mostert and A. L. Shields, “On the structure of semigroups on a compact manifold with boundary,” Annals of Mathematics, pp. 117–143, 1957.
  • [17] S. Jenei, “A note on the ordinal sum theorem and its consequence for the construction of triangular norms,” Fuzzy Sets and Systems, vol. 126, no. 2, pp. 199–205, 2002.
  • [18] P. Klement, E., R. Mesiar, and E. Pap, “Triangular norms. position paper i: basic analytical and algebraic properties,” Fuzzy Sets and Systems, vol. 143, no. 1, pp. 5–26, 2004.
  • [19] ——, Triangular norms.   Springer Science & Business Media, 2013, vol. 8.
  • [20] ——, “Triangular norms. position paper iii: Continuous t-norms,” Fuzzy Sets and Systems, vol. 145, pp. 439–454, 08 2004.
  • [21] ——, “Triangular norms. position paper ii: general constructions and parameterized families,” Fuzzy Sets and Systems, vol. 145, no. 3, pp. 411–438, 2004.
  • [22] G. Gnecco, M. Gori, S. Melacci, and M. Sanguineti, “Foundations of support constraint machines,” Neural computation, vol. 27, no. 2, pp. 388–480, 2015.
  • [23] M. Diligenti, S. Roychowdhury, and M. Gori, “Image classification using deep learning and prior knowledge,” in Proceedings of Third International Workshop on Declarative Learning Based Programming (DeLBP), February 2018.
  • [24] V. Novák, I. Perfilieva, and J. Mockor, Mathematical principles of fuzzy logic.   Springer Science & Business Media, 2012, vol. 517.
  • [25] S. Kolb, S. Teso, A. Passerini, and L. De Raedt, “Learning smt (lra) constraints using smt solvers.” in IJCAI, 2018, pp. 2333–2340.
  • [26] G. Marra, F. Giannini, M. Diligenti, and M. Gori, “Integrating learning and reasoning with deep logic models,” arXiv preprint arXiv:1901.04195, 2019.
  • [27] L. Serafini and A. d. Garcez, “Logic tensor networks: Deep learning and logical reasoning from data and knowledge,” arXiv preprint arXiv:1606.04422, 2016.
  • [28] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
  • [29] N. Ketkar, “Introduction to pytorch,” in Deep learning with python.   Springer, 2017, pp. 195–208.
  • [30] S. Fakhraei, J. Foulds, M. Shashanka, and L. Getoor, “Collective spammer detection in evolving multi-relational social networks,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’15.   ACM, 2015, pp. 1769–1778.
  • [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [32] J. Neville and D. Jensen, “Iterative classification in relational data,” in Proc. AAAI-2000 Workshop on Learning Statistical Models from Relational Data, 2000, pp. 13–20.
  • [33] Q. Lu and L. Getoor, “Link-based classification,” in Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 496–503.
  • [34] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classification in network data,” AI magazine, vol. 29, no. 3, p. 93, 2008.