On the relation between Loss Functions and T-Norms

07/18/2019 ∙ by Francesco Giannini, et al. ∙ Università di Siena UNIFI 4

Deep learning has been shown to achieve impressive results in several domains like computer vision and natural language processing. A key element of this success has been the development of new loss functions, like the popular cross-entropy loss, which has been shown to provide faster convergence and to reduce the vanishing gradient problem in very deep structures. While the cross-entropy loss is usually justified from a probabilistic perspective, this paper shows an alternative and more direct interpretation of this loss in terms of t-norms and their associated generator functions, and derives a general relation between loss functions and t-norms. In particular, the presented work shows intriguing results leading to the development of a novel class of loss functions. These losses can be exploited in any supervised learning task and which could lead to faster convergence rates that the commonly employed cross-entropy loss.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

tf-logic

A Tensorflow extension to allow Logic Based loss functions


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A careful choice of the loss function has been pivotal into the success of deep learning. In particular, the cross-entropy loss

, or log loss, measures the performance of a classifier and increases when the predicted probability of an assignment diverges from the actual label 

[7]. In supervised learning, the cross-entropy loss has a clear interpretation as it attempts at minimizing the distribution of the predicted and given pattern labels. From a practical standpoint, the main advantage of this loss is to limit the vanishing gradient issue for networks with sigmoidal or softmax output activations.

Recent advancements in Statistical Relational Learning (SRL) [16] allow to inject prior knowledge, often expressed using a logic formalism, into a learner. One of the most popular lines of research in this community attempts at defining frameworks for performing logic inference in the presence of uncertainty. For example, Markov Logic Networks [18] and Probabilistic Soft Logic [1] integrate First Order Logic (FOL) and graphical models. More recently, many attempts have been focusing on integrating reasoning with uncertainty with deep learning [20]. A common solution, followed by approaches like Semantic Based Regularization [4]

and Logic Tensor Networks 

[5], relies on using deep networks to approximate the FOL predicates, and the overall architecture is optimized end-to-end by relaxing the FOL into a differentiable form, which translates into a set of constraints. For the sake of overall consistency, one question that can naturally arise in this context is how the fitting of the supervised examples can be expressed using logic formalism. Following this starting point, this paper follows an orthogonal approach for the definition of a loss function, by studying the relation between the translation of the prior knowledge using t-norms and the resulting loss function. In particular, the notion of t-norm generator plays a fundamental role in the behavior of the corresponding loss. Remarkably, the cross-entropy loss can be naturally derived within this framework. However, the presented theoretical results suggest that there is a larger class of loss functions that correspond to the different possible translations of logic using t-norms, and some loss functions are potentially more effective than the cross-entropy to limit the vanishing gradient issue, therefore proving a faster convergence rate.

The paper is organized as follows: Section 2 presents the basic concepts about t-norms, generators and aggregator functions. Section 3 introduces the learning frameworks used to represent supervised learning in terms of logic rules, while Section 4 presents the experimental results and, finally, Section 5 draws some conclusions.

2 Fuzzy Aggregation Functions

The aggregation takes place on a set of values typically representing preferences or satisfaction degrees restricted to the unit interval to be aggregated. There are several ways to aggregate them into a single value expressing an overall combined score, according to what is expected from such mappings. The purpose of aggregation functions is to combine inputs that are typically interpreted as degrees of membership in fuzzy sets, degrees of preference or strength of evidence. Aggregation functions have been studied by several authors in the literature [2, 3], and they are successfully used in many practical applications, for instance see [8, 19]

. Please note that the fuzzy aggregation functions that will be covered in this section can be directly applied to the output of a multi-task classifier, when implemented via a neural network with sigmoidal or softmax output units.

2.0.1 Basic Definitions.

Aggregation functions are defined for inputs of any cardinality, however for simplicity the main definitions are provided only for the binary case. A (binary) aggregation function is a non-decreasing function , such that: , . An aggregation function can be categorized according to the pointwise order in Equation 1 as: conjunctive when , disjunctive when , averaging (a mean) when and hybrid otherwise; where and are the aggregation functions for the and respectively.

(1)

Conjunctive and disjunctive type functions combine values as if they were related by a logical AND and OR operations, respectively. On the other hand, averaging type functions have the property that low values can be compensated by high values. Mean computation is the most common way to combine the inputs, since it assumed the total score cannot be above or below any of the inputs, but it depends on all the inputs.

2.1 Archimedean T-Norms

Despite averaging functions have nice properties to aggregate fuzzy values, they are not suitable to represent neither a conjunction nor a disjunction, because they do not generalize their boolean counterpart. This is a reason why, we focus on t-norms and t-conorms  [11, 14], that are associative, commutative aggregation functions with 1 and 0 as neutral element, respectively. Table 1 reports Gödel, Lukasiewicz and Product t-norms, which are referred as the fundamental t-norms because all the continuous t-norms can be obtained as ordinal sums of the two fundamental t-norms [10]. A simple example of a t-norm that is not continuous is given by the Drastic t-norm , that is always returning a zero value, except for . Archimedean t-norms [13] are a class of t-norms that can be constructed by means of unary monotone functions, called generators.

Gödel Lukasiewicz Product
Table 1: Fundamental t-norms.
Definition 1

A t-norm is said to be Archimedean if for every , . In addition, is said strict if for all , otherwise is said nilpotent.

For instance, the Lukasiewicz t-norm is nilpotent, the Product t-norm is strict, while the Gödel one is not archimedean, indeed , for all . The Lukasiewicz and Product t-norms are enough to represent the whole classes of nilpotent and strict Archimedean t-norms [14].

A fundamental result for the construction of t-norms by additive generators is based on the following theorem [12]:

Theorem 2.1

Let be a strictly decreasing function with and for all in , and its pseudo-inverse. Then the function defined as

(2)

is a t-norm and is said an additive generator for .

Any t-norm with an additive generator is Archimedean, if is continuous then is continuous, is strict if and only if , otherwise it is nilpotent.

Example 1

If we take , then also and we get :

Example 2

Taking , we have and we get :

Eq. (2) allows to derive the other fuzzy connectives as function of the generator:

(3)

If is expressed as a parametric function, it is possible to define families of t-norms, which can be constructed by the generator obtained when setting the parameters to specific values. Several parametric families of t-norms have been introduced [2]. The experimental section of this paper employs the family of Schweizer–Sklar and Frank t-norms, depending on a parameter and respectively, and whose generators are defined as:

(4)

3 From Formulas to Loss Functions

A learning process can be thought of as a constraint satisfaction problem, where the constraints represent the knowledge about the functions to be learned. In particular, multi-task learning can be expressed via a set of constraints expressing the fitting of the supervised examples, plus any additional abstract knowledge.

Let us consider a set of unknown task functions defined on

, all collected in the vector

and a set of known functions or predicates . Given the set of available data, a learning problem can be generally formulated as where is a positive-valued functional denoting a certain loss function. Each predicate is approximated by a neural network providing an output value in . The available knowledge about the task functions consists in a set of FOL formulas

and the learning process aims at finding a good approximation of each unknown element, so that the estimated values will satisfy the formulas for the input samples. Since any formula is true if it evaluates to 1, in order to satisfy the constraints we may minimize the following loss function:

(5)

where any is the weight for the -th logical constraint, which can be selected via cross-validation or jointly learned [15, 21], is the truth-function corresponding to the formula according to a certain t-norm fuzzy logic and is a decreasing function denoting the penalty associated to the distance from satisfaction of formulas, so that . In the following, we will study different forms for the cost function and how it depends on the choice of the t-norm generator. In particular, a t-norm fuzzy logic generalizes Boolean logic to variables assuming values in and is defined by its t-norm modeling the logical AND [9]. The connectives can be treated using the fuzzy generalization of first–order logic that was first proposed by Novak [17]. The universal and existential quantifiers occurring in the formulas in allows the aggregation of different evaluations (groundings) of the formulas on the available data. For instance, given a formula depending on a certain variable , where denotes the available samples for the -th argument of one of the involved predicates in , we may convert the quantifiers as the minimum and maximum operations that are common to any t-norm fuzzy logic:

3.1 Loss Functions by T-Norms Generators

A quantifier can be seen as a way to aggregate all the possible groundings of a predicate variable that, in turn, are -values. Different aggregation functions have also been considered, for example in [5], the authors consider a mean operator to convert the universal quantifier. However this has the drawback that also the existential quantifier has the same semantics conversion and then it is determined by the authors via Skolemization. Even if this choice may yield some learning benefits, it has no direct justification inside a logic theory. Moreover it does not suggest how to map the functional translation of the formula into a constraint. In the following, we investigate the mapping of formulas into constraints by means of generated t-norm fuzzy logics, and we exploited the same additive generator of the t-norm to map the formula into the functional constraints to be minimized, i.e. .

Given a certain formula depending on a variable that ranges in the set and its corresponding functional representation evaluated on each , the conversion of universal and existential quantifiers should have semantics equivalent to the AND and OR of the evaluation of the formula over the groundings, respectively. This can be realized by directly applying the t-norm or t-conorms over the groundings. For instance, for the universal quantifier:

(6)

where is an additive generator of the t-norm corresponding to the universal quantifier. Since any generator function is decreasing, in order to maximize the satisfaction of we can minimize applied to Equation 6, namely:

if is nilpotent
if is strict (8)

As a consequence, with respect to the convexity of the expressions in Equations 3.1-8, we get the following result, that is an immediate consequence of how the convexity is preserved by function composition.

Proposition 1

If is a linear function and is concave, Equation 3.1 is convex. If is a convex function and is linear, Equation 8 is convex.

Example 3

If (Lukasiewicz t-norm) from Equation 3.1 we get:

Hence, in case is concave (see [6] for a characterization of the concave fragment of Lukasiewicz logic), this function is convex.

If (Product t-norm) from Equation 8 we get the cross-entropy:

As we already pointed out in Section 2, if is an additive generator for a t-norm , then the residual implication and the biresidum with respect to are given by Equation 3. In particular, if are two unary predicates functions sharing the same input domain , and the following formulas yield the following penalty terms:

3.2 Redefinition of supervised Learning with Logic

In this section, we study the case of supervised learning w.r.t. the choice of a certain additive generator. Let us consider a multi-task classification problem with predicates defined over the same input domain with a supervised training set where each is the output class for the pattern and is the overall set of supervised patterns. Finally, the known predicate is defined for each predicate such that iff , and we indicate as the set of positive examples for the -th predicate. Then, we can enforce the supervision constraints for as:

In the special case of the predicates implemented by neural networks and exclusive multi-task classification, where each pattern should be assigned to one and only one class, the exclusivity can be enforced using a softmax output activation. Typically, in this scenario, only the positive supervisions are explicitly listed, and since it holds that , yields:

(9)

For instance, in the case of Lukasiewicz and Product logic, we have, respectively:

corresponding to the and cross entropy losses, respectively.

4 Experimental Results

(a) The Schweizer–Sklar t-norms
(b) The Frank t-norms
Figure 1: Convergence speed of multiple generated loss functions on the MNIST classification task for different values of the parameter of equation 4. The well-known cross-entropy loss is equivalent to the loss obtained by the generator.

The proposed framework allows to recover well-known loss functions by expressing the fitting of the supervision using logic and then carefully selecting the t-norm used to translate the resulting formulas. However, a main strength of the proposed theory is that it becomes possible to derive new principled losses starting from any family of parametric t-norms. Driven by the huge impact that cross-entropy gained w.r.t. to classical loss functions in improving convergence speed and generalization capabilities, we designed a set of experiments to investigate how the choice of a t-norm can lead to a loss function with better performances than the cross-entropy loss. The Schweizer–Sklar and the Frank parametric t-norms defined in Section 2.1 have been selected for this experimental evaluation, given the large spectrum of t-norms that can be generated by varying their

parameter. The well known MNIST dataset is used as benchmark for all the presented experiments. In order to have a fair comparison, the same neural network architecture is used during all the runs: a 1-hidden layer neural network with 50 hidden ReLU units and 10 softmax output units. The softmax activation function allows to express only positive supervisions, like commonly done in mutually exclusive classification using the cross-entropy loss. Optimization is carried on using Vanilla gradient descent with a fixed learning rate of

.

Results are shown in Figure 1, that reports the accuracy on the test set of a neural network trained on the MNIST dataset. Specific choices of the parameter recover classical loss functions, like the cross-entropy loss, which is equivalent to the loss obtained using . The results confirm that the cross-entropy loss converges faster than the obtained when using . However, there is a wide range of possible choices for the parameter that brings an even faster convergence and better generalization than the widely adopted used cross-entropy.

5 Conclusions

This paper presents a framework to embed prior knowledge expressed as logic statements into a learning task, showning how the choice of the t-norm used to convert the logic into a differentiable form defines the resulting loss function used during learning. When restricting the attention to supervised learning, the framework recovers popular loss functions like the cross-entropy loss, and allows to define new loss functions corresponding to the choice of the parameters of t-norm parametric forms. The experimental results show that some newly defined losses provide a faster convergence rate that the commonly used cross-entropy loss. Future work will focus on testing the loss functions in more structured learning tasks, like the one commonly addressed with Logic Tensor Networks and Semantic based Regularization. The parametric form of the loss functions allows to define joint learning tasks, where the loss parameters are co-optimized during learning, for example using maximum likelihood estimators.

References

  • [1]

    Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss markov random fields and probabilistic soft logic. Journal of Machine Learning Research

    18, 1–67 (2017)
  • [2] Beliakov, G., Pradera, A., Calvo, T.: Aggregation functions: A guide for practitioners, vol. 221. Springer (2007)
  • [3] Calvo, T., Kolesárová, A., Komorníková, M., Mesiar, R.: Aggregation operators: properties, classes and construction methods. In: Aggregation operators, pp. 3–104. Springer (2002)
  • [4]

    Diligenti, M., Gori, M., Sacca, C.: Semantic-based regularization for learning and inference. Artificial Intelligence

    244, 143–165 (2017)
  • [5] Donadello, I., Serafini, L., d’Avila Garcez, A.: Logic tensor networks for semantic image interpretation. In: IJCAI International Joint Conference on Artificial Intelligence. pp. 1596–1602 (2017)
  • [6] Giannini, F., Diligenti, M., Gori, M., Maggini, M.: On a convex logic fragment for learning and reasoning. IEEE Transactions on Fuzzy Systems (2018)
  • [7] Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning, vol. 1. MIT press Cambridge (2016)
  • [8] Grabisch, M., Marichal, J.L., Mesiar, R., Pap, E.: Aggregation functions: means. Information Sciences 181(1), 1–22 (2011)
  • [9] Hájek, P.: Metamathematics of fuzzy logic, vol. 4. Springer Science & Business Media (2013)
  • [10] Jenei, S.: A note on the ordinal sum theorem and its consequence for the construction of triangular norms. Fuzzy Sets and Systems 126(2), 199–205 (2002)
  • [11] Klement, E.P., Mesiar, R., Pap, E.: Triangular norms. position paper i: basic analytical and algebraic properties. Fuzzy Sets and Systems 143(1), 5–26 (2004)
  • [12] Klement, E.P., Mesiar, R., Pap, E.: Triangular norms. position paper ii: general constructions and parameterized families. Fuzzy Sets and Systems 145(3), 411–438 (2004)
  • [13] Klement, E.P., Mesiar, R., Pap, E.: Triangular norms. position paper iii: continuous t-norms. Fuzzy Sets and Systems 145(3), 439–454 (2004)
  • [14] Klement, E.P., Mesiar, R., Pap, E.: Triangular norms, vol. 8. Springer Science & Business Media (2013)
  • [15] Kolb, S., Teso, S., Passerini, A., De Raedt, L.: Learning smt (lra) constraints using smt solvers. In: IJCAI. pp. 2333–2340 (2018)
  • [16] Koller, D., Friedman, N., Džeroski, S., Sutton, C., McCallum, A., Pfeffer, A., Abbeel, P., Wong, M.F., Heckerman, D., Meek, C., et al.: Introduction to statistical relational learning. MIT press (2007)
  • [17] Novák, V., Perfilieva, I., Mockor, J.: Mathematical principles of fuzzy logic, vol. 517. Springer Science & Business Media (2012)
  • [18] Richardson, M., Domingos, P.: Markov logic networks. Machine learning 62(1), 107–136 (2006)
  • [19] Torra, V., Narukawa, Y.: Modeling decisions: information fusion and aggregation operators. Springer Science & Business Media (2007)
  • [20] Xu, J., Zhang, Z., Friedman, T., Liang, Y., Broeck, G.V.d.: A semantic loss function for deep learning with symbolic knowledge. arXiv preprint arXiv:1711.11157 (2017)
  • [21] Yang, F., Yang, Z., Cohen, W.W.: Differentiable learning of logical rules for knowledge base reasoning. In: Advances in Neural Information Processing Systems. pp. 2319–2328 (2017)