OWL 2 ontologies  are nowadays a popular means to represent structured knowledge and its formal semantics is based on Description Logics (DLs) . The basic ingredients of DLs are concept descriptions (in First-Order Logic terminology, unary predicates), inheritance relationships among them and instances of them.
Although an important amount of work has been carried about DLs, the application of machine learning techniques to OWL 2 ontologies,viz. DL ontologies, is relatively less addressed compared to the
Inductive Logic Programming(ILP) setting (see e.g. [91, 92] for more insights on ILP). We refer the reader to [71, 93] for an overview and to Section 2
In this work, we focus on the problem of automatically learning fuzzy concept inclusion axioms from OWL 2 ontologies. More specifically, given a target class of an OWL ontology, we address the problem of learning fuzzy  concept inclusion axioms that describe sufficient conditions for being an individual instance of .
Consider an ontology that describes the meaningful entities of a city. 111For instance, http://donghee.info/research/SHSS/ObjectiveConceptsOntology(OCO).html Now, one may fix a city, say Pisa, extract the properties of the hotels from Web sites, such as location, price, etc., and the hotel judgements of the users, e.g., from Trip Advisor. 222http://www.tripadvisor.com Now, using the terminology of the ontology, one may ask about what characterises good hotels in Pisa (our target class ) according to the user feedback. Then one may learn from the user feedback that, for instance, ‘An expensive Bed and Breakfast is a good hotel’ (see also Section 5 later on).
The objective is essentially the same as in e.g. [70, 114] except that now we propose to rely on the eal AdaBoost  boosting algorithm to be adapted to the (fuzzy) OWL case. Of course, like in [68, 114], we continue to support so-called fuzzy concept descriptions and fuzzy concrete domains [76, 112, 113] such as ‘an expensive Bed and Breakfast is a good hotel’. Here, the concept expensive is a so-called fuzzy concept , i.e. a concept for which the belonging of an individual to the class is not necessarily a binary yes/no question, but rather a matter of degree in . For instance, in our example, the degree of expensiveness of a hotel may depend on the price of the hotel: the higher the price the more expensive is the hotel. Here, the range of the ‘attribute’ hotel price becomes a so-called fuzzy concrete domain  allowing to specify fuzzy labels such as ‘high/moderate/low price’.
We recall that (discrete) AdaBoost [46, 108, 47] uses weak hypotheses with outputs restricted to the discrete set of classes that it combines via leveraging weights in a linear vote. On the other hand eal AdaBoost  is a generalisation of it as real-valued weak hypotheses are admitted (see  for a comparison to approaches to real-valued AdaBoost).
Besides the fact that (to the best of our knowledge) the use of both (discrete) AdaBoost (with the notable exception of ) and its generalisation to real-valued weak hypotheses in the context OWL 2 ontologies is essentially unexplored, the main features of our algorithm, called Fuzzy OWL-Boost, are the following:
the fuzzy concept inclusion axioms are then linearly combined into a new fuzzy concept inclusion axiom describing sufficient conditions for being an individual instance of the target class ;
all generated fuzzy concept inclusion axioms can then be directly encoded as Fuzzy OWL 2 axioms [11, 12].333As Fuzzy OWL 2 supports the linear combination of weighted concepts. As a consequence, a Fuzzy OWL 2 reasoner, such as fuzzyDL [10, 13], can then be used to automatically determine (and to which degree) whether an individual belongs to the target class .
Let us remark that we rely on real-valued AdaBoost as the weak hypotheses Fuzzy OWL-Boost generates are indeed fuzzy concept inclusion axioms and, thus, the degree to which an instance satisfies them is a real-valued degree of truth in .
In the following, we proceed as follows. In Section 2 we compare our work with closely related work appeared so far. For completeness, we refer to A in which we provide a much more extensive list of references related to OWL rule learning, though less related to our setting. In Section 3, for the sake of completeness, we recap the salient notions we will rely on in this paper. Then, in Section 4 we will present our algorithm Fuzzy OWL-Boost, which then is evaluated for its effectiveness in Section 5. Section 6 concludes and points to some topics of further research.
2 Related Work
Concepts inclusion axioms learning in DLs stems from statistical relational learning, where classification rules are (possibly weighted) Horn clause theories from examples (see e.g. [91, 92]) and various methods have been proposed in the DL context so far (see e.g. [71, 93]). The general idea consists of the exploration of the search space of potential concept descriptions that cover the available training examples using so-called refinement operators (see, e.g. [5, 59, 62]). The goal is then to learn a concept description of the underlying DL language covering (possibly) all provided positive examples and (possibly) not covering any of the provided negative examples. The fuzzy case (see [67, 70, 114]) is a natural extension in which one relies on fuzzy DLs [9, 113] and fuzzy ILP (see e.g. ) instead.
Closely related to our work are [44, 67, 70, 114]. The works [67, 70], which stem essentially from [68, 69, 72, 73, 74, 75], propose fuzzy Foil-like algorithms and are inspired by fuzzy ILP variants such as [29, 109, 111],444See, e.g. , for an overview on fuzzy rule learning mehtods. while here we rely on a real-valued variant of AdaBoost. Let us note that [67, 73] consider the weaker hypothesis representation language DL-Lite , while here we rely on fuzzy as in [68, 69, 72, 74, 75, 70]. Fuzzy has also been considered in , which however differs from [67, 70] by the fact that a (fuzzy) probabilistic ensemble evaluation of the fuzzy concept description candidates has been considered. 555Also, as far as we were able to figure out, concrete datatypes were not addressed in the evaluation. Discrete boosting has been considered in , which also shows how to derive a weak learner —(called wDLF) from conventional learners using some sort of random downward refinement operator covering at least a positive example and yielding a minimal score fixed with a threshold. Besides that we deal here with fuzziness in the hypothesis language and a real-valued variant of AdaBoost, the weak learner we propose here differentiates from the previous one by using a kind of gradient descent like algorithm to search for the best alternative. Notably, this also deviates from ‘fuzzy’ rule learning AdaBoost variants, such as [28, 87, 90, 107, 122] in which the weak learner is required to generate the whole rules search space beforehand the selection of the best current alternative. Such an approach is essentially unfeasible in the OWL case due to the size of the search space.
Eventually,  can learn fuzzy OWL DL concept equivalence axioms from FuzzyOWL 2 ontologies, by interfacing with the fuzzyDL reasoner . The candidate concept expressions are provided by the underlying DL-Learner [57, 15, 16] system. However, it has been tested only on a toy ontology so far. Last, but not least, let us mention , which is based on an ad-hoc translation of fuzzy Łukasiewicz DL constructs into fuzzy Logic Programming (fuzzy LP) and then uses a conventional ILP method to learn rules. Unfortunately, the method is not sound as it has been shown that the mapping from fuzzy DLs to LP is incomplete  and entailment in Łukasiewicz is undecidable .
While it is not our aim here to provide an extensive overview about learning w.r.t. ontologies literature, nevertheless we refer the interested reader to A for an extensive list of references, which may be the subject of a survey paper instead.
For the sake of self completeness, we first introduce the main notions related to (Mathematical) Fuzzy Logics and Fuzzy Description Logics we will use in this work (see  for a more extensive introduction to both).
3.1 Mathematical Fuzzy Logic
Fuzzy Logic is the logic of fuzzy sets . A fuzzy set over a countable crisp set is a function , called fuzzy membership function of . A crisp set is characterised by a membership function instead. The ‘standard’ fuzzy set operations conform to , and ( is the set complement of ), the cardinality of a fuzzy set is often defined as , while the inclusion degree between and is defined typically as .
The trapezoidal (Fig. 1 (a)), the triangular (Fig. 1 (b)), the -function (left-shoulder function, Fig. 1 (c)), and the -function (right-shoulder function, Fig. 1 (d)) are frequently used to specify membership functions of fuzzy sets.
Although fuzzy sets have a greater expressive power than classical crisp sets, their usefulness depends critically on the capability to construct appropriate membership functions for various given concepts in different contexts. We refer the interested reader to, e.g., . One easy and typically satisfactory method to define the membership functions is to uniformly partition the range of, e.g. salary values (bounded by a minimum and maximum value), into 5 or 7 fuzzy sets using triangular (or trapezoidal) functions (see Figure 2). Another popular approach may consist in using the so-called C-means fuzzy clustering algorithm (see, e.g. ) with three or five clusters, where the fuzzy membership functions are triangular functions built around the centroids of the clusters (see also ).
In Mathematical Fuzzy Logic , the convention prescribing that a formula is either true or false (w.r.t. an interpretation ) is changed and is a matter of degree measured on an ordered scale that is no longer , but typically . This degree is called degree of truth of the formula in the interpretation . Here, fuzzy formulae have the form , where and is a First-Order Logic (FOL) formula, encoding that the degree of truth of is greater than or equal to . So, for instance, states that ‘Hotel Verdi is cheap’ is true to degree greater or equal . From a semantics point of view, a fuzzy interpretation maps each atomic formula into and is then extended inductively to all FOL formulae as follows:
where is the domain of , and , , , and are so-called t-norms, t-conorms, implication functions, and negation functions, respectively, which extend the Boolean conjunction, disjunction, implication, and negation, respectively, to the fuzzy case.
One usually distinguishes three different logics, namely Łukasiewicz, Gödel, and Product logics , 666Notably, a theorem states that any other continuous t-norm can be obtained as a combination of them. whose truth combination functions are reported in Table 1.
Note that the operators for ‘standard’ fuzzy logic, namely , , and , can be expressed in Łukasiewicz logic. More precisely, . Furthermore, the implication is called Kleene-Dienes implication (denoted ), while Zadeh implication (denoted ) is the implication if ; otherwise.
An r-implication is an implication function obtained as the residuum of a continuous t-norm , 777Note that Łukasiewicz, Gödel and Product implications are r-implications, while Kleene-Dienes implication is not. i.e. . Note also, that given an r-implication , we may also define its related negation by means of for every .
The notions of satisfiability and logical consequence are defined in the standard way, where a fuzzy interpretation satisfies a fuzzy formula , or is a model of , denoted as , iff . Notably, from and one may conclude (if is an r-implication) (this inference is called fuzzy modus ponens).
3.2 Fuzzy Description Logics basics
We recap here the fuzzy DL , which extends the well-known fuzzy DL  with the weighted concept construct (indicated with the letter ) [12, 113]. is expressive enough to capture the main ingredients of fuzzy DLs we are going to consider here. Note that fuzzy DLs and fuzzy OWL 2 in particular, cover many more language constructs than we use here (see, e.g. [9, 12, 113]).
We start with the notion of fuzzy concrete domain, that is a tuple with datatype domain and a mapping that assigns to each data value an element of , and to every -ary datatype predicate a -ary fuzzy relation over . Therefore, maps indeed each datatype predicate into a function from to . Typical datatypes predicates are characterized by the well known membership functions (see also Fig. 1)
where e.g. is the left-shoulder membership function and corresponds to the crisp set of data values that are greater than or equal to the value .
Now, consider pairwise disjoint alphabets and , where is the set of individuals, is the set of concept names (also called atomic concepts) and is the set of role names. Each role is either an object property or a datatype property. The set of concepts are built from concept names using connectives and quantification constructs over object properties and datatype properties , as described by the following syntactic rule ():
An ABox consists of a finite set of assertion axioms. An assertion axiom is an expression of the form (called concept assertion, is an instance of concept to degree greater than or equal to ) or of the form (called role assertion, is an instance of object property to degree greater than or equal to ), where are individual names, is a concept, is an object property and is a truth value. A Terminological Box or TBox is a finite set of General Concept Inclusion (GCI) axioms, where a fuzzy GCI is of the form ( is a sub-concept of to degree greater than or equal to), where is a concept and . We may omit the truth degree of an axiom; in this case is assumed and we call the axiom crisp. We also write as a macro for the two GCIs and . We may also call a fuzzy GCI of the form , where is a concept name, a rule and its body. A Knowledge Base (KB) is a pair , where is a TBox and is an ABox. With we denote the set of individuals occurring in .
Concerning the semantics, let us fix a fuzzy logic and a fuzzy concrete domain . Now, unlike classical DLs in which an interpretation maps e.g. a concept into a set of individuals , i.e. maps into a function (either an individual belongs to the extension of or does not belong to it), in fuzzy DLs, maps into a function and, thus, an individual belongs to the extension of to some degree in , i.e. is a fuzzy set. Specifically, a fuzzy interpretation is a pair consisting of a nonempty (crisp) set (the domain) and of a fuzzy interpretation function that assigns: (i) to each atomic concept a function ; (ii) to each object property a function ; (iii) to each datatype property a function ; (iv) to each individual an element such that if (the so-called Unique Name Assumption); and (v) to each data value an element . Now, a fuzzy interpretation function is extended to concepts as specified below (where ):
The satisfiability of axioms is then defined by the following conditions: (i) satisfies an axiom if ; (ii) satisfies an axiom if ; (iii) satisfies an axiom if with 888However, note that under standard logic is interpreted as and not as . . is a model of iff satisfies each axiom in . If has a model we say that is satisfiable (or consistent). We say that entails axiom , denoted , if any model of satisfies . The best entailment degree of of the form , : or :, denoted , is defined as
Please note that (i.e. implies , and similarly, (i.e. implies . However, in both cases the other way around does not hold. Furthermore, we may well have that both and hold.
Eventually, consider concept , a GCI , a KB , a set of individuals and a (weight) distribution over . Then the cardinality of w.r.t. and , denoted , is defined as
while the weighted cardinality w.r.t. , and , denoted , is defined as
Furthermore, the confidence degree (also called inclusion degree) of w.r.t. and , denoted , is defined as
Similarly, the weighted confidence degree (also called weighted inclusion degree) of w.r.t. , and , denoted , is defined as
Example 3.1 (Example 1.1 cont.)
Let us consider the following axiom
where is a datatype property whose values are measured in euros and the price concrete domain has been automatically fuzzified as illustrated in Figure 3. Now, it can be verified that for hotel , whose room price is euro, i.e. we have the assertion : in the KB, we infer under Product logic that 999Using fuzzy modus ponens, , where .
4 Learning Fuzzy Concept Inclusions via Real-Valued Boosting
To start with, we introduce our learning problem.
4.1 The Learning Problem
In general terms, the learning problem we are going to address is stated as follows:
a satisfiable KB and its individuals ;
a target concept name with an associated unknown classification function , where for each , the possible values (labels) correspond, respectively, to ( is a positive example of ) and ( is a non-positive example of );
a hypothesis space of classifiers ;
a training set (the positive and non-positive examples of , respectively) of individual-label pairs:
With we denote the set of individuals occurring in . We assume that for all , , i.e. both and hold for all . 101010Essentially we state that does not already know whether is an instance of or not. We write if is a positive example (i.e., ), if is a non-positive example (i.e., ).
Learn: a classifier that is the result of Emprical Risk Minimisation (ERM) on . That is,
where is a loss function such that measures how different the prediction of a hypothesis is from the true outcome and is the risk associated with hypothesis over
, defined as the expectation of the loss function over.
The effectiveness of the learned classifier is then assessed by determining on a a test set , disjoint from .
In our setting, we assume that a hypothesis is a fuzzy GCI of the form
where each is a so-called fuzzy concept expression 111111Note that is a basic ingredient of the OWL profile language OWL EL . defined according to the following syntax:121212 is the concrete domain of boolean values.
For , the classification prediction value of w.r.t. , and is defined as (for ease, we omit and )
Note that, as stated above, essentially a hypothesis is a sufficient condition (expressed via the weighted sum of concepts) for being an individual instance of a target concept to some positive degree. So, if then is a non-positive instance of , while if then is a positive instance of to some degree and, thus, we distinguish between positive and non-positive instances of only. Furthermore, let us note that even if is a crisp KB, the possible occurrence of fuzzy concrete domains in expressions of the form in the left-hand side of a hypothesis may imply that .
Note that in e.g.  a hypothesis is of the form instead.
Clearly, the set of hypothesis by this syntax is potentially infinite due, e.g., to conjunction and the nesting of existential restrictions. The set is made finite by imposing further restrictions on the generation process such as the maximal number of conjuncts and the depth of existential nestings allowed.
One may also think of further partition the set of non-positive examples into a set of negative and a set of unknown examples (and use as labelling set , respectively, with –positive, – unknown, – negative), as done in many other approaches (see e.g. ). That is, an individual is a negative example of if , while is a unknown example of if neither nor hold. In that case, usually we are looking for an exact definition of , i.e. a hypothesis is of the stronger form instead. 131313We recall that a hypothesis as in Eq. 5 does not allow us to infer negative instances of , while does. That is, we may well have the case and with . Which one to choose may depend on the application domain and on the effectiveness of the approach. We do not address this case here.
It is easily verified that indeed a hypothesis can be rewritten as a set of rules of the form (with new concept names):
where, as we will see later on, each fuzzy GCI is a weak hypothesis (classifier), while their aggregation is computed via eal AdaBoost in which each indicates how much contributes to the classification prediction value.
We conclude with the notions of consistent, non-renduntant, sound, complete and strongly complete hypothesis w.r.t. , which are defined as follows:
is a consistent;
- Strong Completeness.
We say that a hypothesis covers (strongly covers) an example iff (). Therefore, soundness states that a learned hypothesis is not allowed to cover a non-positive example, while the way (strong) completeness is stated guarantees that all positive examples are (strongly) covered.
In general a learned (induced) hypothesis has to be consistent, non-renduntant and sound w.r.t. , but not necessarily complete, but, of course, these conditions can also be relaxed.
4.2 The Learning Algorithm
We now present our real-valued boosting-based algorithm, which is based on a boosting schema applied a fuzzy GCI learner. Our learning method creates an ensemble of classifiers made up of fuzzy concept expressions (see Eq. 5), each of which is provided by a fuzzy weak learner, whose predictiveness is required to be better than randomness. Essentially, at each round the weak learner generates a fuzzy candidate GCI of the form that determines a change to the distribution of the weights associated with the examples. The weights of misclassified examples get increased so that a better classifier can be produced in the next round, indicating the harder examples to focus on. The weak hypotheses are then eventually combined into a hypothesis (see Eq. 6). We will rely on eal AdaBoost [85, 86] as boosting algorithm, while we will use a weak learner that is similar to Foil- [67, 68, 70], both of which need to adapted to our specific setting.
Formally, consider a KB, , a training set , a set of individuals with , and a weight distribution over . 141414The weight of w.r.t. is denoted . With
we indicate the uniform distribution over, i.e. (with ). Furthermore, consider a weak hypothesis of the form returned by the weak learner. Note that for , . Next, we transform this value into a value in as required by eal AdaBoost. So, let let be the transformation function
and let the classification prediction value of w.r.t. , and be defined as (again for ease, we omit and )
We also define the examples labelling over in the following way: for
Then, the Fuzzy OWL-Boost algorithm calling iteratively a weak learner is shown in 1, which we comment briefly next. The algorithm is essentially the same as eal AdaBoost, except for few context dependent parts. In Step 2 we initialise the set of individuals to be considered as . Essentially, all individuals will be weighted. The main loop (Steps 5 - 11) is the same as for eal AdaBoost with the particularity that Step 6 we invoke a fuzzy GCI (weak) learner that is assumed to return a GCI of the form . Note that, for ease of presentation, we didn’t include an additional condition that causes a break of the loop. In fact, an implicit condition of boosting is that the error of a weak learner is below . This may implemented in our case by adding another step before Step 12 that computes the error
where is defined as ()
and determines whether there is a disagreement among the sign of and . Then, if we break the loop. In Step 12 we add the (weak) learned fuzzy GCI to the hypothesis set . In Steps 14 - 18 we prepare the final classifier ensemble. To do so, we have to perform a normalisation step. In fact, since in eal AdaBoost generally , we have to normalise the set of values () before building the weighted sum in Step 16. To do so, we rely on the well-known softmax function. Eventually, in Step 17, we determine the degree to be attached to the ensemble classifier computed as the confidence value, which resembles the well-known precision measure used in macchine learning. 151515Precision is also called positive predictive value and roughly is the percentage of positive instances among all retrieved instances.
We next describe the weak learner we employ here. As anticipated, will use a Foil- [67, 68, 70] like weak learner, which however needs to be adapted to our specific setting. In general terms the weak learning algorithm, called wFoil-, proceeds as follows:
start from concept ;
apply a refinement operator to find more specific concept description candidates;
exploit a scoring function to choose the best candidate;
re-apply the refinement operator until a good candidate is found;
iterate the whole procedure until a satisfactory coverage of the positive examples is achieved.
We briefly detail these steps.
Computing fuzzy datatypes. For a numerical datatype , we allow equal width triangular/trapezoidal partition of values into a finite number of fuzzy sets (typically, or sets), which is identical to [67, 70, 114] (see, e.g. Figure 2). However, we additionally, allow also the use of the C-means fuzzy clustering algorithm over with or clusters, where the fuzzy membership function is a triangular function build around the centroid of a cluster. Note that C-means has not been considered in [67, 70, 114]. 161616Specifically, C-means has not been considered so far in fuzzy GCI learning.
The refinement operator. The refinement operator we employ is the same as in [67, 68, 74, 114] except that now we add the management of boolean values as well. Essentially, the refinement operator takes as input a concept and generates new, more specific concept description candidates (i.e., ). For the sake of completeness, we recap the refinement operator here. Let be an ontology, be the set of all atomic concepts in , the set of all object properties in , the set of all numeric datatype properties in , the set of all boolean datatype properties in and a set of (fuzzy) datatypes. The refinement operator is shown in Table 2.
The scoring function. The scoring function we use to assign a score to each candidate hypothesis is essentially a weighted gain function, similar to the one employed in [67, 68, 74, 114] and implements an information-theoretic criterion for selecting the best candidate at each refinement step. Specifically, given a GCI of the form chosen at the previous step, a KB , a set of individuals , a weight distribution over , a set of examples and a candidate GCI of the form , then
where is the weighted cardinality of positive examples covered by that are still covered by . Note that the gain is positive if the confidence degree increases.
Stop Criterion. wFoil- stops when the confidence degree is above a given threshold , or no better weak learner can be found that does not cover any negative example (in ) above a given percentage. Note that in Foil- instead, non-positive examples are not allowed to be covered.
The wFoil- Algorithm. The wFoil- algorithm is defined in Algorithm 2, which we comment briefly as next. Steps 1 - 3 are simple initialisation steps. Steps 5 - 21. are the main loop from which we may exit in case there is no improvement (Step. 16), and the confidence degree of the so far determined weak learner is above a given threshold or it does not cover any negative example above a given percentage (Step. 18). Note that the latter case guarantees soundness of the weak learner if the percentage is set to . In Step 8 we determine all new refinements, which then are scored in Steps 10 -15 in order to determine the one with the best gain. Eventually, once we exit from the main loop, the best found weak learner is returned (Step 22 and 23).