Over the last few years, knowledge graphs have become indispensable in a large number of data-driven applications (hogan2021knowledge). For instance, many companies followed the example of the Google Knowledge Graph (singhal2012introducing), including LinkedIn (he2016building), Microsoft (shrivastava2017bring), eBay (pittman2017cracking), Amazon (krishnan2018making), AirBnB (chang2018scaling), and Uber (hamad2018food). Although there are different definitions and categories of knowledge graphs (hogan2020knowledge), knowledge graphs are valuable for many tasks, including information retrieval in search engines, and structured machine learning in fixed domains of discourse. However, such real-world knowledge graphs often contain millions of entities and concepts (nodes) with billions of assertions (edges), thus making tasks such as concept learning difficult to solve.
In the previous decade, lehmann2010concept
investigated concept learning using refinement operators. Their results showed improvements in accuracy over previous approaches from Inductive Logic Programs (ILP) on different data sets, suggesting that it is an interesting research direction to explore. A similar work was carried out bybadea2000refinement in the description logic . The authors applied the learning procedure of ILP in Horn clauses111https://en.wikipedia.org/wiki/Horn_clause (Accessed on 23.06.2021) to description logics.
Although learning in description logics using refinement operators showed several advantages over other methods, it still suffers from scalability issues. On large knowledge graphs, concept learning as described in (lehmann2010concept) and (badea2000refinement) can be very slow as the refinement tree can get very large. This scalability issue is pointed out in (rizzo2020class)
, where the authors proposed to use meta-heuristics to tackle the issue. In this paper, we aim atspeeding up the concept learning process by predicting the length of the target concept in advance so that the search space can be reduced. To the best of our knowledge, no similar work has been carried out before, and our findings will serve as a foundation for more investigations in this direction. We evaluate our approach using CELOE. In a nutshell, our contributions are:
Design of different neural network architectures for learning concept lengths.
Design and implementation of an expressive length-based refinement operator for the generation of training data.
Integration of trained concept length predictors into the CELOE algorithm, resulting in a new algorithm which we call CELOE-CLP.
The rest of the paper is organized as follows: In Section 2 we present the notations and terminologies needed in the following sections. In Section 3, we provide a background on description logics, knowledge graph embedding techniques as well as refinement operators for description logics. In particular, we focus on the description logic . Section 4 presents some previous work on concept learning using refinement operators. In Section 5, we describe our method for learning concept lengths and present the results we obtained on different knowledge graphs in Section 6. Section 7 draws conclusions from our findings and introduces new directions for future work.
2 Terminology and Notation
In this section, we present some special notations and terminologies used throughout the paper. (note the difference with ) is a knowledge graph and is the set of all individuals in . is the cardinality function, that is, a function that takes a set and returns the number of elements in the set. The terms “concept” and “class expression” are used interchangeably. If is a class expression, then and are the sets of positive and negative examples of , respectively.
Note that this section only presents notations that are more common to different sections. Any other notation will be explicitly defined in the section where it is used.
3.1 Description Logics
Description logics (baader2003description) are a family of languages for knowledge representation. They are fundamental to the Web Ontology Language, OWL, which is often used for the of RDF knowledge graphs. While description logics can be combined with other techniques to yield more powerful formalisms (rosati2006dl+; rudolph2007relational), we focus on the description logic (ttributive anguage with omplement). It is the simplest closed description logic with respect to propositional logics. Its basic components are concept names (e.g., Teacher, Human), role names (e.g., hasChild, bornIn) and individuals (e.g., Mike, Jack). Table 1 introduces syntax and its semantics. In the table, is an interpretation and its domain, see (lehmann2010concept) for more details.
In , (syntactic) concept lengths are defined recursively (lehmann2010concept):
, for all atomic concepts ,
, for all concepts
, for all concepts
, for all concepts and .
3.2 Refinement Operators
[Refinement Operator] A quasi-ordering is a reflexive and transitive relation. Let be a quasi-ordered space. A downward (upward) refinement operator on is a mapping such that for all , implies . Let be a knowledge graph, with
Assume the sets of concept names and role names in are given by:
Let be the set of all concept expressions (rudolph2011foundations) that can be constructed from and (note that is infinite and every concept name is a concept expression). Consider the mapping defined by: for all . is clearly a downward refinement operator and we have for example that,
The first bullet line can be justified by the presence of and in the . Using the semantics of and in Table 1 and the fact that , it is straightforward to prove that the second bullet line holds. Refinement operators can have a number of important properties which we do not discuss in this paper (for further details, we refer the reader to the paper by lehmann2010concept). In the context of concept learning, these properties can be exploited to optimize the traversal of the space of concepts in search for a specific concept.
3.3 Knowledge Graph Embeddings
A knowledge graph embedding function maps knowledge graphs to continuous vector spaces to facilitate downstream tasks such as link prediction, knowledge graph completion and concept learning in description logics(wang2017knowledge). Approaches for embedding knowledge graphs can be subdivided into two categories: the first category of approaches only uses facts in the knowledge graph (nickel2012factorizing; weston2013connecting; bordes2014semantic), and the second approach takes into account additional information about entities and relations such as textual descriptions (xie2016representation; wang2016text)
. Both approaches usually initialize each entity and relation with a random vector, random matrix or random tensor, then define a scoring function to learn the embeddings so that facts that are observed in the knowledge graph are given high scores, and non-observed facts are assigned smaller scores (unless there is a good reason for them to be given a high score, for instance, when the fact is supposed to hold but it is not observed in the knowledge graph, or the fact is a logical implication of the learned patterns). We refer the reader to the surveywang2017knowledge for more details. In this work, we use the Convolutional Complex Embedding Model (ConEx) by demir2020convolutional, which has proved to achieve state-of-the-art results with less trainable parameters.
4 Related Work
The rich syntax of description logics and other W3C222https://www.w3.org/ standards (hitzler2009foundations) paved the way for solving challenging tasks such as link prediction, concept learning, knowledge graph completion. lehmann2010concept investigated concept learning using refinement operators. In their work, they studied different combinations of properties that a refinement operator can have and designed their own refinement operator. The latter was used to learn concept expressions in many knowledge graphs, including Carcinogenesis, Poker, Forte, and Moral333See https://github.com/SmartDataAnalytics/DL-Learner/releases for datasets.. Their approach proved to be competitive in accuracy with (and on some examples, superior to) the state-of-the-art, mainly inductive logic programs. A similar work was already carried out by Badea and Nienhuys-Cheng badea2000refinement in the description logic . The authors applied the learning procedure of ILP in Horn clauses to description logics. They built a refinement operator for learning in which turned out to have an advantage: it avoids the difficulties related to the computation of the most specific concepts (MSC) baader1998least faced by the approach in cohen1994learning. The evaluation of their approach on real ontologies from different domains showed promising results, but it had the disadvantage of dependency on the instance data.
lehmann2011class proposed an algorithm for learning concepts (CELOE), with a focus on ontology engineering, namely the extension of OWL ontologies. The authors argue that machine learning equivalent class suggestions have two main advantages over human expert suggestions: 1.) the suggested class expressions fit the instance data, and 2.) it is easier to understand an OWL class expression than to understand the structure of an ontology and manually generate a class expression. The evaluation of their approach on real ontologies showed promising results. Its disadvantage is mainly the dependency on the availability of instance data, and the quality of the ontology.
rizzo2018framework proposed , a modified version of (fanizzi2008dl) for concept learning, which employs meta-heuristics to help reduce the search space. The algorithm computes a generalization as a disjunction of partial descriptions, each covering a part of the positive examples and ruling out as many negative, and uncertain membership examples as possible. The generation of each partial description is done by selecting refinements that contain at least one positive example and score a certain minimum gain. The first release of , also known as (rizzo2018framework) was essentially based on omission rates: to check if further iterations are required, compares the scores of the current concept definition and of the best concept obtained at that stage. Its latest versions (, and ) (rizzo2020class) arrived with two more tricks, which are used during the search process. The algorithm employs a lookahead strategy by assessing the quality of the next possible refinements of the current partial description. Contrarily, the algorithm attempts to solve the myopia problem in by introducing a local memory, used to avoid reconsidering sub-optimal choices previously made.
In this section, we address the problem of predicting the correct concept length for a given learning problem as a means to improve the runtime of concept learning approaches based on refinement operators. Besides, we use the Closed World Assumption (CWA) on the negative examples of each concept: every individual in the knowledge graph that does not belong to a concept expression is a negative example for that concept.
5.1 Training Data Construction for Length Prediction
Given a knowledge graph , we compute the transitive closure of the rdf:type statements and add said information to . This technique led to better embeddings (we used (demir2020convolutional) to compute the embeddings) in our preliminary experiments. Next, we proceed as follows:
Generate class expressions of various lengths using a length-based refinement operator444Length-based refinement operators generally restrict the length of child nodes to be generated not to exceed and/or not to be less than some predefined integer values.. In this process, short concepts are to be preferred over long concepts, i.e, when two concepts have the same set of positive examples, the longest concept is left out.
Get the positive and negative examples for each generated class expression , using semantics. We then define a hyper-parameter that represents the total number of positive and negative examples we want to use. In fact, choosing to use all positive examples and all negative examples might lead to scalability issues. The sampling of positive and negative examples is done as follows:
If and , then we randomly sample individuals from each of the two sets and .
Otherwise, we take all individuals in the minority set and sample the remaining number of individuals from the other set.
On the vector representation of entities (individuals), we create an extra-dimension at the end of the entries, where we insert for positive examples, and for negative examples. In other words, we define an injective function for each target concept :
where d is the dimension of the embedding space, and is the entity whose embedding is . Intuitively, is a categorization function that facilitates the distinction between positive and negative examples, without affecting the quality of the embeddings. Thus, a data point in the training, validation, and test data sets is a tuple , where is a matrix of shape . Let be the number of positive and negative examples for , respectively. Assume the embedding vectors of positive examples are and those of negative examples are . Then, the row of is given by:
We view the prediction of concept lengths as a classification problem with classes , where is the length of the longest concept in the training data set. As shown in Table 2, the concept length distribution is imbalanced. To prevent concept length predictors from overfitting on the majority classes, we used the weighted cross-entropy loss:
where is the batch size,
is the batch matrix of predicted probabilities (or scores),is the batch vector of targets, is a weight vector, and is the indicator function defined by:
The weight vector is defined by: , where is the number of class expressions of length in the training data set. Table 2 provides details on the training, validation, and test data sets for each of the four knowledge graphs.
|Concept||Carcinogenesis||Family Benchmark||Mutagenesis||Semantic Bible|
5.2 Concept Length Predictors
We considered four neural network architectures: Long Short Term Memory (LSTM)(hochreiter1997long)
, Gated Recurrent Unit (GRU)(cho2014learning)6.3.
5.3 Concept Learning
[Class Expression Learning] Given a set of positive examples and a set of negative examples , the learning problem is to find a class expression such that the F-measure defined by:
is maximized. In this work, we are interested in finding such a concept expression by using refinement operators (lehmann2010concept). Note that the solution concept may not be unique; however, from the training data construction (1), our proposed method favors short solutions.
In CELOE-CLP, the concept length learner predicts the length of the target concept for each learning problem, which is used as the maximum child length in the refinement operator. As a result, our concept learner only tests concepts of length at most the predicted length, during the search process. Note that CELOE’s refinement operator does not support a global setting of the maximum length refinements can have. Hence, we used our own implemented length-based refinement operator in CELOE-CLP. Figure 1 illustrates CELOE-CLP exploration strategy.
Refinements of length more than the predicted length are ignored during the search. In the figure, is of length 7 (which is greater than 5), and therefore it is neither tested nor added to the search tree. During concept learning, we sample positive examples and negative examples from the considered learning problem such that , as described in 2. For learning problems with the total number of positive and negative examples less than , we up-sample the initial sets of examples.
We carried out our experiments on four benchmark knowledge graphs: Carcinogenesis, Family Benchmark, and Semantic Bible (available on the DL-Learner page) described in Table 3.
|Dataset||#instances||#concepts||#obj. properties||#data properties||# triples|
The training of our concept length learners was carried out on a single 12GB memory NVIDIA K80 GPU with 24GB of RAM and an 8-core Intel Xeon E5-2695 with 2.30GHz and 16GB RAM. In contrast, we did not need a GPU to conduct our experiments on concept learning (with CELOE and CELOE-CLP). Hence, we only used the CPU.
6.3 Hyper-parameter Optimization
In our preliminary experiments on all four knowledge graphs, we used a random search (bergstra2012random) to select fitting hyper-parameters (as summarized in Table 4
). Our experiments suggest that choosing two layers for the recurrent neural networks (LSTM, GRU) is the best choice in terms of computation costs and classification accuracy. In addition, two linear layers, batch normalization and dropout layers are used to increase the performance. On the other hand, the CNN model is made of two convolution layers, two linear layers, two dropout layers, and a batch normalization layer. Finally, we chose 4 layers for the MLP model with also batch normalization and dropout layers. The Rectified Linear Unit (ReLU) and the Scaled Exponential Linear Units (SELU)(klambauer2017self)activation functions are used in the intermediate layers of all models, whereas the Sigmoid function is used in the output layers.
We ran the experiments in a -fold cross validation setting. Table 4
gives an overview of the hyper-parameter settings on each of the four knowledge graphs considered. The number of epochs was set based on the training speed, for example on the Carcinogenesis knowledge graph, length predictors are able to achieve state-of-the-art results with just 50 epochs (see Section6.4 for details).
|#parameters||train. time (s)||#parameters||train. time (s)|
|#parameters||train. time (s)||#parameters||train. time (s)|
Adam optimizer (kingma2014adam) is used to train the length predictors. The number of examples N was varied between and , and the embedding dimension d from to , but we finally chose and as best values for both classification accuracy and computation cost on the four data sets considered.
6.4 Results and Discussion
In Figure 2, Figure 3 and Figure 4, we show the training curves for the four model architectures. Due to space constraints, we do not show the training curves for the Family Benchmark knowledge graph; however, it can be found in our public GitHub repository555https://github.com/ConceptLengthLearner/ReproducibilityRepo.
In all figures, we can observe that during training, the loss decreased on both the training and the validation data sets, which suggests that the models were able to learn. The Gated Recurrent Unit (GRU) model outperforms the rest of the models on all knowledge graphs. The input to the MLP model is the average of the embeddings of the positive and negative examples for a concept, which we qualify as “poor feeding”. As shown in Figure 2-to-4, the model underperformed as compared to the other three model architectures. We also assessed the element-wise multiplication of the embeddings and obtained similar results. However, as reflected in Table 6, all models outperform a random model.
|Avg. Runtime (s)|
|Avg. Solution Length|
|Avg. Runtime (s)|
|Avg. Solution Length|
Table 6 compares our proposed neural network architectures and a random model on the Carcinogenesis, Family Benchmark, Mutagenesis and Semantic Bible knowledge graphs. From the table, it appears that recurrent neural network models (GRU, LSTM) outperform the other two models (CNN and MLP) on the two large (and hierarchically rich)666Hierarchically rich in this context refers to the large amount of axioms and declarations (declaration axioms, logical axioms, relations between instances, etc) present in a knowledge graph. knowledge graphs (Carcinogenesis, Mutagenesis). While the convolution model tends to overfit on all knowledge graphs, the MLP model is just unable to extract relevant information from the average embeddings. On the Family Benchmark and Semantic Bible knowledge graphs, which appear to have a few declarations and axioms (witnessed by the number of triples in Table 3), all our proposed networks could not achieve state-of-the-art results. Hence our learning approach is more suitable for rich (or large) knowledge graphs, which aligns well with our objective of speeding up concept learning on large knowledge graphs. Besides, all our proposed models are clearly better than a random model with a performance (F1 score) difference on average between (MLP) and (GRU).
Table 7 presents the comparison results between CELOE and CELOE-CLP on hundred learning problems per knowledge graph. In this evaluation, CELOE-CLP uses our best concept length predictor (GRU). The results in the table are in the format . Moreover, we used Wilcoxon’s (Rank Sum) test to check whether the difference in performance of the two concept learners is significant (in which case there is a “
”) or not. The null hypothesis for this test is: “the two distributions that we compare are the same”, and the significance level is. From the results of the test, the differences in runtime and solution length were found to be significant in most cases. Hence, CELOE-CLP outperforms CELOE in terms of runtime and solution length (biased towards short solution concepts). Moreover, even if the differences are not significant in some cases, our approach tends to compute better solutions (higher F1 score). In our experiments, CELOE-CLP is , , , and on average faster than CELOE on the Carcinogenesis, Family Benchmark, Mutagenesis, and Semantic Bible knowledge graphs, respectively. The full comparison results, including the learning problems that we considered as well as the computed solutions, can be found on our GitHub repository https://github.com/ConceptLengthLearner/ReproducibilityRepo.
7 Conclusion and Future Work
We investigated the prediction of concept lengths in the description logic, to speed up the concept learning process using refinement operators. To this end, four neural network architectures were evaluated on four benchmark knowledge graphs: Carcinogenesis, Family Benchmark, Mutagenesis, and Semantic Bible. The evaluation results suggest that all of our proposed models are superior to a random model, with recurrent neural networks (GRU and LSTM) being the best performing at this task. We showed that integrating our concept length predictors into a concept learner can reduce the search space and improve the runtime while preserving the quality (F-measure) of solution concepts.
Even though our proposed learning approach was very efficient when dealing with concepts of length up to 11 (in ), its behavior is not guaranteed when longer concepts are considered. Moreover, the use of generic embedding techniques might lead to sub-optimal results. Hence, we plan to consider embedding a given knowledge graph while we are learning the lengths of its complex (long) class expressions.