1 Introduction
Over the last few years, knowledge graphs have become indispensable in a large number of datadriven applications (hogan2021knowledge). For instance, many companies followed the example of the Google Knowledge Graph (singhal2012introducing), including LinkedIn (he2016building), Microsoft (shrivastava2017bring), eBay (pittman2017cracking), Amazon (krishnan2018making), AirBnB (chang2018scaling), and Uber (hamad2018food). Although there are different definitions and categories of knowledge graphs (hogan2020knowledge), knowledge graphs are valuable for many tasks, including information retrieval in search engines, and structured machine learning in fixed domains of discourse. However, such realworld knowledge graphs often contain millions of entities and concepts (nodes) with billions of assertions (edges), thus making tasks such as concept learning difficult to solve.
In the previous decade, lehmann2010concept
investigated concept learning using refinement operators. Their results showed improvements in accuracy over previous approaches from Inductive Logic Programs (ILP) on different data sets, suggesting that it is an interesting research direction to explore. A similar work was carried out by
badea2000refinement in the description logic . The authors applied the learning procedure of ILP in Horn clauses^{1}^{1}1https://en.wikipedia.org/wiki/Horn_clause (Accessed on 23.06.2021) to description logics.Although learning in description logics using refinement operators showed several advantages over other methods, it still suffers from scalability issues. On large knowledge graphs, concept learning as described in (lehmann2010concept) and (badea2000refinement) can be very slow as the refinement tree can get very large. This scalability issue is pointed out in (rizzo2020class)
, where the authors proposed to use metaheuristics to tackle the issue. In this paper, we aim at
speeding up the concept learning process by predicting the length of the target concept in advance so that the search space can be reduced. To the best of our knowledge, no similar work has been carried out before, and our findings will serve as a foundation for more investigations in this direction. We evaluate our approach using CELOE. In a nutshell, our contributions are:
Design of different neural network architectures for learning concept lengths.

Design and implementation of an expressive lengthbased refinement operator for the generation of training data.

Integration of trained concept length predictors into the CELOE algorithm, resulting in a new algorithm which we call CELOECLP.
The rest of the paper is organized as follows: In Section 2 we present the notations and terminologies needed in the following sections. In Section 3, we provide a background on description logics, knowledge graph embedding techniques as well as refinement operators for description logics. In particular, we focus on the description logic . Section 4 presents some previous work on concept learning using refinement operators. In Section 5, we describe our method for learning concept lengths and present the results we obtained on different knowledge graphs in Section 6. Section 7 draws conclusions from our findings and introduces new directions for future work.
2 Terminology and Notation
In this section, we present some special notations and terminologies used throughout the paper. (note the difference with ) is a knowledge graph and is the set of all individuals in . is the cardinality function, that is, a function that takes a set and returns the number of elements in the set. The terms “concept” and “class expression” are used interchangeably. If is a class expression, then and are the sets of positive and negative examples of , respectively.
Note that this section only presents notations that are more common to different sections. Any other notation will be explicitly defined in the section where it is used.
3 Background
3.1 Description Logics
Description logics (baader2003description) are a family of languages for knowledge representation. They are fundamental to the Web Ontology Language, OWL, which is often used for the of RDF knowledge graphs. While description logics can be combined with other techniques to yield more powerful formalisms (rosati2006dl+; rudolph2007relational), we focus on the description logic (ttributive anguage with omplement). It is the simplest closed description logic with respect to propositional logics. Its basic components are concept names (e.g., Teacher, Human), role names (e.g., hasChild, bornIn) and individuals (e.g., Mike, Jack). Table 1 introduces syntax and its semantics. In the table, is an interpretation and its domain, see (lehmann2010concept) for more details.
Construct  Syntax  Semantics 

Atomic concept  
Atomic role  
Top concept  
Bottom concept  
Conjunction  
Disjunction  
Negation  
Existential restriction  
Universal restriction 
In , (syntactic) concept lengths are defined recursively (lehmann2010concept):

, for all atomic concepts ,

, for all concepts

, for all concepts

, for all concepts and .
3.2 Refinement Operators
[Refinement Operator] A quasiordering is a reflexive and transitive relation. Let be a quasiordered space. A downward (upward) refinement operator on is a mapping such that for all , implies . Let be a knowledge graph, with
Assume the sets of concept names and role names in are given by:
Let be the set of all concept expressions (rudolph2011foundations) that can be constructed from and (note that is infinite and every concept name is a concept expression). Consider the mapping defined by: for all . is clearly a downward refinement operator and we have for example that,
The first bullet line can be justified by the presence of and in the . Using the semantics of and in Table 1 and the fact that , it is straightforward to prove that the second bullet line holds. Refinement operators can have a number of important properties which we do not discuss in this paper (for further details, we refer the reader to the paper by lehmann2010concept). In the context of concept learning, these properties can be exploited to optimize the traversal of the space of concepts in search for a specific concept.
3.3 Knowledge Graph Embeddings
A knowledge graph embedding function maps knowledge graphs to continuous vector spaces to facilitate downstream tasks such as link prediction, knowledge graph completion and concept learning in description logics
(wang2017knowledge). Approaches for embedding knowledge graphs can be subdivided into two categories: the first category of approaches only uses facts in the knowledge graph (nickel2012factorizing; weston2013connecting; bordes2014semantic), and the second approach takes into account additional information about entities and relations such as textual descriptions (xie2016representation; wang2016text). Both approaches usually initialize each entity and relation with a random vector, random matrix or random tensor, then define a scoring function to learn the embeddings so that facts that are observed in the knowledge graph are given high scores, and nonobserved facts are assigned smaller scores (unless there is a good reason for them to be given a high score, for instance, when the fact is supposed to hold but it is not observed in the knowledge graph, or the fact is a logical implication of the learned patterns). We refer the reader to the survey
wang2017knowledge for more details. In this work, we use the Convolutional Complex Embedding Model (ConEx) by demir2020convolutional, which has proved to achieve stateoftheart results with less trainable parameters.4 Related Work
The rich syntax of description logics and other W3C^{2}^{2}2https://www.w3.org/ standards (hitzler2009foundations) paved the way for solving challenging tasks such as link prediction, concept learning, knowledge graph completion. lehmann2010concept investigated concept learning using refinement operators. In their work, they studied different combinations of properties that a refinement operator can have and designed their own refinement operator. The latter was used to learn concept expressions in many knowledge graphs, including Carcinogenesis, Poker, Forte, and Moral^{3}^{3}3See https://github.com/SmartDataAnalytics/DLLearner/releases for datasets.. Their approach proved to be competitive in accuracy with (and on some examples, superior to) the stateoftheart, mainly inductive logic programs. A similar work was already carried out by Badea and NienhuysCheng badea2000refinement in the description logic . The authors applied the learning procedure of ILP in Horn clauses to description logics. They built a refinement operator for learning in which turned out to have an advantage: it avoids the difficulties related to the computation of the most specific concepts (MSC) baader1998least faced by the approach in cohen1994learning. The evaluation of their approach on real ontologies from different domains showed promising results, but it had the disadvantage of dependency on the instance data.
lehmann2011class proposed an algorithm for learning concepts (CELOE), with a focus on ontology engineering, namely the extension of OWL ontologies. The authors argue that machine learning equivalent class suggestions have two main advantages over human expert suggestions: 1.) the suggested class expressions fit the instance data, and 2.) it is easier to understand an OWL class expression than to understand the structure of an ontology and manually generate a class expression. The evaluation of their approach on real ontologies showed promising results. Its disadvantage is mainly the dependency on the availability of instance data, and the quality of the ontology.
rizzo2018framework proposed , a modified version of (fanizzi2008dl) for concept learning, which employs metaheuristics to help reduce the search space. The algorithm computes a generalization as a disjunction of partial descriptions, each covering a part of the positive examples and ruling out as many negative, and uncertain membership examples as possible. The generation of each partial description is done by selecting refinements that contain at least one positive example and score a certain minimum gain. The first release of , also known as (rizzo2018framework) was essentially based on omission rates: to check if further iterations are required, compares the scores of the current concept definition and of the best concept obtained at that stage. Its latest versions (, and ) (rizzo2020class) arrived with two more tricks, which are used during the search process. The algorithm employs a lookahead strategy by assessing the quality of the next possible refinements of the current partial description. Contrarily, the algorithm attempts to solve the myopia problem in by introducing a local memory, used to avoid reconsidering suboptimal choices previously made.
5 Method
In this section, we address the problem of predicting the correct concept length for a given learning problem as a means to improve the runtime of concept learning approaches based on refinement operators. Besides, we use the Closed World Assumption (CWA) on the negative examples of each concept: every individual in the knowledge graph that does not belong to a concept expression is a negative example for that concept.
5.1 Training Data Construction for Length Prediction
Given a knowledge graph , we compute the transitive closure of the rdf:type statements and add said information to . This technique led to better embeddings (we used (demir2020convolutional) to compute the embeddings) in our preliminary experiments. Next, we proceed as follows:

Generate class expressions of various lengths using a lengthbased refinement operator^{4}^{4}4Lengthbased refinement operators generally restrict the length of child nodes to be generated not to exceed and/or not to be less than some predefined integer values.. In this process, short concepts are to be preferred over long concepts, i.e, when two concepts have the same set of positive examples, the longest concept is left out.

Get the positive and negative examples for each generated class expression , using semantics. We then define a hyperparameter that represents the total number of positive and negative examples we want to use. In fact, choosing to use all positive examples and all negative examples might lead to scalability issues. The sampling of positive and negative examples is done as follows:

If and , then we randomly sample individuals from each of the two sets and .

Otherwise, we take all individuals in the minority set and sample the remaining number of individuals from the other set.

On the vector representation of entities (individuals), we create an extradimension at the end of the entries, where we insert for positive examples, and for negative examples. In other words, we define an injective function for each target concept :
(1) 
where d is the dimension of the embedding space, and is the entity whose embedding is . Intuitively, is a categorization function that facilitates the distinction between positive and negative examples, without affecting the quality of the embeddings. Thus, a data point in the training, validation, and test data sets is a tuple , where is a matrix of shape . Let be the number of positive and negative examples for , respectively. Assume the embedding vectors of positive examples are and those of negative examples are . Then, the row of is given by:
(2) 
We view the prediction of concept lengths as a classification problem with classes , where is the length of the longest concept in the training data set. As shown in Table 2, the concept length distribution is imbalanced. To prevent concept length predictors from overfitting on the majority classes, we used the weighted crossentropy loss:
where is the batch size,
is the batch matrix of predicted probabilities (or scores),
is the batch vector of targets, is a weight vector, and is the indicator function defined by:The weight vector is defined by: , where is the number of class expressions of length in the training data set. Table 2 provides details on the training, validation, and test data sets for each of the four knowledge graphs.
Concept  Carcinogenesis  Family Benchmark  Mutagenesis  Semantic Bible  

length  train  val  test  train  val  test  train  val  test  train  val  test 
3  3647  405  1013  84  10  23  1038  115  288  487  54  135 
5  782  87  217  570  63  158  1156  129  321  546  61  152 
6  88  10  24  0  0  0  0  0  0  162  18  45 
7  1143  127  318  997  111  277  1310  146  364  104  12  29 
9  0  0  0  1960  218  547  0  0  0  73  8  21 
11  0  0  0  0  0  0  0  0  0  41  5  11 
5.2 Concept Length Predictors
We considered four neural network architectures: Long Short Term Memory (LSTM)
(hochreiter1997long), Gated Recurrent Unit (GRU)
(cho2014learning), MultiLayer Perceptron (MLP), and Convolutional Neural Network (CNN). The implementation details as well as hyperparameter setting for each of the networks are given in Section
6.3.5.3 Concept Learning
[Class Expression Learning] Given a set of positive examples and a set of negative examples , the learning problem is to find a class expression such that the Fmeasure defined by:
(3)  
is maximized. In this work, we are interested in finding such a concept expression by using refinement operators (lehmann2010concept). Note that the solution concept may not be unique; however, from the training data construction (1), our proposed method favors short solutions.
5.4 CeloeClp
In CELOECLP, the concept length learner predicts the length of the target concept for each learning problem, which is used as the maximum child length in the refinement operator. As a result, our concept learner only tests concepts of length at most the predicted length, during the search process. Note that CELOE’s refinement operator does not support a global setting of the maximum length refinements can have. Hence, we used our own implemented lengthbased refinement operator in CELOECLP. Figure 1 illustrates CELOECLP exploration strategy.
Refinements of length more than the predicted length are ignored during the search. In the figure, is of length 7 (which is greater than 5), and therefore it is neither tested nor added to the search tree. During concept learning, we sample positive examples and negative examples from the considered learning problem such that , as described in 2. For learning problems with the total number of positive and negative examples less than , we upsample the initial sets of examples.
6 Evaluation
6.1 Datasets
We carried out our experiments on four benchmark knowledge graphs: Carcinogenesis, Family Benchmark, and Semantic Bible (available on the DLLearner page) described in Table 3.
Dataset  #instances  #concepts  #obj. properties  #data properties  # triples 

Carcinogenesis  22,372  142  4  15  193,878 
Family Benchmark  202  18  4  0  4,064 
Mutagenesis  14,145  86  5  6  124,134 
Semantic Bible  724  48  29  9  9,092 
6.2 Hardware
The training of our concept length learners was carried out on a single 12GB memory NVIDIA K80 GPU with 24GB of RAM and an 8core Intel Xeon E52695 with 2.30GHz and 16GB RAM. In contrast, we did not need a GPU to conduct our experiments on concept learning (with CELOE and CELOECLP). Hence, we only used the CPU.
6.3 Hyperparameter Optimization
In our preliminary experiments on all four knowledge graphs, we used a random search (bergstra2012random) to select fitting hyperparameters (as summarized in Table 4
). Our experiments suggest that choosing two layers for the recurrent neural networks (LSTM, GRU) is the best choice in terms of computation costs and classification accuracy. In addition, two linear layers, batch normalization and dropout layers are used to increase the performance. On the other hand, the CNN model is made of two convolution layers, two linear layers, two dropout layers, and a batch normalization layer. Finally, we chose 4 layers for the MLP model with also batch normalization and dropout layers. The Rectified Linear Unit (ReLU) and the Scaled Exponential Linear Units (SELU)
(klambauer2017self)activation functions are used in the intermediate layers of all models, whereas the Sigmoid function is used in the output layers.We ran the experiments in a fold cross validation setting. Table 4
gives an overview of the hyperparameter settings on each of the four knowledge graphs considered. The number of epochs was set based on the training speed, for example on the Carcinogenesis knowledge graph, length predictors are able to achieve stateoftheart results with just 50 epochs (see Section
6.4 for details).Dataset  #epochs  lr  d  batch size  N 

Carcinogenesis  50  0.003  40  512  1,000 
Family Benchmark  150  0.003  40  256  101 
Mutagenesis  100  0.003  40  512  1,000 
Semantic Bible  200  0.003  40  256  362 
Carcinogenesis  Family Benchmark  
#parameters  train. time (s)  #parameters  train. time (s)  
LSTM  160,208  188.42  160,610  105.70 
GRU  125,708  191.16  126,110  102.85 
CNN  838,968  16.77  536,992  20.82 
MLP  61,681  10.04  61,807  16.32 
Mutagenesis  Semantic Bible  
#parameters  train. time (s)  #parameters  train. time (s)  
LSTM  160,208  228.13  161,012  196.20 
GRU  125,708  228.68  126,512  197.86 
CNN  838,248  44.74  96,684  18.43 
MLP  61,681  14.29  61,933  9.56 
Adam optimizer (kingma2014adam) is used to train the length predictors. The number of examples N was varied between and , and the embedding dimension d from to , but we finally chose and as best values for both classification accuracy and computation cost on the four data sets considered.
6.4 Results and Discussion
In Figure 2, Figure 3 and Figure 4, we show the training curves for the four model architectures. Due to space constraints, we do not show the training curves for the Family Benchmark knowledge graph; however, it can be found in our public GitHub repository^{5}^{5}5https://github.com/ConceptLengthLearner/ReproducibilityRepo.
In all figures, we can observe that during training, the loss decreased on both the training and the validation data sets, which suggests that the models were able to learn. The Gated Recurrent Unit (GRU) model outperforms the rest of the models on all knowledge graphs. The input to the MLP model is the average of the embeddings of the positive and negative examples for a concept, which we qualify as “poor feeding”. As shown in Figure 2to4, the model underperformed as compared to the other three model architectures. We also assessed the elementwise multiplication of the embeddings and obtained similar results. However, as reflected in Table 6, all models outperform a random model.
Carcinogenesis  Family Benchmark  

LSTM  GRU  CNN  MLP  RM  LSTM  GRU  CNN  MLP  RM  
Train. Accuracy  
Val. Accuracy  
Test Accuracy  
Test F1  
Mutagenesis  Semantic Bible  
LSTM  GRU  CNN  MLP  RM  LSTM  GRU  CNN  MLP  RM  
Train. Accuracy  
Val. Accuracy  
Test Accuracy  
Test F1 
Carcinogenesis  Family Benchmark  

CELOE  CELOECLP  CELOE  CELOECLP  
Avg. F1  
Avg. Runtime (s)  
Avg. Solution Length  
Mutagenesis  Semantic Bible  
CELOE  CELOECLP  CELOE  CELOECLP  
Avg. F1  
Avg. Runtime (s)  
Avg. Solution Length 
Table 6 compares our proposed neural network architectures and a random model on the Carcinogenesis, Family Benchmark, Mutagenesis and Semantic Bible knowledge graphs. From the table, it appears that recurrent neural network models (GRU, LSTM) outperform the other two models (CNN and MLP) on the two large (and hierarchically rich)^{6}^{6}6Hierarchically rich in this context refers to the large amount of axioms and declarations (declaration axioms, logical axioms, relations between instances, etc) present in a knowledge graph. knowledge graphs (Carcinogenesis, Mutagenesis). While the convolution model tends to overfit on all knowledge graphs, the MLP model is just unable to extract relevant information from the average embeddings. On the Family Benchmark and Semantic Bible knowledge graphs, which appear to have a few declarations and axioms (witnessed by the number of triples in Table 3), all our proposed networks could not achieve stateoftheart results. Hence our learning approach is more suitable for rich (or large) knowledge graphs, which aligns well with our objective of speeding up concept learning on large knowledge graphs. Besides, all our proposed models are clearly better than a random model with a performance (F1 score) difference on average between (MLP) and (GRU).
Table 7 presents the comparison results between CELOE and CELOECLP on hundred learning problems per knowledge graph. In this evaluation, CELOECLP uses our best concept length predictor (GRU). The results in the table are in the format . Moreover, we used Wilcoxon’s (Rank Sum) test to check whether the difference in performance of the two concept learners is significant (in which case there is a “
”) or not. The null hypothesis for this test is: “the two distributions that we compare are the same”, and the significance level is
. From the results of the test, the differences in runtime and solution length were found to be significant in most cases. Hence, CELOECLP outperforms CELOE in terms of runtime and solution length (biased towards short solution concepts). Moreover, even if the differences are not significant in some cases, our approach tends to compute better solutions (higher F1 score). In our experiments, CELOECLP is , , , and on average faster than CELOE on the Carcinogenesis, Family Benchmark, Mutagenesis, and Semantic Bible knowledge graphs, respectively. The full comparison results, including the learning problems that we considered as well as the computed solutions, can be found on our GitHub repository https://github.com/ConceptLengthLearner/ReproducibilityRepo.7 Conclusion and Future Work
We investigated the prediction of concept lengths in the description logic, to speed up the concept learning process using refinement operators. To this end, four neural network architectures were evaluated on four benchmark knowledge graphs: Carcinogenesis, Family Benchmark, Mutagenesis, and Semantic Bible. The evaluation results suggest that all of our proposed models are superior to a random model, with recurrent neural networks (GRU and LSTM) being the best performing at this task. We showed that integrating our concept length predictors into a concept learner can reduce the search space and improve the runtime while preserving the quality (Fmeasure) of solution concepts.
Even though our proposed learning approach was very efficient when dealing with concepts of length up to 11 (in ), its behavior is not guaranteed when longer concepts are considered. Moreover, the use of generic embedding techniques might lead to suboptimal results. Hence, we plan to consider embedding a given knowledge graph while we are learning the lengths of its complex (long) class expressions.
Comments
There are no comments yet.