Dividing the Ontology Alignment Task with Semantic Embeddings and Logic-based Modules

02/25/2020 ∙ by Ernesto Jiménez-Ruiz, et al. ∙ City, University of London 0

Large ontologies still pose serious challenges to state-of-the-art ontology alignment systems. In this paper we present an approach that combines a neural embedding model and logic-based modules to accurately divide an input ontology matching task into smaller and more tractable matching (sub)tasks. We have conducted a comprehensive evaluation using the datasets of the Ontology Alignment Evaluation Initiative. The results are encouraging and suggest that the proposed method is adequate in practice and can be integrated within the workflow of systems unable to cope with very large ontologies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of (semi-)automatically computing an alignment between independently developed ontologies has been extensively studied in the last years. As a result, a number of sophisticated ontology alignment systems currently exist [DBLP:journals/tkde/ShvaikoE13, DBLP:books/daglib/0032976].222Ontology matching surveys and approaches: http://ontologymatching.org/ The Ontology Alignment Evaluation Initiative333OAEI evaluation campaigns: http://oaei.ontologymatching.org/ (OAEI) [DBLP:conf/semweb/AlgergawyCFFFHH18, DBLP:conf/semweb/AlgergawyFFFHHJ19] has played a key role in the benchmarking of these systems by facilitating their comparison on the same basis and the reproducibility of the results. The OAEI includes different tracks organised by different research groups. Each track contains one or more matching tasks involving small-size (e.g., conference), medium-size (e.g., anatomy), large (e.g., phenotype) or very large (e.g., largebio) ontologies. Some tracks only involve matching at the terminological level (e.g., concepts and properties) while other tracks also expect an alignment at the assertional level (i.e., instance data).

Large ontologies still pose serious challenges to ontology alignment systems. For example, several systems participating in the largebio track were unable to complete the largest tasks during the latest OAEI campaigns.444Largebio track: http://www.cs.ox.ac.uk/isg/projects/SEALS/oaei/ These systems typically use advanced alignment methods and are able to cope with small and medium size ontologies with competitive results, but fail to complete large tasks in a given time frame or with the available resources such as memory.

There have been several efforts in the literature to divide the ontology alignment task (e.g., [Hamdi:2010, hu:2008]). These approaches, however, have not been successfully evaluated with very large ontologies, failing to scale or producing partitions of the ontologies leading to information loss [pereira:2017]. In this paper we propose a novel method to accurately divide the matching task into several independent, smaller and manageable (sub)tasks, so as to scale systems that cannot cope with very large ontologies.555A preliminary version of this work has been published in arXiv [arxiv18_division] and in the Ontology Matching workshop [om18_division]. Unlike state-of-the-art approaches, our method: (i) preserves the coverage of the relevant ontology alignments while keeping manageable matching subtasks; (ii) provides a formal notion of matching subtask and semantic context; (iii) uses neural embeddings to compute an accurate division by learning semantic similarities between words and ontology entities according to the ontology alignment task at hand; (iv) computes self-contained (logical) modules to guarantee the inclusion of the (semantically) relevant information required by an alignment system; and (v) has been successfully evaluated with very large ontologies.

2 Preliminaries

A mapping (also called match) between entities666In this work we accept any input ontology in the OWL 2 language [OWL2]. We refer to (OWL 2) concepts, properties and individuals as entities. of two ontologies and is typically represented as a 4-tuple where and are entities of and , respectively; is a semantic relation, typically one of ; and is a confidence value, usually, a real number within the interval . For simplicity, we refer to a mapping as a pair . An ontology alignment is a set of mappings between two ontologies and .

An ontology matching task is composed of a pair of ontologies (typically called source) and (typically called target) and possibly an associated reference alignment . The objective of a matching task is to discover an overlapping of and in the form of an alignment . The size or search space of a matching task is typically bound to the size of the Cartesian product between the entities of the input ontologies: , where denotes the signature (i.e., entities) of and denotes the size of a set.

An ontology matching system is a program that, given as input a matching task , generates an ontology alignment .777Typically automatic, although there are systems that also allow human interaction [ker2019]. The standard evaluation measures for an alignment are precision (P), recall (R) and f-measure (F) computed against a reference alignment as follows:

(1)

Figure 1: Pipeline to divide a given matching task .

2.1 Problem definition and quality measures

We denote division of an ontology matching task , composed by the ontologies and , as the process of finding matching subtasks (with =,…,), where and .

Size of the division. The size of each matching subtask is smaller than the original task and thus reduces the search space. Let be the division of a matching task into subtasks. The size ratio of the subtasks and with respect to the original matching task size is computed as follows:

(2)
(3)

The ratio is less than while the aggregation , being the number of matching subtasks, can be greater than as matching subtasks depend on the division technique and may overlap.

Alignment coverage. The division of the matching task aims at preserving the target outcomes of the original matching task. The coverage is calculated with respect to a relevant alignment , possibly the reference alignment of the matching task if it exists, and indicates whether that alignment can still be (potentially) discovered with the matching subtasks. The formal notion of coverage is given in Definitions 1 and 2.

Definition 1 (Coverage of a matching task)

Let be a matching task and an alignment. We say that a mapping is covered by the matching task if and . The coverage of w.r.t. (denoted as ) represents the set of mappings covered by .

Definition 2 (Coverage of the matching task division)

Let be the result of dividing a matching task and an alignment. We say that a mapping is covered by if is at least covered by one of the matching subtask (with =,…,) as in Definition 1. The coverage of w.r.t. (denoted as ) represents the set of mappings covered by . The coverage is given as a ratio with respect to the (covered) alignment:

(4)

3 Methods

In this section we present our approach to compute a division given a matching task and the number of target subtasks . We rely on locality ontology modules to extract self-contained modules of the input ontologies. The module extraction and task division is tailored to the ontology alignment task at hand by embedding the contextual semantics of a (combined) inverted index of the ontologies in the matching task.

Figure 1 shows an overview of our approach. (i) The ontologies and are indexed using the lexical index LexI (see Section 3.2); (ii) LexI is divided into clusters based on the semantic embeddings of its entries (see Section 3.4); (iii) entries in those clusters derive potential mapping sets (see Section 3.3); and (iv) the context of these mapping sets lead to matching subtasks (see Sections 3.1 and 3.3). Next, we elaborate on the methods behind these steps.

3.1 Locality modules and context

Logic-based module extraction techniques compute ontology fragments that capture the meaning of an input signature (e.g., set of entities) with respect to a given ontology. That is, a module contains the context (i.e., sets of semantically related entities) of the input signature. In this paper we rely on bottom-locality modules [DBLP:journals/jair/GrauHKS08, DBLP:conf/esws/Jimenez-RuizGSSL08], which will be referred to as locality-modules or simply as modules. These modules include the ontology axioms required to describe the entities in the signature. Locality-modules compute self-contained ontologies and are tailored to tasks that require reusing a fragment of an ontology. Please refer to [DBLP:journals/jair/GrauHKS08, DBLP:conf/esws/Jimenez-RuizGSSL08] for further details.

Locality-modules play an key role in our approach as they provide the context for the entities in a given mapping or set of mappings as formally presented in Definition 3.

Definition 3 (Context of a mapping and an alignment)

Let be a mapping between two ontologies and . We define the context of (denoted as ) as a pair of locality modules and , where and include the semantically related entities to and , respectively. Similarly, the context for an alignment between two ontologies and is denoted as , where and are modules including the semantically related entities for the entities and in each mapping .

Intuitively, as the context of an alignment (i.e., ) semantically characterises the entities involved in that alignment, a matching task can be reduced to the task without information loss in terms of finding (i.e., ). For example, in the small OAEI largebio tasks [DBLP:conf/semweb/AlgergawyCFFFHH18, DBLP:conf/semweb/AlgergawyFFFHHJ19] systems are given the context of the reference alignment as a (reduced) matching task (e.g., ), instead of the whole FMA and NCI ontologies.

# Index key Index value
Entities Entities
1 disorder :Disorder_of_pregnancy, :Disorder_of_stomach :Pregnancy_Disorder
2 disorder, pregnancy :Disorder_of_pregnancy :Pregnancy_Disorder
3 carcinoma, basaloid :Basaloid_carcinoma :Basaloid_Carcinoma, :Basaloid_Lung_Carcinoma
4 follicul, thyroid, carcinom :Follicular_thyroid_carcinoma :Follicular_Thyroid_carcinoma
5 hamate, lunate :Lunate_facet_of_hamate -
Table 1: Inverted lexical index LexI. For readability, index values have been split into elements of and . ‘-’ indicates that the ontology does not contain entities for that entry.

3.2 Indexing the ontology vocabulary

We rely on a semantic inverted index (we will refer to this index as LexI). This index maps sets of words to the entities where these words appear. LexI encodes the labels of all entities of the input ontologies and , including their lexical variations (e.g., preferred labels, synonyms), in the form of key-value pairs where the key is a set of words and the value is a set of entities such that the set of words of the key appears in (one of) the entity labels. Similar indexes are commonly used in information retrieval applications [DBLP:books/daglib/0031897], Entity Resolution systems [blocking-er-2019], and also exploited in ontology alignment systems (e.g., LogMap [Jimenez-Ruiz:2011], ServOMap [servomap14] and AML [aml18]) to reduce the search space and enhance the matching process. Table 1 shows a few example entries of LexI for two input ontologies.

LexI is created as follows. (i) Each label associated to an ontology entity is split into a set of words; for example, the label “Lunate facet of hamate” is split into the set {“lunate”, “facet”, “of”, “hamate”}. (ii) Stop-words are removed from the set of words. (iii) Stemming techniques are applied to each word (i.e., {“lunat”, “facet”, “hamat”}). (iv) Combinations of subsets of words also serve as keys in LexI; for example, {“lunat”, “facet”}, {“hamat”, “lunat”} and so on.888In order to avoid a combinatorial blow-up, the number of computed subsets of words is limited. (v) Entities leading to the same (sub)set of words are associated to the same key in LexI, for example {“disorder”} is associated with three entities. Finally, (vi) entries in LexI pointing to entities of only one ontology or associated to a number of entities larger than are not considered.999In the experiments we used . Note that a single entity label may lead to several entries in LexI, and each entry in LexI points to one or more entities.

3.3 Covering matching subtasks

Each entry (i.e., a key-value pair) in LexI is a source of candidate mappings. For instance, the example in Table 1 suggests that there is a candidate mapping since these entities are associated to the {“disorder”} entry in LexI. These mappings are not necessarily correct but will link lexically-related entities, that is, those entities sharing at least one word among their labels (e.g., “disorder”). Given a subset of entries or rows of LexI (i.e., ), the function provides the set of mappings derived from . We refer to the set of all (potential) mappings suggested by LexI (i.e., ) as . represents a manageable subset of the Cartesian product between the entities of the input ontologies. For example, LexI suggest around potential mappings for the matching task , while the Cartesian product between and involves more than mappings.

Since standard ontology alignment systems rarely discover mappings outside , the context of (recall Definition 3) can be seen as a reduced matching task of the original task . However, the modules and , although smaller than and , can still be challenging for many ontology matching systems. A solution is to divide or cluster the entries in LexI to lead to several tasks involving smaller ontology modules.

Definition 4 (Matching subtasks from LexI)

Let be a matching task, LexI the inverted index of the ontologies and , and a set of clusters of entries in LexI. We denote the set of matching subtasks from LexI as where each cluster leads to the matching subtask , such that is the set of mappings suggested by the LexI entries in (i.e., key-value pairs) and and represent the context of w.r.t. and .

Quality of the matching subtasks. The matching subtasks in Definition 4 rely on LexI and the notion of context, thus it is expected that the tasks in will cover most of the mappings that a matching system can compute, that is will be close to . Furthermore, the use of locality modules to compute the context guarantees the extraction of matching subtasks that are suitable to ontology alignment systems in terms of preservation of the logical properties of the given signature.

Intuitively each cluster of LexI will lead to a smaller matching task (with respect to both and ) in terms of search space. Hence will be smaller than . The overall aggregation of ratios (cf. Equation 3) depends on the clustering strategy of the entries in LexI and it is also expected to be smaller than .

Reducing the search space in each matching subtask has the potential of enabling the evaluation of systems that cannot cope with the original matching task in a given time-frame or with (limited) computational resources.

OAEI track Source of Source of Task Ontology Version Size (classes)
Anatomy Manually created [mousealignment05] Consensus (vote=3) AMA-NCIA AMA v.2007 2,744
NCIA v.2007 3,304
Largebio UMLS-Metathesaurus [umlsassessment11] Consensus (vote=3) FMA-NCI FMA v.2.0 78,989
 FMA-SNOMED NCI v.08.05d 66,724
SNOMED-NCI  SNOMED CT v.2009 306,591
Phenotype Consensus alignment (vote=2) [phenotype2017] Consensus (vote=3) HPO-MP HPO  v.2016 11,786
MP v.2016 11,721
DOID-ORDO DOID v.2016 9,248
ORDO v.2016 12,936
Table 2: Matching tasks. AMA: Adult Mouse Anatomy. DOID: Human Disease Ontology. FMA: Foundational Model of Anatomy. HPO: Human Phenotype Ontology. MP: Mammalian Phenotype. NCI: National Cancer Institute Thesaurus. NCIA: Anatomy fragment of NCI. ORDO: Orphanet Rare Disease Ontology. SNOMED CT: Systematized Nomenclature of Medicine – Clinical Terms. Phenotype ontologies downloaded from BioPortal. For all tracks we use the consensus with vote=3 as system mappings . The Phenotype track does not have a gold standard so a consensus alignment with vote=2 is used as reference.

3.4 Semantic embeddings

We use a semantic embedding approach to identify, given , a set of clusters of entries from LexI. As in Definition 4, these clusters lead to the set of matching subtasks . The semantic embeddings

aim at representing into the same (vector) space the features about the relationships among words and ontology entities that occur in

LexI. Hence, words and entities that belong to similar semantic contexts will typically have similar vector representations.

Embedding model. Our approach currently relies on the StarSpace toolkit101010StarSpace: https://github.com/facebookresearch/StarSpace and its neural embedding model [wu_2017] to learn embeddings for the words and ontology entities in LexI. We adopt the TagSpace [DBLP:conf/emnlp/WestonCA14] training setting of StarSpace. Applied to our setting, StarSpace learns associations between a set of words (i.e., keys in LexI) and a set of relevant ontology entities (i.e., values in LexI). The StarSpace model is trained by assigning a -dimensional vector to each of the relevant features (e.g., the individual words and the ontology entities in LexI

). Ultimately, the look-up matrix (the matrix of embeddings - latent vectors) is learned by minimising the loss function in Equation

5.

(5)

In this loss function we compare positive samples with negative samples. Hence we need to indicate the generator of positive pairs (in our setting those are word-entity pairs from LexI) and the generator of negative entries (in our case we sample from the list of entities in the values of LexI). StarSpace follows the strategy by Mikolov et al. [mikolov_2013] and selects a random subset of negative examples for each batch update. Note that we tailor the generators to the alignment task by sampling from LexI. The similarity function operates on -dimensional vectors (e.g., , and ), in our case we use the standard dot product in Euclidean space.

Clustering strategy. The semantic embedding of each entry LexI is calculated by concatenating (i) the mean vector representation of the vectors associated to each word in the key , with (ii) the mean vector of the vectors of the ontology entities in the value , as in Equation 6, where represents the concatenation of two vectors, and represents -dimensional vector embeddings learnt by StarSpace, and is a ()-dimension vector.

(6)

Based on the embeddings

we then perform standard clustering with the K-means algorithm to obtain the clusters of

LexI entries . For example, following our approach, in the example of Table 1 entries in rows and (respectively and ) would belong to the same cluster.

Suitability of the embedding model. Although we could have followed other embedding strategies, we advocated to learn new entity embeddings with StarSpace for the following reasons: (i) ontologies, particularly in the biomedical domain, may bring specialised vocabulary that is not fully covered by precomputed word embeddings; (ii) to embed not only words but also concepts of both ontologies; and (iii) to obtain embeddings tailored to the ontology alignment task (i.e., to learn similarities among words and concepts dependant on the task). StarSpace provides the required functionalities to embed the semantics of LexI and identify accurate clusters. Precise clusters will lead to smaller matching tasks, and thus, to a reduced global size of the computed division of the matching task (cf. Equation 3).

4 Evaluation

In this section we provide empirical evidence to support the suitability of the proposed method to divide the ontology alignment task. We rely on the datasets of the Ontology Alignment Evaluation Initiative (OAEI) [DBLP:conf/semweb/AlgergawyCFFFHH18, DBLP:conf/semweb/AlgergawyFFFHHJ19], more specifically, on the matching tasks provided in the anatomy, largebio and phenotype tracks. Table 2 provides an overview of these OAEI tasks and the related ontologies and mapping sets.

The methods have been implemented in Java111111Java codes: https://github.com/ernestojimenezruiz/logmap-matcher and Python121212Python codes: https://github.com/plumdeq/neuro-onto-part (neural embedding strategy), tested on a Ubuntu Laptop with an Intel Core i9-8950HK CPU@2.90GHz and allocating up to of RAM. Datasets, matching subtasks, computed mappings and other supporting resources are available in the Zenodo repository [zenodo_material_ecai]. For all of our experiments we used the following StarSpacehyperparameters:

-trainMode 0 -similarity dot --epoch 100 --dim 64

.


(a) of over

(b) of over

(c) of

(d) Module sizes of for FMA-NCI
Figure 2: Quality measures of with respect to the number of matching subtasks .

4.1 Adequacy of the division approach

We have evaluated the adequacy of our division strategy in terms of coverage (as in Equation 4) and size (as in Equation 3) of the resulting division  for each of the matching task in Table 2.

Coverage ratio. Figures 1(a) and 1(b) shows the coverage of the different divisions  with respect to the reference alignment and system computed mappings, respectively. As system mappings we have used the consensus alignment with vote=3, that is, mappings that have been voted by at least systems in the last OAEI campaigns. The overall coverage results are encouraging: (i) the divisions cover over of the reference alignments for all tasks, with the exception of the SNOMED-NCI case where coverage ranges from to ; (ii) when considering system mappings, the coverage for all divisions is over with the exception of AMA-NCIA, where it ranges from to ; (iii) increasing the number of divisions tends to slightly decrease the coverage in some of the test cases, this is an expected behaviour as the computed divisions include different semantic contexts (i.e., locality modules) and some relevant entities may fall out the division; finally (iv) as shown in [pereira:2017], the results in terms of coverage of state-of-the-art partitioning methods (e.g., [hu:2008, Hamdi:2010]) are very low for the OAEI largebio track (, and as the best results for FMA-NCI, FMA-SNOMED and SNOMED-NCI, respectively), thus, making the obtained results even more valuable.

Size ratio. The results in terms of the size (i.e., search space) of the selected divisions are presented in Figure 1(c). The search space is improved with respect to the original for all the cases, getting as low as of the original matching task size for the FMA-NCI and FMA-SNOMED cases. The gain in the reduction of the search space gets relatively stable after a given division size; this result is expected since the context provided by locality modules ensures modules with the necessary semantically related entities. The scatter plot in Figure 1(d) visualise the size of the source modules against the size of the target modules for the FMA-NCI matching subtasks with divisions of size . For instance, the (blue) circles represent points being and the source and target modules (with =,…,) in the matching subtasks of . It can be noted that, on average, the size of source and target modules decreases as the size of the division increases. For example, the largest task in   is represented in point , while the largest task in   is represented in point .

Tool Task Matching    Performance measures Computation times (s)
subtasks  P  R  F  Min  Max  Total
MAMBA (v.2015) AMA-NCIA 5 0.870 0.624 0.727 73 785 1,981
10  0.885  0.623  0.731 41 379 1,608
50 0.897 0.623 0.735 8 154    1,377
FCA-Map (v.2016) FMA-NCI 20 0.656 0.874 0.749 39 340 2,934
50 0.625 0.875 0.729 19 222 3,213
FMA-SNOMED 50 0.599 0.251 0.354 6 280 3,455
100 0.569 0.253 0.350 5 191 3,028
SNOMED-NCI 150 0.704 0.629 0.664 5 547 16,822
200 0.696 0.630 0.661 5 395 16,874
SANOM (v.2017) FMA-NCI 20 0.475 0.720 0.572 40 1,467 9,374
50 0.466 0.726 0.568 15 728 7,069
FMA-SNOMED 100 0.145 0.210 0.172 3 1,044 13,073
150 0.143 0.209 0.170 3 799 10,814
POMap++ (v.2018) FMA-NCI 20 0.697 0.732 0.714 24 850 5,448
50 0.701 0.748 0.724 11 388 4,041
FMA-SNOMED 50 0.520 0.209 0.298 4 439 5,879
100 0.522 0.209 0.298 3 327 4,408
ALOD2vec (v.2018) FMA-NCI 20 0.697 0.813 0.751 115 2,141 13,592
50 0.698 0.813 0.751 48 933 12,162
FMA-SNOMED 100 0.702 0.183 0.29 9 858 12,688
150 0.708 0.183 0.291 7 581 10,449
Table 3: Evaluation of systems that failed to complete OAEI tasks in the 2015-2018 campaigns. Times reported in seconds (s).

Computation times. The time to compute the divisions of the matching task is tied to the number of locality modules to extract, which can be computed in polynomial time relative to the size of the input ontology [DBLP:journals/jair/GrauHKS08]. The creation of LexI does not add an important overhead, while the training of the neural embedding model ranges from in AMA-NCI to in SNOMED-NCI. Overall, for example, the required time to compute the division with matching subtasks ranges from (AMA-NCIA) to approx. (SNOMED-NCI).

4.2 Evaluation of OAEI systems

In this section we show that the division of the alignment task enables systems that, given some computational constraints, were unable to complete an OAEI task. We have selected the following five systems from the latest OAEI campaigns, which include novel alignment techniques but failed to scale to very large matching tasks: MAMBA (v.2015) [mamba2015], FCA-Map (v.2016) [fcamap2016], SANOM (v.2017) [sanom17], ALOD2vec (v.2018) [alod2vec2018] and POMap++ (v.2018) [pomap2018]. MAMBA failed to complete the anatomy track, while FCA-Map, SANOM, ALOD2vec and POMap++ could not complete the largest tasks in the largebio track. MAMBA and SANOM threw an out-of-memory exception with , whereas FCA-Map, ALOD2vec and POMap++ did not complete the tasks within a hours time-frame. We have used the SEALS infrastructure to conduct the evaluation [DBLP:conf/semweb/AlgergawyCFFFHH18, DBLP:conf/semweb/AlgergawyFFFHHJ19].

Table 3 shows the obtained results in terms of precision, recall, f-measure, and computation times (time for the easiest and the hardest task, and total time for all tasks) over different divisions computed using our strategy. For example, FCA-Map was run over divisions with 20 and 50 matching subtasks (i.e., ) in the FMA-NCI case. Note that for each matching subtask a system generates a partial alignment , the final alignment for the (original) matching task is computed as the union of all partial alignments (). The results are encouraging and can be summarised as follows:

  1. We enabled several systems to produce results even for the largest OAEI test case (e.g., FCA-Map with SNOMED-NCI).

  2. The computation times are also very good falling under the hours time frame, specially given that the (independent) matching subtasks have been run sequentially without parallelization.

  3. The size of the divisions, with the exception of FCA-Map, is beneficial in terms of total computation time.

  4. The increase of number of matching subtasks is positive or neutral for MAMBA, POMap++ and ALOD2vec in terms of f-measure, while it is slightly reduced for FCA-Map and SANOM.

  5. Global f-measure results are lower than top OAEI systems; nevertheless, since the above systems could not be evaluated without the divisions, these results are obtained without any fine-tuning of their parameters.

  6. The computation times of the hardest tasks, as increases, is also reduced. This has a positive impact in the monitoring of alignment systems as the hardest task is completed in a reasonable time.

5 Related work

Partitioning and blocking. Partitioning and modularization techniques have been extensively used within the Semantic Web to improve the efficiency when solving the task at hand (e.g., visualization [stuckenschmidt:2009, agibetov_2015], reuse [DBLP:conf/esws/Jimenez-RuizGSSL08], debugging [DBLP:conf/aswc/SuntisrivarapornQJH08], classification [DBLP:conf/semweb/RomeroGH12]). Partitioning or blocking has also been widely used to reduce the complexity of the ontology alignment task [aml18]. In the literature there are two major categories of partitioning techniques, namely: independent and dependent. Independent techniques typically use only the structure of the ontologies and are not concerned about the ontology alignment task when performing the partitioning. Whereas dependent partitioning methods rely on both the structure of the ontology and the ontology alignment task at hand. Although our approach does not compute (non-overlapping) partitions of the ontologies, it can be considered a dependent technique.

Prominent examples of ontology alignment systems including partitioning techniques are Falcon-AO [hu:2008], GOMMA [DBLP:conf/dils/GrossHKR10], COMA++ [alger:2011] and TaxoMap [Hamdi:2010]. Falcon-AO, GOMMA and COMA++ perform independent partitioning where the clusters of the source and target ontologies are independently extracted. Then pairs of similar clusters (i.e., matching subtasks) are aligned using standard techniques. TaxoMap [Hamdi:2010] implements a dependent technique where the partitioning is combined with the matching process. TaxoMap proposes two methods, namely: PAP (partition, anchor, partition) and APP (anchor, partition, partition). The main difference of these methods is the order of extraction of (preliminary) anchors to discover pairs of partitions to be matched (i.e., matching subtasks). SeeCOnt [Alger:2015] presents a seeding-based clustering technique to discover independent clusters in the input ontologies. Their approach has been evaluated with the Falcon-AO system by replacing its native PBM (Partition-based Block Matching) module [pbm2016]. Laadhar et al. [pomap2018] have recently integrated within the system POMap++ a hierarchical agglomerative clustering algorithm to divide an ontology into a set of partitions.

The above approaches, although presented interesting ideas, did not provide guarantees about the size and coverage of the discovered partitions or divisions. Furthermore, they have not been successfully evaluated on very large ontologies. On the one hand, as reported by Pereira et al. [pereira:2017] the results in terms of coverage of the PBM method of Falcon-OA, and the PAP and APP methods of TaxoMap are very low for the OAEI largebio track. On the other hand, as discussed in Section 4, POMap++ fails to scale with the largest largebio tasks.

Note that the recent work in [DBLP:conf/sac/LaadharGMRTG19] has borrowed from our workshop paper [om18_division] the quality measures presented in Section 2.1. They obtain competitive coverage results for medium size ontologies; however, their approach, as in POMap++, does not scale for large ontologies.

Blocking techniques are also extensively used in Entity Resolution (see [blocking-er-2019] for a survey). Although related, the problem of blocking in ontologies is different as the logically related axioms for a seed signature play an important role when computing the blocks.

Our dependent approach, unlike traditional partitioning and blocking methods, computes overlapping self-contained modules (i.e., locality modules [DBLP:journals/jair/GrauHKS08]). Locality modules guarantee the extraction of all semantically related entities for a given signature. This capability enhances the coverage results and enables the inclusion of the (semantically) relevant information required by an alignment system. It is worth mentioning that the need of self-contained and covering modules, although not thoroughly studied, was also highlighted in a preliminary work by Paulheim [Paulheim:2008].


Embedding and clustering.

Recently, machine learning techniques such as semantic embedding 

[cai2018comprehensive] have been investigated for ontology alignment. They often first learn vector representations of the entities and then predict the alignment [azmy2019matching, zhang2019multi, sun2019transedge]. However, most of them focus on alignment of ontology individuals (i.e., ABox) without considering the ontology concepts and axioms at the terminological level (i.e., TBox). Nkisi-Orji et al. [nkisi2018ontology]

predicts the alignment between ontology concepts with Random Forest, but incorporates the embeddings of words alone, without any other semantic components like in our work. Furthermore, these approaches focus on predicting the alignment, while our work aims at boosting an existing alignment system. Our framework could potentially be adopted in systems like in

[nkisi2018ontology] if facing scalability problems for large ontologies.

Another piece of related work is the clustering of semantic components, using the canopy clustering algorithm [mccallum2000efficient] where objects are grouped into canopies and each object can be a member of multiple canopies. For example, Wu et al. [wu2018towards] first extracted canopies (i.e., mentions) from a knowledge base, and then grouped the entities accordingly so as to finding out the entities with the same semantics (i.e., canonicalization). As we focus on a totally different task – ontology alignment, the context that can be used, such as the embeddings for the words and ontology entities in LexI, is different from these works, which leads to a different clustering method.

6 Conclusions and future work

We have developed a novel framework to split the ontology alignment task into several matching subtasks based on a semantic inverted index, locality modules, and a neural embedding model. We have performed a comprehensive evaluation which suggests that the obtained divisions are suitable in practice in terms of both coverage and size. The division of the matching task allowed us to obtain results for five systems which failed to complete these tasks in the past. We have focused on systems failing to complete a task, but a suitable adoption and integration of the presented framework within the pipeline of any ontology alignment system has the potential to improve the results in terms of computation times.

Opportunities. Reducing the ontology matching task into smaller and more manageable tasks may also bring opportunities to enhance (i) user interaction [ker2019], (ii) reasoning and repair [DBLP:phd/dnb/Meilicke11], (iii) benchmarking and monitoring [DBLP:conf/semweb/AlgergawyCFFFHH18, DBLP:conf/semweb/AlgergawyFFFHHJ19], and (iv) parallelization. The computed independent matching subtasks can potentially be run in parallel in evaluation platforms like the HOBBIT [hobbit16]. The current evaluation was conducted sequentially as (i) the SEALS instance only allows running one task at a time, and (ii) the evaluated systems were not designed to run several tasks in parallel; for instance, we managed to run MAMBA outside SEALS, but it relies on a MySQL database and raised a concurrent access exception.

Impact on the f-measure. As shown in Section 4.2, the impact of the number of divisions on the f-measure depends on the evaluated systems. In the near future we aim at conducting an extensive evaluation of our framework over OAEI systems able to deal with the largest tasks in order to obtain more insights about the impact on the f-measure. In [arxiv18_division] we reported a preliminary evaluation where YAM-Bio [yambio2017] and AML [aml2013] kept similar f-measure values, while LogMap [Jimenez-Ruiz:2011] had a reduction in f-measure, as the number of divisions increased.

Number of subdivisions.

Currently our strategy requires the size of the number of matching subtasks or divisions as input. The (required) matching subtasks may be known before hand if, for example, the matching tasks are to be run in parallel in a number of available CPUs. For the cases where the resources are limited or where a matching system is known to cope with small ontologies, we plan to design an algorithm to estimate the number of divisions so that the size of the matching subtasks in the computed divisions is appropriate to the system and resource constraints.

Dealing with a limited or large lexicon. The construction of LexI shares a limitation with state-of-the-art systems when the input ontologies are lexically disparate or in different languages. In such cases, LexI

can be enriched with general-purpose lexicons (

e.g., WordNet), more specialised background knowledge (e.g., UMLS Metathesaurus) or with translated labels using online services (e.g., Google). On the other hand, a large lexicon may also have an important impact in the computation times. Our conducted evaluation shows, however, that we can cope with very large ontologies with a rich lexicon (e.g., NCI Thesaurus).

Notion of context. Locality-based modules are typically much smaller than the whole ontology and they have led to very good results in terms of size and coverage. We plan, however, to study different notions of context of an alignment (e.g., the tailored modules proposed in [DBLP:journals/jair/RomeroKGH16]) to further improve the results in terms of size while keeping the same level of coverage.

This work was supported by the SIRIUS Centre for Scalable Data Access (Norges forskningsråd), the AIDA project (Alan Turing Institute), Samsung Research UK, Siemens AG, and the EPSRC projects AnaLOG, OASIS and UK FIRES. We would also like to thank the anonymous reviewers that helped us improve this work.

References