NEMA: Automatic Integration of Large Network Management Databases

06/01/2020 ∙ by Fubao Wu, et al. ∙ 0

Network management, whether for malfunction analysis, failure prediction, performance monitoring and improvement, generally involves large amounts of data from different sources. To effectively integrate and manage these sources, automatically finding semantic matches among their schemas or ontologies is crucial. Existing approaches on database matching mainly fall into two categories. One focuses on the schema-level matching based on schema properties such as field names, data types, constraints and schema structures. Network management databases contain massive tables (e.g., network products, incidents, security alert and logs) from different departments and groups with nonuniform field names and schema characteristics. It is not reliable to match them by those schema properties. The other category is based on the instance-level matching using general string similarity techniques, which are not applicable for the matching of large network management databases. In this paper, we develop a matching technique for large NEtwork MAnagement databases (NEMA) deploying instance-level matching for effective data integration and connection. We design matching metrics and scores for both numerical and non-numerical fields and propose algorithms for matching these fields. The effectiveness and efficiency of NEMA are evaluated by conducting experiments based on ground truth field pairs in large network management databases. Our measurement on large databases with 1,458 fields, each of which contains over 10 million records, reveals that the accuracies of NEMA are up to 95 achieves 2

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 11

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the development of big data analytics and data mining techniques, scalable network measurement and analysis techniques have been used in finding hidden information or patterns to help with network management, network monitoring and network security [38, 11, 16, 33]. In one scenario of important network management, big network vendors serving various customers own a multitude of databases on network product information, network configurations, geographical information on deployment sites, incident logs, etc. Since the databases serve different departments, they are usually separated from each other and independently managed by different departments and groups. However, network data are inherently designed to host “connections” among different devices, groups that the devices are in, and functionality the devices serve together; they are often required to be shared and put together to be used in many important tasks such as network prediction, semantic query, network analysis and fault detection. For example, with the data connections, one can easily model a graph database to query related product configurations and performance issues around faulty devices or service incidents, which are matched with another product-related table [28, 1]. As many studies [10, 12, 7, 36] point out, the discovery of matching fields is the most crucial and foremost step for the integration of databases. For this reason, in this paper we aim to automatically construct such matchings that lead to efficient network management and analysis.

There is an abundance of research on field matching and integration approaches for different data formats in different contexts such as relational databases, XML and object-oriented data formats [14, 31, 37, 5]. Existing database matching approaches include two main categories of techniques. One is based on schema-level matching, which exploits metadata using schema characteristics such as field names, data types, structural properties and other schema information [4, 15]. However, network management databases from different sources have different design standards and naming conventions [18, 9]. Even similar fields can have different names (e.g. “product_family” can also be named as “product series”). The other category is instance-level matching, which uses the record values of two fields to calculate the similarity and to determine matched fields [8, 29, 27, 24]

. Most of the previously proposed schemes rely on syntactic similarities, sampling or machine learning techniques that are meant to extract common patterns from the matching data corpus. However, it is difficult to directly apply these techniques to network databases or challenging to reliably construct dictionaries, corpus, and training labels with large datasets when the naming convention is not consistent and diverse.

The challenges for matching these network databases are: (1) The database design is not ideally uniform. The data tables are created in different groups and departments by different people. Therefore, it is not reliable to use schema information to match data directly and easily. (2) The data is noisy and irregular. Some table fields contain unexpected records such as null, invalid values and typos. Some table records are either partly missing, incomplete or incorrect. Some fields have a large amount of records, while some have very few records. (3) The table contents are complicated and heterogeneous with numerical and non-numerical data formats. (4) No thesauri or auxiliary information exist that we can rely on for matching. The only set of observation available is the database itself.

To solve these challenges, we propose an automatic matching technique for large NEtwork MAnagement databases (NEMA) to construct a graph database for network management and analysis effectively and efficiently. We propose several algorithms for numerical matching and non-numerical respectively based on instance matching. Our main contributions are as follows:

  1. We propose effective range difference similarity and bucket dot product similarity metric to match numerical fields, and top priority metric to match non-numerical fields.

  2. To make the algorithm more scalable for large network management databases, we utilize min-hash-locality sensitive hashing algorithm for faster processing with a little scarification of accuracy.

  3. We further propose to use the proposed similarity metric scores as features for classification to improve the reliability of our matching technique.

  4. We experimentally demonstrate the effectiveness and efficiency of our matching algorithms in the real Cisco network management databases for constructing a graph network database.

The rest of this paper is organized as follows. We define the problem in Section II. Section III describes NEMA matching algorithms in detail including both numerical matching and non-numerical matching. Experimental evaluation is presented in Section V and related work is shown in Section VI. We conclude our paper in Section VII.

Ii Problem Description

Given structured network management databases, our goal is to create a graph database of network management by finding the most accurate matched field pairs among different tables in these databases. The matching of two fields is determined by the matching score measured by their record pair similarities. We utilize the matched results to construct a graph database for semantic query, network analysis, network prediction, etc. [26, 21, 22].

To illustrate our problem and algorithms clearly, we use three sample tables below as a toy example throughout the rest of this paper. In Table II, PRODUCT Table () contains 2 fields (primary key), and with 7 records respectively. In Table II, INCIDENT Table () contains 2 fields (primary key), and with 7 records respectively. In Table III, ORDER Table ( ) contains 3 fields (primary key), , and with 7 records respectively. The problem is to find whether these 7 fields match among the Table II, II and III by evaluating the matching of their records, then to construct a graph database for efficient network analysis and management.

product_id family
107 AIR series
108 con series
109 con series
150 47-7000
151 cisco0500
152 80-7066C
153 con5100
TABLE II: INCIDENT ()
incident_key prod_key
201 107
202 107
203 108
204 109
207 150
208 151
209 152
TABLE I: PRODUCT ()
order_key incident_id product_name
301 201 AIR1212AC
302 201 AIR1002
303 203 con5122
304 204 mem-4700m-64d=
305 207 47-7066C
306 208 cisco0510
307 208 cs6012
TABLE III: ORDER ()

Record Matching: Given two instances e1 and e2, a record matching function is defined as a 4-tuple: where e1 and e2 are two field records; is a similarity score (typically in the [0, 1]) between e1 and e2; is a relation (e.g., equivalence, part-of, overlapping, etc.) between e1 and e2. The matching function asserts that the relation holds between the record and with score . For numerical matching, only if two records are equal, they are considered matched. For example, records {107, 108, 109, 150, 151, 152 } in field in Table II are matched as the equal records in field in Table II, respectively. For a non-numerical pair, however, if the similarity score of two records are higher than a threshold based on a similarity metric, they are considered matched. Here we consider the part-of relation in the network management databases for meaningful relations including subgroup versus group, product versus product family, subseries versus series, and so on. For example, the record “con5122” in field and the record “con5100” in field can have high similarity score with part-of relation. A record pair which is matched is called a matched record pair, and it is called a non-matched record pair if the pair is not matched.

Field Matching: A field here means a database field indicating the names of a column and the single piece of data stored. Given two fields , and a threshold , we define as the matching/similarity score (e.g. Jaccard similarity) between two fields and . If value is above , we call a matched field pair, otherwise it is called a non-matched field pair. In the toy example, has a high matching score with Jaccard similarity, so can be correlated and matched. Moreover, the field pair can also be matched in terms of many matched record pairs such as some pairs , and , and so on.

Graph Database: One effective way to utilize matched results is to construct a graph database for semantic query, network analysis, network prediction, etc. [26, 21, 22], which is also our goal. We define a graph database as a labeled, attributed and undirected graph where is the node set containing all the records appearing in the fields which match, is the edge set between node pairs for node set . is a set of label information of node set , which are the index attributes for the columns of the table. is a set of label information of edge sets , which are the relations of two records from two tables’ column attributes. Specifically, when we construct a graph database from matching of relational databases, a node consists of a field and a record value in a row; the label of comprises of the other field information in the same row as node ; an edge is the matching between two records; indicates the field information when two records of the fields matches. Figure 1 shows an example of a constructed graph database from parts of matched results in Tables II, II and III. For example, a specific node “product_id: 107” in the Table II has a node label “PRODUCT” and “Family:AIR series”. Node matches with “prod_key: 107” in Table II, so it is connected to the node “Incident_key:202” with a Product-incident relation, to the node “Incident_key: 201” with a Product-incident relation, and to the node “Family:AIR series” with a Product-family relation.

Fig. 1: Graph database created from Table II, II and III

Iii Matching Algorithm

To find whether two fields match, one simple way is to use field name matching. If the name of two fields are the same or similar, they are matched. However, this is not reliable for many database sources because they are noisy and irregular. Moreover, the network databases comprise of numerical and non-numerical fields, and they have different attributes and matching requirements. For numerical matching, we consider equivalence relation between record pairs. For non-numerical matching, however, we do not directly consider the equivalence relation as the matching standard. non-numerical record values are possible to be semantically correlated with different names. For example, the non-numerical fields and in the example Table II and III have very few common characters on their names, but they are semantically correlated that has a part-of relation with . Moreover, in terms of field records, records “cisco0510” and “cisco0500” in these two fields can be considered belonging to the same family and being matched with a high similarity. However, the record pair “47-7066C” and “80-7066C” are considered to be non-matched with different families, even though the two strings have many common characters. Details will be discussed in III-C). Hence, we make use of the record matching to decide whether two fields match to improve matching accuracy and satisfy semantic matching. Overall, we match numerical and non-numerical fields separately and design different matching algorithms for each of them.

Iii-a System Overview

The system overview is shown in Figure 2. We divide the structured data into numerical data with only numerical fields and non-numerical data with only non-numerical fields. In each part, we develop an independent matching algorithm for field matching. Matching algorithms for numerical and non-numerical data are quite different, which will be introduced in section III-B and III-C. The results of each part are combined together to load into a graph database.

Fig. 2: System diagram of automatic integration for network databases

Iii-B Numerical Matching

Numerical fields are table fields with records which are numerical values. For example, the and in Table II are numerical fields. Their record values construct a basis for similarity metrics of fields. We define each numerical field record values as a set. This is transfered to a problem of set similarity.

There are some common methods for solving set similarity including Jaccard index, Dice index, Hamming distance, cosine similarity

[6], etc. However, it is not practical to just use one method to get accurate decision bounds of matching because of the noisiness and complexity of the structured databases. We propose a synthetic column-wise numerical matching algorithm to get the decision bounds to determine whether two fields are matched or not. The numerical matching algorithm is shown in Figure 3.

Fig. 3: Numerical matching flow

The process of this algorithm is shown as follows:

  • We preprocess the numerical data, such as removing null values, negative values and some exceptional non-numerical values in every field.

  • We apply range difference similarity metric to all the preprocessed numerical field pairs between every two tables.

  • After we have got the range difference similarity score for each field pair, a threshold (which will be discussed later) will be decided to cut the filtered results.

  • Finally, bucket dot product similarity metric will be applied to the range difference similarity metric’s filtered results. Then we sort the similarity scores of all the pairs to select the most correlated field pairs, that is, top- result as matched pairs with scores above a certain threshold will be selected to input to a graph database.

Iii-B1 Preprocessing

To deal with the irregular and noisy data, we do some preprocessing of field record values before the formal similarity calculations. This includes removing null values, negative values, any non-numerical values and considering unique values only. We do not consider negative values because they are useless and noisy data in the real databases. Almost all the table fields are about identifications, keys or numbers which can possibly be matched among them. In our databases, an average of of field record values are removed (not including unique value reduction) which does not impact the evaluation of instance-based algorithms.

Iii-B2 Range Difference Similarity Metric

Considering the noisy and sparse characteristics of data, Jaccard similarity coefficient as a similarity metric, which measures how many common values between two sets, is not ideal for differentiating some matched pairs and non-matched pairs. For example, for an non-matched field pair with limited number of records, the number of common values in these two fields might take a large portion and hence the Jaccard similarity is very high for them. Their distribution of range, however, can be quite different, which is probably not to be matched in most cases. Therefore, we propose range difference (RD) similarity metric to measure the distribution of these field pairs first. Using RD similarity metric first, we can effectively and efficiently prune lots of unwanted computations, which can also hugely reduce time consumption for further matching. Given a field set

, we sort the record value and then get the different percentile (10th, 20th, 30th,…, 90th percentile). The percentile value is recorded as . Therefore, given two field record sets, and , the RD value for each percentile is given as

(1)

We use 20th plus 30th percentile as the low range coverage, and 80th plus 90th percentile as the high range coverage, to cover the distribution of field record values. Hence the RD similarity score for a field pair (A, B) is defined as:

(2)

To keep consistent with the general similarity metric and be convenient for comparisons, we use 1 minus the averaged RD value as the RD similarity score RDS. The metric based on this similarity score is called RD similarity metric. Using this similarity metric, we can get the similar distribution for matched pairs. The value of is in [0, 1]. The bigger the similarity score, the more correlated the pair. There are several cases about the similarity score considered here:

(1) If there are no overlaps between two field ranges, would be as low as the minimum value 0.

(2) If two fields have similar distributions, would be higher, up to 1.

(3) If two field ranges overlap at the head, tail or in the middle, can fall into a middle value.

In our toy example, the matched pairs and have values as high as 0.996 and 0.999 respectively. In contrast, the pairs and have no overlaps with value 0, which are not matched pairs. The more correlated the field pairs are, the higher RD similarity score they have. Therefore, using a threshold to filter results, we can almost rule out case (1) and part of case (3), then mainly consider case (2) to differentiate them further. To minimize the error of RD similarity score in the first step, we can use a conservative threshold close to the boundary to only filter out definite non-matched pairs, which will be discussed in the section V.

Iii-B3 Bucket Dot Product Similarity Metric

After we consider the distribution of field pairs with range difference similarity metric, we propose bucket dot product (BDP) similarity metric to further refine the filtered results of RD similarity metric. BDP similarity metric is to divide the whole concatenated ranges of two fields into different bucket/bins and compress each bucket as one point to calculate dot product similarity. The intuition behind this is that matched pairs generally have more common values than non-matched pairs. If we increase the bucket size up to a certain value to calculate dot product, it can make the similarities of all the non-matched pairs decrease more, and meanwhile make the similarities of all the matched pairs drop less, therefore it effectively increases the similarity gaps between matched pairs and non-matched pairs. Therefore, a good design of BDP will help significantly differentiate between matched pairs and non-matched pairs.

The general dot product similarity of two vectors

and with elements is defined as follows:

(3)

We use the bucket number () to determine the number of buckets for calculating the dot product. Given two field record sets and , we first derive the required vectors and for the input to the BDP similarity calculation. The vectors and derived from and are constructed in this way. Given two sets and , we concatenate and ’s value ranges as a combined set , and then divide into several buckets according to the . If there is any one value in or falling in a bucket, the bucket point for or ’ is 1, otherwise it is 0. Then we apply the general dot product similarity to and . Therefore, the BDP similarity score (normalized) is defined as follows:

(4)

where decides the sparsity/density of range distributions. Since sets and usually have different sizes with different ranges, it would make sense for the same for each set. For example, we calculate the BDP similarity for a field pair in Table II and III as and with . We first concatenate these two field ranges into a set {201, 202, 203, 204, 207, 208, 209}. Then we construct a set ={{201, 202, 203},{204, 207, 208}, {209}} according to the bucket number . After that, we get the vector = {1, 1, 1} and = {1, 1, 0}. Finally, the BDP similarity score is 0.816, which is high for matching. If we set as 4, BDP similarity score for this pair is 1, which is the highest for matching.

is also an important factor to affect the quality of this metrics. According to our experimental observations, it is affected by the data range and distribution. Generally, matched pairs would have more similar ranges than non-matched pairs. A trade-off value of would effectively improve matched pairs’ similarities more and also do not help improve non-matched pairs’ similarities much, which can potentially increase more gaps between matched and non-matched pairs. The selection of will be discussed later in the section V.

Iii-B4 Primary Key Constraint Matching

If we consider all numerical fields and then do match computations between each other, it would be very time-consuming and not scalable when we have a large number of fields. If there are tables from database sources, and an average of fields in each table, the maximum comparisons are required. If we consider only table and table matching in a semantic way, the primary key constraint [4, 17] can be utilized to reduce the time for pair comparisons. Moreover, relational databases generally have primary keys and closely-related foreign keys in a table, which describe the semantic meanings of the table. It is important to match two tables in this semantic way to construct an effective graph database. Assuming every table has at least a primary key, we can identify the primary key and use the primary key’s records to compare with all other keys records in other tables. We identify the primary key as a field that has unique record values before we apply RD and BDP similarity metrics. For example, and are identified as primary keys in these two tables II and II, so we only need to compare , , and in total, reducing 33% times of comparisons.

Iii-B5 Final Top- Selection

To construct a high quality graph database, more true positive field pairs are preferable from higher similarity scores. Also, manual thresholds could lead to selection instability of field pairs with low similarity scores, so we seek top-k to further refine the quality of matching for the graph database. We sort all the candidate pairs by the similarity scores in a non-ascending order, then we verify this final field pair matching results to select top- field pairs, which involves only a little human labor.

Iii-C Non-numerical Matching based on Top Priority Matching

We propose fast algorithms for non-numerical field matching–top priority match metric (TPM) for fast filtering, and match ratio score for final similarity computations. The diagram is shown in Figure 4. The main process of our non-numerical matching algorithms is as follows:

Fig. 4: Non-numerical matching flow based on TPM
  • Preprocess non-numerical data: this is an important step that decides the quality of our non-numerical matching algorithm. After splitting non-numerical data from the original databases, we use our designed natural language processing methods of segmentation, stemming and prefixing for every field record value.

  • Calculate the record-wise similarities iteratively: It is very time-consuming to apply cosine similarity to the combination of every record pair in a field pair. Considering the scalability of large-scale data matching, we propose top priority match metric for record-wise similarity calculation. In the process of iterative computations, we check the termination condition to terminate the iterations earlier, which significantly reduces time complexity.

  • Calculate match ratio score: after record-wise similarity computations for every field pair are finished, we calculate the defined matching ratio score for each field pair.

  • Select top- results: we sort the field pairs by the matching ratio score in a non-ascending order, and the top- field pairs are selected as the final results as an input into a graph database.

Here, we discuss each proposed steps in detail.

Iii-C1 Preprocessing

Non-numerical matching considers partial match between two strings. For example, product “cisco0510” and product “cisco0500” are in the same series, which is considered as a partial match. Hence, we propose the following preprocessing method. (1) Parse every record string , remove null value, separate alphabetic and numerical characters into different new substrings, and tokenize the string words. (2) Stem the alphabetic strings of the original record and new substrings. (3) Reserve the prefixes with a certain length of the original numerical strings and the new substrings if they are digital substrings. The prefix length is 2 here according to the general characteristics of network databases and our experiments.

Each record string is preprocessed in those three steps above. For example, we have an original field record {’mem-4700m-64d=’} in Table III. We can obtain a string collection {’4700’, ’64’, ’4700m’, ’d’, ’m’, ’mem 4700m 64d’, ’64d’, ’mem’, ’47xx’} after preprocessing .

Iii-C2 Top Priority Match Metric for Record-wise Similarity

One intuitive way is to preprocess all the combination of record pair comparisons and calculate the similarity of each record-wise pairs. That would be very time-consuming or even unfeasible when the data are large. Specially, if two fields are not correlated as a matched pair, it would be costly for useless computations. Therefore, we propose a fast record-wise matching algorithm called top priority match metric for record-wise similarity (TPM) to fulfill this. Intuitively, if two fields and are correlated, there will be a high percentage of record-wise pairs that have higher similarities. The probability of a matched record pair encountered is higher than non-matched record pairs. Therefore, we first sort all the preprocessed records in each two fields and , then we compute how many of records in are matched with records in from top to bottom, and vice-versa. The comparisons can hence be terminated as long as the current record pair similarity achieves below the similarity threshold we set for deciding the matching of a record pair, which greatly reduce the times of comparisons with combinations.

Iii-C3 Record Pair Similarity

In this fast record-wise comparisons, the record pair similarity used is cosine similarity between two record collections after preprocessing two records. It decides how and when to reduce the comparisons of matching in a fast and effective way. A threshold is to decide how similar a record pair is as a matched record pair, which can also be adjusted by users.

Given a preprocessed string collection and another preprocessed string collection , we remove duplicated elements and transfer them into a set ( union ). Next we generate a binary vector and a binary vector according to the value distribution of and in , then we calculate the cosine similarity between and by getting their dot product divided by their magnitude multiplication.

(5)

For example, we have a preprocessed string collection {cisco, 0510, cisco0510, 05xx}, and {cisco, 05xx, 0500, cisco0500}, we transfer them into a set of union , {cisco, 0510, cisco0510, 05xx, 0500, cisco0500}. Then the binary vectors generated according to , and are {1, 1, 1, 1, 0, 0} and {1, 0, 0, 1, 1, 1}. Finally, we calculate the cosine similarity of and as the similarity of and , that is, .

Iii-C4 Matching Ratio Score

Matching ratio score is proposed to calculate the final similarity for a field pair. After we have gone through the reducing comparisons for record-wise similarities, we select the number of record pairs that have similarity scores above . A matching ratio score as a final field pair similarity is the average of ratios of top matched record pairs calculated as follows:

Given two non-numerical sets and , there are items in and items in B. The matching ratio score(MR) between and is defined as follows.

(6)

where

(7)

and

(8)

where and are the cosine similarities of the record pairs and , respectively. value is in [0, 1] and it is the final similarity score to decide the correlation of each field pair.

Iii-C5 Final Top- Results

Similar to numerical field matching, matched non-numerical field pairs in the result list are more meaningful and important than non-matched field pairs, so we select top- results of non-numerical field pairs sorted with matching ratio scores in non-ascending order as an input to a graph database. value can be selected by users for deciding most effective field pairs in a graph database and limiting the size of the graph database.

Iii-D More Efficient Hashing Algorithm for Non-numerical Matching

The proposed TPM metric can be effective to distinguish between matched and non-matched non-numerical fields. However, it possibly involves all the pairwise record combinations in the worst time complexity, which is time-consuming for large databases with millions of records. Therefore, we propose applying more scalable minHash-locality sensitive hashing algorithm (MH-LSH) [20]

to estimate the matching score of non-numerical fields in the databases. It can greatly reduce the comparison size and time for non-numerical record-wise pairs with little cost of matching accuracy down.

The proposed diagram for non-numerical matching based on MH-LSH is shown in Figure 5. The main process of the algorithm is as follows:

  • Preprocess non-numerical data: this step is the same as the preprocessing step of TPM algorithm.

  • Select matched field pairs fast: we apply the locality sensitive hashing technique in the database field matching for fast selecting field pairs that are correlated. In this process, it uses minHash similarity to decide the filtering results for field pairs.

  • Field pair similarity calculation: We use minHash technique in the databases to estimate the matching score of field pairs.

  • Select top- results: Same as the operation for non-numerical matching based on TPM, we sort the field pairs by the estimated matching score in a non-ascending order, the top- field pairs are selected as the final results of field pairs as an input to a graph database.

Fig. 5: Non-numerical matching flow based on MH-LSH

Here, we discuss important steps in detail.

Iii-D1 Matching Score Estimation with MinHash

The similarity score of matched field pairs can be estimated fast with minHash signatures. Given two fields and , we can evaluate the similarity between them as follows. We choose hash functions . For each hash function , we let a signature of set be for . Let a signature of set be for . Then, the probability that the two sets have the same minHash signatures can be used to estimate the similarity between them.

(9)

Here each record in a field or is also preprocessed with the same preprocessing method of TPM algorithm. The preprocessed record strings are combined into a new set like a word set in a document. We also use shingle (a substring of length ) to create a set of -shingles strings and apply different hash functions on the set of strings.

Iii-D2 Field Pair Selection with Locality Sensitive Hashing

Performing pairwise similarity measurement can be time consuming with large amounts of field pairs available. In order to identify which field pairs are similar quickly, we propose using locality sensitive hashing (LSH) to select the candidate field pairs.

The values of minHash signature for one field are grouped into -tuples (referred to as sketches) with rows. Similar field pairs have similar minHash signatures and hence have a high probability of having the same sketches. Moreover, dissimilar pairs have low chance of falling into the same sketch. The probability that two fields of and have at least one sketch (of size ) in common out of is

(10)

Therefore, we can find the candidate pairs with the designed number and . The selection of and is generally decided by a threshold shown in [20], which indicates how similar the two fields is to be considered as a candidate pair, and can also be set by users. In this way, if pairs with similarity above , they will be selected as candidate pairs to be further estimated, and the matching score between them with minHash will be calculated.

Iv Classification-based Matching

Our proposed algorithms for numerical and non-numerical matching involve several manual thresholds to get the matching results. The manual threshold selection is trivial and may not be generic to new databases. We propose a classification-based method to decide the matching or non-matching. Using our proposed similarity scores as features, we construct a classification-based matching system.

For a classification method, there are four important factors involved–input data, feature design, model selection and training/testing. The details of classification are introduced in the following sections.

Iv-a Sampling for ground truth

To address the problem of limited availability of the ground truth in our Cisco dataset, we use a sampling method [19] to synthesize more ground truth pairs, which preserves the ”synchronization property” (to preserve the Jaccard similarity of original sets). Given an original ground truth and , we sample a new ground truth pair based on and . The sampling process is to make sure if a particular sample is sampled in and is also in , then it is also sampled from [19]. Assume we have only

real numerical ground truth pairs that are real data from the databases and labeled by humans. To extend ground truth pairs, we synthesize 100 times of ground truth pairs from each of the field pairs. For example, in our dataset, there are 60 real numerical ground truth pairs that are real data from the databases and labeled by humans, so we synthesize 6000 more ground truth pairs which are enough to apply to a classifier. Synthetic non-numerical ground truth pairs are also generated in the similar way.

Iv-B Feature Design for Classification

In the previous numerical field matching, we have generated range difference (RD) similarity and bucket dot product (BDP) similarity metrics. We propose to use these similarity scores as features. There is one important manual threshold-bucket number involved while calculating BDP similarity, so we generate 15 different BDP similarity scores based on different bucket numbers in [100, 200, 500, 1000, 2000, 5000, 10000, 50000, 100000, 500000, 1000000, 5000000, 10000000, 50000000, 100000000], which are obtained from the data range of our real ground truth. As a consequence, for each filed pair instance, we have 16 features of scores in total.

For non-numerical field pairs, we have proposed matching ratio score and minHash similarity metric to decide the similarity of these pairs. During the calculation, we have one important record-pair threshold while calculating matching ratio score. Therefore, we use 4 different values in [0.3, 0.4, 0.5, 0.6] to calculate different matching ratios as features. In total, 5 features of scores are obtained for each non-numerical field pair.

Iv-C Training and Testing with SVM

We select Support Vector Machine (SVM) as our classifier for the reason that SVM works well on unstructured data such as non-numerical text and scales well with high dimensional data. We split our data into

as training and validation data, as the test data. In training stage, 5-fold cross validation is used for training and validation, and the results are shown in the experiment.

V Experimental Evaluation

In the section, we evaluate our technique NEMA for structured network database matching. To be specific, we measure the effectiveness of NEMA using ground truths for numerical and non-numerical data that are annotated by humans. Meanwhile, experiments on a large dataset are also conducted, and we show the top-k effective results of matching field pairs. Moreover, comparisons of NEMA with other baseline methods are also shown.

V-a Data

The structured network management databases available in the form of database tables are provided by Cisco Systems, Inc. The dataset includes “install_base” and “service_request” databases. They are generated by different departments and groups of the corporation, which are heterogeneous and diversely distributed.

In these databases, there are 21 tables which contain 1,458 columns. Each column has 10 million records on the average. Out of them, there are 679 numerical fields and 779 non-numerical fields. Therefore, a complete match involves the maximum 1,067,882 field matching decisions. With primary key constrains in numerical matching, there are 5 “primary keys” on average in each table, which would reduce to 374,326 field pairs matching.

We have ground truth field pairs that are annotated by humans to be matched or non-matched field pairs for a subset of the data. There are 60 field pairs of ground truth in numerical data and 40 pairs in non-numerical data. For future reference, table names in service_request database start with “T_”, and start with “X_” in install_base database, respectively.

V-B Experimental Setup

We implement NEMA system in Python. To evaluate the effectiveness of NEMA , we evaluate the numerical and non-numerical algorithm parts, respectively. For each part, we first evaluate its algorithms based on the ground truth data. Then all the column pairs in the large dataset are evaluated in the following experiments shown in sections V-C2 and V-D2, which shows the effectiveness of NEMA. Finally, we compare with the common matching system COMA [3] on the ground truth for both schema-level and instance-level matching.

To construct an effective graph database, identifying accurately matched pairs and reducing non-matched pairs appearing at the top are much more important than finding more non-matched pairs. Also, our evaluation is based on the balanced ground truth of positive and negative pairs. Therefore, we use an overall metric “accuracy” () to evaluate our field matching algorithms. is calculated with true positive(), true negative(), total positive pairs() and total negative pairs() (Here positive means matched and negative means non-matched).

V-C Evaluation based on Numerical Data

We evaluate our technique NEMA on the numerical data in two parts. We use numerical ground truth to evaluate the effectiveness of NEMA numerical algorithm. Then the matching results of all of other numerical field pairs are also described.

V-C1 Evaluating of Ground Truth

We show the evaluation result of NEMA numerical algorithms and the compared baseline method-Jaccard similarity using numerical ground truth (We choose Jaccard similarity since it is an exemplar method considering common values of two sets for similarity calculation). In the dataset, there are 30 matched ground truth field pairs which are originally from fields pairs annotated by humans or from join operations in the databases and proved to be matched field pairs. We randomly sample 300 non-matched field pairs confirmed by humans and select 30 non-matched field pairs from them as non-matched ground truth.

No. Table.fieldA Table.fieldB Matched class
1 T_INCI.prod_hw_key T_HW_PROD.bl_prod_key 1
2 T_INCI.cur_ct_key T_CT.bl_ct_key 1
3 T_INCI.up_tech_key T_TECH.bl_tech_key 1
4 T_INCI.inci_id T_INCI_I2.inci_id 1
5 T_INCI.bl_cot_key T_COT.bl_cot_key 1
6 T_INCI.inci_id T_OR_HD.inci_id 1
7 T_INCI.item_id T_PROD.item_id 1
8 T_INCI.ins_site_key T_SITE.bl_site_key 1
9 T_OR_LN.prod_key T_PROD.bl_prod_key 1
10 T_OR_HD.header_id T_OR_LN.header_id 1
1 T_OR_HD.order_dur T_OR_LN.loc_key 0
2 T_CAL.bl_cal_key T_PROD.item_id 0
3 T_INCI_I2 T_DEFT.deft_id 0
4 T_INCI_I2.res_time T_TECH.sub_tech_id 0
5 X_PRO.list_price T_TECH.sub_tech_id 0
6 T_INCI_I2.res_time T_TECH.bl_tech_key 0
7 T_OR_HD.deliv_dur T_DEFT.deft_key 0
8 T_OR_LN.hold_dur T_TECH.sub_tech_id 0
9 T_IN.serlevel_key T_INCI.last_dur 0
10 T_IN.closect_key T_INCI.resp_tz 0
TABLE IV: Example of numerical ground truth field pairs

Table IV shows 10 matched and 10 non-matched field pair examples of numerical ground truths. Columns “Table.field A” or “Table.field B” shows a table name and its field pair name to be compared. The matched class column indicates that the field pair is matched with value “1” or non-matched with value “0”. For example, in the first row, “T_INCI.prod_hw_key” indicates a field “prod_hw_key” in the table “T_INCI”, and “T_HW_PROD.bl_prod_key” indicates a field “bl_prod_key” in the table “T_HW_PROD”. This field pair about products’ key is matched, which is indicated with “1” in the matched class value. The remaining rows share the same characteristics as well.

Fig. 6: Jaccard similarity metric on the numerical ground truths

The problem of similarity of a numerical field pair is modeled as the problem of similarity of a set pair. The baseline method for the similarity of a numerical field pair is the well-known Jaccard similarity which measures the similarity of two given sets.

Figure 6 shows the Jaccard similarity scores on these ground truth field pairs. Red circle represents matched pairs and blue star represents non-matched pairs. axis denotes the index of these 30 matched and 30 non-matched field pairs, and axis indicates Jaccard similarity score. From this figure, we can see that there are about half of positive and negative pairs mixed together from which are difficult to differentiate.

Fig. 7: RD similarity metric on the numerical ground truths

Figure 7 shows the RD similarity metric on these ground truth field pairs. We know that a pair is more likely to match if its RD similarity score is larger. There are about only positive and negative pairs mixed together, which has greatly improved the result over the Jaccard similarity result.

From the previous result of RD similarity metric, we can set a low threshold and then rule out the result pairs below as non-matched field pairs. To minimize the error of initial matching, we can select a conservative threshold close to 0.1 (all of the uncertain pairs are above 0.1) to eliminate some generally precise non-matched pairs.

After we filter field pairs from RD similarity metric, there are totally 39 field pairs left, that is, about of all ground truth pairs (all of them are non-matched field pairs) are pruned. In these 39 pairs, there are 30 matched pairs and 9 non-matched pairs waiting to be input to the BDP similarity metric.

Figure 8 shows the BDP result after the previous RD similarity metric result. axis denotes the index of these 30 matched and 9 non-matched pairs, and axis indicates the normalized BDP similarity score in a non-ascending order. We can see from that the BDP result has very good decision line between matched pairs and non-matched pairs in which the final threshold is chosen around 0.1. The final accuracy can be up to 95% when is 0.1.

Fig. 8: BDP similarity metric on the remaining numerical ground truth pairs

The bucket number is the main factor affecting the BDP similarity and the final results. To evaluate the effectiveness of different values, we use all the numerical ground truth to test the accuracy with different values from 100, 200, 300, 400, 500 up to 100,000,000 where there exists some numerical records that have values up to 10 billion in almost all the fields. Figure 9 shows the accuracy of BDP similarity metric with different values when the threshold is set at 0.1 and 0.2. It shows a similar summit that the accuracy is at an optimal value when the is around from 10,000 to 200,000. BDP accuracy goes down when the becomes smaller or bigger. We use a trade-off value in our experiments to avoid overfitting.

Fig. 9: BDP accuracy comparison with different bucket numbers on the numerical ground truths

V-C2 Top- Similarity Results

The matching experiment based on all the 679 numerical fields is shown here. Table V shows top- matching results. We use RD similarity score threshold and bucket number for BDP similarity computation. All the rows are accurate matches, which are also confirmed by humans. The accuracy can be up to 100% for the top- results, which shows a great potential for our numerical matching algorithm applied on the large dataset, also reducing lots of human labor for matching. Moreover, with our system-aided matching findings, we can find some pairs matching which are difficult to be found with human annotations such as “changewg_key” and “subregion_key”, indicating which regions that the workgroup mainly serves.

To ensure a graph database more meaningful and complete, the selection of top- results is finally decided by users. Users can observe the top- results and rule out the unwanted matches as the final input pairs to a graph database so as to introduce less “noisy” connections of the graph database.

Table.field A Table.field B BDP similarity
T_DEFT.deft_key T_INCI_DE.bl_def_key 0.844
T_OR_LN.item_id X_PRO.item_id 0.698
T_DEFT.deft_id T_INCI_DE.defect_id 0.682
T_INCI.item_id X_PRO.item_id 0.673
T_DEFT.deft_key T_PROD.item_id 0.64
X_INS.item_id X_PRO.item_id 0.597
T_PROD.bl_prod_key T_SUR.task_key 0.567
T_INCI.changewg_key T_WK.subregion_key 0.551
T_INCI.changewg_key T_WK.wkgrp_key 0.551
T_INCI.changewg_key T_WK.theater_key 0.551
T_INCI.currentwg_key T_WK.theater_key 0.545
T_INCI.currentwg_key T_WK.subregion_key 0.545
T_INCI.currentwg_key T_WK.wkgrp_key 0.545
T_INCI.hwversion_id T_PROD.bl_prod_key 0.507
T_INCI.createwkgrp_key T_WK.theater_key 0.507
T_INCI.createwkgrp_key T_WK.subregion_key 0.507
T_INCI.createwkgrp_key T_WK.wkgrp_key 0.507
T_PROD.item_id X_PRO.item_id 0.468
T_INCI.prod_hw_key T_HW_PROD.bl_prod_key 0.463
T_SUR.evalwkgrp_key T_WK.wkgrp_key 0.433
TABLE V: Top- similarity result of numerical field pairs

V-D Evaluation based on Non-numerical Data

We evaluate our NEMA non-numerical algorithms based on TPM and Hashing on the non-numerical data in two parts as well. We use non-numerical ground truth data to evaluate the effectiveness of our algorithms. The matching results of all the other non-numerical field pairs are then described.

V-D1 Evaluating of ground truth

The experiment for evaluating of non-numerical ground truths is shown here. There are 20 matched ground truth pairs which are annotated by humans. Additionally, 20 non-matched ground truth pairs are randomly selected and verified. Part of examples are shown in table VI.

No. Table.field A Table.field B Matching class
1 T_PROD.item_name X_INS.item_name 1
2 T_COT.cpr_country T_SITE.country 1
3 T_CT.temp_desc T_PROD.item_desc 1
4 T_CT.ctserv_line X_SAH.servline_name 1
5 T_INCI.curr_wg_name T_WK.wkgp_name 1
6 T_CT.temp_desc X_SAH.temp_name 1
7 T_SITE.cust_state X_SAH.billto_state 1
8 T_CT.temp_name X_SAH.temp_desc 1
9 T_PROD.prod_family T_HW_PROD.family 1
10 T_PROD.prod_family T_HW_PROD.erp_family 1
1 T_INCI.init_gp_name T_SITE.address 0
2 T_DEFT.deft_submitter T_SITE.email_addr 0
3 T_COT.cpr_country T_INCI.summary 0
4 T_SITE.address1 X_PRO.prod_family 0
5 T_PROD.prod_family X_SAH.hdrcust_name 0
6 T_INCI.tacpica_ct T_HW_PROD.family 0
7 T_WK.wkgp_desc X_PRO.physisn_loc 0
8 T_SITE.county T_SUR.batchcot_name 0
9 T_INCI.customersw_ver T_SITE.state 0
10 T_OR_LN.partsloc_code X_INS.item_name 0
TABLE VI: Examples of non-numerical ground truths

We first analyze the ground truth record-pairs and show the viability for the record pair similarity threshold . Table VII shows the record pair similarity scores of 9 different record pairs in a field pair (“T_PROD.prod_subgrp”, “T_HW_PROD.platform”). The first 7 rows of pairs are matched record pairs that have higher similarity scores. The last 2 rows are not matched record pairs with lower score of 0.3333, which is lower than 0.4. It is conservative to capture the matching among different records with a similarity value 0.4 assigned to the threshold .

T_PROD.prod_subgrp T_HW_PROD.platform Record similarity
c900 series c900 series 1
c2950 series c2916 series 0.8
1601r series 1601 series 0.775
css2950 css2916 0.667
C2960 C2960CX 0.577
C3560CX C3560X 0.5
AIR35CE AIR35SE 0.4
ts900 cs900 0.333
c800 s800 0.333
TABLE VII: Sample of non-numerical record pairs

We demonstrate the effectiveness of non-numerical algorithms based on TPM and MH-LSH by calculating matching ratio score based on TPM and estimated matching score based on MH-LSH on non-numerical ground truths. Figure 10 shows matching ratio scores of this ground truth data matching in a non-ascending order based on TPM. The matching ratio scores of almost all the matched pairs are above the non-matched pairs’s. If we use the threshold 0.1 or select top- results from this, the accuracy can achieve about , which shows the effectiveness of NEMA based on TPM. Figure 11 shows the matching scores of these ground truths in a non-ascending order based on MH-LSH. Although the decision boundary is not as good as the TPM-based result in Figure 10, the accuracy can achieve 90% when the threshold is 0.1 or top 19 results are selected.

Fig. 10: TPM-based matching ratio score
Fig. 11: MH-LSH-based matching score

V-D2 Top- similarity results

There are 779 non-numerical fields in the large dataset. Considering that almost all the primary keys in a table are numerical fields, we do not consider primary key constraint matching method for non-numerical matching. The record similarity threshold is set to be 0.4 here based on the analysis of Table VII. The final top list of matching ratio scores are obtained based on TPM algorithm from all the non-numerical field pairs.

Table VIII shows the top- results of field pair matching based on TPM. We can see that all the field pairs are matched pairs, and they are also confirmed by humans later, which shows the effectiveness of NEMA based on TPM algorithm.

Table.field1 Table.field2 Matching ratio
T_INCI_I2.currentwg_key T_WK.wkgp_name 0.637
T_INCI_I2.currentwg_key T_WK.wkgp_desc 0.632
T_INCI.initwg_name T_WK.wkgp_name 0.63
T_INCI.initwg_name T_WK.wkgp_desc 0.628
T_WK.wkgmgr_email T_SUR.eval_email 0.626
T_INCI.creatorwg_name T_WK.wkgp_name 0.624
T_INCI.creatorwg_name T_WK.wkgp_desc 0.622
T_INCI.curr_wg_name T_WK.wkgp_name 0.611
T_INCI.curr_wg_name T_WK.wkgp_desc 0.607
X_SAH.billto_state T_SITE.state 0.504
X_SAH.billto_state T_SITE.cust_state 0.499
T_INCI.initwg_name T_INCI_I2.curr_wg_name 0.462
T_INCI.initwg_name T_INCI_I2.wkgrp_name 0.457
T_COT.cpr_country T_SITE.country 0.453
T_SITE.cust_country T_COT.cpr_country 0.453
T_INCI_I2.wkgrp_name T_INCI.curr_wg_name 0.451
T_COT.cpr_country T_SITE.cust_country 0.447
T_INCI.curr_wg_name T_INCI_I2.curr_wg_name 0.442
T_SITE.country T_COT.cpr_country 0.44
T_HW_PROD.erpplatform X_SCDC.productsub_grp 0.431
TABLE VIII: Top- matching results of non-numerical field pairs

V-E Evaluation of Classification-based Matching

As described in the section IV, we use SVM-based classifier to decide the matching of two field pairs. Here we show the numerical and non-numerical data testing result using cross-validation. We have generated 6000 synthetic numerical ground truth and non-numerical ground truth pairs based on the 60 numerical field pair and 40 non-numerical field pair, respectively. We run the whole experiment 20 times and obtain the average validation and testing accuracy which are shown in the table IX. It shows high accuracy of classification-based method to match field pairs with our proposed similarity metrics and also avoids the manual thresholds.

Data/Average ACC Numerical Non-numerical
Validation() 0.995 0.953
Testing() 0.972 0.946
TABLE IX: SVM validation and testing result on synthetic numerical and non-numerical ground truth

V-F Comparisons with baseline methods

We compare our technique NEMA with baseline methods of COMA system [13] and rule-based method Regex [24] here. COMA is a state of the art and popular hybrid matching tool and system supporting both schema-level and instance-level matching. Regex is an instance-level rule-based matching method based on regular expressions. We test the numerical and non-numerical ground truth matching of NEMA and COMA in schema-level and instance-level and Regex in instance-level. Their accuracies and differences on matching results are compared and analyzed quantitatively.

V-F1 Comparison of Accuracy

We measure the accuracy and compare the COMA in the schema level and instance level and Regex in the instance level for numerical and non-numerical data matching. On the schema level matching, COMA uses the best field matching similarity ”0” (which has no corresponding line in the COMA system) as a threshold in the schema-level matching. On the instance level matching, COMA has one similar instance-level matching that uses aggregated maximum record-wise similarities to obtain the final field pair similarities. The record-wise similarity is based on common similarity metrics such as edit distance [32] and trigram [2]. Edit distance is a technique to measure how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other. Trigram is a technique of splitting a string into triples of characters and comparing those to the trigrams of another string. The field matching similarity between two fields and in COMA is defined as follows:

(11)

Regex is a matching method based on regular expression by creating patterns from sampling instances of one field and then match against instances of another field to decide matching.

Figure 12 shows the accuracy comparisons among COMA, NEMA and Regex. COMA-SCH is COMA with schema-level matching algorithm. COMA-ED and COMA-TRG means that COMA uses edit distance and trigram to measure the record similarity in instance-level matching, respectively. Regex is the matching based on regular expression in [24]. NEMA-(RD+BDP)/TPM is NEMA using RD and BDP in numerical data matching, and TPM in non-numerical data matching. NEMA-MH_LSH indicates that NEMA uses minHash-locality sensitive hashing in non-numerical data matching, without MH-LSH bar shown for numerical matching.

The accuracies of NEMA-(RD+BDP)/TPM in numerical and non-numerical data matching can be up to , as high as COMA’s non-numerical data matching, but having higher accuracy than COMA-SCH and Regex matching. For numerical matching, the accuracies of COMA-ED, COMA-TRG and Regex are - lower than NEMA-(RD+BDP)/TPM because of the ineffectiveness to identify non-matched pairs of numerical ground truths. For non-numerical matching, the accuracies of COMA-ED and COMA-TRG are higher than NEMA-(RD+BDP)/TPM which is about higher than Regex. However, the field matching score of COMA is measured based on its general string similarity matching, which is not well applied to the network management database matching for record pair similarity requirements. A large number of pairs with high record similarities in COMA are not thought of as matches in the network management databases, which shows the usefulness of the NEMA non-numerical algorithm.

Fig. 12: Accuracy comparison

V-F2 Comparisons of Mismatched Examples

We further analyze the differences of COMA and NEMA in matching the ground truth field pairs. Table X shows the field pairs in every row and its similarity scores by COMA and NEMA. These field pairs are found to be matched pairs by NEMA with relatively high similarity scores, but COMA shows no similarities with score 0. For COMA, the field names have very few common characters in spelling, even though the semantic commonality exists. NEMA does not rely on the inaccurate schema-level properties, but it uses the record instance for the decisions of field matching, which indirectly considers the semantic correspondences. If the record instances for some matched pairs are incomplete or missing, however, the similarity scores for these field pairs are also low. Table XI shows the two field pairs that have low similarities in NEMA. Although the field names in each pair express the same thing semantically, the record instances in the fields are actually incomplete and have very few in common between each other. However, in our databases, the missing or incomplete field pairs are very few compared to the large number of field pairs, which does not affect the overall performance for the numerical matching with NEMA.

Table.fieldA Table.fieldB COMA-ED NEMA-TPM
T_INCI.ins_site_key T_SITE.partysite_id 0 0.208
T_OR_HD.creator_id T_INCI.lastup_by 0 0.643
T_CT.temp_desc T_PROD.item_desc 0 0.05
TABLE X: Example of field pairs matched by NEMA, but not by COMA
Table.fieldA Table.fieldB COMA-ED NEMA-TPM
T_INCI.bl_cot_key T_CT.bl_cot_key 0.750 0.091
T_SUR.bl_surv_key T_SUR_ANS.bl_surv_key 0.76 0.009
TABLE XI: Example of field pairs matched by COMA, but not by NEMA

Here we analyze the specific record pair examples of non-numerical instance-level matching. COMA uses standard edit distance and trigram to calculate the similarities of records, which is not quite suitable for the matching requirement of network management databases. Table XII below shows 9 examples of records pairs and three different kinds of similarities (NEMA-TPM, COMA-ED, COMA-TRG). The first 7 rows as one group are thought of as matched record pairs, the last two rows in the other group are non-matched record pairs. We can see from that the similarity of matched pairs based on COMA are quite similar around 0.7 for these two groups, from which is not easy to differentiate. While the similarities by NEMA have good differences (0.333 for non-matched pairs, 0.4 above for matched pairs). This further demonstrates that NEMA is more suitable for the network database matching.

Record 1 Record 2 NEMA-TPM COMA-ED COMA-TRG
c900 series c900 series 1 1 1
c2950 series c2916 series 0.8 0.833 0.6
1601r series 1601 series 0.775 0.909 0.738
css2950 css2916 0.667 0.714 0.6
C2960 C2960CX 0.577 0.6 0.775
C3560CX C3560X 0.5 0.833 0.671
AIR35CE AIR35SE 0.4 0.857 0.6
c800 s800 0.333 0.75 0.5
ts 900 cs900 0.333 0.8 0.667
TABLE XII: Examples of record similarity comparisons

V-F3 Comparison of Efficiency

Considering the expensive time consumption for non-numerical field pair matching, we test the efficiency based on the whole non-numerical ground truth. We run the experiment 20 times on the same machine with the same data to calculate the average computation time and standard deviation (SD) without data loading time. Figure

13 shows the total computation time spent for COMA-SCH, COMA-ED, COMA-TRG, Regex, NEMA-TPM and NEMA-MH_LSH. COMA-SCH and Regex using schema information only are fastest among all methods, but the accuracies are the lowest. Except from from COMA-SCH and Regex, COMA-ED is the slowest, taking about 12,000 seconds. NEMA-TPM takes about 2500 seconds, about 5 times speedup than COMA-ED. NEMA-MH_LSH takes 860 seconds, which is about 14 times faster than COMA-ED. NEMA-TPM reaches the best trade-off between accuracy and efficiency among these algorithms.

Fig. 13: Efficiency comparison

Vi Related Work

The structured data matching is an old and important research topic but unsolved and ever-growing problem, which has a wide range of applications in database integration, migration, semantic query, etc. [5]. In survey papers [30, 34], the authors proposes a solution taxonomy differentiating between element and structure level, schema and instance level, language and constraint-based matching techniques. Furthermore, P. Shivaiko et al. [35] reviews the state-of-the-art matching systems which were based on strings, structure, data instance and semantics matching techniques using different schema formats such as database, XML, OWL, RDFS, etc. In database schema matching, previous common matching systems in schema-level are introduced in several prototypes such as Similarity Flooding (SF) [25], Coma [3], etc.

SF [25] is a matching algorithm that models two structured columns to be compared as two directed labeled graphs. It makes use of field data, key properties and the string-based alignment (prefix and suffix test) to obtain the alignments between two nodes of the graph. The similarity is calculated from similar nodes to adjacent neighbors through propagation. Our NEMA only relies on the data instance values to infer the matching of fields, which does not utilize the structured properties and data types. However, SF uses a metric for matching quality based on the intended matching results, which is similar to our accuracy metric based on top- results.

Coma [3] is a composite matching system providing extensible library and framework for combining obtained results. It contains mainly 6 elementary matchers using string-based techniques, 5 hybrid matchers using names and structural paths, and one reuse-oriented matcher based on previous matching results . The composite matcher effectively improves the match quality over single matchers using the default combination strategy. Compared to SF, the overall average matching quality are the best among them [3]. The extended version Coma++ [13] utilizes the shared taxonomy and pivot schema to further improve the overall matching quality. In our evaluation, we compare with the Coma++ method using the default combination strategy and find our technique NEMA overall outperforms than COMA in schema-level matching.

Except from the previous matching approaches using field and structural information matching, data instanced-based approaches [27, 23, 8, 39] use the similarity metric or machine learning or rule-based methods to determine the similarity of fields. In [23], the authors utilize a corpus that contains schema and mappings between some schema pairs, and learn the constraints from schema statistics to help generate more matching pairs. In [8], the authors uses the mutual information of statistics to measures the similarity of schema instances between two columns to decide the matching, which shows an effective method based on instances. Also, the authors in [29] propose a new sample-driven approach which enables the end-users to easily construct their own data to match the source and target schema. [24] proposes a rule-based method by creating regular expression pattern to match against columns. [39]

uses matching-learning technique of training neural networks for getting candidate pairs and then filters the pairs with a rule-based algorithm. Our NEMA uses proposed similarity metrics as features to train a SVM to classify for matching effectively and efficiently. COMA

[13] proposes two instance-level matching methods based on the constraint of instance data and the content-based matching to measure field matching. The constraint-based method relies on the general, numerical and pattern constraint which has specific limitation to the specific data which is not suitable for the network databases. The content-based matching depends on the aggregation of similarity scores of instance contents and it is kind of similar to our NEMA technique on content-based similarity measurement, which are compared with NEMA in the experimental evaluations.

To sum up, most of currently popular matching approaches and systems focus on schema-level information matching. The data instances level matching approaches using field record values are mostly based on some statistical models and machine learning from corpus. We further explore the database instance matching by comparing field records using different metrics and propose effective and overall matching algorithms considering the characteristic of network database matching for a graph database construction for efficient data query, analysis and management.

Vii Conclusion

In this paper we propose a systematic technique NEMA to match databases for network management. Different from previous database matching approaches, we design a technique to match numerical and non-numerical fields in instance-level respectively, which can effectively be integrated into a graph database for network management and analysis. For numerical matching, we propose range difference similarity and bucket dot product similarity metrics. For non-numerical matching, we design top priority match metric and also propose applying minHash-locality sensitive hashing algorithm, which reduces the matching time for large databases. To address the drawback of manual thresholds, an effective classification-based method is also proposed based on the proposed similarity metrics. The results of NEMA are demonstrated to be promising with better efficiency and comparable accuracy than the baseline algorithms.

With the explosion of big data and popularity of distributed graph processing systems, this work has the potential to significantly reduce the human work involving identifying the matching fields for a large graph database construction and also be applied for large-scale data matching. A majority of partial matching pairs can be found by our matching algorithms which are not easily detected by humans.

References

  • [1] C. C. Aggarwal and H. Wang (2010) Graph data management and mining: a survey of algorithms and applications. In Managing and mining graph data, pp. 13–68. Cited by: §I.
  • [2] R. C. Angell, G. E. Freund, and P. Willett (1983) Automatic spelling correction using a trigram similarity measure. Information Processing & Management 19 (4), pp. 255–261. Cited by: §V-F1.
  • [3] D. Aumueller, H. Do, S. Massmann, and E. Rahm (2005) Schema and ontology matching with coma++. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 906–908. Cited by: §V-B, §VI, §VI.
  • [4] A. Bhattacharjee and H. Jamil (2009) OntoMatch: a monotonically improving schema matching system for autonomous data integration. In Information Reuse & Integration, 2009. IRI’09. IEEE International Conference on, pp. 318–323. Cited by: §I, §III-B4.
  • [5] S. Castano, A. Ferrara, and S. Montanelli (2018) Matching techniques for data integration and exploration: from databases to big data. In A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, pp. 61–76. Cited by: §I, §VI.
  • [6] M. Cheatham and P. Hitzler (2013) String similarity metrics for ontology alignment. In The Semantic Web–ISWC 2013, pp. 294–309. Cited by: §III-B.
  • [7] C. Chen, B. Golshan, A. Y. Halevy, W. Tan, and A. Doan (2018)

    BigGorilla: an open-source ecosystem for data preparation and integration.

    .
    IEEE Data Eng. Bull. 41 (2), pp. 10–22. Cited by: §I.
  • [8] W. Chen, H. Guo, F. Zhang, X. Pu, and X. Liu (2012) Mining schema matching between heterogeneous databases. In Consumer Electronics, Communications and Networks (CECNet), 2012 2nd International Conference on, pp. 1128–1131. Cited by: §I, §VI.
  • [9] C. Coronel and S. Morris (2016) Database systems: design, implementation, & management. Cengage Learning. Cited by: §I.
  • [10] C. Daraio, M. Lenzerini, C. Leporelli, P. Naggar, A. Bonaccorsi, and A. Bartolucci (2016) The advantages of an ontology-based data management approach: openness, interoperability and data quality. Scientometrics, pp. 1–15. Cited by: §I.
  • [11] M. Dastbaz, C. Pattinson, and B. Akhgar (2015) Green information technology: a sustainable approach. Morgan Kaufmann. Cited by: §I.
  • [12] X. L. Dong and D. Srivastava (2015) Big data integration. Synthesis Lectures on Data Management 7 (1), pp. 1–198. Cited by: §I.
  • [13] D. Engmann and S. Massmann (2007) Instance matching with coma+. Cited by: §V-F, §VI, §VI.
  • [14] D. Gomez-Cabrero, I. Abugessaisa, D. Maier, A. Teschendorff, M. Merkenschlager, A. Gisel, E. Ballestar, E. Bongcam-Rudloff, A. Conesa, and J. Tegnér (2014) Data integration in the era of omics: current and future challenges. BioMed Central. Cited by: §I.
  • [15] B. Gu, Z. Li, X. Zhang, A. Liu, G. Liu, K. Zheng, L. Zhao, and X. Zhou (2016) The interaction between schema matching and record matching in data integration. IEEE Transactions on Knowledge and Data Engineering 29 (1), pp. 186–199. Cited by: §I.
  • [16] D. S. Himmelstein and S. E. Baranzini (2015) Heterogeneous network edge prediction: a data integration approach to prioritize disease-associated genes. PLoS computational biology 11 (7), pp. e1004259. Cited by: §I.
  • [17] E. Ioannou and Y. Velegrakis (2019) Embench++: data for a thorough benchmarking of matching-related methods. Semantic Web (Preprint), pp. 1–16. Cited by: §III-B4.
  • [18] D. Jang, T. Kim, and H. Kim (2017) History management for network information of iot devices. In International Conference on Security with Intelligent Computing and Big-data Services, pp. 138–145. Cited by: §I.
  • [19] H. Köhler, X. Zhou, S. Sadiq, Y. Shu, and K. Taylor (2010) Sampling dirty data for matching attributes. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 63–74. Cited by: §IV-A.
  • [20] J. Leskovec, A. Rajaraman, and J. D. Ullman (2014) Mining of massive datasets. Cambridge University Press. Cited by: §III-D2, §III-D.
  • [21] M. Lissandrini, M. Brugnara, and Y. Velegrakis (2018) Beyond macrobenchmarks: microbenchmark-based graph database evaluation. Proceedings of the VLDB Endowment 12 (4), pp. 390–403. Cited by: §II, §II.
  • [22] T. Lose, P. van Heusden, and A. Christoffels (2019) COMBAT-tb-neodb: fostering tuberculosis research through integrative analysis using graph database technologies. Bioinformatics. Cited by: §II, §II.
  • [23] J. Madhavan, P. A. Bernstein, A. Doan, and A. Halevy (2005) Corpus-based schema matching. In 21st International Conference on Data Engineering (ICDE’05), pp. 57–68. Cited by: §VI.
  • [24] O. A. Mehdi, H. Ibrahim, and L. S. Affendey (2012) Instance based matching using regular expression. Procedia Computer Science 10, pp. 688–695. Cited by: §I, §V-F1, §V-F, §VI.
  • [25] S. Melnik, H. Garcia-Molina, and E. Rahm (2002) Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In Data Engineering, 2002. Proceedings. 18th International Conference on, pp. 117–128. Cited by: §VI, §VI.
  • [26] J. J. Miller (2013) Graph database applications and concepts with neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, Vol. 2324. Cited by: §II, §II.
  • [27] H. Nottelmann and U. Straccia (2007) Information retrieval and machine learning for probabilistic schema matching. Information processing & management 43 (3), pp. 552–576. Cited by: §I, §VI.
  • [28] G. Pantuza, F. Sampaio, L. F. Vieira, D. Guedes, and M. A. Vieira (2014) Network management through graphs in software defined networks. In 10th International Conference on Network and Service Management (CNSM) and Workshop, pp. 400–405. Cited by: §I.
  • [29] L. Qian, M. J. Cafarella, and H. Jagadish (2012) Sample-driven schema mapping. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 73–84. Cited by: §I, §VI.
  • [30] E. Rahm and P. A. Bernstein (2001) A survey of approaches to automatic schema matching. the VLDB Journal 10 (4), pp. 334–350. Cited by: §VI.
  • [31] A. D. Raut and M. Atique (2014) A survey of indexing techniques for xml database. Compusoft 3 (1), pp. 461. Cited by: §I.
  • [32] E. S. Ristad and P. N. Yianilos (1998) Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (5), pp. 522–532. Cited by: §V-F1.
  • [33] S. Sahu, A. Mhedhbi, S. Salihoglu, J. Lin, and M. T. Özsu (2017) The ubiquity of large graphs and surprising challenges of graph processing. Proceedings of the VLDB Endowment 11 (4), pp. 420–431. Cited by: §I.
  • [34] P. Shvaiko and J. Euzenat (2005) A survey of schema-based matching approaches. In Journal on data semantics IV, pp. 146–171. Cited by: §VI.
  • [35] P. Shvaiko and J. Euzenat (2013) Ontology matching: state of the art and future challenges. IEEE Transactions on knowledge and data engineering 25 (1), pp. 158–176. Cited by: §VI.
  • [36] M. Stonebraker and I. F. Ilyas (2018) Data integration: the current status and the way forward.. IEEE Data Eng. Bull. 41 (2), pp. 3–9. Cited by: §I.
  • [37] N. Sünderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid (2017) Meaningful maps with object-oriented semantic mapping. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 5079–5085. Cited by: §I.
  • [38] J. Wang, Y. Wu, N. Yen, S. Guo, and Z. Cheng (2016) Big data analytics for emergency communication networks: a survey. IEEE Communications Surveys & Tutorials 18 (3), pp. 1758–1778. Cited by: §I.
  • [39] Y. Yang, M. Chen, and B. Gao (2008) An effective content-based schema matching algorithm. In 2008 International Seminar on Future Information Technology and Management Engineering, pp. 7–11. Cited by: §VI.