Machine Learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to “learn data”. More precisely, “learn” is here intended as the possibility to automatically recognize complex patterns and make “intelligent” decisions, based on information data. Hence, machine learning is closely related to fields such as statistics, probability theory, data mining, pattern recognition, artificial intelligence, adaptive control and theoretical computer science.
Machine learning algorithms can be classified in the following types:
supervised learning algorithms: a function/classifier is generated, that maps outputs on the training inputs, based on labeled examples input-output;
unsupervised learning algorithms: patterns in the input are recognized, the examples have no labels;
semi-supervised learning algorithms: supervised and unsupervised learning information is combined;
reinforcement learning: actions from observation of the world are generated. Every action has some impact in the environment and the environment provides feedbacks that are translated into a score that guide the learning process.
The principal supervised learning techniques currently applied or under consideration at statistical agencies worldwide to solve the record linkage matching problem are: classification tree [Cohen98, elfe2002]Bilenko2003, Christen2008a, Christen2008b]
and neural network[Wilson2011]. In this short paper, another machine learning technique is proposed to solve the record linkage problem: the multi-criteria classification method Electre Tri. It is the first time that multi-criteria machine learning technique is used to solve the record linkage problem.
This application answers to one of “many challenges in applying supervised machine learning to record linkage matching” [Kenn15], showing that the use of multi-criteria classification method Electre Tri to solve the record linkage problem provides good results in term of classification model performances. The importance of this application is in light of the increasing development of the use of administrative sources data. In this context, an important problem is that of finding matching pairs of records from heterogeneous databases, while maintaining privacy of the databases parties. To this purpose secure computation of distance metrics is important for secure record linkage [RCF2004].
The paper is organized as follows. Section 2 describes an introduction to the Record Linkage problem; then the next Section 3 describes the method Electre Tri, used to solved the Record Linkage and in the last Section 4 a preliminary experiment is conducted on simulated data. The paper closes with some final remarks and conclusions.
2 Linked Data: the Record Linkage
Generally speaking, in integration of two data sets the objective is the detection of those records, in the different data sets, that belong to the same statistical unit. This action allows the reconstruction of a unique record of data that contains all the unit information collected from different data sources on that unit.
Therefore, record linkage is the methodology of bringing together corresponding records from two or more files or finding duplicates within files [Winkler99]. In the first situation, the definition of record linkage in [fellegi69] is more precise “Record linkage is a solution to the problem of recognizing those records in two files which represent identical persons, objects, or events (said to be matched)”
The term record linkage originated in the public health area when files of individual patients were brought together using name, date-of-birth and other information [Winkler99].
One of the main motivations for the utilize of the record linkage method is the construction of the big data bases for answer to the new informative needs [felle97].
In order to better understand the problem, small practical example is now presented. Suppose the user wants to link two datasets of persons A and B, whose the variables Name, Address and Age are known.
Suppose that Table A contains the following values:
|a1||John A Smith||16 Main Street||16|
|a2||Javier Martinez||49 E Applecross Road||33|
|a3||Gillian Jones||645 Reading Aev||22|
Furthermore, suppose that Table B contains the following values:
|b1||J H Smith||16 Main St||17|
|b2||Haveir Marteenez||49 Aplecross Raod||36|
|b3||Jilliam Brown||123 Norcross Blvd||43|
The matching table contains two units referring probably to the same persons, that the method should individuate as matches: ’John A Smith’ with ’J H Smith’ and ’Javier Martinez’ with ’Haveir Marteenez’.
Modern record linkage begins with the pioneering work of Newcombe et al. [newcombe59]
, who introduced odds ratio of frequencies and the decision rules for delineating matches and nonmatches. In recent years, advances have yielded computer system that incorporate sophisticated ideas from computer sciences, statistics and operational research[Winkler99].
Then, Fellegi and Sunter [fellegi69]
introduced a mathematical foundation for record linkage. Their theory demonstrated the optimality of the decision rules used by Newcombe and introduced a variety of ways of estimating crucial matching probabilities (parameters) directly from the files being matches.
Formally, given two files A and B to be matched, each pair has to be classified into true match or true nonmatch.
The odds ratios of probabilities is:
where is an arbitrary agreement pattern in the comparison space , is the set of of true matches and is the set of true nonmatches. Between these two sets, the intermediate set of the possible matches exists.
The decision rule reported below helps to classify the pairs:
if , then the pair is a designated match,
if , then the pair is a designated potential match,
if , then the pair is a designated nonmatch.
The estimation of the thresholds Upper and Lower is not easy in an objective way; the choice is competence of the analyst. In the decision rule, three different sets were created: the designated matches, designated potential matches, designated nonmatches. They constitute the partition of the set of all the records in the space in three subsets (matches), (potential matches) and (nonmatches), whose intersections are empty sets.
The idea is to solve the record linkage problem as a multi-criteria based classification problem, whose a priori defined classes are the subsets of the partition.
Without going into too much details, in the next section a brief introduction to the method Electre Tri is presented.
3 The multi-criteria method Electre Tri
In Multi Criteria Decision Aid, a finite set of objects (alternatives, actions, projects) is evaluated by a finite set of criteria, which measure their performances.
A criterion is the real-valued function , such that indicates the performance of the alternative on the criterion . The comparison of any pair of alternatives and may be grounded to the comparison of the two values and [Electre.manual].
In general, a criterion can be either of gain or cost type; gain means that the DM prefers the highest value, while cost means that the DM prefers the lowest value on the criterion.
Many types of criterion were studied in literature, such as true-criterion, pseudo-criterion, pre-criterion, semi-criterion and other types [Electre.manual].
In the case of true-criterion, if the difference between two performances is positive, then the true-criterion structure implies that the alternatives are in the strict preference relation; while if the difference is equal to 0, then they are in indifference relation.
The Electre Tri is a pseudo-criterion-based method. This type of criterion takes into account that data can be affected by errors from uncertainty, imprecision and small differences or big can not imply the same binary relations. Small and big differences of performances have to imply different binary relations. To define “small” and “big”, two values are considered, which are the preference and indifference thresholds.
In literature, grouping problems can be divided in clustering, classification and sorting problems, depending on the a priori/posteriori knowledge of classes. The sorting problem is a classification problem, dealt with multi-criteria approach, requiring to Decision Maker (DM) any preference information. So, the aim of an ordinal sorting problem consists in assigning each alternative in one of the ordered predefined categories.
Formally, given predefined ordered categories and a finite set of alternatives , evaluated on a finite set of criteria , in the case all criteria are gain-type, the relations among the categories are , such that is the profile, upper limit of category and lower limit of category . In this way, and are the worst and the best categories respectively.
The Electre Tri method is based on outranking relations, indicated with S, which characterize how the alternatives are compared with the profiles. Because the assignment of an alternative to a specific category follows from the comparison, on all criteria, of its performances with the profiles ones.
The relation validates or invalidates the assertion “ outranks ” whose meaning is “ is at least as good as ”, on the set G.
In the context of the Electre Tri method, the validation of outranking relation is made by the computation of four indices [Mousseau.Slowinski:1998, Electre.manual]:
the partial concordance indices on each criterion;
the global concordance index on all the criteria;
the partial discordance indices on each criterion;
the credibility index on all the criteria.
For the computation of the partial concordance indices, it is necessary to know the profiles, preference and indifference thresholds values. In the case one of these parameters are not known, the index can not be computed. For the computation of the global concordance index is necessary to know the weights, representing the importance coefficients of the criteria. For the computation of the partial discordance indices, it is necessary to know the profiles, preference and veto thresholds values. And the credibility index corresponds to the global concordance index weakened by veto effects. If veto thresholds do not enter in the model, the credibility index is equal to the global concordance index. From the credibility index to the definition of an outranking relation, it is necessary to fix a cutting level lambda, which is the minimum credibility index value which permits to define the outranking relation. Finally, the assignment of an alternative to one category does not result from the outranking relation directly, but it is necessary to use one (or both) of the two proposed exploitation procedures. They are the pessimistic and the optimistic assignment procedures. These procedures analyze the way an alternative compares to the profiles so as to determine the category to which the alternative should be assigned.
One of the main difficulties is the elicitation of various parameters that in the Electre Tri are profiles, weights, thresholds (preference, indifference and veto) and cutting level lambda. Even if these parameters can be interpreted, it can be difficult to fix directly their values (direct elicitation) and to have a clear global understanding of the implications of these values in terms of the output [Mousseau.Slowinski:1998].
In order to estimate indirectly the value of the parameters, De Leone and Minnetti [rdlvm:electre]
proposed new estimation methodology whose procedure is composed of two phases: the first dedicated to the profiles and thresholds estimations, the second to the weights and cutting level estimations. The core of the procedure is the profiles’ estimation, suggested with Linear Programming (LP) using training set.
Let be the number of categories, the number of criteria, the LP problem is the following:
where is a small positive value.
The problem (1) minimizes the sum of the classification errors on all criteria and on all the alternatives in the training set, when this alternative’s performance lies out the belonged category. The first two constraints define the error .
4 Application to Real Data: a first experiment
As said in the previous section, the multi-criteria approach requires DM any preference information, including binary relations. Since it is possible to state binary relations between the subsets as , the record linkage problem can be structured as ordinal sorting problem, that is, classification problem whose classes are ordered in the strict preference binary relations.
Moreover, the importance of using multi-criteria decision methods, with respect to the other classification methods, is in the possibility to assign weights to each criterion, not possible in all the classification problems, and to use the preference information, provided by DM, for estimating the classification model’s parameters.
The proposed application wants to find a classification model (i.e. classifier or learner), assigning each record of the space to one of the three categories , and , following the two phases procedure formulated by De Leone and Minnetti [rdlvm:electre].
The input data, used in the application, were taken from Winkler from American Census (in SecondString file for approximate string matching techniques). Two data sets A and B are considered, containing 449 and 392 records respectively, and the true links are 327.
The variables (textual fields from synthetic census data) are the following:
DS (labels of the data sets with A and B);
LASTCODE (middle name initial);
NUMCODE (address street number);
STREET (address street name).
In this short paper, results from preliminary experiments are reported, because the application is an ongoing research, due to its complexity.
Some variables contain missing values that cause difficulties in the analysis, making it more complicated. So in order to facilitate the analysis, the records with missing values are deleted.
There are a number of popular methods of estimating the learner’s ability to generalize; the test set method was used here. In this experiment, the use of distance measure and the search of training set had played the most important roles; they had contributed to obtain good results of the classification model, found by Electre Tri [Min14].
The performance of the classification model, applied to the test set (83868 alternatives) was 99.09% when all the criteria have the same importance and the lambda parameter is .
If lambda increases, the performance increases, up to 99.81% when and 99.89% when .
In the case the importance coefficients of criteria were considered different, the performances of the models were substantially the same, varying the lambda parameter.
In the case of performance 99.09%, the classification errors were committed by the model on the false links; namely, the model saw almost all the true links. The opposite situation occurred in the case of performance 99.89%, when the model saw almost all the false links and misclassified the true links. To the DM the choice of the most interesting model, depending his preferences.
5 Final Conclusion and Remarks
In this short paper, the Electre Tri machine learning technique was proposed for solving Record Linkage matching. It is the first time that multi-criteria decision technique is used to solve the record linkage problem.
The proposed application started with an initial experiment demonstrating that the application of the Electre Tri to record linkage shall provide good results in terms of classifier performances. This paper shows only the results of a preliminary experiment, which provided good results in terms of performances of the classification model. Also this experiment confirmed that record linkage is more sensitive to the quality of preprocessing and standardization that of matching, as said in [Win14].
As consequence, other measures of distance in the construction of the input data matrix, as well as, different schemes in the search of training set, will be used.