Developing a Temporal Bibliographic Data Set for Entity Resolution

06/20/2018 ∙ by Yichen Hu, et al. ∙ Australian National University 0

Entity resolution is the process of identifying groups of records within or across data sets where each group represents a real-world entity. Novel techniques that consider temporal features to improve the quality of entity resolution have recently attracted significant attention. However, there are currently no large data sets available that contain both temporal information as well as ground truth information to evaluate the quality of temporal entity resolution approaches. In this paper, we describe the preparation of a temporal data set based on author profiles extracted from the Digital Bibliography and Library Project (DBLP). We completed missing links between publications and author profiles in the DBLP data set using the DBLP public API. We then used the Microsoft Academic Graph (MAG) to link temporal affiliation information for DBLP authors. We selected around 80K (1 million (50 names and personal web profile to improve the reliability of the resulting ground truth, while at the same time keeping the data set challenging for temporal entity resolution research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The Digital Bibliography and Library Project (DBLP) is a computer science bibliography database developed and maintained by the University of Trier which has being used by many researchers over the past three decades for experimental studies (ley_dblp_2002, ). By 2018, DBLP contain more than 4.1 million publications111 https://dblp.org/statistics/recordsindblp.html. While DBLP attempts to identify individual authors, in many cases errors caused by homonyms or the same names are not detected222 http://dblp.org/faq/. DBLP has the potential to be used as a temporal data set with publications as records, each of which associates with a time-stamp (i.e. publication date) and authors being identified by their unique identifiers. Because authors may change their affiliation over time, it is thus challenging to identify all records that refer to the same author throughout their career.

Having a large temporal bibliographic data set can be valuable for real-time and temporal entity resolution research (altowim_progresser:_2018, ; ramadan_dynamic_2015, ). However, so far such a data set has not been made available because the public version of DBLP does not contain reliable publication-author cross references. Developing such a data set is also challenging due to data quality issues around name variations, misspellings, as well as frequent names shared by many authors. The version of the DBLP data set used by us is the archive from 1 April, 2018333http://dblp.org/xml/release/dblp-2018-04-01.xml.gz.

A second publicly available bibliographic data set, Microsoft Academic Graph (MAG) (sinha_overview_2015, ), has been released for the 2016 KDD cup competition and is also used by us in this paper. The MAG data set used for the KDD cup 2016 and us is a subset that contains records for 2,258,482 publications from the full MAG data set (which contains records for 166,192,182 publications). The MAG data set overlaps with the DBLP data set on a significant proportion, and covers temporal affiliation changes which are missing in DBLP. However, unlike DBLP, the MAG data set does not contain unique author identifiers. Figure 1 shows the schemas of the DBLP and MAG data sets.

Figure 1. The schemas of the DBLP and MAG source data sets we used to create a temporal DBLP data set. Note that there is no AuthorID in the DBLP Articles table, and the PaperID in the MAG data set is not the same as the ArticleID in the DBLP Articles table. Therefore the three data sets are not cross-referred with each other.

In this paper, we describe our approach to create a temporal data set for entity resolution using a subset of relatively reliable author profiles from DBLP. We add the temporal affiliation information for this refined temporal DBLP data set using the MAG data set. Finally, we discuss the characteristics of the proposed temporal DBLP data set. To summarize our contributions:

  • We create a temporal author entity data set based on DBLP and using cross-reference information harvested through DBLP’s open API.

  • We complete a significant proportion of temporal information in our proposed temporal data set using affiliation information from the MAG data set.

  • We refine the temporal data set to minimize false matches using information in DBLP, including the existence of multiple names of the same profile, as well as persistent and non-persistent personal URLs.

  • We make the generated temporal data set available on
    GitHub444https://github.com/E-Chen/A-refined-DBLP-temporal-dataset.

We next provide necessary background, and in Section 3, we discuss the technique we use to create our DBLP temporal data set. In Section 4, we complete the temporal affiliation information in our temporal DBLP data set, and in Section 5 we address the data quality issues by selecting a subset of author profiles that have either multiple names, static affiliation information, or personal web profiles and are more likely to be accurate. We then describe the characteristics of the source and generated data sets in Section 6.

        <www key="homepages/r/CJvanRijsbergen" ... >
        <author>C. J. van Rijsbergen</author>
        <author>Cornelis Joost van Rijsbergen</author>
        <author>Keith van Rijsbergen</author>
        <note type="affiliation">University of Glasgow, UK</note>
        <title>Home Page</title>
        <url>http://www.dcs.gla.ac.uk/∼keith/</url>
        </www>
Figure 2. An author record in DBLP.

2. Background

DBLP is a popular data set used to evaluate entity resolution techniques (kopcke_evaluation_2010, ; wang_efficient_2015, ) as well as research in areas such as clustering and bibliometrics. However, DBLP is known to have data quality issues (kopcke_object_2014, ), such as heterogeneity problems, homonyms, and synonyms.

Several recent approaches for temporal entity resolution have used various smaller subsets of DBLP. Li et al. (li_linking_2011, ) used 738 temporal records of 18 authors from DBLP, while Wang et al. (wang_rule-based_2018, ) used 3,572 temporal records of more than 20 authors. Chiang et al. (chiang_tracking_2014, ) used a larger subset of 100K records of an unspecified number of authors, however this data set was only used for scalability experiments and not to evaluate entity resolution quality.

The affiliation information in DBLP can be useful to study the temporal changes of attribute values in the context of temporal entity resolution (chiang_tracking_2014, ). However, it is difficult use in a temporal entity resolution scenario because the affiliation information in the DBLP is non-temporal, as shown in line 5 in Figure 2. It is therefore currently not possible to determine the time period for an affiliation of an author using only the DBLP data set. Chiang et al. (chiang_tracking_2014, ) manually added temporal affiliation information for 258 authors, however this approach is expensive and does not scale to the size of DBLP.

3. Linking Publications to Unique Author Profiles

In this section we discuss how we create temporal records for authors in DBLP using the existing DBLP XML file and open API555 http://dblp.org/faq/. The desirable form of our temporal data set is a temporally sorted list of publication records for each author, where each author has an individual record for each of their publications, each author is identified by a unique author identifier (ID), and each publication is identified by a unique publication ID.

DBLP is available for public download as an XML file666 https://dblp.uni-trier.de/xml/ which contains two separate lists of records, as illustrated in Figure 1. The first list contains publication records, where each publication has a unique publication ID (as key in the record), a list of authors described by author tags, the year of the publication, and other publication specific attributes, such as title, venue etc. (ley_dblp:_2009, ). Figure 3 shows an example publication record from the DBLP XML file.

        <inproceedings key="conf/chiir/OBrienFJLTR16" mdate="2016-04-12">
        <author>Heather L. O’Brien</author>
        <author>Nicola Ferro</author>
        <author>Hideo Joho</author>
        <author>Dirk Lewandowski</author>
        <author>Paul Thomas</author>
        <author>Keith van Rijsbergen</author>
        <title>
        System And User Centered Evaluation Approaches in Interactive Information Retrieval (SAUCE 2016).
        </title>
        <pages>337-340</pages>
        <year>2016</year>
        <booktitle>CHIIR</booktitle>
        <crossref>conf/chiir/2016</crossref>
        <url>db/conf/chiir/chiir2016.html#OBrienFJLTR16</url>
        </inproceedings>’
Figure 3. A publication record in DBLP.

The second list contains author records, where each author has a unique author ID (as key in the record), and author attributes that specify the names that an author has used for one of their publication(s). The author identifiers are created by the editors of DBLP either manually or using algorithms, and have a percentage of unknown errors777See news on 2017-09-15 at https://dblp.uni-trier.de/news/. Some author records, such as the one shown in Figure 2, have some associated URLs recorded (such as a university profile page or a Google Scholar profile page, as specified by url tags) and are anticipated to be more reliable than those that do not have an URL (ley_dblp:_2009, ). Since August 2017, some authors have their ORCID imported as well888 https://dblp.uni-trier.de/faq/17334571. We discuss the reliability of author profiles further in Section 5. Some authors have one or more affiliation attributes. However, because no time-stamp is attached to an affiliation, it is not known when an author was a member or worked for a certain affiliation.

The DBLP XML file does not contain ID-based cross references between the list of publications and the list of authors (as Figures 3 and 2 show). Author names in the publication list are recorded in plain-text and cannot be used to refer to a particular author in the author list when there are more than two authors who share exactly the same name. The records in the author list have no references to their corresponding publication(s) at all.

To solve this issue, we use the public API provided by DBLP to retrieve publication records of each author, using author identifiers collected from the author list. The API we used in the paper is
http://dblp.org/pid/pid.xml, where pid refers to an author ID. Each publication record retrieved for each author is in the same format as the publication record shown in Figure 3. However, since the publications are obtained using a specific author identifier, we can now link them to a specific author profile.

Algorithm 1 shows the procedure used to create a temporal data set using DBLP. Since an author can have a number of names (as Figure 2 shows), we firstly create a using the AuthorID of each DBLP author profile as index keys, where each key links to its associated author names (lines 3 to 6). Note that in line 4 we used a strategy , as discussed in Section 5, to select author profiles that either have multiple names, an affiliation, or have at least one non-persistent or persistent URL.

1:
- A list of DBLP author profiles
2:
- A list of DBLP temporal records
3:
4:
5:foreach  in  do
6:     if  then
7:         foreach  in  do
8:                             
9:foreach  in  do
10:     
11:     foreach  in  do
12:         foreach  in  do
13:              if  in  then
14:                  
15:                                               
16:
17:return
Algorithm 1 Create temporal records
        <inproceedings key="conf/chiir/OBrienFJLTR16" mdate="2016-04-12">
        <author>Heather L. O’Brien</author>
        <author>Nicola Ferro</author>
        <author>Hideo Joho</author>
        <author>Dirk Lewandowski</author>
        <author>Paul Thomas</author>
        <author>Keith van Rijsbergen</author>
        <title>
        System And User Centered Evaluation Approaches in Interactive Information Retrieval (SAUCE 2016).
        </title>
        <pages>337-340</pages>
        <year>2016</year>
        <booktitle>CHIIR</booktitle>
        <crossref>conf/chiir/2016</crossref>
        <url>db/conf/chiir/chiir2016.html#OBrienFJLTR16</url>
        </inproceedings>
        ...
        <inproceedings key="conf/ictir/ZucconAR11" mdate="2017-05-25">
        <author>Guido Zuccon</author>
        <author>Leif Azzopardi</author>
        <author>C. J. van Rijsbergen</author>
        <title>
        An Analysis of Ranking Principles and Retrieval Strategies.
        </title>
        <pages>151-163</pages>
        <year>2011</year>
        <booktitle>ICTIR</booktitle>
        <crossref>conf/ictir/2011</crossref>
        <url>db/conf/ictir/ictir2011.html#ZucconAR11</url>
        </inproceedings>’
Figure 4. An example of DBLP API query response using author identifier homepages/r/CJvanRijsbergen.

For each AuthorID in the , we obtain a list of its associated publications using the DBLP API (lines 7 and 8). For example, when we query the author ID homepages/r/CJvanRijsbergen using the DBLP API, we retrieve a list of publication records as shown in Figure 4.

For each author name from a publication record , we check if it exists in the associated by . When we can find an exact match of a name in , we create a temporal record using the author ID , author name of , and the remaining information from publication record (lines 9 to 13).

We create a temporal record for each author of a publication if the author has an author ID in the , regardless of whether the author is the first author or not. We create a Boolean attribute IsFirstAuthor to indicate if an author is the first author of a certain publication. The reason of creating one temporal record for each author is to make the data set more interesting for entity resolution, as it will introduce more temporal records for each author.

For example, let an author profile in the DBLP XML file be = {key: homepages/r/CJvanRijsbergen, names: = [C. J. van Rijsbergen, Cornelis Joost van Rijsbergen, Keith van Rijsbergen]}. We query the author ID homepages/r/CJvanRijsbergen and retrieve a list of publication records as Figure 4 shows. We compare each name of author against each author name in each publication record. For the first reference we can see that Keith van Rijsbergen exactly matches a name in , and therefore we then create a temporal record: {AuthorID: homepages/r/CJvanRijsbergen, PublicationID: conf/chiir/OBrienFJLTR16, AuthorName: Keith van Rijsbergen, Year: 2016, CoAuthors: [Heather L. O’Brien,...,Paul Thomas], Title: System And User Centered...}. We can also see that in this example there are six authors, and assuming each of them has an author identifier, six temporal records with different AuthorName, AuthorID and CoAuthors values will be created.

This approach assumes that there is no publication that has two authors with exactly the same name. In other words, while we understand that different authors can share exactly the same name, we assume authors who have exactly the same name are never co-authors of the same paper. For example, if we query for an author who has two names: Tom Peter and T. Peter, and obtain a publication record with two authors: Tom Peter and T. Peter, we will have difficulty to decide which name the author actually used in this publication. When processing the DBLP data set we however did not encounter any case where multiple authors shared the same name on the same paper.

This DBLP data set is created to simulate a real-world online database, where records of individuals are added to the database one-by-one in a temporal sequence. Records in our temporal DBLP data set are sorted by year and month in ascending order. For each publication venue in each year, we currently assign it a randomly generated month value. In future work we aim to extract the actual publication dates from publication profiles in DBLP.

Figure 5. Completing DBLP temporal affiliation information using the MAG data set. The article titles in the DBLP temporal records are matched to paper titles in MAG records. When a match is found, the author name of that MAG record is compared to all author names associated to the corresponding DBLP temporal record. If a unique matching name pair can be found, the affiliation information from the MAG record is added to its corresponding DBLP temporal record.

4. Completing Temporal Affiliation Information using MAG

From Figure 2, we can see that the affiliation information in DBLP is attached to author profiles and does not contain any temporal information. As a result we cannot allocate such affiliation information to the DBLP temporal records we generated, because we cannot tell when the affiliation of an author was valid or had changed. The Microsoft Academic Graph (MAG) (sinha_overview_2015, ) is an open bibliography data set which currently contains 166,192,182 articles999https://www.openacademic.ai/oag/. The records in the MAG data set contain affiliation information at different points in time for the authors of each paper. This makes it possible to extract temporal affiliation information for authors. However, there is no unique identifier for authors in the MAG data set, and therefore it is not easy to construct an author-based temporal data set using the MAG data set alone. Figure 5 shows how we use records in MAG to complete affiliation information for DBLP temporal records.

Algorithm 2 shows in detail how we complete affiliation information for DBLP temporal records, . We first map MAG records into an index using their titles (lines 1 to 4) and map author IDs to their names (lines 5 to 8). Then for each DBLP temporal record , we check if its title can be found in the (lines 9 to 12). When we find a matching title we compare the author name of the corresponding MAG record in the against all names that are related to the author ID of . If a matching name can be found, we complete the affiliation information of using the affiliation information of . In the function (lines 3 and 10) we remove all punctuation, spaces, and special characters from both DBLP and MAG titles. Using this approach, we were able to match a total of 418,197 titles across the DBLP and MAG data sets.

1:
- A list of DBLP author profiles
- A list of MAG records
- A list of DBLP temporal records
2:
- Updated DBLP temporal records
3:
4:foreach  in  do
5:     
6:     
7:
8:foreach  in  do
9:     foreach  in  do
10:               
11:foreach  in  do
12:     
13:     
14:     if  in MAGIndex then
15:         
16:         foreach  in  do
17:              if  ==  then
18:                  affilaffil                             
19:return
Algorithm 2 Complete DBLP Temporal Affiliation Information Using MAG

5. Selecting Reliable Author Profiles

DBLP recently conducted an analysis101010See news on 2017-09-15 at https://dblp.uni-trier.de/news/ based on about 70,000 ORCIDs, which are persistent identifiers for researchers111111For details see: http://orcid.org. This analysis discovered 600 records where an author ID is related to more than one ORCID, and 5,000 records where the same ORCID appears in more than one author profile. These findings suggest that a significant number of author IDs are inaccurate in DBLP. If we assume the ORCIDs are accurate entity identifiers, then in the case where an author profile is related to more than one ORCID, it indicates that actually contains multiple author profiles which were wrongly linked together. When multiple author profiles share the same ORCID, it indicates that these author profiles actually refer to the same author and should be merged. Note that not all ORCIDs have been incorporated into an author profile, and our analysis in Section 6 shows that only 20,954 ORCIDs have been added to author profiles, and the rest of the ORCIDs are still pending validation.

Since we expect the combined data set to be used to evaluate entity resolution techniques, it is important to reduce the number of false matches and false non-matches (missing true matches), where false matches and missing true matches will reduce both precision and recall 

(Chr12, ; Han17, ). By examining the original DBLP data set, we discovered several types of information that can be used to refine a more reliable subset of author profiles.

        <www key="homepages/189/7254" ... >
        <author>Jeffrey A. McDougall</author>
        </www>
(a) The most common type of profile in DBLP.
        <www key="homepages/77/10481" ... >
        <author>Emitza Guzman</author>
        <author>Adriana Emitzá Guzmán Ortega</author>
        <note type="affiliation">Technical University Munich, Germany</note>
        </www>
(b) A profile with multiple author names.
        <www key="homepages/c/PeterChristen" ... >
        <author>Peter Christen</author>
        <url>http://cs.anu.edu.au/~Peter.Christen/</url>
        <note type="affiliation">The Australian National University</note>
        </www>
(c) A profile with a non-persistent URL.
        <www key="homepages/99/3847-1" ... >
        <author>Wei Song</author>
        <note type="affiliation">
        University of New South Wales, School of Computer Science and Engineering, Sydney, Australia
        </note>
        <url>https://orcid.org/0000-0001-7573-3557</url>
        </www>
(d) A profile with a persistent URL.

Figure 6. Four categories of author profiles in DBLP.

We identified four categories of author profiles in the DBLP data set with different levels of reliability:

  1. No support: Figure 6 (a) shows the most common type of profile which has only one author ID and one author name. More than 99% (1,984,904) of author profiles in DBLP are of this type. They are likely created for either temporary and non-persistent researchers or they are missed true matches to an existing profile. We consider this type of profiles to have no support and to be the least reliable.

  2. Multiple names or affiliation: 38,035 profiles have more than one author name or affiliation information as Figure 6 (b) shows, suggesting these profiles have been merged by the DBLP team either manually or automatically using an algorithm. Name changes and variations are important aspects that make a data set realistic for evaluating temporal entity resolution techniques, and profiles with multiple names have considerable benefit to be included into our temporal data set.

  3. Non-persistent URLs: 27,146 profiles contain at least one non-persistent URL, such as a staff profile page from a university, as Figure 6 (c) shows. Non-persistent URLs are not meant to be used as author identifiers, but they can be used as relatively strong evidence that a profile received fair attention and scrutiny.

  4. Persistent URLs: 27,496 profiles contain verified persistent identifiers from a third party, such as Google Scholar, ORCID, or Scopus121212 https://www.scopus.com/ as Figure 6 (d) shows. We consider profiles with persistent URLs to be the most reliable.

In the temporal data set we aim to develop, we create temporal records using author profiles that either have multiple names, an affiliation, or have at least one non-persistent or persistent URL (i.e. these profiles are in one of the last three categories). We call these criteria support information.

6. Characteristics of Data sets

Figure 7. Distribution of support information in author profiles in the proposed DBLP temporal data set.

In this section we discuss some of the statistics and characters of the generated temporal data set, as well as the MAG and the DBLP data sets in general.

Figure 7 shows the distribution of author profiles that have at least one of the three categories of support information described in the previous section. We can see that we have a majority of author profiles supported by at least one URL, and more than 15K profiles have at least two types of support information. Also note that though the DBLP team imported about 70K ORCIDs into DBLP, only about 20K of them can be found in DBLP author profiles, where the remainder of ORCIDs are located in publication records and are waiting to be linked to an author profile.

Figure 8 shows the number of author profiles, publications, and new author profiles by year. A new author profile refers to a profile that was newly added to a data set. The number of new authors increased sharply in the last ten years while the total number of authors and publications increased steadily over time.

Figure 8. Number of author profiles, publications, and new author profiles in each year in the generated temporal DBLP data set.

Figure 9 shows the average number of publications for each author per year and the average number of co-authors for each publication each year. Both of these have grown steadily since 1990. Authors are collaborating more over time suggesting that research networks and communities are getting more complex over time.

Figure 9. Average number of publications per author and average co-authors per publication in each year in the generated temporal DBLP data set.

Figure 10 shows the number of temporal records that have their affiliation information completed in our generated data set. We can see that a large percentage of temporal records do not have their affiliation information completed. As mentioned in Section 4, we managed to match 418,197 titles from which we completed affiliation information for 205,490 temporal records. Note that the MAG data set we used was a snapshot provided in 2013, and therefore papers published after 2013 cannot be linked. In the future we plan to use an updated and completed version of the MAG data set131313 https://www.openacademic.ai/oag/ to complete the affiliation information of more authors.

Figure 10. Number of temporal records and records with affiliation information completed using MAG in each year in the generated temporal DBLP data set.
Total Avg Median Max
Number of unique name strings 2,316,982 Per publication 5.56 3 3203
Number of publications 2,258,482 Per unique name string 5.42 1 3360
Number of unique affiliation strings 771,997 Per unique name string 1.77 1 565

Table 2. Statistics of the DBLP XML data set
Total Avg Median Max
Number of authors 2,066,233 Per unique name string 1.01 1 141
Number of unique name strings 2,082,526 Per author 1.02 1 10
Number of unique name strings - Per publication 2.93 3 287
Number of publications 4,011,876 Per unique name string 5.72 2 3133

Table 3. Statistics of the generated DBLP temporal data set
Total Avg Median Max
Number of authors 81,177 Per unique name string 1.05 1 69
Number of authors - Per publication 1 1.5 37
Number of authors - Per venue 112.6 26 19,703
Number of authors - Per unique affiliation string 2.47 1 167
Number of unique name string 112,166 Per author 1.43 1 10
Number of publications 1,956,963 Per author 36.1 16 1,273
Number of unique affiliation string 88,773 Per author 2.7 1 136
Table 1. Statistics of MAG data set

Table 3 shows the statistic of the generated temporal DBLP data set. In total we generated 2,931,038 temporal records for 81,177 authors and 1,956,963 publications between 1936 to 2018. In contrast to the temporal DBLP data set without filtering unreliable profiles (Table 3), the proposed data set contains only 1% of profiles while it covers about 50% of all publications in DBLP, suggesting that authors who have a relatively more developed DBLP profile are more likely to be regular authors. Since the MAG data set does not have unique author identifier, statistics for author profiles are not available for the MAG data set.

The most commonly shared author name is Wei Wang, which is shared by 69 different authors as Table 3 shows. The author with the largest number of names is Naufal M Saad, who has 10 different names: Muhammad Naufal Bin Muhammad Saad, Naufal M Saad,..., Mohammed Naufal bin Mohamad Saad, M Naufal Mohamad Saad. The venue with the largest number of authors is CoRR, which involves 19,703 authors. 18,047 authors do not have a publication as the first author.

Many authors have multiple different affiliation strings in MAG, but when we inspect these affiliations, most of them are different names of the same institution, for example, Computer Vision LabETH Zurich DITET BIWI, Computer Vision Laboratory ETH Zurich, and ETH Zurich Computer Vision Laboratory Sternwartstrasse 7 8092 Switzerland

are some affiliation strings for the same author. This suggests that a user of our data set may wish to standardize these affiliation strings before using the data set. To avoid variance caused by different standardization techniques and to keep the quality and variability of the data set, we did not apply standardization to the generated data set.

7. Conclusion and Future Work

We have described the development of a temporal data set for entity resolution which was created using the publicly available Digital Bibliography and Library Project (DBLP) and the Microsoft Academic Graph (MAG) data sets. We used DBLP’s public API to link author profiles and publication records, and then created one temporal record for each author of each publication in DBLP. We then matched titles from MAG records to this temporal DBLP data set to complete the affiliation information for each author for each publication record. We used three categories of support information to refine the DBLP author profiles to improve the quality of the proposed temporal data set. We generated a temporal data set with 2,931,038 records for 81,177 authors and 1,956,963 publications between 1936 to 2018, where most of the temporal records refer to publications after 1990. The data set is made freely available on GitHub for public use141414https://github.com/E-Chen/A-refined-DBLP-temporal-dataset.

So far we have used only a subset of the large MAG which was made available for the KDD competition in 2016. A larger and more complete version of MAG (sinha_overview_2015, ) which has been linked to AMiner (tang_arnetminer:_2008, ) has recently been made available. We aim to create a larger and more comprehensive temporal data set by linking DBLP to the full MAG and AMiner data sets in the future.

We also plan to use similarity comparison algorithms, such as Jaccard similarity or edit distance (Chr12, ), to match titles between MAG and DBLP. We noticed some data parsing issues we had with the DBLP XML file that may result in some multi-line titles being ignored, and we aim to fix this issue in the future.

References

  • (1) Altowim, Y., Kalashnikov, D.V., Mehrotra, S.: ProgressER: Adaptive Progressive Approach to Relational Entity Resolution. ACM TKDD 12(3), 1–45 (2018)
  • (2) Chiang, Y.H., Doan, A., Naughton, J.F.: Tracking Entities in the Dynamic World: A Fast Algorithm for Matching Temporal Records. PVLDB 7(6) (2014)
  • (3) Christen, P.: Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer (2012)
  • (4) Hand, D., Christen, P.: A note on using the f-measure for evaluating record linkage algorithms. Statistics and Computing pp. 1–9 (2017)
  • (5) Köpcke, H.: Object Matching on Real-world Problems. PhD Thesis (2014)
  • (6) Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1-2), 484–493 (2010)
  • (7) Ley, M.: The DBLP computer science bibliography: Evolution, research issues, perspectives. In: International Symposium on String Processing and Information Retrieval. pp. 1–10. Springer (2002)
  • (8) Ley, M.: DBLP: some lessons learned. PVLDB 2(2), 1493–1500 (2009)
  • (9) Li, P., Dong, X., Maurino, A., Srivastava, D.: Linking Temporal Records. PVLDB 4(11) (2011)
  • (10) Ramadan, B., Christen, P., Liang, H., Gayler, R.W.: Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution. JDIQ 6(4), 15:1–15:29 (2015)
  • (11) Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.J.P., Wang, K.: An Overview of Microsoft Academic Service (MAS) and Applications. In: ACM WWW. pp. 243–246 (2015)
  • (12) Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: ArnetMiner: Extraction and Mining of Academic Social Networks. In: ACM SIGKDD. pp. 990–998. New York, USA (2008)
  • (13) Wang, H., Ding, X., Li, J., Gao, H.: Rule-based Entity Resolution on Database with hidden temporal Information. IEEE TKDE pp. 1–1 (2018)
  • (14) Wang, Q., Vatsalan, D., Christen, P.: Efficient Interactive Training Selection for Large-Scale Entity Resolution. In: PAKDD. Ho Chi Minh City, Vietnam (2015)