Implementing Recommendation Algorithms in a Large-Scale Biomedical Science Knowledge Base

10/24/2017
by   Jessica Perrie, et al.
0

The number of biomedical research articles published has doubled in the past 20 years. Search engine based systems naturally center around searching, but researchers may not have a clear goal in mind, or the goal may be expressed in a query that a literature search engine cannot easily answer, such as identifying the most prominent authors in a given field of research. The discovery process can be improved by providing researchers with recommendations for relevant papers or for researchers who are dealing with related bodies of work. In this paper we describe several recommendation algorithms that were implemented in the Meta platform. The Meta platform contains over 27 million articles and continues to grow daily. It provides an online map of science that organizes, in real time, all published biomedical research. The ultimate goal is to make it quicker and easier for researchers to: filter through scientific papers; find the most important work and, keep up with emerging research results. Meta generates and maintains a semantic knowledge network consisting of these core entities: authors, papers, journals, institutions, and concepts. We implemented several recommendation algorithms and evaluated their efficiency in this large-scale biomedical knowledge base. We selected recommendation algorithms that could take advantage of the unique environment of the Meta platform such as those that make use of diverse datasets such as a citation networks, text content, semantic tag content, and co-authorship information and those that can scale to very large datasets. In this paper, we describe the recommendation algorithms that were implemented and report on their relative efficiency and the challenges associated with developing and deploying a production recommendation engine system.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/25/2018

Etymo: A New Discovery Engine for AI Research

We present Etymo (https://etymo.io), a discovery engine to facilitate ar...
06/10/2021

Citation Recommendation for Research Papers via Knowledge Graphs

Citation recommendation for research papers is a valuable task that can ...
04/21/2022

Multi-task recommendation system for scientific papers with high-way networks

Finding and selecting the most relevant scientific papers from a large n...
02/12/2018

Towards an Open Science Platform for the Evaluation of Data Fusion

Combining the results of different search engines in order to improve up...
05/16/2017

Knowledge discovery through text-based similarity searches for astronomy literature

The increase in the number of researchers coupled with the ease of publi...
07/11/2019

Geographical Distribution of Biomedical Research in the USA and China

We analyze nearly 20 million geocoded PubMed articles with author affili...
06/10/2021

Algorithm Auditing at a Large-Scale: Insights from Search Engine Audits

Algorithm audits have increased in recent years due to a growing need to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Digital libraries continue to expand due to new literature being written and old literature being digitized. As a result, scientific databases have emerged as one of the milestones in the modern scientific enterprise. One of the main goals of these resources is to refine the methods of information retrieval and augment citation analysis Falagas . (2008). A frequent challenge for science researchers is to keep up-to-date with and find relevant research. Recommendation systems made popular in eCommerce platforms have become an important research tool to help scientists and researchers find relevant research results in a growing number of disparate sources of literature.

In this paper we describe our experience implementing several recommendation algorithms in a large-scale biomedical research knowledge base known as Meta111https://meta.com/. Meta Molyneux  Molyneux (2012) is a biomedical-focused discovery and distribution platform with the chief goal of enabling rapid browsing of personalized, filterable streams of new research. Newly published findings are provided to researchers by allowing users to subscribe to any context or entity in the semantic network, which contains over biomedical controlled vocabularies and ontologies, and five core entities (papers, researchers, institutions, journals, concepts) and relations among the entities (e.g., researchers write papers, papers mention concepts, journals publish papers, etc.). It currently indexes over 27M papers with 1.7M full-text articles. The recommendation algorithms presented in this paper were implemented in Meta and make use of the diverse datasets available in the Meta knowledge base, including citation networks, text content, semantic tag content, and co-authorship information. The ultimate goal is to make it quicker and easier for researchers to filter through scientific papers, find the most important work, and discover the most relevant research tools and products.

The remainder of this paper is organized as follows. In Section 2, we survey related scientific databases with a particular focus on biomedical sciences. We provide an overview of the recommendation system that was implemented in the Meta platform in Section 3. The recommendation algorithms we implemented are described in Section 4. An evaluation of the run time of each algorithm and practical considerations are discussed in Section 5. We conclude with suggestions for future work in Section 6.

2 Related Work

Major online scientific databases that are currently in use by biomedical researchers are PubMed, Google Scholar (GS), Web of Science (WoS), Scopus, Microsoft Academic (MA), Semantic Scholar (S2), and Meta. PubMed is a free online resource developed and maintained by the National Centre for Biotechnology Information (NCBI) in the United States Canese  Weis (2013); NCBI (2017). It comprises over 27 million references from the MEDLINE database, in addition to other life science journals and online books NIH (2017). PubMed is mostly focused on medicine and biomedical literature whereas the other resources described below include various scientific fields Falagas . (2008). It provides search filters that help trim the search results to a specific clinical study or specific topic. It also provides approximately 50 search fields and tags (e.g., first author name, publisher, title, etc.) NCBI (2017). Search results in PubMed can be sorted based on different criteria such as publication date or relevance NCBI (2017). The relevance of a document in a single-term query is dependent on the inverse global weight of the terms, the local weight of the terms, the weight of the fields the term appears in, and the field length (newer publications have higher weight) NCBI (2017). Furthermore, for a specific article the researcher can view its related articles. The similarity score of two documents is measured by the number of terms they have in common. Overall, around 2 million terms are identified and they are weighted based on the number of different documents in the database that contain the term (global weight) and the number of times the term occurs in the first and the second document (local weight). Also, the location of the term can give it a small advantage in the local weighting. For example, if the term is in the title, it will be counted twice NCBI (2017). For each article, the similarity score is computed relative to all other articles in the database and the most similar documents are identified and stored to reduce the retrieval time NCBI (2017). Citation analysis is limited only to journals in PubMed Central, which is PubMed’s repository for open-access full-text articles containing more than 1.5 million full-text biomedical articles Masic  Milinovic (2012). For instance, if a publication which is not in the PubMed Central cites an article, the article’s citation count will not increase Shariff . (2013). There are also a number of plugins available for PubMed that extend the available features of the database Dokuwiki (2016).

Google Scholar is another free service which crawls the web and finds scholarly articles, theses, books, abstracts‘ and court opinions Google (20171). Documents are indexed by their meta-tags. If the meta-tags are not available, automatic format inspection is used (for example, title will have a large font, author names should come right before or after the title with slightly smaller font, etc.). Many argue that this inclusion process creates problems such as dirty and erroneous metadata De Winter . (2014), inclusion of non-scientific documents De Winter . (2014), and even spamming and manipulation of citation analysis measures Beel  Gipp (2010); Lopez-Cozar . (2012). However, Google tries to rectify these problems by allowing authors and researchers to directly curate the data Google (20172), and by providing guidelines for webmasters on how to format their websites and use meta-tags Google (20173). In comparison to PubMed, Google Scholar provides very limited search fields (title, author, publication year, all text, and publisher). In addition, many of the documents in the corpus lack some of these fields, for example, publication year De Winter . (2014). However, Google Scholar performs full-text search, which distinguishes it from PubMed and Web of Science De Winter . (2014).

Search results in Google Scholar are ordered by relevance ranking of the documents reportedly based on weighing the full-text of each document, where it was published, who it was written by, as well as how often and how recently it has been cited in other scholarly literature Google (20171); De Winter . (2014). The exact method of finding the relevant documents are not specified but in a recent study Google Scholar was found to return twice as many relevant articles as PubMed Shariff . (2013)

. Others have found that Google Scholar articles were more likely to be classified as relevant, had higher numbers of citations and were published in higher impact factor journals

Nourbakhsh . (2012). In Google Scholar, researchers can access the citation analysis view of a specific paper by clicking on the cited by link located beside its name. Also, researchers can view articles related to a specific article by clicking on the related articles link. Another feature of Google Scholar is Google Scholar Metrics (GSM), by which Google ranks scholarly publications based on their h5-index (the largest number h such that h articles published in that publication in last five years have at least h citations each). Publications include articles from journals (94%), selected conferences in Computer Science and Electrical Engineering (4%), and preprints from arXiv, SSRN, NBER and RePEC (2%) Martín-Martín . (2014).

Web of Science (WoS) is developed and maintained by Clarivate Analytics (formerly the Institute of Scientific Information (ISI) of Thomson Reuters) and, in comparison with other resources, covers the oldest publications, with archived records going back to 1900 Falagas . (2008); Cision (2016). The WoS indexing procedure is manual and a group of editors update the journal coverage by identifying and evaluating promising new journals or deleting journals that have become less useful Testa (July 18, 2016). In order to evaluate the publications, the editors consider criteria such as the journal’s basic publishing standards, its editorial content, the international diversity of its authorship, and the citation data associated with it Testa (July 18, 2016). Some argue that this manual selection is a potential threat for WoS since it may not be able to keep up with the rapid pace of knowledge production and the coverage might not be satisfactory especially in comparison with other resources such as Google Scholar De Winter . (2014); Larsen  Von Ins (2010). Recently, WoS and Google Scholar have established a collaborative effort to interlink their data sources. This allows researchers to search in Google Scholar and move to WoS for deeper citation analysis such as in-depth citation history research Kreisman (November 6, 2013); Clarivate (2017). WoS finds relevant articles using keywords in the search query and its citation-based methods. One of these citation-based methods is called Keyword Plus Garfield (1990). In the Keyword Plus method, in addition to title words, author-supplied keywords, and abstract words, titles of cited papers are processed and most commonly recurring words and phrases are used to retrieve relevant articles Garfield (1990). WoS includes some tools for visualizing citation relationships.

Scopus was launched at nearly the same time as Google Scholar and is developed and maintained by Elsevier. It is the largest abstract and citation database of peer-reviewed literature Elsevier (20171). Like WoS, the indexing procedure is manual and the journals are evaluated based on a number of criteria, including content, online availability, journal policies, and publishing regularity Elsevier (20171). In comparison to other generic resources like WoS and GS, Scopus offers a wider range of search fields called proximities. Scopus also offers a tool called Journal Analyzer which can be used by a researcher to compare up to ten Scopus sources on different parameters, including citations, Scimago Journal Rank (SJR), Source Normalized Impact per Paper (SNIP), and percentage of documents not cited Edith Cowan University Library (2017). In Scopus the related articles are suggested based on shared references, authors and/or keywords Elsevier (20172).

Microsoft Academic (MA) is another free academic search and discovery resource developed by Microsoft Research Harzing (2016). Unlike WoS and Scopus, the indexing process is done automatically. MA uses semantic search rather than keyword search and allows search inputs in natural language Microsoft (20171). Both GS and MA offer profiles for authors, however a study shows that GS profiles include more citations with a strong bias toward the information and computing areas whereas the MA profiles are disciplinarily better balancedOrtega  Aguillo (2014). In GS, the profiles are created voluntarily and the authors can freely edit and modify their profiles, on the other hand, in MA, the profiles are automatically generated but authors can perform restricted editing on their profiles such as merging or suggesting changesOrtega  Aguillo (2014). MA aims to not only help researchers find scholarly articles online, but also to help them discover relationships between authors and organizationsHands (2012). MA enables researchers to see the top authors, publications, and journals of a specific scientific domainHarzing (2016). In addition, it provides visualizations using Microsoft Academic Graph which shows publications, citations among publications, authors, and relations of authors to institutions, publication venues, and research fieldsMicrosoft (20172). The co-author graph and co-author path offered by MA can be a valuable tool for analyzing collaboration in researchHands (2012).

Semantic Scholar (S2) is a free scholarly search engine, developed by the Allen Institute for Artificial Intelligence on 2015

AI2 (2017). Similar to MA, S2 uses semantic search rather than keyword search and allows search inputs in natural language. S2 covers over 40 million scientific research articles Jones (November 11, 2016). The S2 ranking system is based on the word-based model in ElasticSearch that matches query terms with various parts of a paper, combined with document features such as citation count and publication time in a learning to rank architecture TY. Liu (2009)

. S2 uses Explicit Semantic Ranking (ESR), to connect query and documents using semantic information from a knowledge graph

Xiong . (2017). An academic knowledge graph, built using S2’s corpus, includes concept entities, their descriptions, context correlations, relationships with authors and venues, and embeddings trained from the graph structure. Queries and documents are represented by entities in the knowledge graph, providing ‘smart phrasing’ for ranking. Semantic relatedness between query and document entities is computed in the embedding space, which provides a soft matching between related entities.

The Meta recommendation system described in this paper implements and compares a set of recommendation algorithms more diverse than those available in the other systems of biomedical papers and uses the largest number of unique features from the papers. PubMed Canese  Weis (2013) has the same coverage in terms of number of papers, but PubMed uses text-based similarity recommendations on metadata only whereby the Meta system makes use of several similarity algorithms based on metadata, fulltext, and semantic relationships.

These platforms, to differing degrees, enable researchers to access scientific publications and identify related or relevant articles through search capability or using recommendation systems. Recommendation systems have emerged as a promising approach for dealing with the ever increasing body of academic literature.

Several other existing systems, such as reference management systems, provide some aspects of recommendations, citation management, or citation analysis Bollacker . (1998); Lawrence . (1999); Beel . (2014); Bollen  Van de Sompel (2006); Jack (2012). Compared to the large-scale systems surveyed above, these tools do not have extensive coverage of the literature. Furthermore, many of these techniques rely on self-identified user preferences or on a partial list of his/her citations Corman . (2002). The effectiveness of these techniques is limited in that recommendations are either based on only one theoretical mechanism, namely, similarity between user preferences, or solely on network statistics as derived from his/her citation list Huang . (2008). When user preference information is not available, recommendations are made based solely on information about the papers using content-based filtering techniques. The algorithms presented in this paper make recommendations based on information about the papers such as co-authorship and citation networks as well as proximity of citations in the text, similarity of words in the text, and semantic tags.

3 Overview

The algorithms described in this paper were integrated into Meta’s paper-to-paper recommendation system and make use of its large-scale semantic knowledge base. The paper-to-paper recommendation system has four main components: (a) public and private data sources that feed the knowledge network; (b) an extract, transform, load (ETL) pipeline that disambiguates the entities and discovers relations among them; (c) base recommendation algorithms that use a single specific type of data to make recommendations for a paper; and, (d) aggregation algorithms that combine recommendations from the base recommenders to generate the final set of recommendations optimized on specific criteria (see Figure 1). The seven base recommendation algorithms are described in detail in Section 4.

Three main data sources are used to populate the knowledge base. PubMed is the central repository for all biomedical publications and provides a detailed API through which biomedical journals and conferences can be retrieved Canese  Weis (2013). A PubMed record contains title, abstract, and metadata (e.g., authors, affiliations, keywords, DOI, ISSN, etc.), and also sometimes information on the cited papers. Each PubMed paper has a unique id (PMID) corresponding to a unique digital object identifier (DOI) registered by Crossref (http://www.crossref.org/), which is a non-profit association of scholarly publishers that develops the infrastructure to distribute and maintain DOIs. From Crossref, we gathered metadata for about 50.9 million documents and citations for some of them. Our third data source is full text articles from publisher partners of Meta which, at the time of our experiment, included Elsevier, Sage, DeGruyter, PLoS, BMC, among others. The Meta full text pipeline contains various adapters for diverse publishers, and extracts both metadata and citation information from full text content, which arrives in both XML and PDF formats.

Figure 1: Data flow of Meta’s recommendation engine

Each paper then goes through a disambiguation engine which has two main tasks. The first is disambiguating the authors of the paper where the goal is to associate the paper with existing authors in the database or assign a newly discovered author. At the time of our experiment, Meta’s author database contained approximately 11 million biomedical researcher profiles calculated from 24.5 million papers spanning 89 million paper-author relationship tuples. Meta’s author disambiguation algorithm is modeled after the winning algorithms of KDD Cup 2013 Author Disambiguation challenge (track-2) Li . (2015); J. Liu . (2013)

. Given a manually disambiguated paper-author assignment training set, a random forest classifier is trained to discriminate between correct and incorrect author-paper assignments. Given an existing paper-to-author assignment database, and a newly published paper, the algorithm compares the paper against each candidate author’s profile which included over 43 predictive features at the time of our experiment, using the classification model. If the author with maximum match probability achieves a threshold, the paper is assigned to this candidate author, otherwise a new author profile is generated and the paper is assigned as the first paper of the newly discovered author. The 43 predictive features span the five major categories: author name similarity metrics (Levenstein, Jaro-Winkler, Jaccard etc.), paper content similarity (mostly based on TF-IDF), affiliation similarity, co-authorship information, and author’s active time compatibility. Meta’s author disambiguation algorithm achieves an F1 score of

, AU-ROC of and AU-PRC of .

The second disambiguation process deals with concept mentions. Once a concept mention is recognized through an entity recognizer (such as GNAT Hakenberg . (2008), DNORM Leaman . (2013), NeJI Campos . (2013), etc.), it is normalized into the canonical name from UMLS Bodenreider . (1998) and becomes a semantic tag. Among the many concept types, we used only the Medical Subject Headings (MeSH) in our algorithms.

Next, the paper goes through citation extraction phase, during which references listed by the paper are identified and resolved into unambiguous, directed DOI-DOI pairs and added into the citation network of Meta which has roughly 580 million citations. For papers with full text, if possible, we also extract pairwise proximities of the references. Finally, the text and semantic tag components of the paper are indexed into an inverted index, which is built using Hadoop’s MapReduce based TF-IDF builder Manning . (2008). The recommendation algorithms presented in this paper operate on the transformed data in Meta’s semantic knowledge network. The algorithms were implemented using a diverse technology stack: Hadoop, Java, Python and mySQL. Some of the algorithms depended heavily on the Hadoop based MapReduce framework, while others were implemented with direct SQL queries. The recommended papers produced by the base algorithms were aggregated using a number of rank aggregation algorithms, which were all implemented using SciPy and NumPy packages of Python.

4 Recommendation Algorithms

The paper-to-paper recommendation problem can be stated as: Given a database of papers, where and a paper, that is of interest to a researcher , recommend a list of papers, to such that , are judged to be related to and/or in some way useful to . The list may be a partially ordered list such that is considered to be more relevant than , and so on.

We implemented seven recommendation algorithms on a database of more than 24 million biomedical papers. Note, since running our experiments, there are 27 million biomedical papers in the Meta database. We focused on two main criteria when choosing which algorithms to include, namely the ability to scale and the ability to leverage various available data types. This meant that we mainly chose simple yet powerful algorithms instead of complex ones, with the expectation that the rank aggregation step can compensate for any weaknesses in the base algorithms in an effective manner. Hence, we also implemented four different algorithms that aggregate results from the seven base algorithms. The details of each are presented below. The algorithms we implemented are inspired by existing work Gipp  Beel (2009); Dwork . (2001); Ailon . (2008); Ali  Meilă (2012); Kessler (1963); Marshakova-Shaikevich (1973); Small (1973) and have been customized for our dataset of biomedical papers. Table 1 summarizes the algorithms that are described in this section.

Name Short Description
B-CCS: Co-citation Similarity Recommends papers cited by similar citing papers Marshakova-Shaikevich (1973); Small (1973).
B-BC: Bibliographic Coupling Recommends papers with similar references Kessler (1963).
B-IBCF: Item-Based Collaborative Filtering Treats citations as user-item purchases, recommends items to users that are similar to ones user already bought.
B-CCP: Co-citation Proximity Recommends papers that are co-cited and close together in the text Gipp  Beel (2009).
B-AS: Abstract Similarity Recommends papers with similar text content.
B-STS: Semantic Similarity Recommends papers with similar semantic content.
B-CA: Co-authorship Recommends papers with similar/shared authors Sugiyama  Kan (2011); Newman (2001).
A-LP: LP-based Aggregation

Aggregates based on linear programming relaxation based optimization

Ailon . (2008).
A-BS: Beam Search Aggregation

Aggregates based on heuristics using beam search

Ali  Meilă (2012).
A-BL: Borda Aggregation Aggregates by simply averaging over the ranks de Borda (1781).
A-MS: Merge Sort Aggregation Aggregates based on merge sort based heuristic Ali  Meilă (2012).
Table 1: Summary of recommendation and rank aggregation algorithms used in our system

4.1 Base Recommendation Algorithms

The base recommendation algorithms make use of citation information, content information in abstracts, the full text of the papers, and authorship information.

4.1.1 Citation-based Algorithms

We generated a citation network of the papers in our database by gathering citations from 50.9 million documents from across the sciences, metadata from over 24.6 million PubMed documents and the full text of over 16 million articles using a fully automated technique. Our resulting citation network has over 17 million nodes (which is a subset of the biomedical papers in the 50.9 million articles) and over 350 million edges. The following base algorithms that use the citation network were implemented: Co-citation Similarity (B-CCS), Bibliographic Coupling (B-BC), Item-Based Collaborative Filtering (B-IBCF), and Co-citation Proximity (B-CCP). Figure 2 illustrates a sample data set of three papers with citations indicated.

Figure 2: Citation structures of sample documents. Citation-based algorithms produce the following recommendations for Paper E in order: B-CCS A and C (tied), B and D (tied); B-BC Z, Y; B-IBCF C, D; B-CCP A, D, B, C.
Co-citation Similarity (B-CCS)

Intuitively, papers that are cited by the same paper or co-cited Marshakova-Shaikevich (1973); Small (1973) many times are likely to be similar to each other. This notion of similarity provides us with a basis for recommendation. Referring to the example in Figure 2, given Paper , B-CCS recommends Papers and ahead of Paper or Paper because Paper is co-cited with Paper in two papers (Papers ) and Paper is co-cited with Paper in two papers as well (also Papers and ). However, Paper is only co-cited with Paper in one paper (Paper ) and is only co-cited with Paper in one paper (Paper ).

The notion of co-cited papers can be captured by using incoming citation vectors. Given a citation network that contains

papers, we define the incoming citation vector of a paper as an -dimensional bit vector where if cites , otherwise . Then, and are co-cited by paper if . Two papers with many 1’s in the same position in their incoming citation vectors are co-cited by many papers.

To recommend papers related to paper

, we can apply standard vector similarity metrics such as cosine similarity on

and for all papers to find papers that are most co-cited with . Cosine similarity also normalizes similarity scores by the norms of the vectors, intuitively weighting papers with many incoming citations less than papers with few incoming citations. However, cosine similarity gives an equal weight to all coordinates of and . Suppose there is a hypothetical paper that cites a lot of papers, then for many papers , in the vectors , . Conversely, if a paper cites few papers, then in the vectors , for only a few papers . Intuitively, coordinate should contribute more than because it is rarer; two papers co-cited by a paper with few outgoing citations is worth more than being co-cited by a paper with many outgoing citations. To account for this, we normalize the incoming citation vectors by dividing each coordinate of and by the number of outgoing citations of the paper represented by the coordinate before applying cosine similarity.

The number of pairwise similarity computations grows quadratically with the number of papers in the database and is around for 25M papers. To speed up this computation, we only consider pairs of papers with at least one common incoming citation, and this resulted in a -fold decrease in the number of pairwise similarity computations.

Bibliographic Coupling (B-BC)

Papers having similar citation profiles are intuitively more similar than papers with different citation profiles Kessler (1963); this gives us yet another basis for recommendation. In this case, we compute the -dimensional outgoing citation vector for each paper as where if cites and otherwise. Then, and both cite paper if . Two papers with many 1’s in the same position in their outgoing citation vectors cite many of the same papers.

We then employ the same algorithm used for co-citation similarity (B-CCS) except with the citation edges reversed. We normalize outgoing citation vectors by penalizing coordinates that represent papers with many incoming citations (those that are cited by many papers); then, given a paper, we compute the cosine similarity between it and every other paper to obtain papers with highly similar citation profiles as recommendations. The penalization step is the same as in B-CCS. The intuition behind it is: two papers citing a paper with few incoming citations is worth more than citing a paper with many incoming citations.

In the example in Figure 2, for Paper , B-BC recommends Paper before Paper because Paper has more citations in common with Paper (both co-cite Papers and ). Paper only has one citation in common with Paper .

Similar to our approach used for pairwise similarity computations in co-citation similarity (B-CCS) algorithm, we only consider pairs of papers with at least one common outgoing citation resulting in a -fold decrease in the number of computations.

Item-based Collaborative Filtering (B-IBCF)

The item-based collaborative filtering algorithm is implemented by Apache Hadoop222http://mahout.apache.org/users/recommender/intro-itembased-hadoop.html. Using the citation network, we treat each citation edge as a user-item interaction. Paper citing paper represents user buying item . We treat all our papers as both items and users and recommend papers (items) to papers (users) based on citations. We perform the standard item-based collaborative filtering approach Sarwar . (2001): given a user (paper) , we want to recommend items (papers) to that does not already have (does not already cite), and are similar to items that already has (already cites). Just like the co-citation similarity algorithms, similarity is based on vector similarity. Given an item (), its user vector is the binary vector of users (papers) that have purchased (cited) this item (). So, for example, if the incoming citation vector for paper is where if cites and otherwise, then we consider as an item that is bought by those users where . Since these vectors are binary, we use Hadoops’s log-likelihood vector similarity measure to compute item similarity between items that user has bought, and items that does not have and pick the best items by averaging similarity scores across all items that has. Intuitively, given a paper , we recommend papers most similar to its citations (using log-likelihood similarity, which is intuitively co-citation similarity).

As shown in the example in Figure 2, for Paper , B-IBCF recommends Paper and then Paper because Paper (that has more citations in common with Paper ) cites Paper (which Paper does not cite/have) while Paper (which has one citation in common with Paper ) cites Paper (which Paper does not cite/have). Papers and are not recommended because Paper also cites (has) them.

The primary difference between this algorithm and B-CCS is that given an input paper , B-CCS finds papers closest to using co-citation similarity. This algorithm, however, does not look at the input paper, it instead treats the input paper as a set of papers by looking at its citations, and then recommends papers closest to its citations by averaging co-citation similarity between its citations and other papers. The hope is looking at a paper’s citations gives more information than the paper itself.

Co-citation Proximity (B-CCP)

The co-citation proximity approach is based on citation proximity analysis Gipp  Beel (2009). The intuition behind the algorithm is that if citations occur close together in the text of a paper, then the cited papers are likely to be more closely related than if the citations were further apart. We use a different weighting scheme for the proximity occurrences than Gipp and Beel Gipp  Beel (2009) and we aggregate the occurrence values.

We processed each paper to extract all possible citation pairs between the papers referenced in the citation list of . Each citation pair is given a proximity type (group – within the same square brackets, sentence, paragraph, section, or paper) based on the minimal distance between each citation. The proximity type is calculated by parsing the structure of the document’s XML format or applying minor heuristics.

Relationship weights are used to quantify the different minimum proximities between citation pairs and are summed across document pairs to indicate their similarity. For example, co-citations in the same paper are assigned a weight of 1, co-citations in the same section, a weight of 2, and so on. If paper and paper are cited once within the same sentence (a total relation weight of 4) but paper and paper are cited within the same section in three additional documents (a total relation weight of ), then paper has a stronger similarity to paper than to paper . We also experimented with and applied the approach to larger datasets (over 16 million documents) than what Gipp and Beel used (1.2 million) Gipp  Beel (2009).

Referring back to the example in Figure 2, for Paper , B-CCP recommends documents based on minimal citation proximity to Paper over the multiple papers in which Paper is cited. The recommended documents are ordered as follows: Paper which is cited in the same sentence as a citation to Paper (weight of 4) in Paper and in the same section (weight of 2) in Paper (total weight is 6); Paper which is cited in the same group as Paper (weight of 5) in Paper ; Paper which is cited in the same sentence as Paper (weight of 4) in Paper ; and, Paper which is cited in the same paper (Paper ) as Paper (weight of 1) and in the same section as Paper (weight of 2) in Paper (total weight of 3).

One issue with this approach is the situation in which paper and paper are cited in the same sentence but used to contrast each other Gipp  Beel (2009). This is not a significant issue in our case because our large collection of papers means that consistently co-cited papers will have a stronger connection. Additionally, even if two papers are co-cited in the context of a disagreement and/or conflict because they propose opposing theories, the fact that they are frequently co-cited may make them strongly related (i.e., such that one would be a good recommendation for the other).

4.1.2 Content-based Algorithms

We can also identify similar papers to recommend based on the content of the paper or its abstract. These similarity-based algorithms make use of terms in the text and semantic meaning of the terms in the text.

Figure 3: Example of common words and keywords (based off MeSH ontology) represented by rectangles in the documents. Content-based algorithms produce the following recommendations for Paper E in order: B-AS A, B, C, D (using words); B-STS B, A, D, C (using keywords).
Abstract Similarity (B-AS)

Almost every paper includes an abstract that typically summarizes the paper’s focus, methods, experiments, results, and contributions in a succinct and efficient manner. Many research article search engines index only the abstract (rather than the full text of the article) because abstracts provide sufficient information about the full paper. Two articles with similar abstracts are likely to be similar articles; therefore, we used the text of abstracts as a basis for recommending articles. To determine abstract similarity, we use a TF-IDF similarity measure on the words of the abstract. TF-IDF (term frequency-inverse document frequency) is calculated as the product of the term frequency (TF: the number of times a term occurs in a document) and the inverse document frequency (IDF: a measure of how common or rare the term is across all documents).

Using the B-AS algorithm to recommend papers for Paper in Figure 3, Paper is recommended before Paper because Paper contains three instances of an infrequent word (highlighted in light purple). Paper is recommended before Paper because Paper contains one instance of the infrequent word and two frequent words (highlighted in green and pink). Papers and both contain frequent words in common with Paper , but Paper contains more instances of words in common with Paper (three vs. two); hence, it is recommended before Paper .

To obtain accurate TF-IDF similarity, we first normalize the abstracts by tokenizing them into words, eliminating external token punctuation, and stop-word tokens. TF-IDF is then calculated on a token level. We calculate the inverse document frequency of each token on our entire paper abstract dataset (size approximately 14 million). Inverse document frequency of a token amongst all papers in the dataset is defined as:

where is the number of papers in the set in which occurs.

Then, given two abstracts from papers and , we compute their TF-IDF vectors; that is, their abstracts expanded into -dimensional bit vectors, where is the number of distinct words that occur in all abstracts (in our database this is approximately 9 million distinct words) such that each position in the vector for paper contains for the corresponding token . The term frequency of a token in is defined as: , where is the number of times occurs in the abstract of paper .

Given the two TF-IDF vectors, and for and respectively, we compute their cosine similarity as to obtain the final similarity score. Intuitively, this similarity score captures abstracts that share similar terms, strengthened by the number of times the term occurs in the abstracts under consideration and penalized by the commonality of the term amongst all abstracts. Thus, we expect rare terms that occur frequently in both abstracts to indicate strong similarity between the abstracts.

Suppose for a given paper in our dataset, we want to obtain the top 50 papers similar to using abstract TF-IDF similarity. This computation is extremely inefficient as it requires similarity calculations. Therefore, as a fast approximation for a given paper abstract, we consider only those paper abstracts that share at least one rare term with it. We define a term as rare when . This step significantly cuts down the number of similarity calculations to approximately (more than 3,000-fold decrease). For the top recommended papers, the abstracts should intuitively share at least one rare term, so this filtering step should not eliminate too many papers and in practice, this heuristic search space reduction strategy works well.

Semantic Similarity (B-STS)

Unfortunately, the B-AS algorithm is very sensitive to ambiguity and synonymy problems. To overcome this issue, we aimed to use semantic relationships to infer indirect mentions. Traditional TF-IDF similarity based systems are not be able to identify similarity among different terms for the same concept but normalized field/concept annotations provide a principled way to detect and measure similarity. Hence, we applied named entity recognition algorithms to all papers in our database to identify mentions of concepts such gene, chemicals, diseases, and research areas, which are all included in the MeSH ontology

Nelson (2009).

There are about 28,000 terms and 139,000 supplementary concepts in MeSH. For every paper we capture a summary of the paper based on the fields it contains. Intuitively, papers that share more fields are more similar than papers that share less fields. As in the abstract similarity algorithm (B-AS), we use TF-IDF similarity to compute semantic similarity in exactly the same way, except instead of using normalized tokens representing words of the abstract, we use fields associated with the paper. TF-IDF inherently treats papers that share many rare fields as closest to each other. Note, the term frequency of a term and paper is either 0 or 1 because our field/term tagger only tags the existence of each field in a paper. As in abstract similarity, we only compare similarities between papers which share at least one rare field (term, ), where rare is defined as occurring in at most 5,000 papers in the set of papers: . This heuristic filtering approach reduces the number of pairs we have to compare to 72.2 billion () without jeopardizing the quality of the recommendations.

Going back to the example in Figure 3, having reduced the words to their semantic fields, the frequency of instances within each paper no longer has an impact. Paper is recommended first because it shares the most infrequent terms with Paper . Paper and then Paper are recommended next because Paper still contains a term more infrequent than Paper . Finally, Paper is recommended because it contains one infrequent term in common with Paper .

4.1.3 Co-authorship Similarity (B-CA)

The main idea behind co-authorship based recommendations is that papers which share authors are likely to be related to each other Sugiyama  Kan (2011); Newman (2001). At the time of our experiment, Meta’s author database contained approximately 11 million automatically discovered biomedical researcher profiles calculated from 24.6 million papers spanning 89 million paper-author relationship tuples. Meta’s author disambiguation algorithm is modeled after the winning algorithms of KDD Cup 2013 Author Disambiguation challenge (track-2) Li . (2015). We take a simple approach by first building the co-authorship network where the set of nodes represents the set of papers and a weighted edge between two papers, represents the number of shared co-authors between papers and . Then, for a given paper we traverse the co-author network graph to each of its one- and two-hop neighbors to calculate the shared-author scores as the sum of the weighted edges in the path from to . Each one- and two-hop neighbors is ranked by its shared-author score with and the papers with the highest scores are recommended (ties are broken randomly).

As shown in the example in Figure 4, in one and two hops from Paper , Paper has six co-authors (three on the path E-A-B, one on the path E-B, and two on the path E-C-B), and hence, is the first recommendation. Paper is next because it has four co-authors on the one- and two-hop paths (one on E-A and three on E-B-A), while Paper is last because it only has three co-authors on the paths (one on E-C, and two on E-B-C).

Figure 4: Co-authorship structure where common authors are shown as icons along paths. Recommendations for Paper E are as follows: B-CA B, A, C.

4.2 Aggregation Algorithms

We implemented four rank aggregation methods Dwork . (2001); Ailon . (2008); Ali  Meilă (2012) to aggregate results from the base algorithms described above.

4.2.1 Problem Definition and Notation

Given a set of elements and complete rankings or permutations of these elements , the goal is to find the Kemeny optimal ranking Kemeny  Snell (1962), , i.e., the ranking that minimizes , where is the number of pairwise disagreements between a pair of rankings, also known as the Kendall distance. When complete rankings are not available, we place all the unranked objects at the bottom of the list, and consider all objects in this set to be tied with each other. The problem of finding the Kemeny optimal ranking is NP-hard Bartholdi III . (1989). See Ali  Meilă (2012) for a comprehensive survey of algorithms to compute Kemeny ranking. Here, we use four different algorithms to approximate the Kemeny ranking.

The precedence matrix has entries that represent the fraction of times an element is ranked higher than element , i.e., , where is the indicator function, and is the precedence operator for ranking .

4.2.2 LP approximation (A-LP)

The problem of finding the Kemeny optimal ranking can be solved exactly by posing it as an integer linear program (ILP). Specifically, consider the following optimization problem:

(1)
subject to

The first set of constraints ensure that

are binary variables. The second and third set of constraints are symmetry and transitivity constraints, respectively, to ensure that

is a ranking. Note that this formulation also solves the minimum weighted feedback arc set in tournaments Ailon . (2008). The binary constraints can be relaxed to resulting in a linear program (LP) relaxation of the ILP. Even though the LP can be solved using off-the-shelf LP solvers, in practice we found this to be prohibitively expensive due to the large number of transitivity constraints – cubic in the number of elements .

4.2.3 Beam Search (A-BS)

The set of all permutations can be represented in the form of a tree, where each permutation can be traced in a path from the root to a leaf. Note that every path from the root to an internal node in the tree represents a partial ranking. We use beam search to explore the set of all permutations, and output the optimal ranking. The basic idea is to consider only candidate solutions (partial rankings) at each level of the tree, where is a user-defined parameter known as beam width, and these candidates represent the best partial rankings found so far by the heuristic search algorithm. The tree is then explored in a breadth-first fashion from the root all the way down to the leaves. The optimal solution is then selected from the best candidates found at the lowest level of the tree. A greedy version of the algorithm can be derived by setting , where at each level only one candidate solution is considered greedily. In the other extreme, when , the algorithm explores all the possible exponential number of rankings/paths in the tree.

In order to select the best candidate solutions at each level of the tree, we need to define a cost function to score partial rankings. This cost function can be defined using the precedence matrix as: , where is a partial ranking and is the set of all pairs such that in the partial ranking, including transitive pairs. Our implementation of the algorithm takes about 3.58s/paper on a single machine with 8 threads.

4.2.4 Borda Counts (A-BL)

A simple algorithm to aggregate rankings is to rank objects based on their average ranking computed from all the multiple rankings de Borda (1781). This is equivalent to sorting the elements based on the column sum of the precedence matrix, i.e., . Our implementation of the algorithm takes about 0.161s/paper on a single machine with 8 threads.

4.2.5 Sort-based Approximation (A-MS)

Comparison-based sorting algorithms such as merge sort or quick sort can be adapted to aggregate rankings using the precedence matrix Ali  Meilă (2012). Instead of comparing pairs of elements and in the sorting algorithm, we compare and . We refer the reader to Schalekamp  Zuylen () for more details on comparison sort methods for rank aggregation. In our experiments, we adapted merge sort to solve the rank aggregation problem. Our implementation takes about 0.159s/paper on a single machine with 8 threads.

4.2.6 Weighted Aggregation

We note that the algorithms described above can be adapted to take weights into account, where the weights are assigned for each of the multiple recommendation algorithms. Let denote these weights such that . We can modify the precedence matrix as: , and use this as input to the above algorithms (A-LP, A-BS, A-BL, A-MS). Determining weights for algorithms is left for future work.

5 Moving to Production: Practical Aspects

In this section, we focus on challenges associated with selecting and deploying a production recommendation engine system. All but one base recommender algorithm (B-CA) are implemented using the MapReduce platform and hence are linearly scalable. We were able to generate recommendations from base algorithms for over 24.6 million articles in less than a week using a small scale (32 cores) Hadoop cluster built on top of commodity hardware. However, the aggregation algorithms do not fit naturally into the MapReduce framework and presented the main challenge in terms of runtime.

When implementing a recommendation system in a production platform, there are several issues to consider but runtime performance is one of the most critical. The runtime complexity of all aggregation algorithms is mostly determined by the effective size of the list of papers, i.e., the size of the union of all the ranked lists of papers returned by the base algorithms. Let each base algorithm output a ranking of elements, and without loss of generality assume that is fixed. Let be the number of base algorithms. Therefore, the effective size is denoted by , where is a measure of overlap among all the base algorithms. Note that indicates that all the ranked lists are mutually exclusive, and indicates that all the ranked lists are the same.

In order to compute the runtime performance of our algorithms we asked 14 active biomedical researchers to select 15 papers each from their field of student. There was one duplicate paper; thus, a total of 209 papers were used to evaluate the performance of our algorithms. Figure 5(A) shows the cumulative distribution of the number of base recommenders that are able to generate recommendations for each of the 209 papers. In of the cases the aggregation algorithms receive input from at least three algorithms and in of the cases all base recommenders can generate recommendations. On the other hand, Figure 5(B) shows the pairwise overlap percentage rate between base recommenders, suggesting only modest overlap between citation-based recommendation algorithms and very little overlap among the remaining pairs. As such, effective size tends to be large () and the runtime complexity of the algorithms becomes important. The number of variables and constraints in the LP of the LP Approximation (A-LP) algorithm is and , respectively. The runtime complexity of Beam Search (A-BS) is , where is the beam width. Runtime complexity of Borda Counts (A-BL) and Sort-based (A-MS) algorithms are the same as that of a sorting algorithm whose input is a list of size , i.e., .

Figure 5: (A) Cumulative distribution of base recommenders generating input for the aggregation step. (B) Percentage of overlap between pairs of base recommenders (C-E) Runtime vs effective size for A-BL, A-MS, A-BS algorithms

Although not shown here, it is worth mentioning that the amount of overlap has a significant effect on the runtime complexity of A-LP (unlike other aggregation algorithms) as the reduction in the number of variables and constraints is quadratic and cubic in , respectively. Also, adding more base algorithms (i.e., increasing ) will affect LP more than other aggregation methods, unless increases by the same rate. Indeed, we omitted using the LP-based algorithm as it was prohibitively slow for the majority of papers.

The runtime for A-BL, A-MS and A-BS, which are given in Figure 5(C)-(E), respectively, decisively show that the A-BL and A-MS algorithms perform similarly and are faster than the A-BS algorithm. A decision about which algorithm(s) to ultimately deploy in practice must take this runtime performance into account. It took about a week to generate recommendations for all the papers in the Meta database.

Another metric to use when selecting the final algorithm(s) is coverage, which, in this context is defined as the number of papers for which our system can generate recommendations. Aggregation algorithms overcome a fundamental shortcoming of base recommendation algorithms which cannot produce recommendations for all papers. Indeed, B-IBCF, B-CCP, B-BC, B-CCS and B-STS fail to generate recommendations for , , , and of the papers, respectively. Finally and, perhaps most importantly, a decision about which algorithm(s) to deploy in a production system must consider quality and relevance of the recommendation results. There are several methods that can be used to evaluate the relevance and usefulness of the output of recommendation algorithms. Future work will evaluate and compare our algorithms on this dimension. However, even if a base recommender is an overall winner from a quality of output perspective, it most likely cannot be used as the sole algorithm because of lack of coverage, which in turn means that aggregation is necessary.

6 Conclusions and Future Work

In this paper, we presented several recommendation algorithms that were implemented and evaluated in Meta’s large-scale biomedical science knowledge base. Existing academic paper recommendation engines, especially those in biomedical sciences, are limited in scope, size and functionality. We experimented with seven base recommender algorithms, and four aggregation algorithms. Base recommender algorithms utilize diverse sets of data such as a citation network, text content, semantic tag content, and co-authorship information. We compared the algorithms according to runtime complexity and scalability and discussed some of the considerations in implementing recommendation algorithms in a large-scale production system.

The main focus for our future work will be to consider the quality of the resulting recommendations from the algorithms and to compare the results according to relevance and usefulness for biomedical researchers. Once the quality of recommendations from the different algorithms is understood, future work can also consider how to adapt the aggregation algorithms by assigning weights to each of the base recommendation algorithms.

Acknowledgements

The authors would like to thank Meta’s Data Science team for their valuable feedback and support during this work. The authors would also like to thank Bahar Ghadiri Bashardoost for contributions to the research and algorithm implementation. This research was partially funded by an Engage Grant from Canada’s Natural Sciences and Engineering Research Council (NSERC) and support from Smart Computing for Innovation (SOSCIP).

References

  • AI2 (2017) AI2AI2.  2017. Leverage AI To Combat Information Overload. Leverage AI To Combat Information Overload. http://allenai.org/semantic-scholar/ Last accessed: 23 October 2017
  • Ailon . (2008) AilonCN08Ailon, N., Charikar, M.  Newman, A.  2008. Aggregating inconsistent information: ranking and clustering Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM)55523.
  • Ali  Meilă (2012) AliM12Ali, A.  Meilă, M.  2012. Experiments with Kemeny ranking: What works when? Experiments with Kemeny ranking: What works when? Mathematical Social Sciences6428–40.
  • Bartholdi III . (1989) BartholdiTT89Bartholdi III, J., Tovey, C.  Trick, M.  1989. Voting schemes for which it is can be difficult to tell who won the election Voting schemes for which it is can be difficult to tell who won the election. Social Choice and Welfare6157–165.
  • Beel  Gipp (2010) beel2010academicBeel, J.  Gipp, B.  2010. Academic search engine spam and Google Scholar’s resilience against it Academic search engine spam and Google Scholar’s resilience against it. Journal of Electronic Publishing133.
  • Beel . (2014) beel2014architectureBeel, J., Langer, S., Gipp, B.  Nürnberger, A.  2014. The architecture and datasets of Docear’s research paper recommender system The architecture and datasets of Docear’s research paper recommender system. D-Lib Magazine20111.
  • Bodenreider . (1998) bodenreider1998beyondBodenreider, O., Nelson, SJ., Hole, WT.  Chang, HF.  1998. Beyond synonymy: Exploiting the UMLS semantics in mapping vocabularies. Beyond synonymy: Exploiting the UMLS semantics in mapping vocabularies. Proceedings of the AMIA symposium Proceedings of the AMIA symposium ( 815).
  • Bollacker . (1998) bollacker1998citeseerBollacker, KD., Lawrence, S.  Giles, CL.  1998. CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications. Proceedings of the 2nd International Conference on Autonomous agents Proceedings of the 2nd International Conference on Autonomous agents ( 116–123).
  • Bollen  Van de Sompel (2006) Bollen:2006:AAA:1141753.1141821Bollen, J.  Van de Sompel, H.  2006. An architecture for the aggregation and analysis of scholarly usage data An architecture for the aggregation and analysis of scholarly usage data. Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries ( 298–307). ACM.
  • Campos . (2013) campos2013modularCampos, D., Matos, S.  Oliveira, JL.  2013. A modular framework for biomedical concept recognition A modular framework for biomedical concept recognition. BMC Bioinformatics141281.
  • Canese  Weis (2013) canese2013pubmedCanese, K.  Weis, S.  2013. PubMed: The bibliographic database PubMed: The bibliographic database. The NCBI Handbook [Internet]. http://www.ncbi.nlm.nih.gov/books/NBK153385/ Last accessed: 23 October 2017
  • Cision (2016) clarviate2016Cision.  2016. http://www.prnewswire.com/news-releases/acquisition-of-the-thomson-reuters-intellectual-property-and-science-business-by-onex-and-baring-asia-completed-300337402.html Last accessed: 23 October 2017
  • Clarivate (2017) clarviateandgsClarivate.  2017. Web of Science: Core collection help. Web of science: Core collection help. https://images.webofknowledge.com/images/help/WOS/hp_full_record.html Last accessed: 23 October 2017
  • Corman . (2002) corman2002studyingCorman, SR., Kuhn, T., McPhee, RD.  Dooley, KJ.  2002. Studying complex discursive systems Studying complex discursive systems. Human Communication Research282157–206.
  • de Borda (1781) Borda1781de Borda, JC.  1781. Mémoire sur les élections au scrutin Mémoire sur les élections au scrutin. Histoire de l’Académie Royale des Sciences, Paris657–664.
  • De Winter . (2014) de2014expansionDe Winter, JC., Zadpoor, AA.  Dodou, D.  2014.

    The expansion of Google Scholar versus Web of Science: A longitudinal study The expansion of Google Scholar versus Web of Science: A longitudinal study.

    Scientometrics9821547–1565.
  • Dokuwiki (2016) dokuwiki2016Dokuwiki.  2016. Pubmed Plugin. Pubmed Plugin. https://www.dokuwiki.org/plugin:pubmed Last accessed: 23 October 2017
  • Dwork . (2001) DworkKNS01Dwork, C., Kumar, R., Naor, M.  Sivakumar, D.  2001. Rank aggregation methods for the web Rank aggregation methods for the web. Proceedings of the 10th International Conference on World Wide Web Proceedings of the 10th International Conference on World Wide Web ( 613–622).
  • Edith Cowan University Library (2017) SCjAnalyzerEdith Cowan University Library.  2017. Research: Find Highly Ranked Journals. Research: Find highly ranked journals. http://ecu.au.libguides.com/research/find-highly-ranked-journals Last accessed: 23 October 2017
  • Elsevier (20171) scopusContentElsevier.  20171. The largest up-to-date collection of global, unbiased and expertly sourced research. The largest up-to-date collection of global, unbiased and expertly sourced research. https://www.elsevier.com/solutions/scopus/content Last accessed: 23 October 2017
  • Elsevier (20172) SCrecomElsevier.  20172. Search, Discover, Analyze. Search, Discover, Analyze. https://www.elsevier.com/solutions/scopus/features Last accessed: 23 October 2017
  • Falagas . (2008) falagas2008comparisonFalagas, ME., Pitsouni, EI., Malietzis, GA.  Pappas, G.  2008February. Comparison of PubMed, Scopus, Web of Science, and Google Scholar: Strengths and weaknesses Comparison of PubMed, Scopus, Web of Science, and Google Scholar: Strengths and weaknesses. The Journal of the Federation of American Societies for Experimental Biology222338–342.
  • Garfield (1990) garfield1990keywordsGarfield, E.  1990. Keywords Plus-ISI’s breakthrough retrieval method. 1. Expanding your searching power on current-contents on diskette Keywords Plus-ISI’s breakthrough retrieval method. 1. Expanding your searching power on current-contents on diskette. Current Contents325–9.
  • Gipp  Beel (2009) Gipp09aGipp, B.  Beel, J.  2009July. Citation Proximity Analysis (CPA) - A new approach for identifying related work based on co-citation analysis Citation Proximity Analysis (CPA) - A new approach for identifying related work based on co-citation analysis. B. Larsen  J. Leta (), Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09) Proceedings of the 12th international conference on scientometrics and informetrics (issi’09) ( 2). Rio de Janeiro, BrazilInternational Society for Scientometrics and Informetrics. ISSN 2175-1935
  • Google (20171) google2017Google.  20171. Google Scholar: About. Google scholar: About. https://scholar.google.ca/intl/en/scholar/about.html Last accessed: 15 August 2017
  • Google (20172) gsCitationGoogle.  20172. Google Scholar citations. Google Scholar citations. https://scholar.google.ca/intl/en/scholar/citations.html Last accessed: 23 October 2017
  • Google (20173) gsIncGoogle.  20173. Inclusion guidelines for webmasters. Inclusion guidelines for webmasters. https://scholar.google.ca/intl/en/scholar/inclusion.html Last accessed: 23 October 2017
  • Hakenberg . (2008) hakenberg2008interHakenberg, J., Plake, C., Leaman, R., Schroeder, M.  Gonzalez, G.  2008. Inter-species normalization of gene mentions with GNAT Inter-species normalization of gene mentions with GNAT. Bioinformatics2416i126–i132.
  • Hands (2012) MSpaperHands, A.  2012. Microsoft Academic Search – http://academic.research.microsoft.com Microsoft Academic Search – http://academic.research.microsoft.com. Technical Services Quarterly293251-252.
  • Harzing (2016) harzing2016microsoftHarzing, AW.  2016. Microsoft Academic (Search): a Phoenix arisen from the ashes? Microsoft academic (search): a phoenix arisen from the ashes? Scientometrics10831637–1647.
  • Huang . (2008) huang2008ciHuang, Y., Contractor, N.  Yao, Y.  2008. CI-KNOW: Recommendation based on social networks CI-KNOW: Recommendation based on social networks. Proceedings of the International Conference on Digital Government Research Proceedings of the international conference on digital government research ( 27–33).
  • Jack (2012) MendRecJack, K.  2012. Mendeley: Recommendation systems for academic literature. Mendeley: Recommendation systems for academic literature. http://www.slideshare.net/KrisJack/mendeley-recommendation-systems-for-academic-literature Last accessed: 23 October 2017
  • Jones (November 11, 2016) jones2016Jones, N.  November 11, 2016. AI science search engines expand their reach. AI science search engines expand their reach. http://www.nature.com/news/ai-science-search-engines-expand-their-reach-1.20964 Last accessed: 23 October 2017
  • Kemeny  Snell (1962) KemenyS62Kemeny, J.  Snell, J.  1962. Mathematical models in social sciences Mathematical models in social sciences. Blaisdell, New York.
  • Kessler (1963) kessler1963bibliographicKessler, MM.  1963. Bibliographic coupling between scientific papers Bibliographic coupling between scientific papers. American documentation14110–25.
  • Kreisman (November 6, 2013) wosGSKreisman, R.  November 6, 2013. Thomson Reuters-Google Scholar linkage offers big win for STM users and publishers. Thomson Reuters-Google Scholar linkage offers big win for STM users and publishers. Outsell, Inc. Advancing the Business of Information.
  • Larsen  Von Ins (2010) larsen2010rateLarsen, PO.  Von Ins, M.  2010. The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index The rate of growth in scientific publication and the decline in coverage provided by science citation index. Scientometrics843575–603.
  • Lawrence . (1999) Lawrence99digitallibrariesLawrence, S., Giles, CL.  Bollacker, K.  1999. Digital libraries and autonomous citation indexing Digital libraries and autonomous citation indexing. IEEE Computer32667–71.
  • Leaman . (2013) leaman2013dnormLeaman, R., Doğan, RI.  Lu, Z.  2013. DNorm: Disease name normalization with pairwise learning to rank DNorm: Disease name normalization with pairwise learning to rank. Bioinformatics29222909–2917.
  • Li . (2015) li2013combinationLi, CL., Su, YC., Lin, TW., Tsai, CH., Chang, WC., Huang, KH.Yang, C.  2015. Combination of feature engineering and ranking models for paper-author identification in KDD Cup 2013 Combination of feature engineering and ranking models for paper-author identification in KDD Cup 2013.

    The Journal of Machine Learning Research1612921–2947.

  • J. Liu . (2013) liu2013rankingLiu, J., Lei, KH., Liu, JY., Wang, C.  Han, J.  2013. Ranking-based name matching for author disambiguation in bibliographic data Ranking-based name matching for author disambiguation in bibliographic data. Proceedings of the 2013 KDD Cup Workshop Proceedings of the 2013 KDD Cup Workshop ( 8).
  • TY. Liu (2009) liu2009Liu, TY.  2009. Learning to rank for information retrieval Learning to rank for information retrieval. Foundations and Trends in Information Retrieval33225–331.
  • Lopez-Cozar . (2012) lopez2012manipulatingLopez-Cozar, ED., Robinson-García, N.  Torres-Salinas, D.  2012. Manipulating Google Scholar citations and Google Scholar metrics: Simple, easy and tempting Manipulating Google Scholar citations and Google Scholar metrics: Simple, easy and tempting. arXiv preprint arXiv:1212.0638.
  • Manning . (2008) manning2008scoringManning, CD., Raghavan, P.  Schütze, H.  2008. Scoring, term weighting and the vector space model Scoring, term weighting and the vector space model. Introduction to information retrieval1002–4.
  • Marshakova-Shaikevich (1973) marshakova1973Marshakova-Shaikevich, I.  1973. System of document connections based on references System of document connections based on references. Scientific and Technical Information Serial of VINITI63–8.
  • Martín-Martín . (2014) martin2014googleMartín-Martín, A., Ayllón, JM., Orduña-Malea, E.  López-Cózar, ED.  2014. Google Scholar Metrics 2014: A low cost bibliometric tool Google Scholar Metrics 2014: A low cost bibliometric tool. arXiv preprint arXiv:1407.2827.
  • Masic  Milinovic (2012) masic2012Masic, I.  Milinovic, K.  2012. On-line Biomedical Databases–The Best Source For Quick Search of the Scientific Information in the Biomedicine On-line biomedical databases–the best source for quick search of the scientific information in the biomedicine. Acta Informatica Medica20272.
  • Microsoft (20171) MicrosoftFAQ2016Microsoft.  20171. Frequently Asked Questions. Frequently asked questions. https://academic.microsoft.com/faq Last accessed: 23 october 2017
  • Microsoft (20172) MicrosoftAcademicGraphMicrosoft.  20172. Microsoft Academic Graph. Microsoft academic graph. https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/ Last accessed: 23 October 2017
  • Molyneux  Molyneux (2012) molyneux2012systemMolyneux, S.  Molyneux, A.  201209 21. System and method for establishing a dynamic meta-knowledge network. System and method for establishing a dynamic meta-knowledge network. Google Patents. US Patent App. 13/623,933
  • NCBI (2017) pubmedHelpNCBI.  2017. PubMed Help. PubMed Help. http://www.ncbi.nlm.nih.gov/books/NBK3827/ Last accessed: 23 October 2017
  • Nelson (2009) nelson2009medicalNelson, SJ.  2009. Medical terminologies that work: The example of MeSH Medical terminologies that work: The example of MeSH. Proceedings of the 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks (ISPAN) Proceedings of the 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks (ISPAN) ( 380–384).
  • Newman (2001) newman2001structureNewman, ME.  2001. The structure of scientific collaboration networks The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences982404–409.
  • NIH (2017) difPubMedNIH.  2017. MEDLINE, PubMed, and PMC (PubMed Central): How are they different? MEDLINE, PubMed, and PMC (PubMed Central): How are they different? https://www.nlm.nih.gov/pubs/factsheets/dif_med_pub.html Last accessed: 23 October 2017
  • Nourbakhsh . (2012) nourbakhsh2012medicalNourbakhsh, E., Nugent, R., Wang, H., Cevik, C.  Nugent, K.  2012. Medical literature searches: A comparison of PubMed and Google Scholar Medical literature searches: A comparison of PubMed and Google Scholar. Health Information and Libraries Journal293214–222.
  • Ortega  Aguillo (2014) ortega2014microsoftOrtega, JL.  Aguillo, IF.  2014. Microsoft Academic Search and Google Scholar citations: Comparative analysis of author profiles Microsoft Academic Search and Google Scholar citations: Comparative analysis of author profiles. Journal of the Association for Information Science and Technology6561149–1156.
  • Sarwar . (2001) sarwar2001itemSarwar, B., Karypis, G., Konstan, J.  Riedl, J.  2001. Item-based collaborative filtering recommendation algorithms Item-based collaborative filtering recommendation algorithms. Proceedings of the 10th international conference on World Wide Web Proceedings of the 10th international conference on World Wide Web ( 285–295).
  • Schalekamp  Zuylen () Schalekamp98Schalekamp, F.  Zuylen, A.  . Rank aggregation: Together we’re strong Rank aggregation: Together we’re strong. Proceedings of the 11th Workshop on Algorithm Engineering and Experiments. proceedings of the 11th workshop on algorithm engineering and experiments.
  • Shariff . (2013) shariff2013retrievingShariff, SZ., Bejaimal, SA., Sontrop, JM., Iansavichus, AV., Haynes, RB., Weir, MA.  Garg, AX.  2013. Retrieving clinical evidence: A comparison of PubMed and Google Scholar for quick clinical searches Retrieving clinical evidence: A comparison of PubMed and Google Scholar for quick clinical searches. Journal of Medical Internet Research158.
  • Small (1973) small1973coSmall, H.  1973. Co-citation in the scientific literature: A new measure of the relationship between two documents Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for information Science244265–269.
  • Sugiyama  Kan (2011) sugiyama2011serendipitousSugiyama, K.  Kan, MY.  2011. Serendipitous recommendation for scholarly papers considering relations among researchers Serendipitous recommendation for scholarly papers considering relations among researchers. Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries ( 307–310).
  • Testa (July 18, 2016) wosSelectionTesta, J.  July 18, 2016. The Thomson Reuters Journal Selection Process. The Thomson Reuters journal selection process. https://clarivate.com/essays/journal-selection-process/ Last accessed: 23 October 2017
  • Xiong . (2017) xiong2017explicitXiong, C., Power, R.  Callan, J.  2017. Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding Explicit semantic ranking for academic search via knowledge graph embedding. Proceedings of the 26th International Conference on World Wide Web Proceedings of the 26th international conference on world wide web ( 1271–1279).