GrapAL: Querying Semantic Scholar's Literature Graph

We introduce GrapAL (Graph database of Academic Literature), a versatile tool for exploring and investigating scientific literature which satisfies a variety of use cases and information needs requested by researchers. At the core of GrapAL is a Neo4j graph database with an intuitive schema and a simple query language. In this paper, we describe the basic elements of GrapAL, how to use it, and several use cases such as finding experts on a given topic for peer reviewing, discovering indirect connections between biomedical entities, and computing citation-based metrics.



There are no comments yet.


page 1

page 2

page 3

page 4


GrapAL: Connecting the Dots in Scientific Literature

We introduce GrapAL (Graph database of Academic Literature), a versatile...

A Multi-scale Visual Analytics Approach for Exploring Biomedical Knowledge

This paper describes an ongoing multi-scale visual analytics approach fo...

G2GML: Graph to Graph Mapping Language for Bridging RDF and Property Graphs

How can we maximize the value of accumulated RDF data? Whereas the RDF d...

Towards Connecting Use Cases and Methods in Interpretable Machine Learning

Despite increasing interest in the field of Interpretable Machine Learni...

A Review of Serverless Use Cases and their Characteristics

The serverless computing paradigm promises many desirable properties for...

A Cognitive Science perspective for learning how to design meaningful user experiences and human-centered technology

This paper reviews literature in cognitive science, human-computer inter...

GHTraffic: A Dataset for Reproducible Research in Service-Oriented Computing

We present GHTraffic, a dataset of significant size comprising HTTP tran...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Researchers rely on scientific literature to perform a wide variety of tasks such as searching for papers, assessing candidates applying for a research position, and keeping track of papers published on topics of interest. Several software products are available to help researchers perform these tasks. For example, many biomedical researchers use PubMed to find papers relevant for their studies,111 Google Scholar allows researchers to verify and curate their user profiles,222 and Semantic Scholar extracts research topics, figures, and tables from papers and links them to external content such as slides, videos and GitHub repositories.333 However, features provided by such products are chosen and designed to satisfy only the most common tasks, ignoring the long tail of informational needs such as finding experts on a given topic, identifying potential collaborators, assessing influence between research areas, and discovering connections between biological entities.

In this paper, we address these limitations by introducing a tool which provides a flexible and efficient way to query the Semantic Scholar knowledge base. GrapAL is publicly available at, along with documentation. In the following section (§2), we introduce the schema and query language used in GrapAL and discuss how users can connect to the database. In §3, we show how GrapAL can be used to satisfy several example informational needs identified through user studies. In §4, we discuss some of the design choices and the system architecture for GrapAL.

2 How to use GrapAL

GrapAL is designed to satisfy a wide variety of use cases and scenarios requested by users of Semantic Scholar who wish to process scientific literature to facilitate their own work. To achieve this, we design GrapAL as a Neo4j property graph with an intuitive schema, and make it available to query using the Cypher query language Francis et al. (2018).

Figure 1: Overview of GrapAL schema. *denotes indexed property.


Fig. 1 demonstrates the schema of our graph database, which consists of 7 node types (displayed in turquoise) and 8 edge types (displayed in purple). At the core of the graph is the Paper node. A Paper node connects to a Venue node (via an APPEARS_IN edge), to Author nodes (via AUTHORS edges), to Affiliation nodes (via AFFILIATED_WITH edges), to Entity nodes (via MENTIONS edges), to RelationInstance nodes (via MENTIONS_RELATION edges), and to other Paper nodes (via CITES edges). A RelationInstance node (e.g., causes[Smoking, Cancer]) represents an n-ary relationship of type Relation (via a WITH_RELATIONSHIP edge) between two or more Entity nodes (via WITH_ENTITY edges). Table 1 provides the number of instances of each node and edge in the schema at the time of this writing. This schema modifies the one described in Ammar et al. (2018) to include additional node types (e.g., Affiliation) and optimizes for query execution time. See Ammar et al. (2018) for details on how we extract entities and various metadata for each paper.

Node Type Count Affiliation* 16M Author 17M Entity 493K Paper 46M Relation 51 RelationInstance 347K Venue* 78K
Edge Type

Table 1: Approximate node and edge cardinalities. *indicates node types that are not de-duplicated.

Query Language.

Before we discuss realistic case studies in §3, we introduce the query language used to query GrapAL with a few simple example queries. The following query matches author nodes in GrapAL and returns the first 10: [tabsize=2,breaklines,fontsize=]cypher MATCH (a:Author) RETURN a LIMIT 10 The following query further applies additional property value contsraints: [tabsize=2,breaklines,fontsize=]cypher MATCH (a:Author last: ”Ellis”, first: ”Clarence”) RETURN a Note the round brackets used to specify an instance of node type Author, and the curly brackets used to specify its properties.

The following query finds a paper by its title and returns its co-authors: [tabsize=2,breaklines,fontsize=]cypher MATCH (a:Author)-[:AUTHORS]-¿(p:Paper) WHERE p.title = ”One-shot learning of object categories” RETURN a Note the use of square brackets to specify edges and the arrow to specify edge direction.

More information about the Cypher query language can be found in Francis et al. (2018).

Connecting to GrapAL.

Users can query GrapAL in three different ways. First, an interactive graphical interface is available at which is suitable for interactive exploration of GrapAL with a relatively small number of results. Second, the Neo4j HTTP API can be used for batch processing of the graph, e.g., to execute a query for a list of authors of interest.444Documentation for the API is available at Alternatively, developers can query the graph natively in their applications using one of the Neo4j language drivers. Neo4j officially supports five languages: .NET, Java, Javascript, Go and Python, but drivers are available for a longer list of programming languages thanks to the active Neo4j community.555See for the complete list of Neo4j language drivers.

3 Case Studies

In the early phase of this project, we conducted several user studies with computer science and biomedical researchers to better understand the kinds of questions they would like to answer with this data. In this section, we focus on a few use cases that were given by researchers and translate each to a GrapAL query in Cypher.

Shortest Path.

Consider researcher a seeking an introduction or an endorsement to work with researcher b. By finding the shortest path between the two researchers in GrapAL, researcher a can identify collaborators connecting the two. [tabsize=2,breaklines,fontsize=]cypher MATCH p=shortestPath((a:Author)- [:AUTHORS*0..6]-(b:Author)) WHERE a.first = ”Swabha” AND a.last = ”Swayamdipta” AND b.first = ”Regina” AND b.last = ”Barzilay” RETURN p This query returns a path connecting the two researchers by papers on which each has been an author connected by the common co-author Luke Zettlemoyer.

We constrain the number and type of edges in the graph to a maximum of six AUTHORS edges. For ambiguous names, it may be necessary to first find the author page on Semantic Scholar and fetch the unique author ID from the URL.666Swabha Swayamdipta’s ID is 2705113 since her author page is The clause a.first = "Swabha" AND a.last = "Swayamdipta" can be replaced with a.author_id = 2705113. We note that the shortest path can be computed between other node types (and through other edge types) in GrapAL as well.

Finding Experts.

One of the pain points in organizing a conference is identifying reviewers who are knowledgeable on a given research topic. By querying GrapAL, members of the organizing committee can find more reviewers and rely less on their immediate social network when deciding whom to invite. For example, the following query can be used to find researchers who published the most papers on “Relationship extraction” since 2013. [tabsize=2,breaklines,fontsize=]cypher MATCH (a:Author)-[:AUTHORS]-¿(p:Paper), (p)-[:MENTIONS]-¿ (:Entity name: ”Relationship extraction”) WHERE p.year ¿ 2013 WITH a, count(p) as cp RETURN a, cp ORDER BY cp DESC Furthermore, in order to find the exact canonical name of an entity node, we can run a regular expression query like the following to find the proper form of the node for which we are searching. [tabsize=2,breaklines,fontsize=]cypher MATCH (e:Entity) WHERE =  ”(?i)relationship extraction” RETURN e

Papers at the Intersection of Entities.

Search engine results sometimes make it difficult to find papers that discuss the intersection of multiple topics or fields. With GrapAL, we can return all papers that discuss any number of entities, such as ”Constriant programming” and ”Natural language processing”. [tabsize=2,breaklines,fontsize=

]cypher MATCH (p:Paper)-[:MENTIONS]-¿ (e1:Entity name: ”Constraint programming”), (p:Paper)-[:MENTIONS]-¿ (e2:Entity name: ”Natural language processing”) RETURN p This result of this query is a list of papers that mention both of the aforementioned entities.

Connecting Scientific Concepts.

Some researchers wanted to explore direct and indirect connections between two scientific concepts (entities) of interest, e.g., the impact of ‘adjuvant antiestrogen therapy (Arimidex)’ on ‘estrogen receptors’. Using GrapAL, we can find the shortest path between these two entities and get a chain of interactions. This is potentially useful in piecing together research and finding connections between different biomedical entities, which may be scattered across papers. [tabsize=2,breaklines,fontsize=]cypher MATCH path=shortestPath( (er:Entity name: ”Estrogen Receptors”)- [:WITH_ENTITY*0..15]- (ar:Entity name: ”Arimidex”)) WITH nodes(path) as ns UNWIND ns as n MATCH (n)-[:WITH_ENTITY position: 0]-¿ (e0:Entity), (n)-[:WITH_ENTITY position: 1]-¿ (e1:Entity), (n)-[:WITH_RELATIONSHIP]-¿ (r:Relation) RETURN e0, r, e1 This query returns a list of triples (e0, r, e1) which connect ‘Arimidex’ to ‘Estrogen Receptors’. The UNWIND operator allows us to examine each node on the shortest path and process it as needed.

Citation-Based Metrics.

Citations are often used to define metrics for assessing the impact of a researcher or a venue. In addition to computing traditional metrics such as h-index and i10-index, GrapAL can also be used to compute more granular metrics to estimate the rate at which papers in one conference cite papers in another conference. For example: [tabsize=2,breaklines,fontsize=

]cypher MATCH (p1:Paper)-[:APPEARS_IN]-¿ (naacl:Venue), (p2:Paper)-[:APPEARS_IN]-¿(cvpr:Venue), path=((p1)-[:CITES]-¿(p2)) WHERE naacl.text =  ”.*NAACL.*” AND cvpr.text =  ”.*CVPR.*” RETURN count(path) This query returns the number of times a NAACL paper cites a CVPR paper. We use the =~ operator to match on venue names by regular expression because venues are stored as unstructured strings.

4 System Design

Graph Database.

Due to the high connectivity in the data and the nature of queries GrapAL is designed for, we opted to create GrapAL using a graph-native database instead of a more conventional relational database. Unlike a relational database, a graph database provides a natural and efficient way to query and traverse multi-hop relations without using computationally expensive join operations. While several graph database systems have become available recently (e.g., AWS Neptune,, dgraph), we decided to use the Neo4j platform which was released in 2007, has a strong community of developers, and is the most widely used graph database system as of the time of this writing.777 The main limitation we experienced using Neo4j is that it is not distributed (the same is true for other widely-used options), but this was not a severe problem since we were able to fit the data for GrapAL on a single server.

Building and Deploying GrapAL.

GrapAL is powered by the same data that powers the website, as described in Ammar et al. (2018). We use a staging server to read a snapshot of the data as Spark DataFrames from AWS S3 and write CSV files which match the property schema described earlier. Due to the sheer amount of records, we process different shards of the data in parallel before aggregating all shards into one CSV file for each node and edge type of the schema. Then, we use the Neo4j CSV import function to build the database. Once the data is imported, the database files are copied over to a production server which serves the dataset publicly and has lower processor and memory requirements compared to the staging server. We plan to rebuild GrapAL at a monthly cadence with new snapshots of the data.

5 Related Work

Related APIs have been made available to help researchers navigate scientific literature. Singh et al. (2018) provides an API to interact with the ACL anthology. However, it is limited to the areas of computational linguistics and natural langauge processing, and it uses a predefined list of query templates with placeholders for authors, papers and venues. Springer Nature SciGraph 888 provides an API for accessing publication metadata from the Springer Nature corpus, but it is limited to papers and books published by Springer Nature. The Microsoft Academic Graph Shen et al. (2018) is similarly an API for examining academic literature. As a relational database, it is hard to query with complex, multi-hop relations as discussed in §4. This work is also related to a line of NLP work focusing on scientific documents including citation prediction (e.g., Yogatama et al., 2011; Bhagavatula et al., 2018), author modeling (e.g., Sim et al., 2015), stylometry (e.g., Bergsma et al., 2012), bibliometrics (e.g., Foulds and Smyth, 2013; Weihs and Etzioni, 2017) and information extraction (e.g., Kergosien et al., 2018; Andruszkiewicz and Hazan, 2018).


GrapAL is a versatile tool for exploring and investigating scientific literature built on the Neo4j graph database framework. We describe the basic elements of GrapAL, how to use it, and use cases such as finding experts on a given topic for peer reviewing, discovering indirect connections between biomedical entities, and computing citation-based metrics.

Future improvements include more metadata and changes to the structure of affiliation and venue data. We intend to change the data pipeline architecture to perform event-based incremental updates rather than a regular batch build. We continue to improve the models used to populate GrapAL’s nodes and edges (e.g., author disambiguation and entity extraction and linking).


We thank Khaled Ammar for his graph database suggestions, Michal Guerquin for his help in designing and building the pipeline, and Darrell Plessas for his technical assistance. We also thank Noah Smith and the Semantic Scholar team for their support.


  • Ammar et al. (2018) Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew E. Peters, Joanna Power, Sam Skjonsberg, Lucy Lu Wang, Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen, and Oren Etzioni. 2018. Construction of the literature graph in semantic scholar. In NAACL-HTL.
  • Andruszkiewicz and Hazan (2018) Piotr Andruszkiewicz and Rafal Hazan. 2018. Annotated corpus of scientific conference’s homepages for information extraction. In LREC. European Language Resource Association.
  • Bergsma et al. (2012) Shane Bergsma, Matt Post, and David Yarowsky. 2012. Stylometric analysis of scientific articles. In ACL-HLT, pages 327--337. Association for Computational Linguistics.
  • Bhagavatula et al. (2018) Chandra Bhagavatula, Sergey Feldman, Russell Power, and Waleed Ammar. 2018. Content-based citation recommendation. In NAACL-HLT.
  • Foulds and Smyth (2013) James Foulds and Padhraic Smyth. 2013. Modeling scientific impact with topical influence regression. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 113--123. Association for Computational Linguistics.
  • Francis et al. (2018) Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. 2018. Cypher: An evolving query language for property graphs. In SIGMOD Conference.
  • Kergosien et al. (2018) Eric Kergosien, Amin Farvardin, Maguelonne Teisseire, Marie-Noelle BESSAGNET, Joachim Schöpfel, Stéphane Chaudiron, Bernard Jacquemin, Annig Lacayrelle, Mathieu Roche, Christian Sallaberry, and Jean-Philippe Tonneau. 2018. Automatic identification of research fields in scientific papers. In LREC. European Language Resource Association.
  • Shen et al. (2018) Zhihong Shen, Hao Ma, and Kuansan Wang. 2018. A web-scale system for scientific knowledge exploration. In ACL.
  • Sim et al. (2015) Yanchuan Sim, Bryan Routledge, and Noah A. Smith. 2015. A utility model of authors in the scientific community. In EMNLP, pages 1510--1519. Association for Computational Linguistics.
  • Singh et al. (2018) Mayank Singh, Pradeep Dogga, Sohan Patro, Dhiraj Barnwal, Ritam Dutt, Rajarshi Haldar, Pawan Goyal, and Animesh Mukherjee. 2018.

    Cl scholar: The acl anthology knowledge graph miner.

    In NAACL 2018.
  • Weihs and Etzioni (2017) Luca Weihs and Oren Etzioni. 2017. Learning to predict citation-based impact measures. 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 1--10.
  • Yogatama et al. (2011) Dani Yogatama, Michael Heilman, Brendan O’Connor, Chris Dyer, Bryan R. Routledge, and Noah A. Smith. 2011. Predicting a scientific community’s response to an article. In EMNLP, pages 594--604. Association for Computational Linguistics.