Metrics derived from publication outputs are often a key factor in published quantitative rankings of universities and other research institutions, but these rankings are derived using coarse-grained measures that do not enable an observer to gain insight into the unique scholarly contributions of an institution (huang2020comparison). Thus, ultimately they are not fit for the purpose of understanding the capacity or track record of a university or department to support a specific line of research or project. Likewise, they do not provide historical context of the institution’s role in developing fields of research. Yet without an alternative, these rankings are often used as indicators of a “good choice” for prospective students, funding agencies, and national review boards. The principle of particularity proposed by marginson2014university is that rankings should not rely on proxy measures that do not directly measure the particular qualities of universities that they purport to capture. Certainly, aggregate rankings fail that standard for anything but the most basic questions of relative merit or reputation (collins2016ranking). If we wish to understand the distinctive qualities of each university’s research contributions then we need better ways to ask finely-honed questions about what those universities are producing now and have produced over time.
In fact, information that could enable us to answer those more nuanced questions does exist within the abstracts and texts of the publications that the researchers at an institution generate. Thus, it would advantageous to create techniques that allow us to directly tap into that information. Indeed, information retrieval methods that have been developed for ranking web search results provide us with a flexible set of approaches to explore the bespoke qualities of universities’ research contributions (white2007combining; bar2016bibliometrics). Providing the ability for end users’ to define ad hoc keyword queries and retrieve ranked results from an index opens the door to millions of different ways of exploring the scholarly production of an institution. For example, if we imagine a prospective graduate student wanting to study “quantum computing”, she might try to find a university that is highly ranked in Computer Science or Physics. Yet this is no guarantee that there are professors or research groups there working on specific problems related to quantum computing that she is interested in. Furthermore, if she decided to change her search to a more refined topic, such as “quantum decoherence” the ranked categories are even less useful. Ideally, if she were to look at the number and quality of relevant research publications on quantum computing that are produced by different universities, then that would provide a more appropriate initial indicator of the suitability the institution for her study. Given the sheer number of articles available, and with no way of exploring them by institution, it would be hard to know where to start. Solving this then becomes an information search and, ultimately, a sensemaking problem.
In this article we present the design and implementation of Quoka (or QUesting Open Knowledge Atlas), a data-driven, interactive, and searchable atlas of scholarly knowledge production built from the text and metadata of 100 million research artifacts. The atlas is designed to enable a user to enter an ad hoc keyword query and then explore relevant research outputs of thousands of institutions world-wide over time. In doing so it provides multiple types of feedback on an interactive map and timeline display. At the global level a heatmap display shows geographic regions where relevant publications are being produced. At a more zoomed-in level, individual institutions are visually ranked by markers that are sized based on an information retrieval ranking of the institution’s research outputs and depending on the user’s query. The timeline display shows changes in the prominence of the topic over time and it also serves as a mechanism to filter the map display by year. Thus, for any given topic, geographic regions and institutions can be compared over different time periods. Additionally, by selecting an institution on the map, the top most relevant publications are shown for the filtered time period.
The four main contributions of the research detailed in this article are as follows:
We introduce a technique that organizes and indexes scholarly publications along two dimensions: institution and year of publication, which together provide a useful frame of reference for exploring, searching, and comparing research topics and institutions. Because a text index underlies the proposed system, the user can easily drill down and read specific documents produced at an institution.
While some prior work has been done on geographically and temporally organizing research publications from specific publishers (cf. gao2013spatiotemporal), the scale of the data used in Quoka—with regard to both the breadth of topics as well as the number of objects (>100 million)—far exceeds anything done previously; the index is comparable in size to commercial scholarly search engines such as Google Scholar. The result is an open-ended and regularly updated atlas that can be used in multiple disciplinary contexts.
We re-frame the ranking problem as one of sensemaking, whereby the goal is essentially one of schematizing the evidence from a diverse set of data (i.e., research publications) (pirolli2005sensemaking). For that to work effectively we need a tool that allows a user to forage and collect contextually-relevant information, which is functionality that we prototype in the interactive atlas.
We demonstrate through use cases how the interactive atlas can provide a nuanced historical and institution-based perspective of scholarly production for a set of example topics, and discuss the implications for the sociological study of knowledge production.
The remainder of this article is organized as follows. After providing background material on scientometrics, sensemaking, and information retrieval, we explain the design goals and issues for the atlas, including data collection and preparation. We follow with a description of the implementation of the Quoka system, including both the back-end index as well as the front-end user interface. Finally, we conclude with a brief discussion of the next steps.
2 Related Work
Despite the common goal of providing a ranking result based on content, the most prominent ranking methods in science do not incorporate methods from information retrieval (bar2016bibliometrics; mayr2015scientometrics; white2007combining). Where the academic outputs do play a role in rankings (of either individuals or institutions), it is more often based on factors such as the reputation of the researchers (safon2013global), the quantity of citations (hirsch2007does), or other network based measures (ding2009pagerank). Indeed, rankings based on these measures have been widely critiqued for not capturing the quality of research that is produced and even being harmful to institutions (bornmann2007we; amsler2012university; lynch2015control; biagioli2020gaming). The goal for what we propose here is not to create a new ranking for the purpose of globally comparing universities, but rather to create tools that can highlight the variety and comparative focuses of many different institutions. For this, the ability to search and contextualize the research outputs of an institution is key.
Sensemaking is the process that an individual or group engages in to synthesize knowledge in order to aid decision-making from complex and varied evidence (russell1993cost). In human-computer interaction studies, sensemaking has been framed in terms of constructing representational schemas that explain the evidence, for example, to provide a summary report that synthesizes from data. Sensemaking has been studied in many different domains, including intelligence analysis, medical decision making, and education (pirolliRussell2011). pirolli2005sensemaking have created a model of sensemaking that consists of two major loops, a foraging loop and a sensemaking loop. In the foraging loop information is found and organized into evidence. In the sensemaking loop a structured story is built from this evidence by representing it in a schematic form (e.g., a visualization or narrative) and by creating hypotheses that support the sensemaker’s conclusions. The two loops feed into each other, so the creation of a schematic, for example, can lead to a re-evaluation of the evidence or require seeking out additional information. Sensemaking, therefore, is an iterative process that flows in both a bottom-up and top-down manner (pirolli2005sensemaking).
Sensemaking has taken on a slightly different definition when viewed from a psychological perspective. It has been characterized as an active form of situational awareness, where frames, or mental models, are created from data and iteratively re-framed to interpret past events and predict future ones (klein2006making; klein2006making2). Regardless of the formalization used, sensemaking tools are designed to facilitate the iterative steps required in order for someone to come to a conclusion or decision in the face of heterogeneous, often ambiguous, and sometimes contradictory data (kirschner2012visualizing). Computational tools aid humans who are performing sensemaking by supporting information seeking and foraging tasks, or aiding it through collaborative sensemaking interfaces.
2.2 Why discovering relationships between institutions and knowledge production is sensemaking
In this section we describe three example scenarios that demonstrate how coming to an understanding of the scholarly output of an institution or research organization is, in fact, a kind of sensemaking activity as described above. In each case the sensemaker utilizes a combination of bottom-up and top-down processes to reach a conclusion on a problem they are interested in. As described in pirolli2005sensemaking the bottom-up processes include searching and filtering for information, reading and extracting evidence from the information that is found, representing the information in a schematic form, such as a narrative, and making a decision. The top-down processes include re-evaluating the conclusions based on external feedback, which will lead to searching for additional support, evidence, and relations to either bolster the previous conclusions or re-evaluate them.
2.2.1 Scenario 1: A student researching where to go to graduate school.
In the introduction we presented a scenario of a student researching where to go to for graduate school. Attracting students is one of the common motivators for university rankings, but for students wishing to explore deeper into the offerings at different universities rankings are not particularly useful. Undoubtedly, many potential Ph.D. students have a general interest in a domain of research but they might not have been exposed to the breadth of topics and projects in progress at various universities. It is also possible that there is relevant research being conducted in a different type of department than the one the student is considering, as often occurs in interdisciplinary contexts. Finding a good potential program and advisor is a sensemaking task because it involves distilling the many different forms of evidence through an iterative process whereby the student learns more about the research that is being conducted and who is doing it. This might begin in a bottom-up way with a broad exploratory search through different programs that are available, or in a more top-down manner, where, for example, the undergraduate research advisor has pre-disposed the student to look at a few specific programs, which then expands to other programs as she reads more of the research that is happening there and elsewhere. As the process progresses the student might refine her schema of what type of program is suitable and might even change the focus of her planned research. This can, of course, also change after she has begun her studies, but the decision about where to apply and enrol represents a clear example of sensemaking from heterogeneous information with the goal of making a critical decision that can have significant implications for the students’ future.
2.2.2 Scenario 2: A national review board wanting to get a picture of the research landscape.
Universities and individual researchers are increasingly required to report on the performance of their research output to national review boards (hicks2012performance). When measuring research performance several factors can be used to measure impact, complicated by the many ways that different disciplines and individual researchers contribute to knowledge production.
2.2.3 Scenario 3: A researcher studying the history and sociology of an academic research field.
Over time new academic research disciplines are created and others shift in focus and scope over time. These developments, as well as understanding the underlying causes and actors involved, are of interest to historians of the academy. Younger research fields, such as geographic information science (GIScience), often have fervent debates about the nature of the discipline (reitsma2013revisiting), while older disciplines might re-brand or splinter as the field progresses. One example of this is the clear shift in 2004 away from “humanities computing” to “digital humanities” (hockey2004history; vanhoutte2013gates). The role of geography in the development of research and innovation has also been extensively studied (saxenian1996regional; jons2013global; Frenken2020).
2.3 Chronotopic information interaction
Chronotopic information interaction is a design paradigm that uses the inherent spatio-temporal structure found in a heterogeneous document collection to support information seeking behavior. This structure can be derived from document metadata as well as the references to places and dates within the text of the documents (Adams2020). A search engine that uses this form of interaction visually emplaces search results within an integrated geographical and temporal frame of reference, which provides context to explore and discover information. This geographical and historical context allows a user to leverage their individual expert knowledge about different locations and historical events while exploring the search results (duggan2008knowledge). When applied to research publications, chronotopic information interaction enables the assessment of an institution’s historical role in the development of research as well as a way to explore the contributions that different institutions make over time. In the Quoka system chronotopic information interaction is used to augment sensemaking tasks to understand the role of institutions and researchers in knowledge production over space and time.
3 Overview of the Atlas
The Quoka atlas is a publicly-available interactive website 111https://pteraform.csse.canterbury.ac.nz/quoka/. Similar to a standard search engine, a keyword query is entered in a search box and a geographic and historical overview is presented on a web map and timeline as seen in Figure 1. In that example the results for the query “evolutionary psychology” are shown. A heatmap is rendered on the map derived from the index scores for each institution, and the timeline shows the relative salience of the research topic over time from 1960 to 2020. Since geographically proximate institutions blend together when zoomed out, the heatmap enables one to see geographic clustering of research activity.
A range of years can be selected along the timeline and the heatmap will adjust accordingly. This allows the user to investigate how the geography of research has changed over time. In Figure 2 we see how the field of evolutionary psychology was much less geographically distributed before 1984. As we zoom in on the map, the heatmap fades and individual institutions are marked with sized circles based on the institution’s score (see Figure 3). By selecting an institution with a mouse click we can explore relevant research articles that were published by researchers there. Figure 4 shows as we zoom into Stanford University and select it that a ranked list of articles related to evolutionary psychology is presented, filtered by the time range currently selected.
In their investigation of sensemaking by intelligence analysts, pirolli2005sensemaking described how during the initial information foraging steps that lead to sensemaking they made use of a “shoebox”, an intermediate place to store found information that would be used later to identify potentially relevant relations and evidence, which are subsequently then used to support findings and conclusions (pirolli2005sensemaking). Being able to share this found information is useful for collaborative sensemaking as well, but additional information which enables others to understand the task context during a search process is important to communicate when handing off the information to others (paul2009cosense). Therefore, in the shoebox we save the search state (query and institution) that led to an article being found, plus provide a means for the searcher to record additional notes and comments. Figure 4 shows that the user has previously added an article from the University of Michigan and has written some notes.
4.1 Data sources
The Quoka indexes are built using the COKI Academic Observatory data collection pipeline, which fetches data about research publications from multiple sources and exposes synthesized data as a collection of Google BigQuery datasets. Figure 5 shows a high level schematic of the automated data pipeline and its constituent technologies. Data on over 100 million research artifacts has been collected from Unpaywall, Microsoft Academic Graph, Open Citations, and CrossRef, and this data is updated in an automated manner on a regular basis. Metadata about institutions is matched to this data from Geonames and the Global Research Identifier Database. Exported snapshots of this integrated data power various dashboards and visualizations, and they also serve as the input data for the Quoka indexes.
4.2 Index creation
The server back end of Quoka relies on two text indexes built from the Academic Observatory data and are implemented using Elasticsearch and the Apache Lucene library software (mccandless2010lucene; gormley2015elasticsearch). The current implementation is based on a snapshot of the COKI data from November 2020, and the indexes are in total approximately 400 GB in size.
The first is an index of the aggregate text produced by each institution per year (the institution-year index). This index generates a score for each institution, given a keyword query. These scores can be used for ranking as well as visual feedback, and in the case of the atlas as indicators on the interactive map. The second, the DOI index, is an index of each research artifact organized by digital object identifier. The fields that are stored in the index for each DOI include: title, abstract, Microsoft academic graph fields of study (WangEtAl2020), authors, publisher, journal name, Global Research Identifier Database222https://grid.ac/ (GRID) ids, year of publication, citation count, and open access information. This index allows for ranking of research objects, filtered by institution, year, and other criteria.
Quoka uses an information-based scoring scheme for both indexes (clinchant2010information). Given query , the score for institution, , for a set of years is the sum of scores over each (Equation 1).
The institution’s score for each year, , is defined as follows in Equation 2.
is an H2 term frequency normalized version of the sum of occurrences of the word in the all the text produced by institution in year amati2002probabilistic. This normalization is based on the number of total words produced by the institution within the year, so that institutions which are producing less content overall though proportionately high with regard to the given query are not scored artificially low. is the average number of institution, year pairs where the word occurs.
For the index of DOIs, a similar measure is used with an additional normalization parameter for the overall length of the text we have associated with the DOI (title, abstract, etc.). In addition to the text, the year and institutions that contributed to the content are stored in separate fields. This is to enable the retrieval of DOIs for a given institution and filtering to get relevant DOIs that match a given year range.
4.3 Web application
The atlas is a single page web application designed to run in modern browser systems. The user interface is built using the React333https://reactjs.org/
A reactive web server written using Eclipse Vert.X666https://vertx.io/ acts as middleware between the Elasticsearch index and the client web application by filtering and handling query requests from the client, and formatting the results as a JSON object for the client application. The use of Elasticsearch and Vert.X technologies provides a scalable architecture that can respond to changing demands on the system from the users.
We presented the design and implementation of Quoka, an interactive atlas for exploring the research outputs of institutions around the world. The system supports sensemaking tasks related to understanding the creation and history of academic research, and can help to provide a more nuanced picture of the heterogeneity of research being conducted at different institutions. The Quoka service consists of a back-end data infrastructure and information retrieval index, combined with an interactive web-based interface.
Next steps will involve developing more sensemaking components for the Quoka system, including the integration of structured domain-based knowledge to support context-based search and to improve the usability and personalization of the system.
The initial prototype of the Quoka atlas was developed during a generous visiting researcher opportunity provided to Benjamin Adams by the Curtin Institute of Computation from Oct.-Nov. 2019. This research was also supported by New Zealand Ministry of Business Innovation & Employment, Grant/Award Number: UOAX1932. William Wallace helped with initial prototyping of the sandbox component.