The task of discovering research expertise at institutions pursuing scientific research can be a difficult task. For example, it is often challenging for companies or research laboratories to pinpoint researchers at an institution to collaborate with for the problems they wish to solve (Belkhodja and Landry, 2007); this can lead to methods such as sending scores of emails to faculty, hoping that eventually the appropriate individual will be found. However, this method often leads to lost opportunities. Furthermore, within institutions themselves, the task of planning major research efforts often involves the same tactics when it comes to finding appropriate people.
Both of these situations share a similar problem: the lack of an easily accessible, up-to-date record with relevant information for user queries. At the moment, most institutions have manually curated directories, which hold each individual’s department, position, and contact information. However, when trying to learn more about a researcher these directories can sometimes provide inaccurate or incomplete information. For instance, a full professor’s latest research interests may be very different from those at the time that faculty member joined the institution years ago. Additionally, the information typically lacks important details about their specific fields of interest as well as their publications. Therefore, these directories can be less helpful for the companies, laboratories, and government agencies that seek to pursue business or fund projects at a specific university. This hinders the ability of both external and internal individuals to pinpoint research talents, understand the scope of research activities at an institution, and discover new connections.
To solve this common challenge faced by research institutions, our work makes the following contributions:
[itemsep=2mm, topsep=2mm, parsep=1mm, leftmargin=5mm]
PeopleMap, an interactive tool that “maps out” researchers based on their research interests and publications by leveraging embeddings generated by natural language processing (NLP) techniques. PeopleMap contributes as:
The first visualization dedicated to helping users explore researcher embeddings; while there has been research that develops methods to recommend research papers and publication venues (Beel et al., 2016; Medvet et al., 2014; Beel, 2017; Alhoori and Furuta, 2017; Küçüktunç et al., 2013), less work focuses on developing usable easy-to-access tools for users to interactively explore researcher datasets. PeopleMap fills this research gap and seeks to improve the interpretability and explorability of researcher datasets.
An open-source, sustainable web application for the community that can be easily accessed via web browsers and implemented as a web-based application. PeopleMap is registered under the permissive MIT license, and its code repository is available at https://github.com/poloclub/people-map. Besides the PeopleMap visualization, it also provides a series of data collection and preprocessing tools that allows users to create a researcher dataset from any list of researchers found on Google Scholar. Additionally, it includes a step-by-step documentation guide (https://app.gitbook.com/@poloclub/s/people-map/) that covers every step of the process from downloading the repository to launching the PeopleMap platform (Section 4). With the combined data collection resources and PeopleMap visualization, the tool provides an automated solution for researcher interest summarization and discovery, which simplifies the exploration of the work of scientific researchers (Section 3).
PeopleMap Usage Scenarios and Deployment As a first real-world use case of PeopleMap, we have successfully implemented PeopleMap for The Institute for Data Engineering and Science(IDEaS), a major cross-campus research entity at Georgia Tech (http://ideas.gatech.edu/) whose members include faculty from across colleges and departments on campus. Preliminary feedback from IDEaS’ leadership team has been positive; they are very excited about PeopleMap’s interactivity and the way that this tool can be easily updated for new members. The live PeopleMap for IDEaS can be found at https://poloclub.github.io/people-map/ideas/.
To demonstrate the easy application of PeopleMap to a different organization’s members, we also implemented PeopleMap for the Center of Machine Learning at Georgia Tech (https://ml.gatech.edu/), another major cross-campus entity. The live PeopleMap
for the Center of Machine Learning can be found at:https://poloclub.github.io/people-map/ml/. We also provide an additional usage scenario to highlight how a potential user could implement and use PeopleMap.
3. Introducing PeopleMap
is an open-source, web-browser-based visualization tool that maps out researchers using NLP techniques, allowing users to explore all the different information extracted from researchers’ profiles using textual embeddings. It can determine the possible groupings of similarly-interested researchers, represent how researchers align with specified fields of study, and reveal potential Gaussian distributions describing the research topics present in the dataset.PeopleMap’s user interface consists of four major components: (1) Map View (Figure 1A) visualizes the research topic similarities among researchers; (2) Research Query (Figure 1B) allows users to search for researchers and query areas of study; (3) Researcher View (Figure 1C), which shows the detailed information of the researcher hovered over by the user (e.g., affiliation, citations, interests); (4) Control Panel (Figure 1D) allows users to adjust the hyperparameters of the Map View visualization. Next, we describe each component in more detail.
3.1. Mapping Out Researcher Interests
The Map View of PeopleMap (Figure 1A) is a visualization of embeddings representing the researchers in the selected dataset. Within the Map View, each dot represents a researcher and their corresponding embedding projected into a two-dimensional space. With the researcher data extracted from Google Scholar, these embeddings were created using term frequency–inverse document frequency (TFIDF) (Jones, 1972)
matrices and principal component analysis (PCA)(Wold et al., 1987), which is discussed in greater detail in the following sections:
3.1.1. Collecting Google Scholar data for each researcher
Generating a PeopleMap visualization requires only public data that anyone can access. We collect each researcher’s public information from Google Scholar, which includes the researcher’s profile, publications, and research interests using a Python-based module called scholarly (https://github.com/scholarly-python-package/scholarly). The specific information included are:
Google Scholar profile URL
Top 50 most cited publications (titles, abstracts, and years of publication)
Top 50 most recent publications (titles, abstracts, and years of publication)
Google Scholar profile keywords
Google Scholar profile photo
PeopleMap formats and stores all researcher data in a CSV file, one column for each category of information listed above.
3.1.2. Researcher Embeddings
Using the publication data extracted from Google Scholar, the title and abstracts of each researcher’s publications are first concatenated together to create a combined document for each researcher. Additionally, Google Scholar keywords of each researcher can also be concatenated into their respective combined documents. After their creation, in order to normalize and prepare them for analysis, these combined documents are:
Removed words with non-English alphabet characters to restrict the bounds of analysis
Eliminated words with fewer than two characters in length to mitigate noise in the data
Lowercased words to simplify capitalization
Cleaned of HTML tags
Cleaned of stop-words
Stemmed words to simplify syntax
Once the documents have been normalized, they are then converted into researcher embeddings representing each individual researcher through the use of the TFIDF technique. This technique takes into account both the occurrence of each word within a researcher’s publications and its frequency. Furthermore, it provides us a quantitative method by which we can ignore common words shared by most, if not all, of the researchers, while measuring specific “important” or “characteristic” words that differentiate researchers (Jones, 1972). Each researcher’s embedding is a column in a TFIDF matrix, with each row representing the respective term values for a specific word in each researcher’s embedding. The following equation represents the combination of
total researcher embeddings, each individually represented as vectors, to create the combined TFIDF matrix :
With the researcher embeddings in the TFIDF matrix, it is necessary to first reduce the dimensionality of the embeddings, which are vectors in a several-thousand dimensional space, so that they can be visualized. To achieve this, principle component analysis (PCA) is used to assist in feature extraction and elimination, simplifying the researcher embeddings into vectors within a two-dimensional space that can be visualized in the Map View (Figure 1A).
We chose PCA as a starting embedding technique, because PeopleMap is one of the first tools for interactively mapping out researchers. Our primary goal is to create a platform that improves the explorability and interpretability of researcher datasets. While there are many potential embedding techniques for the textual data of researchers, we aimed to start with more classic embeddings that could provide adjustable parameters for the platform. We purposefully used PCA over other potential visualization techniques, such as UMAP (McInnes et al., 2018) or t-SNE (Maaten and Hinton, 2008), because they tend to find structure within the noise of a dataset with small sample sizes compared to the dimensionality of the data, while PCA is well justified as a linear model for such datasets (McInnes et al., 2018). Thus, we use PCA since it fits the constraints of our researcher dataset and allows us to still find emergent patterns among the researcher embeddings. In the future, we endeavor to improve the complexity of our embeddings by exploring several potential embedding techniques.
3.2. Querying Researchers and Areas of Study
The Research Query component (Figure 2) allows the user to both locate specific researchers, as well as see which researchers are aligned with each of the Google Scholar keywords collected from the researcher dataset. When the user searches for a researcher, PeopleMap highlights the researcher’s representation in Map View by enlarging the dot’s radius and outlining it; PeopleMap also displays the researcher’s Google Scholar profile information in the Researcher View. When calculating a researcher’s alignment with a selected Google Scholar keyword, PeopleMap uses similarity analysis between researcher embeddings and topic embeddings, which is discussed in the next section.
3.2.1. Similarity Analysis
The TFIDF researcher embeddings used for the Map View component are also used to calculate the similarity between a researcher and a specified topic. For example, if a user wants to see which researchers frequently use a specific term prominently throughout their work, it is possible to use their researcher embeddings to find which ones use the term most often compared to their overall writing. To calculate this, the specified topic (e.g. “natural language processing”) is first converted into a TFIDF embedding using the same process that is outlined for the researcher’s publications in (Section 3.1.2
). Then, the cosine similarity between the specified topic embedding and each of the researcher embeddings in the TFIDF matrix is calculated, which indicates the similarity between the two vectors: the higher the value, the greater the similarity(Ramos and others, 2003). The following equation represents the cosine similarity between the specified topic embedding, represented as the vector , and the current researcher embedding, represented as the vector , to produce the resulting similarity score, represented as :
By performing cosine similarity calculations between the specified topic embedding and the researcher embeddings in the TFIDF matrix, the top similarity scores can be used to find the researchers that most align with the specified topic. These researchers are, in turn, highlighted in the Map View when the specified topic is queried in the Research Query component (Figure 2 shows an example query). Researchers are colored based on how well they align with the query. Darker indicates more aligned. The Research Query tool, together with the color gradient visualizing the query results, help users better understand the scope of research relevance among the researchers. The researchers more prominently highlighted are those who tend to use the query term proportionally more than their peers in the dataset. This can serve as a reference to begin inquiries into the individual’s research rather than serve as a full assessment of their contributions to that research topic.
3.3. Clustering Researchers
To help users more easily identify groups of related researchers, the Map View (Figure 1
A) colors the researcher dots to indicate clusters of associated researchers. The intention of this coloration is not to create strictly-defined groups of researchers. Rather, we want to help users visualize the scope of shared interests among researchers. To assign these colorings, we use Gaussian mixture modeling, which will be explained in greater detail in the following section.
3.3.1. Assigning Colors
Previously, we used PCA to reduce the dimensionality of the researcher embeddings, projecting them into a two-dimensional space for visualization (Section 3.1.2). This dimension reduction of the researcher embeddings is also necessary for clustering techniques to be performed. In the researcher dataset for the IDEaS faculty at Georgia Tech (which is visualized in Figure 1), the researcher embeddings have over 11,000 dimensions, with each dimension representing a word in the vast vocabulary shared by the researcher dataset; however, there are only 83 datapoints. Thus, considering the complexity of the data, it is necessary to simplify the dimensionality of the data before performing clustering (Bellman, 2015).
Therefore, using the newly-reduced researcher vectors created using PCA, the total set of researcher vectors is analyzed using Gaussian mixture modeling. Using this technique, the overall distribution of researcher vectors is categorized into several different Gaussian distributions (visualized distributions in Figure 3). These distributions are meant to assist the user in their understanding of the different topics within the researcher dataset and how these topics are shared among different groups.
Once these researcher vectors are clustered using Gaussian mixture modeling, they are visualized within the Map View component of PeopleMap and colored according to their designated Gaussian distribution, with each distribution being assigned a unique color. Researcher dots that are close together tend to reflect a similarity in research pursuits between the two researchers; increased distance between researcher dots reflects the opposite. Using distance and coloring of a research embedding as metrics for gauging similarity, the user can better understand the relationship between each of the researchers as well as the diversity of topics in the Map View (Figure 3).
3.4. Calibrating Exploration
To change the settings of the Map View, the user can use the control panel at the bottom of the visualization (Figure 1D) to manipulate the Map View and investigate the relationships and information presented by the dataset. The following settings assist in the exploratory process of the researcher dataset, allowing the user to explore the impact of different variables on the overall visualization and patterns among the researchers.
Show Distributions: In order to help the user better understand how each cluster of researchers is formed, this toggle permits the user to see the Gaussian distributions calculated by the Gaussian mixture model (discussed in Section 3.3
). Each distribution is colored differently according to the dots within it. Additionally, each distribution visualizes the space covered by three standard deviations of the distribution along each of its axes.
#Clusters: To assist the user in their exploration of the researcher embedding clusters, this slider allows the user to adjust the number of Gaussian distributions generated by the Gaussian mixture model algorithm (discussed in Section 3.3). The slider itself does not change the embeddings of the researchers. By increasing the number of clusters, the Gaussian distributions generated becomes increasingly tight. Likewise, by decreasing the number of clusters, the Gaussian distributions become more expansive but also decrease in tightness.
Show All Names: To help users find specific researchers and recognize individuals in different clusters, this toggle displays the names of researchers alongside their respective dot within the Map View. It can be used to find researchers without hovering over each dot individually.
Keywords Emphasis: This drop-down allows the user to adjust the emphasis that is placed on a researcher’s Google Keywords, compared to their titles and abstracts, when generating their TFIDF embedding (Section 3.1.2). By increasing the emphasis, more multiples of a researcher’s keyword are concatenated into their original combined document that is used to generate their TFIDF embedding. By decreasing the emphasis, less multiples of a researcher’s keywords are concatenated into their original combined document. The purpose of this drop-down is to increase or decrease the weight placed on a researcher’s self-identified topics of study when calculating their position in the visualization, allowing the user to better understand the characteristics of each researcher’s fields of interest.
Publication Set: This drop-down allows the user to select which publications they would like to use for the Map View: The default option will use a researcher’s 50 most cited publications to characterize their research, while the other option will use a researcher’s more recent 50 publications in their characterization. These options allow users to explore the researcher dataset from two different angles of what a researcher may be more known for and what they are currently working on.
Researcher details on demand: To see more information about a researcher, users can hover over the researcher’s dot in the Map View (Figure 1A), which will display in the Researcher View (Figure 1C) the researcher’s:
Google Scholar Profile Keywords
Total Citation Count
Google Scholar Profile Link
Google Scholar Profile Photo
4. Usage and Access of PeopleMap
4.1. PeopleMap Code Repository and Documentation
In addition to the source code, we provide two live demos of PeopleMap that allow anybody to explore and become familiar with the PeopleMap platform. The first demo analyzes the publications of the faculty in Georgia Tech’s Center of Machine Learning (https://poloclub.github.io/people-map/ml/), while the second demo analyzes the publications of the faculty at the Institute for Data Engineering and Science (IDEaS), also at Georgia Tech (https://poloclub.github.io/people-map/ideas/).
The corresponding datasets for these two faculty groups are available alongside the source code of the Github page: https://github.com/poloclub/people-map.
4.2. Example Usage Scenario
James is an academic director at a university, looking to develop a new project centered around the study of black holes. He is looking for potential colleagues at his university with whom he can begin working on this new project. While he does have some current connections with professors at his university, he would like to explore the diversity of researchers at his university by using PeopleMap.
To start, James clones the PeopleMap repository and begins following the steps of the documentation. Next, he goes to the university directory and gathers the Google Scholar profile names of all of the relevant researchers. Using tools included in the repository, he gathers their relevant publication information, processes the text, and generates the data files for the PeopleMap platform.
With PeopleMap fully set up, James begins exploring the researcher dataset with all the tools explained in Section 3. First, he uses the Publication Set drop-down and selects Most Recent Publications since he wants to find researchers currently focusing on studying black holes. Next, James clicks the Research Query component (Section 3.2) and types “black holes”, searching to see the researchers most closely aligned with the topic. The tool then highlights the top-five researchers associated with the topic. From this initial search, he discovers several individuals he did not know from his previous correspondence and decides to look a little deeper.
Using this information, James proceeds to use the Researcher View component (Figure 1C) to identify the researchers, clicking on their Google Scholar profile links to see some of their published work. However, before ending his search, he would like to see some of the other researchers that are in close proximity to the ones already selected. Using the Keywords Emphasis drop-down, he tries different choices of keywords to see the groups of researchers that emerge near the previously identified researchers, using the Show All Names toggle to take note of other researchers that are frequently associated with the ones found using the Research Query component. With this wide array of researchers, James is confident he has gathered all the potential collaborators and proceeds to use their Google Scholar profiles found in the Researcher View component, as well as other resources, to gauge which ones would be the best fit for the project.
5. Predicted Impact
5.1. Enhanced and Enabled CIKM Research Activities
PeopleMap aims to facilitate several different CIKM research areas. As a tool for the visualization and exploration of researcher datasets, PeopleMap seeks to assist in the data presentation of research fields of interest and researcher information. Furthermore, PeopleMap can provide functionalities for users and interfaces for information and data systems by increasing the interactivity and explorability of researcher datasets through the PeopleMap platform and its functionalities. By assisting in both of these CIKM research areas, PeopleMap offers a new platform for public and private organizations to both explore the interests of their members and summarize the fields of study their members pursue.
5.2. Scaling the Impact of PeopleMap
PeopleMap for research entities. PeopleMap could transform how research talents at research institutions may be summarized and discovered by both internal and external collaborators. At Georgia Tech, we have successfully developed PeopleMap for two major research entities: IDEaS and the Center for Machine Learning. The leadership of IDEaS are very excited about this tool, especially the interactivity and explorability that it provides for researcher datasets as well as the ease with which it can be updated for new members. While we used the tool for faculty datasets in IDEaS and the Center of Machine Learning in Section 4.1, it could be applied to the entirety of the College of Computing or even Georgia Tech as a whole. The scope of the researchers included is a matter of preference for the group seeking to implement PeopleMap.
PeopleMap for larger entities. Using the data-collecting and processing tools that are part of the PeopleMap repository, it is possible to expand the platform to other researcher datasets, as long as these researchers have Google Scholar profiles with their associated publications listed. The PeopleMap for IDEaS visualizes 83 researchers. However, it is possible to have significantly more researchers than this amount; the limiting factor for the total count is essentially the size of the Map View visualization. As more researchers are added, the higher number of dots can lead to greater visual complexity in the visualization, potentially causing “overplotting” as it becomes harder to distinguish between each of the dots and locate specific individuals using either the Show All Names toggle or the Researcher View component. Additionally, the researcher information within PeopleMap does not update automatically when researchers’ Google Scholar profiles update. PeopleMap users would need to re-run the data collection and processing step to refresh PeopleMap.
PeopleMap as a complementary resource. Rather than replacing current directories, we developed PeopleMap as a tool to complement these existing directories. PeopleMap can be used in conjunction with the directories of universities, companies, agencies, and other institutions to lend an additional perspective upon the diversity of research interests that the institution holds.
6. Conclusion and Future Work
PeopleMap, in its current form, will continue to be useful for years to come, but we plan on continuing to improve the system by increasing the sophistication of the NLP techniques used in analysis and expanding the available functionalities for exploring researcher datasets. In the current version of PeopleMap, we use TFIDF to generate researcher embeddings (Section 3.1.2) from our gathered researcher data before using PCA and Gaussian mixture modeling for visualizing these embeddings and performing clustering techniques (Section 3.3). However, as we seek to increase the complexity of our embeddings, we plan on exploring several potential embedding techniques. For example, we aim to extract hidden layers from pretrained and finetuned Transformer (Vaswani et al., 2017) models such as BERT (Devlin et al., 2018). Prior work has explored fine-tuning these models on text data from the scientific domain, yielding improved results on downstream tasks (Beltagy et al., 2019). However, we aim to use similar techniques in the context of visualization. Using these techniques, we open up the possibility of both improved information extraction and visualization of researcher datasets.
Lastly, we hope that PeopleMap can assist any individual seeking to delve deeper into the fields of interests found within any group of researchers. We encourage any institution composed of published researchers to use PeopleMap if they would like to explore the diversity of content produced by their members. We expect that recommendation systems for research papers and publication venues will continue to be a topic of interest in coming years, as there have been several different studies addressing potential platforms and solutions (Beel et al., 2016; Medvet et al., 2014; Beel, 2017; Alhoori and Furuta, 2017; Küçüktunç et al., 2013). Furthermore, we also expect organizations will seek to improve outdated directory systems so that both internal and external groups can more efficiently and confidently connect with researchers for potential collaborations.
- Recommendation of scholarly venues based on dynamic user interests. Journal of Informetrics 11 (2), pp. 553–563. Cited by: 1st item, §6.
- Paper recommender systems: a literature survey. International Journal on Digital Libraries 17 (4), pp. 305–338. Cited by: 1st item, §6.
- Towards effective research-paper recommender systems and user modeling based on mind maps. arXiv preprint arXiv:1703.09109. Cited by: 1st item, §6.
- The triple-helix collaboration: why do researchers collaborate with industry and the government? what are the factors that influence the perceived barriers?. Scientometrics 70 (2), pp. 301–332. Cited by: §1.
- Adaptive control processes: a guided tour. Princeton university press. Cited by: §3.3.1.
- SciBERT: a pretrained language model for scientific text. External Links: Cited by: §6.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §6.
- A statistical interpretation of term specificity and its application in retrieval. Journal of documentation. Cited by: §3.1.2, §3.1.
- TheAdvisor: a webservice for academic recommendation. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries, pp. 433–434. Cited by: 1st item, §6.
- Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §3.1.2.
- Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §3.1.2.
Publication venue recommendation based on paper abstract.
2014 IEEE 26th International Conference on Tools with Artificial Intelligence, pp. 1004–1010. Cited by: 1st item, §6.
- Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242, pp. 133–142. Cited by: §3.2.1.
- Attention is all you need. External Links: Cited by: §6.
- Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: §3.1.