Delivering Scientific Influence Analysis as a Service on Research Grants Repository

08/23/2019 ∙ by Yuming Wang, et al. ∙ Huazhong University of Science u0026 Technology 0

Research grants have played an important role in seeding and promoting fundamental research projects worldwide. There is a growing demand for developing and delivering scientific influence analysis as a service on research grant repositories. Such analysis can provide insight on how research grants help foster new research collaborations, encourage cross-organizational collaborations, influence new research trends, and identify technical leadership. This paper presents the design and development of a grants-based scientific influence analysis service, coined as GImpact. It takes a graph-theoretic approach to design and develop large scale scientific influence analysis over a large research-grant repository with three original contributions. First, we mine the grant database to identify and extract important features for grants influence analysis and represent such features using graph theoretic models. For example, we extract an institution graph and multiple associated aspect-based collaboration graphs, including a discipline graph and a keyword graph. Second, we introduce self-influence and co-influence algorithms to compute two types of collaboration relationship scores based on the number of grants and the types of grants for institutions. We compute the self-influence scores to reflect the grant based research collaborations among institutions and compute multiple co-influence scores to model the various types of cross-institution collaboration relationships in terms of disciplines and subject areas. Third, we compute the overall scientific influence score for every pair of institutions by introducing a weighted sum of the self-influence score and the multiple co-influence scores and conduct an influence-based clustering analysis. We evaluate GImpact using a real grant database, consisting of 2512 institutions and their grants received over a period of 14 years...

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Research grants from governments and industry have played an important role in seeding and fostering fundamental and cutting-edge research projects, resulting in many research innovations and scientific discoveries. However, existing scientific influence analysis services to date have mainly centered on evaluating the impact factor of a journal or a conference based on citation counts. The first proposal for the Journal Impact Factor was introduced by E. Garfield in 1955[1] to evaluate the influence of journals. It has been developed for more than 60 years[2] and is still widely used today. In 2005, Hirsch[3, 4] proposed the h-index to measure the influence of an individual researcher by combining both quantify (the number of publications) and quality (the count of citations). Several variant indices have been proposed to further enhance h-index, such as g-index[5] and hg-index[6] by adding one or more new attributes into indices or changing the way of processing citation counts. Such publication citation-based impact factor has been used by many research labs and academic institutions as one factor to evaluate the scholarly achievement of a researcher. Google scholar is a popular service for such purpose.

As big data and cloud computing become ubiquitous, there is a growing demand for developing large scale scientific influence analysis on research grants repositories and delivering such analysis as a service. In contrast to the publication citation counts, the scientific influence analysis on research grants can provide insight on how research grants help foster new research collaborations, encourage cross-organizational collaborations, influence new research trends, and identify technical leadership. For example, by examining the research grant repository over a certain period of time, it can reveal a number of interesting perspectives: in which subject areas academic institutions and industry researchers collaborate by means of cross-organization projects, the type of influence that research grants have on prioritizing certain research subjects over the others, on the research trends, and the leadership in different research subject areas. Analysis of research grants data may also reveal the specific research subject areas that are on-demand or on the priority-list by governments or industry. However, very few research efforts have been engaged on grants based scientific influence analysis using statistical methods [7, 8].

In this paper, we develop a graph-theoretic approach to mine a research grants repository for large scale grants-based influence analysis, coined as GImpact, and our design and development goal is to deliver GImpact as a service with three original contributions. First, we mine a large scale grant database to identify and extract important features for grant-based scientific influence analysis and represent such features using graph theoretic models. For example, we can extract features to analyze research collaborations among different individual researchers and/or among different institutions by constructing a grant based researcher collaboration graph or an institution collaboration graph. For each of such graphs, we can again associate multiple aspect-based collaboration graphs that further characterize the grant-based collaborations through different aspects of collaboration, such as the disciplines identified by funding agencies of the research grants or the subject areas or keywords. Due to the space constraint, in this paper, we focus on the scientific influence analysis of grants on cross-institution collaboration. Thus, we construct an institution collaboration graph with institutions as vertices and joint grants between a pair of institutions as an edge. Similarly, we construct associated aspect-based collaboration graphs as additional features to enrich our influence analysis on cross-institution collaborations, such as a discipline graph and a keyword graph. The discipline graph has the disciplines as vertices and the grant based relationships between disciplines as edges with edge weighted by the total number of grants that are relevant to both disciplines. The keyword graph reflects the relationship among subject areas in the context of grants, and has the subject keywords as vertices and an edge between a pair of keywords if both keywords are covered by some grant(s), weighted by the total number of grants that cover both of the keywords. Second, we develop graph-theoretical algorithms to compute the collaboration relationship score between a pair of institutions based on their grant data to reflects two types of influences: self-influence and co-influence. We compute self-influence scores for each pair of institutions in terms of joint-grants based collaboration relationship, taking into account also the traversal reachability on the institution collaboration graph. We also compute the co-influence scores for each pair of institutions by incorporating each associated aspect-based collaboration graph. For example, if one institution is reachable from another institution through the graph traversal between the institution graph and one of its associated collaboration graphs, such as the discipline graph or the keywords graph, we will compute their co-influence score based on the statistical properties of all possible graph-traversal paths among the two institutions. Third, we compute the overall scientific influence scores by integrating the self-influence score and the multiple co-influence scores for each pair of institutions and conduct a scientific influence based clustering analysis on the institution graph by partitioning the institution collaboration graph into clusters, with as one of the service application interface parameters. The GImpact approach presents a general purpose scientific influence analysis as a service framework and a suite of graph-theoretic influence computation algorithms that are capable of mining large scale grant data repositories with an easy-to-use API. We evaluate GImpact using a real grant database, consisting of 2512 institutions and their grants received over a period of 14 years. Our experimental results show that the GImpact influence analysis approach can effectively identify the grant-based research collaboration groups and provide valuable insight and an in-depth understanding of the scientific influence of research grants on research programs, institution leadership, and future collaboration opportunities in different research subject areas.

2 Overview

2.1 Research Grants Dataset

The dataset for the study is obtained from the Social Sciences Management Databases of Chinese Universities (SMDB), which consists of all projects in Humanities and Social Sciences from the Ministry of Education, China from the period of the year 2005 to the year 2018. Table I shows research grant samples, Table II,  III,  IV show institution samples, discipline samples and keyword samples respectively. Table V shows basic statistical characteristics of the dataset.

Record Institution Discipline Keyword
R01 CUFE, SHUFE D63040, D79071 K06, K11
R02 SHUFE, SWUFE D63044, D79071, D84074 K08, K12, K13
R03 BNU, FUDAN D79071, D81030, D88031 K03, K08
R04 RUC D7907340 K01, K09
R05 FUDAN D7907340 K07, K05, K02, K04
R06 CUFE D7907340 K01, K10
TABLE I: The Research Grant Samples
Institution InstitutionName
PKU Peking University
RUC Renmin University of China
BNU Beijing Normal University
CUFE Central University of Finance and Economics
CUPL China University of Political Science and Law
FUDAN Fudan University
ECNU East China Normal University
SHUFE Shanghai University of Finance and Economics
ECUPL East China University of Political Science and Law
WHU Wuhan University
SWUFE Southwestern University of Finance and Economics
SWUPL Southwest University of Political Science and Law
TABLE II: The Institution Samples
Discipline DisciplineName
D190 Psychology
D630 Management
D63040 Enterprise Management
D63044 Public Management
D740 Linguistics
D790 Economics
D79071 Finance
D7907340 Financial Markets
D81030 Administration
D820 Law
D84074 Labor Science
D870 Library, Information and Documentation
D88031 Educational Economics
TABLE III: The Discipline Samples
Keyword KeywordName
K01 Economic Cycle
K02 Monetary Assets
K03 Compulsory Education
K04 International Pricing Power
K05 Bulk Goods
K06 Tax Policy
K07 Derivatives
K08 Financial
K09 Asset Pricing
K10 Capital Market
K11 Venture Capital
k12 Supply Mechanism
k13 Labor Force
TABLE IV: The Keyword Samples
Characteristics Number
# of Research Grants 334068
# of Institutions 2512
# of Disciplines 1569
# of Keywords 20097
# of Institutions per Grant (Min, Avg, Max) 1, 1.74, 10
# of Disciplines per Grant (Min, Avg, Max) 1, 1.19, 5
# of Keywords per Grant (Min, Avg, Max) 1, 7.87, 19
TABLE V: The Basic Statistical Characteristics

Raw data is plain-text records in the database managed by a relational DBMS. In order to perform the proposed scientific influence analysis on the grant database, we need to perform feature extractions and convert the relational tables of grant records in plain text format into graph representations. For example, each record contains a collection of attributes, such as institutions involved, disciplines related and keywords associated, and so forth. We show an example fragment of the grant record samples in Table 

I. For record R01, we can learn that institution CUFE and SHUFE have direct grant collaboration. The related discipline areas are D63040 and D79071, and the associated keywords are K06 and K11. If we focus on analyzing the scientific influence of grants on research collaboration among institutions through subject areas captured by disciplines and keywords, then we can extract features from the grant database by modeling each grant by a selection of attributes, such as the institutions, the disciplines and the keywords, then we can formulate this projected version of the grant database as with a collection of triples, each of the format , where is the institution collection, is the discipline collection, is the keyword collection. If the main focus of our scientific influence analysis is on institution collaboration, then we construct the institution graph first with institutions as vertices and joint grants between a pair of institutions as an edge weighted by the number of joint grants. For each additional attributes, we will construct an aspect-based collaboration graph, such as a discipline graph and a keyword graph, to highlight the relationship among different values of the attributes, and enrich our influence analysis on cross-institution collaborations. For each of the specific attributes, we can construct a graph by extracting the relationship features between the same type of attributes. Although each aspect-based collaboration graph is homogeneous in nature, the entire collection of graphs are heterogeneous with one primary attribute as the main collaboration graph for influence analysis and other attributes as the additional aspect of collaborations to capture the different aspects of collaboration relationships that are important to characterize the influence between the vertices in the main collaboration graph, i.e., institutions, in our case. The discipline graph has the disciplines as vertices and the grant based relationships between disciplines as edges with edge weighted by the total number of grants that are relevant to both disciplines. The keyword graph reflects the relationship among subject areas in the context of grants, and has the subject keywords as vertices and an edge between a pair of keywords if both keywords are covered by some grant(s), weighted by the total number of grants that cover both of the keywords.

We would also like to note that the techniques developed in our GImpact is generic and one can choose other primary attributes instead of the institution, such as researcher who are the PI or co-PIs of a grant, making institutions as one aspect-based collaboration graph. Due to the space constraint, in this paper we focus on showcase our approach by conducting the scientific influence analysis of grants on cross-institution collaboration. We use two example attributes, disciplines and keywords, to illustrate the selection of attributes to extract features to represent different collaboration aspects in terms of graphs.

Definition 1 (Research Grants Network)

A research grants network is a heterogeneous information network and defined by an undirected graph where is the set of vertices of heterogeneous types, representing attributes of research grants, such as institution, discipline, and keyword, and is the set of edges denoting the heterogeneous relationships between a pair of vertices of homogeneous types, such as institution-institution, discipline-discipline, keyword-keyword.

Consider the research grant samples in Table I. From record R01, we can extract homogeneous links of CUFE-SHUFE, D63040-D79071 directly from the grant database and thus expressed in solid lines, and the heterogeneous links of CUFE-D63040, CUFE-D79071, SHUFE-D63040, and SHUFE-D79071, and thus expressed in dotted lines. Similarly, from record R02, we extract homogeneous links of SHUFE-SWUFE, D63044-D79071, D63044-D84074 and D79071-D84074 and heterogeneous links of SHUFE-D63044, SHUFE-D79071, SHUFE-D84074, SWUFE-D63044, SWUFE-D79071, and SWUFE-D84074. Figure 1 shows an illustrating example of the research grants network built from record R01 and record R02. It consists of two types of attributes (vertices): institutions (black circle) and disciplines (blue square) and three types of relationships (edges): institution-institution (solid line), discipline-discipline (dot line), and institution-discipline (dash line). The edges from record R01 are marked as red, the edges from record R02 are marked as green, and the edges from both record R01 and record R02 are marked as yellow.

Fig. 1: An illustration of the Research Grants Network

By building a heterogeneous research grants network using multiple homogeneous networks directly from the grant database of plain-text grant records, we can perform GImpact based scientific influence analysis on the multiple homogeneous graphs together and find both direct and indirect collaboration relationships among vertices of the primary collaboration graph, such as the institution graph and the relationships of CUFE-SHUFE, SHUFE-SWUFE, and CUFE-SHUFE-SWUFE in terms of not only joint grants but also the grant based scientific influence through mining all the homogeneous graphs collectively as a whole. As a byproduct, we can also learn about relationships between an institution and its associated aspects of collaborations, e.g., SHUFE took part in the disciplines of D63040, D63044, D79071, and D84074. We can learn hidden relationships that are not observable from the plain-text grant records through simple summarization techniques.

2.2 Related Work

The design and development of GImpact are inspired primarily by social network analysis research efforts in the last decade. Social network analysis promotes community detection [9, 10, 11] and social influence computation [12, 13, 14, 15, 16, 17, 18]. Most of existing social network analysis techniques focus on the single network of homogeneous vertices with homogeneous links, such as a social network of people with friendships among people without explicitly modeling different types of links, which constrains the social influence analysis to be at the superficial social network connection specific friendships. Also, most of the existing social network influence analysis is based on the co-authorship using DBLP dataset. To our best knowledge, none of the prior work has explored scientific influence analysis on a large scale grants databases.

Existing research efforts on research grants data repositories are limited. [7] is the first to study the impact of governmental funding on the publication counts and their citation counts from research programs at the National Institute on Aging (NIA), aiming at improving the quality of funded research. [8] evaluates the impact of receiving NIH grants on the publication by comparing the impact of receiving an NIH grant on subsequent publications and citations with publications and citations from those with unsuccessful grant applications on standard research grants of R01s, showing the insignificant difference between these two groups in terms of both publication counts and citation counts. We argue that scientific influence analysis on the innovation of research programs, institution leadership, and future collaboration opportunities can be more useful indicators for grant impact evaluation than only based on publication count and citation count.

2.3 Problem Statement

The first problem we intent to address in the development of GImpact is to develop graph-theoretic and statistical methods to compute indirect grant collaboration relationships among institutions based on those observable features captured in the grant database in terms of grant records.

Consider the research grant samples in Table I. From record R01, we can find that institution CUFE and SHUFE jointly applied for a grant in the disciplines of D63040 and D79071. From record R02, institution SHUFE and SWUFE jointly applied for a grant in the discipline of D63044, D79071, and D84074. Although CUFE and SWUFE did not apply for a grant jointly, CUFE and SWUFE both applied for a grant with SHUFE. Furthermore, CUFE and SWUFE both applied for a grant in the discipline of D79071. Thus, only based on the joint grant information to conduct grant based scientific influence analysis may lead to some biased or inaccurate results. We argue that a comprehensive scientific influence analysis on a grant repository should take into account of not only direct relationship that can be obtained from the grant database records but also the many types of indirect relationships among institutions that have contributed to the research initiatives and research projects in the same or related disciplines and on the same or similar topic keywords. We argue that measuring grant based scientific influence across institutions should consider both direct and indirect collaborations in the context of grant data. Thus, it is important to extract features that are representing different collaboration aspects in addition to the grant data on institutions. For example, disciplines and keywords are important attributes that reflect the collaboration aspects of different institutions. Furthermore, the relationships between an institution and its associated disciplines in the grant disciplines graph and its associated topic keywords in the grant keywords graph are highly relevant as well.

The second problem we propose to tackle in GImpact is to develop statistical mining algorithms to compute two types of influence measures: self-influence and co-influence. The self-influence refers to the influence score that is computed based only on the graph traversal information in a primary collaboration graph of homogeneous vertices, such as the institution graph in which vertices have edges between them if they have joint grants. The graph traversal on the joint grants based institution graph will capture the indirect relationship among institutions that have indirect grant-based collaboration relationships. The co-influence refers to the influence score that is computed based on both the graph traversal information on a primary collaboration graph and its multiple associated aspect-based collaboration graphs. The graph traversal on this collection of homogeneous graphs will also capture the indirect relationship among institutions that have indirect grant-based collaboration relationships in terms of common disciplines or common topic keywords.

In the development of GImpact, we attempt to answer two fundamental questions: (1) How to measure the overall scientific influence between any pair of institutions quantitatively; and (2) How to utilize the overall scientific influence scores to identify grant based institution clusters. To address the first question, we will compute the overall scientific influence score between any pair of institutions using a weighted sum of the self-influence score and the multiple co-influence scores. The overall scientific influence score reflects not only the collaboration patterns in the institution graph through direct and indirect joint grant relationships but also the collaboration patterns through common disciplines and common keywords as well as indirectly related disciplines in the discipline aspect graph and indirectly related topic keywords in the keyword aspect graph. To address the second question, we will develop scientific influence distance based graph clustering algorithm to partition the set of institutions, denoted by , into disjoint clusters (), where and for . The clustering result should achieve a good balance between intra-cluster similarity, i.e., the vertices within one cluster should have close collaboration relationship and similar collaboration patterns, and inter-cluster similarity, i.e., the vertices in different clusters should have relatively loser collaboration relationship and dissimilar collaboration patterns.

The final outcome of GImpact is the grant-based overall scientific influence for each given institution, which is represented by a ranked list of other institutions sorted by the influence score by this institution based on both the direct and indirect joint grants and the grants that are related directly or indirectly by common or similar disciplines and/or keywords.

2.4 Solution Approach and Overall Framework

Given a grant database of plain text records, the users of GImpact is asked to identify the primary collaboration attribute, say institution, and the secondary attributes that can be modeled as different grant-aspect graphs, say disciplines and keywords. We then construct the corresponding research grants network in three steps. (1) We first extract all distinct institutions and their associated aspect attributes from the grant database and represent each institution with the selection of attributes, such as grant ID, institution ID, discipline IDs and keyword IDs. (2) We construct the institution graph, which contains homogeneous vertices of type institutions, and homogeneous edges between institutions if they have joint grants. We use GImpact to learn self-influence collaboration patterns from direct and indirect joint-grant based collaborations. (3) We construct multiple grant-aspect graphs, each of which contains a set of homogeneous vertices of one attribute type and a set of homogeneous edges between two aspect vertices if they are reflected in a common grant, such as the two disciplines are in at least one grant, or the two keywords are appeared in at least one grant. Each of such grant-based aspect graphs will be used by GImpact to perform the scientific influence analysis to learn co-influence collaboration patterns by exploring graph traversal across both the primary institution graph and the aspect graph by utilizing direct and indirect traversal paths.

Definition 2 (Grant-based Institution Graph)

A grant-based institution graph is a subgraph of , and represented as , where is the set of institutions, and is the set of edges denoting the joint grant relationship between a pair of institutions. Let denote the total number of institutions in , we have .

Definition 3 (Grant-based Aspect Graph)

A grant-based aspect graph, denoted as , is a subgraph of , corresponding to a grant-specific aspect attribute, such as disciplines, or keywords, where is the set of distinct aspect attribute values, is the set of edges denoting the direct relationship between two aspect values if they are covered by the same grant, reflecting the grant-specific aspect relevance, such as discipline relevancy in the discipline graph or the keyword relevancy in the keyword graph. denotes the total number of grant aspect specific vertices in , and .

Definition 4 (Grant-based Influence Graph)

A grant-based influence graph is defined based on the institution graph and a grant-based aspect graph , and is denoted as , where is the set of institutions, is the set of grant-based aspect vertices in the -th aspect graph (), and is the set of edges denoting the direct relationship between two distinct aspect attribute values, such as discipline relevancy and keyword relevancy, is the set of edges, each connecting an institution vertex and an aspect vertex, denoting the direct relationship between an institution and its aspect attribute value, weighted by the #grants this institution has with the same aspect attribute value, such as the same discipline in the discipline graph.

Figure 2 provides an example workflow to illustrate the three main tasks of the GImpact framework. Figure 2a shows an example fragment of the research grants network extracted from our grant database SMDB. It consists of three types of vertices: institutions (black circle), disciplines (blue square), and keywords (green square), as well as five types of direct relationships (edges) extracted directly from the grant database plain-text records: institution-institution (black solid line), discipline-discipline (blue dotted line), keyword-keyword (green dotted line), institution-discipline (red dashed line), and institution-keyword (green dashed line). We decouple the research grants network in Figure 2a into three sub-graphs: an institution graph in Figure 2b, and two influence graphs: institution-disciplines influence graph in Figure 2c and institution-keyword influence graph in Figure 2d. Black numbers in the bracket indicate #grants of an institution, and black numbers on an institution-institution edge represent #joint-grants. Red numbers on the red dashed edge between an institution and a discipline denotes #grants that the institution applied in that discipline. Also, purple number on the purple dashed edge between an institution and a keyword denotes #grants that the institution applied to cover that keyword. Blue numbers on a blue dotted edge represents #grants covered by both disciplines. The green number on a green dotted edge represent #grants that cover both keywords. For ease of presentation, we removed the edges with weight less than 10.

Fig. 2: An Illustration of the workflow of the GImpact Influence Analysis Framework

Task 1: Feature Extraction from the plain-text grant database. A user of our GImpact service should first select the primary grant related attribute that he is interested in conducting grant based scientific influence analysis, such as institution, the lead PI or the entire PI and co-PIs team. Assuming we select the attribute institution in the grant record as our primary attribute for collaboration influence study. Then the user will need to select a set of secondary attributes as the collaboration aspects for the influence analysis. For example, common or similar disciplines and keywords can be indicators of research collaboration relevance. The selection of primary and secondary grant based attributes is always user-specific and influence task-specific. Consider research grant samples in Table I. From record R01, we can learn the research domain is D63040 (Enterprise Management) and D79071 (Finance) from its disciplines, and the research content is about K06 (Tax Policy) and K11 (Venture Capital) from its topic keywords. By using these information selections, one can learn that the grant R01 is about tax policy and venture capital in the domain of enterprise management and finance. Such information can be leveraged to conduct research grants based influence analysis of institution CUFE or SHUFE respectively over the grant repository. The outcome of the first task is the grant-based graphs, with an institution graph as the primary collaboration graph and grant-based aspect-specific influence graphs.

Task 2: Computing the overall scientific influence score. GImpact performs this task by computing the self-influence score on the primary institution collaboration graph and the multiple co-influence scores on the grant aspect-specific influence graphs. We divide this step into three sub-tasks: (1) Computing the self-influence score for every pair of vertices on the institution graph based on the direct and indirect joint-grant relationships. We use a homogeneous influence spread model based on heat diffusion [17] to compute the self-influence score. The result is a matrix, denoted by , with each entry representing the self-influence score of pair-wise institutions. (2) Computing co-influence scores on each of the grant aspect-specific influence graphs (for ). We utilize the heterogeneous influence spread model [18] to compute the co-influence scores. The result is a matrix, denoted by , each representing the co-influence that is spread between institutions via one grant-aspect specific attribute based influence graph such as disciplines graph. If we have number of aspect-specific influence graphs , thus we have co-influence score matrices produced in this task, denoted as . (3) Finally, to obtain the overall influence score for each pair of institutions by integrating the self-influence score and the co-influence scores by deriving a weighted scheme and record the result in a matrix, such as a weighted sum of the self-influence score with weight and the co-influence scores with weights .

Task 3: Grant-based Influence Clustering Analysis. In addition to perform influence analysis using self-influence matrix , co-influence matrices: , and the overall influence , we want to utilize the overall influence scores as an influence based distance function to perform graph clustering analysis. For example, by using the K-Medoids clustering method [19], we design our influence based clustering algorithm GImpact to partition all institutions into grant-based collaboration clusters. Unlike conventional K-Medoids clustering method, we refine the centroid-based initialization function. When assigning points into a cluster, we consider both the influence score to the centroid and the influence score to all points in the cluster. Also, we select the new centroid by maximizing the intra-cluster influence similarity and minimizing the inter-cluster influence similarity.

3 Scientific influence analysis

In this section, we will discuss how to compute the overall scientific influence score for each pair of institutions. We will first introduce the heat diffusion [17] based general influence spread model to perform the influence spread process through graph traversal in a undirected graph. Then we utilize the homogeneous influence spread model to compute self-influence score on the primary institution collaboration graph and the heterogeneous influence spread model to compute multiple co-influence scores on the grant aspect-specific influence graphs . Finally, we will integrate the self-influence score and the co-influence scores into the overall scientific influence score with weights .

3.1 General Influence Spread Model

Inspired by the heat diffusion[17], we developed our general influence spread model to perform the influence spread process through graph traversal. Heat diffusion is a physical phenomenon that heat transfer from a hot object to a cold object. The heat diffusion phenomenon is very similar to the influence spread. For example, the extraordinary institution can be considered as the hot object, transfer heat to the cold object or spread influence to the moderate institution considered as the cold object. Once an institution applied for a grant with other institutions. It means that this institution spread influence on other institutions. Also, if an institution applied for a grant in discipline, this institution spread influence on this discipline. The more institutions that applied for grants on this discipline, the more possibilities other institutions apply for grants on this discipline. Thus this institution influenced other institutions through this discipline.

Consider a undirected graph , where is the vertex set and is the edge set. We use to represent the size of , i.e. . The vertex can be considered as a object in the thermal system and the edge can be considered as a heat transfer tunnel (influence path) between and . Suppose at time , the amount of heat that vertex received from during a period of should be proportional to the temperature of the vertex at time

and the probability

of the vertex receive heat from . Based on the above assumptions, we can define . As a result, the temperature change at vertex between time and time is defined by the heat it receives subtract the heat it sends. This is formulated as

(1)

where is the heat conductivity or the influence spread coefficient. For ease of representation, we can express the above formulation into a matrix form:

(2)

where

(3)
(4)

where is a matrix, called the one-hop heat diffusion kernel or the one-hop influence spread kernel, as the heat diffusion only considers one-hop diffusion in the whole process. The value when indicates the heat that vertex receives from its neighbor , while the value when indicates the heat that vertex sends to its all neighbors. The heat received should equal to the heat sent, therefore the sum of each row of the matrix should be 0.

In the limit , this becomes

(5)

By solving this differential equation, we can get

(6)

where is the initial heat distribution or the initial influence distribution. The matrix is a matrix, called the multi-hop heat diffusion kernel or multi-hop influence spread kernel, as the heat diffusion considers infinity times from the initial heat distribution. It can be extended as a Taylor series, where

is the identity matrix:

(7)

The multi-hop influence spread kernel capture both direct and indirect influence paths between any two vertices in the undirected graph . The influence spread coefficient is a user-specific parameter, representing the speed of the influence spread process.

3.2 Homogeneous Influence Spread Model

The institution graph is a homogeneous graph and only contains homogeneous edges between institutions. Based on the general influence spread model mentioned above, we only need to consider the homogeneous edges in the graph, i.e., is a matrix. The probability of the institution receive influence from the institution can be defined as

(8)

where (or ) is the weight of vertex (or ), e.g., #grants of the institution (or ), and is the weight of edge , e.g., #joint-grants between the institution and the institution .

By defining the probability , we can compute the one-hop influence spread kernel according to Eq.(3). Meanwhile, by giving a specific and , we can obtain the multi-hop influence spread kernel according to Eq.(7) as well. The multi-hop influence spread kernel in homogeneous influence spread model is a matrix that capturing both direct and indirect collaboration relationships, i.e., self-influence collaboration patterns. Thus the self-influence score can be defined as .

Figure 2g shows an example of the self-influence score based on the institution graph of Figure 2b, where both and are set to 1. The black number on black edge represents the self-influence score between institutions. Although the weight of edge CUPL-CUFE and the weight of edge CUPL-SWUPL are both 276 in Figure 2b, the self-influence scores of edge CUPL-CUFE and edge CUPL-SWUPL are 0.024 and 0.021 in Figure 2g respectively. It shows that the self-influence score can capture not only direct collaboration relationships but also indirect collaboration relationships.

3.3 Heterogeneous Influence Spread Model

The influence graph is a heterogeneous graph and contains not only homogeneous edges between aspect attributes but also heterogeneous edges between institution and aspect attribute. Based on the general influence spread model mentioned above, we need to redefine the one-hop influence spread kernel for heterogeneous graph [18] to capture both homogeneous and heterogeneous edges. Thus we divide the one-hop influence spread kernel into four parts:

(9)

where is an matrix with the similar definition as Eq.(3), representing the grant based relationship between aspect attributes, defined by Eq.(12); is an matrix representing the scientific influence that aspect attributes received from institutions, defined by Eq.(11); is an matrix representing the scientific influence that institutions received from aspect attributes, defined by Eq.(10); and is a diagonal matrix to keep the sum of each row of the matrix to 0.

(10)

where is the weight on influence path . is defined by normalized by the sum of weights on for any in .

(11)

where is the weight on influence path . is defined by normalized by the sum of weights on for any in .

(12)
(13)

where is the similarity between the aspect attribute and the aspect attribute . summarizes the influence that the aspect attribute send to other aspect attributes and institutions.

For diagonal matrix , the diagonal entry is defined as which summarizes the influence that the institution send to other aspect attributes.

By defining the one-hop influence spread kernel for heterogeneous network and giving a specific and , we can utilize Eq.(7) to obtain the multi-hop influence spread kernel . The multi-hop influence spread kernel in heterogeneous influence spread model is a matrix that capturing relationships between institutions and associated aspect attributes. The part of the multi-hop influence spread kernel capture the scientific influence that aspect attributes received from institutions, i.e., co-influence collaboration patterns.

Figure 2e and Figure 2f show examples of co-influence collaboration patterns based on discipline influence graph of Figure 2c and keyword influence graph of Figure 2d respectively, where both and are set to 1. The red numbers on red dashed lines measure the scientific influence that disciplines received from institutions. The purple numbers on purple dashed lines measure the scientific influence that keywords received from institutions. For ease of presentation, we removed the edges with weight less than 0.001.

3.4 Co-Influence Score

The part of the multi-hop influence spread kernel represents the co-influence collaboration patterns in the associated influence graph. The co-influence collaboration pattern of an institution

is a vector and can be formulated as

(14)

The co-influence score is the similarity score of co-influence collaboration patterns between pair-wise institutions. The similarity of co-influence collaboration patterns should consider not only the angle difference but also the length difference between two vectors. One prolific institution should have a longer length of co-influence collaboration vector because of more #grants, while one ordinary institution should have a shorter length of co-influence collaboration vector because of fewer #grants. Considering that one prolific institution and one ordinary institution may have little research collaboration relationship even if the angle or the proportion of participation is similar. From this point on, we can define the co-influence score between two institutions is

(15)

Figure 2h and Figure 2i show examples of co-influence score based on discipline collaboration patterns of Figure 2e and keyword collaboration patterns of Figure 2f respectively. The black numbers on black edges represent the co-influence score between aspect attributes.

From Figure 2h and Figure 2i, we can observe that the co-influence scores of discipline and keyword are similar but different. It means that the aspect attributes of discipline and keyword are complementary to each other for research collaboration relevance. Meanwhile, from Figure 2g, the self-influence score and the co-influence score are quite different. It means that the co-influence score contains different information from the self-influence score. Thus, combining them into an overall scientific influence score is of great significance.

3.5 Overall Scientific Influence Measure

The overall scientific influence score considers not only direct and indirect collaboration relationships between institutions, i.e., self-influence collaboration patterns, but also relationships between institutions and aspect attributes, i.e., co-influence collaboration patterns by integrating self-influence score and multiple co-influence scores.

By setting a given value for , self-influence score can be rewritten as , where is the influence spread coefficient that can be used as the weight of self-influence score. The overall scientific influence score is defined as

(16)

where is the number of influence graphs, is the co-influence score, is the weight for the co-influence score, and The scalar form can be written as

(17)

Figure 2j shows an example of overall scientific influence scores based on self-influence score of Figure 2g, discipline co-influence score of Figure 2h and keyword co-influence score of Figure 2i, where , and are all set to 1. The black numbers on black edges represent the overall scientific influence between institutions.

4 Graph Clustering Analysis

In this section, we will introduce our grant-based influence clustering algorithm GImpact to partition all institutions into grant-based collaboration clusters by utilizing the overall scientific influence score as an influence based distance function. GImpact follows the conventional K-Medoids clustering algorithm [19] and incorporate some new techniques on centroid initialization, vertex assignment and centroid update. Distance metrics is an essential step for conventional K-Medoids clustering algorithm. A simple idea is to use the adjacency matrix. For a weighted undirected graph, the distance between two vertices is the sum of weights of each edge on the shortest path connecting them. We argue that the overall scientific influence score can be more useful distance metrics than simple adjacency matrix.

4.1 Purpose and Objective

The purpose of the clustering analysis is to partition the institution set into disjoint clusters , where and for to find close collaboration clusters. The objective of clustering analysis is to maximize intra-cluster influence score and minimize the inter-cluster influence score to achieve a good balance between the following two characteristics: (1) vertices within one cluster should have close collaboration relationship and similar collaboration patterns; (2) vertices in different clusters should have relatively loser collaboration relationship and dissimilar collaboration patterns.

Definition 5 (Intra-Cluster Influence Score)

The intra-cluster influence score is the average influence score of vertices in the same cluster to the centroid. For a group of disjoint clusters , is the centroid of cluster , the intra-cluster influence score of is defined as below:

(18)
Definition 6 (Inter-Cluster Influences Score)

The inter-cluster influence score is the average influence score of vertices in one cluster to the centroid of another cluster . For a group of disjoint clusters , (or ) is the centroid of cluster (or ), the inter-cluster influence score between and is defined as below:

(19)

Without loss of generality, a good centroid can represent the cluster. Moreover, only considering the sum of the centroid to vertices in the cluster will effectively reduce the time complexity from to . It is essential for a large scale graph.

4.2 Centroid Initialization

Centroid Initialization is the first step for K-Medoids clustering algorithm. The purpose of centroid initialization is to obtain a collection of initial centroids . Good initial centroids usually have good clustering results, while bad initial centroids are the opposite. There are lots of different centroid initialization schema, such as DENCLUE [20]

and K-Means++

[21]. The original K-Medoids clustering algorithm uses the random initialization schema. The idea of DENCLUE is to choose the local maximum of the density function as centroids. The idea of K-Means++ is to choose the first centroid at random and to choose remaining centroids with probability proportional to its squared distance from the closest existing center. We test different centroid initialization schemes and discuss their respective features.

4.2.1 Random Initialization

The centroids are completely randomly selected in this schema. The random result is the baseline of all other schemes.

4.2.2 Top-K Degree Initialization

Degree is the number of edges that a node has [22]. A vertex which has greater degrees has a local maximum of the number of neighbors. It can diffuse heat to as many vertices as possible. It is obviously better than a completely random selection in the data space. We sort all vertices in the descending order of their degrees and select top-K vertices as the initial centroids.

4.2.3 Top-K Density Initialization

The density function of a vertex is the sum of all influence scores related to itself.

(20)

The larger the density value of a vertex, the faster the vertex can diffuse and receive influence. Compared to top-K degree schemes, this schema considers the weights of its edge. Thus we can select top-K vertices by sorting vertices in the descending order of their density value as the initial centroids.

4.2.4 Mixed Initialization

Inspired by K-Medoids++ [21]

, we develop a mixed centroid initialization schema. A good centroid should be far from all other centroids, but it should not be an outlier either. Our initialization schema chooses the first centroid by the max density value of vertices, calculated by Eq.(

20). Then we choose remaining centroids by considering both average influence score and maximum influence score from other existed centroids. A new centroid with the smallest average influence score will make it as far as possible from other existing centroids. However, if the distance to other existed centroids is unevenly distributed, for example, this distance to one of the centroids is very close, and the distance to other centroids is very far, the average collaboration score may also be quite small. A new centroid with the smallest maximum influence score will make it as far as possible from the nearest existing centroid. However, it may select an outlier in practice. Based on above considerations, we define the mix score of a vertex as:

(21)

where is the existing centroid of cluster in centroid set . Compared to the above schema, this schema chooses a good vertex as begin vertex, and guarantee that centroids are separated as much as possible. But other schemes do not take this point into account, and it will cause the selected centroids may belong to the same practical cluster. Thus we can select vertices by selecting the vertex with the smallest mix score at the beginning of the clustering algorithm.

4.3 Vertex Assignment

After centroids have been chosen in the iteration, we need to assign all the remaining vertices to centroids (or clusters). We test two different schemes for vertex assignment:

4.3.1 Closest Centroid

The simple idea of vertex assignment is to assign the vertex to its closest centroid . If we have good centroids all the time, this schema will work well. However, if the selection of centroids is not good in an iteration, for example, the edge of a cluster or two centroids in the same cluster, the clustering result will develop in a worse direction.

4.3.2 Dynamic Assignment

Considering the problem of the closest centroid schema, we develop a dynamic vertex assignment schema. When looking for the most appropriate cluster for a vertex, we consider not only the distance to centroid but also the distance to all assigned vertices in the cluster. When the assignment process has just begun, it gets the same result as the closest centroid schema. As the assignment process processes, it becomes different and better. Even if a lousy centroid was selected at the last iteration, for example, choosing at the edge of a cluster, we could still assign the right vertices to the centroid, because the center of the gradually generated cluster that we assign vertex to will progressively approach the actual center of the data space. The vertex order of assignment will affect the effect of the assignment, so the vertex order will be randomly shuffled in each iteration. Based on above consideration, we will choose the centroid for each vertex .

4.4 Centroid Update

After assigning all remaining vertices to centroids (or clusters), we need to update new centroids from the assigned cluster. Centroid update is essential for K-Medoids clustering algorithm. A good centroid update schema should make the whole clustering process develop in a good direction. We test two different schemes for centroid update:

4.4.1 Most Central

The simple idea of centroid update is to choose the most central vertex as the new centroid. Before discussing the concept of most central, we will first define the concept of influence score vector of vertex which is a vector of length . The element of influence score vector is the influence score between vertex and other vertices :

(22)

The average influence score vector of cluster is a vector of length . The element of average influence score vector is the average influence score between vertex and other vertices :

(23)

The new centroid is whose influence score vector is the closest to the average influence score vector. This can be formulated as

(24)

This schema relies heavily on the good assignment of vertices. If the assignment is bad, it will update to a bad centroid. The bad centroid will lead to a bad assignment. The clustering result will become worse and worse.

4.4.2 Max Objective

We think a good centroid update schema should make the clustering result develop in the direction of maximizing the objective function. Thus we will update centroid which can make objective function maximum from cluster . The objective function of a vertex is defined as:

(25)

The numerator of Eq.(25) represents the distance from the cluster to the new centroid, i.e., intra-cluster influence score. The denominator of Eq.(25) captures the distance from other clusters to the new centroid, i.e., inter-cluster influence score. To make the objective function maximum, that is, to make the numerator maximum and make the denominator minimize. The new centroid will achieve a good balance between larger intra-cluster influence score and smaller inter-cluster influence score.

The new centroid is whose objective function reaches maximum. This can be formulated as

(26)

4.5 Algorithm Summarization

Given an institution graph and associated influence graphs , Table VI shows the steps for partitioning the institution graph into close collaboration clusters.

 

0:  A institution graph , associated influence graphs , the size of cluster , the weights
0:   clusters
1:  Calculate respectively
2:  Integrate into with weights
3:  Choose initial centroids
4:  Repeat
5:     Assign each vertex to the centroid
6:     Update new centroid
7:  Until Centroids no longer change or reach maximum iteration times
8:  Return   clusters

 

TABLE VI: GImpact Algorithm

5 Optimization

Our research grants dataset contains some special features that we can take advantage of: the discipline attribute contains the hierarchical structure. The hierarchical structure reveals the relationship between disciplines as supervisory information that can be used by supplementing connections and aggregating connections. In this section, we will show how we use these special features to optimize the influence analysis.

5.1 The Hierarchical Structure

The discipline attribute has a three-layer structure, which is first-level discipline, second-level discipline, and third-level discipline. A first-level discipline contains multiple second-level disciplines, and a second-level discipline contains multiple third-level disciplines. Figure 3 provides an illustrating example fragment of the hierarchical structure of the first-level discipline of D820/Law.

Fig. 3: An example fragment of the hierarchical structure of 820/Law

Since the interdisciplinary grants in our dataset are not common, the connections between disciplines are always rare. The hierarchical structure can provide relationships between disciplines to supplement connections. For example, D8204010 (International Public Law) and D8204020 (International Private Law) which are under the same second-level discipline have a close relationship. Besides, we can treat it as a hand-craft classification. Thus we can classify all 1569 disciplines into 63 first-level discipline categories to aggregate connections if they are under the same first-level discipline. For each keyword, we choose the discipline with the most associated grants as feature discipline which can uniquely represent a keyword. Following discipline classification, we can classify all 20097 keywords into 63 first-level discipline categories as well.

5.2 Supplement Connections

Suppose that disciplines under the same discipline have high relevance. We define the coefficient for disciplines under the same second-level discipline, and the coefficient for disciplines under the same first-level discipline. The coefficient should larger than , because disciplines under the same second-level discipline should have higher relevance than disciplines under the same first-level discipline. The supplement weight value is based on the average weight of all edges in the discipline aspect graph. Thus the weight after supplement can be defined as

(27)

5.3 Aggregate Connections

The hierarchical structure can be seen as a hand-craft classification. We can apply such classification on influence graph to aggregate connections between institution and aspect attributes. Such aggregation can ignore the differences between particular research directions under the same research area to make co-influence collaboration pattern more visible, and improve computing performance for a large-scale graph.

To apply classification on influence graph, we can partition the influence graph into parts, denoted by . We can construct a auxiliary matrix for aggregation. The rows of the auxiliary matrix represent disjoint parts, and the columns of the auxiliary matrix represent institutions and aspect attributes. The auxiliary matrix is defined by

(28)

where is the probability of aspect attribute belonging to part . Then just multiply the auxiliary matrix by multi-hop influence spread kernel, we can get the aggregated influence spread kernel which is a matrix. Thus the formulation of the co-influence collaboration pattern of institution will be updated to

(29)

Figure (a)a and Figure (b)b shows examples of aggregated result of multi-hop influence spread kernel of Figure 2e and Figure 2f respectively. The red numbers on red dash lines measure the scientific influence that discipline categories received from institutions. As shown in the figures, CUPL, SWUPL, and ECUPL have a very high influence on D820 (Law) research area. Meanwhile, CUFE, SWUFE, and SHUFE have a very high influence on D790 (Economics)research area. Comparing to the result before aggregation, we ignore the differences under the same first-level discipline to make the similarity of institutions with the same research area much higher. It played a critical role in the graph clustering analysis.

(a) Co-influence Patterns on Discipline
(b) Co-influence Patterns on Keyword
Fig. 4: Aggregated Result of Co-influence Patterns

6 Evaluation

In this section, we will show the effectiveness of our proposed influence analysis approach GImpact and the efficiency of our optimization using a real grants dataset collected from the Social Sciences Management Databases of Chinese Universities (SMDB).

6.1 Datasets and Experiments Setup

In our experiments, we choose institution as primary attribution and discipline and keyword as the set of secondary attributes. We vary the number of clusters K = 25, 50, 100, 200. The associated weights for different K we used in the experiments list at Figure VII. All experiments are conducted on Windows10 with 8GB 1600MHz DDR3 memory and 3.3GHz Intel Core i5. We implement all algorithms in Java 8.

Type
Institution 25 1 - -
Institution+Discipline 25 0.7 1.3 -
Institution+Keyword 25 1.3 0.7 -
Institution+Discipline+Keyword 25 0.15 1.9 0.95
Institution 50 1 - -
Institution+Discipline 50 0.5 1.5 -
Institution+Keyword 50 0.25 1.75 -
Institution+Discipline+Keyword 50 0.8 2.1 0.1
Institution 100 1 - -
Institution+Discipline 100 1.45 0.55 -
Institution+Keyword 100 0.25 1.75 -
Institution+Discipline+Keyword 100 2.05 0.8 0.15
Institution 200 1 - -
Institution+Discipline 200 1.55 0.45 -
Institution+Keyword 200 0.65 1.35 -
Institution+Discipline+Keyword 200 2.2 0.6 0.2
TABLE VII: The Weights for The Overall Scientific Influence Score

6.2 Evaluation Methods

We use two methods to evaluate the quality of the clustering result. The first metric is density of clusters, can be defined as:

(30)

Density measures the cohesiveness within clusters. It reflects the consistent between clustering results and current institution collaboration relationships. The larger the density value, the more cohesive the clustering results.

The second metric is the Davies-Bouldin Index (DBI) [23] which measures the uniqueness of clusters.

(31)
(32)

where is the centroid of , is the influence score between vertex and vertex , is the average influence score of vertices in to . A clustering with higher intra-cluster collaboration score and lower inter-cluster collaboration score will have a lower DBI value.

6.3 Model Effectiveness

We will evaluate the effectiveness of the overall scientific influence score by using a different number of influence graphs under different K. By comparing the clustering results using different influence graphs, we can learn whether integrating multiple influence graphs has effectiveness on clustering analysis. Figure (a)a and Figure (b)b show the density comparison and the DBI comparison respectively by varying the number of clusters K = 25, 50, 100, 200.

(a) Density
(b) DBI
Fig. 5: Different Influence Graph Comparison

The clustering results with both discipline influence graph and keyword influence graph have a significantly higher value of density and a substantially lower value of DBI. It indicates that the overall scientific influence score improves the quality of clustering effectively. It is because the overall scientific influence score considers not only self-influence collaboration patterns but also co-influence collaboration patterns. The clustering results can capture the full feature of the whole research grants network rather than the simple feature only from the institution graph.

Meanwhile, the clustering results with only either discipline influence graph or keyword influence graph are not as good as the clustering results with both two influence graphs. It indicates that just either discipline or keyword cannot describe the research area comprehensively. When K is increasing, the clustering results are even worse than not using influence graphs.

6.4 Clustering Comparison

We will evaluate the comparison between different schemes in the clustering analysis under different K. By comparing the clustering results using different schemes, we can evaluate whether good or bad the different schemes are.

6.4.1 Centroid Initialization Comparison

Figure (a)a and Figure (b)b show the quality comparison with different initialization schemes discussed above. The clustering results with mixed schema have significantly better quality than other schemes. It is because the mixed schema guarantee that initialized centroids are separated as much as possible, while other schemes do not take it into account. If bad centroids are chosen, it is hard to bring it back to the good result. Besides, the top-K degree schema and the top-K density schema are slightly better than the random schema.

(a) Density
(b) DBI
Fig. 6: Centroid Initialization Comparison

6.4.2 Vertex Assignment Comparison

Figure (a)a and Figure (b)b show the quality comparison with different vertex assignment schemes discussed above. When K is small, the clustering results with dynamic assignment schema have significantly better quality than the closest centroid schema. Moreover, when K is large, the clustering results with dynamic assignment schema and closest centroid schema have almost the same quality. It is because when K is small, the number of elements in the cluster is more than the situation that K is large. The more elements in the cluster, the effectiveness of dynamic assignment schema will be more visible. When the number of elements is relatively small, i.e., K is relatively large, the quality of these two schemes should be nearly the same.

(a) Density
(b) DBI
Fig. 7: Vertex Assignment Comparison

6.4.3 Centroid Update Comparison

Figure (a)a and Figure (b)b show the quality comparison with different centroid update schemes discussed above. The clustering results with max objective schema have a significantly higher density value than the most central schema, and the DBI of these two schemes is nearly the same. It demonstrates that max objective schema can find better centroid than the most central schema. It is because the max objective schema can achieve a good balance between larger intra-cluster influence score and smaller inter-cluster influence score.

(a) Density
(b) DBI
Fig. 8: Centroid Update Comparison

6.5 Optimization Efficiency

We will evaluate the quality comparison between different optimization strategies under different K. By comparing different optimization strategies, we can evaluate the pros and cons of different optimization strategies.

Figure (a)a and Figure (b)b show the quality comparison of different optimization strategies discussed above. As shown in the figure, both supplement and aggregation can increase the density value, the combination of aggregation and supplement will have better results. However, the aggregation will increase the DBI value notably, because the order of magnitude of the influence score will be larger after aggregation. It will make the DBI value increasing, but the density value will not be affected by the order of magnitude of influence score.

(a) Density
(b) DBI
Fig. 9: Hierarchy Optimization Comparison

6.6 Case Study

6.6.1 Co-influence Collaboration Patterns

Table VIII and Table IX show details of co-influence collaboration patterns of discipline and keyword after aggregation. For each row, we can learn the probability that an institution belongs to a first-level discipline category. For each column, we can learn the contribution of an institution to a certain first-level discipline category, i.e., the institution-level leadership in different research subject areas.

From these two tables, we can learn some interesting phenomenon. First, comprehensive institutions, e.g., PKU, RUC, and WHU, usually have relatively high values in all research subject areas. Meanwhile featured institutions, e.g., CUFE, CUPL, and BNU, usually have very high scores in their primary research subject areas and have relatively low scores in other research subject areas. It is very consistent with our common sense. Second, by summing each column, we can learn which disciplines are the primary research subject areas of social science. Meanwhile, by summing each row, we can learn which institutions have more contribution to social science development.

Institution D190 D630 D740 D790 D820 D870
PKU 0.015 0.019 0.086 0.123 0.061 0.059
RUC 0.017 0.077 0.023 0.254 0.108 0.081
BNU 0.297 0.015 0.093 0.048 0.035 0.022
CUFE 0.022 0.055 0.008 0.479 0.028 0.011
CUPL 0.127 0.009 0.031 0.026 0.660 0.006
FUDAN 0.025 0.023 0.093 0.117 0.016 0.056
ECNU 0.092 0.022 0.070 0.076 0.014 0.011
SHUFE 0.007 0.065 0.016 0.560 0.025 0.006
ECUPL 0.011 0.021 0.021 0.043 0.565 0.011
WHU 0.009 0.052 0.020 0.115 0.106 0.214
SWUFE 0.010 0.070 0.010 0.515 0.012 0.010
SWUPL 0.012 0.028 0.022 0.057 0.576 0.009
TABLE VIII: The Co-influence Patterns of Discipline
Institution D190 D630 D740 D790 D820 D870
PKU 0.004 0.065 0.207 0.122 0.206 0.015
RUC 0.029 0.149 0.092 0.161 0.197 0.015
BNU 0.060 0.022 0.144 0.041 0.178 0.016
CUFE 0.008 0.152 0.022 0.539 0.140 0.008
CUPL 0.008 0.027 0.045 0.049 0.717 0.003
FUDAN 0.007 0.077 0.226 0.172 0.075 0.013
ECNU 0.050 0.040 0.117 0.096 0.055 0.009
SHUFE 0.003 0.118 0.096 0.502 0.114 0.001
ECUPL 0.004 0.024 0.048 0.086 0.659 0.003
WHU 0.004 0.110 0.100 0.183 0.190 0.048
SWUFE 0.003 0.171 0.086 0.547 0.097 0.002
SWUPL 0.004 0.049 0.050 0.072 0.630 0.003
TABLE IX: The Co-influence Patterns of Keyword

6.6.2 Influence Scores

Table X shows self-influence scores of PKU, FUDAN, CUFE, SHUFE, CUPL, and ECUPL. The influence matrix is a symmetric matrix. Except for the diagonal, we highlight the self-influence scores higher than 0.003. We can notice that the self-influence score between institutions is mainly based on geographical position and research area. For example, PKU, CUFE, and CUPL are all located in Beijing, and FUDAN, SHUFE, and ECUPL are all located in Shanghai too. Meanwhile, the influence score between CUPL and ECUPL achieve 0.005, because they are all study political science and law primarily.

Institution PKU FUDAN CUFE SHUFE CUPL ECUPL
PKU 1.000 0.002 0.006 0.001 0.004 0.002
FUDAN 0.002 1.000 0.001 0.007 0.001 0.004
CUFE 0.006 0.001 1.000 0.003 0.008 0.004
SHUFE 0.001 0.007 0.003 1.000 0.002 0.006
CUPL 0.004 0.001 0.008 0.002 1.000 0.005
ECUPL 0.002 0.004 0.004 0.006 0.005 1.000
TABLE X: Self-influence Score

Table XI and Table XII show co-influence scores of discipline and keyword respectively. We highlight the co-influence scores higher than 0.5 expect for the diagonal. The research area of an institution mainly decided the co-influence score because discipline and keyword can adequately describe the research area of an institution. For instance, the co-influence score of discipline between CUFE and SHUFE can reach 0.935, and the co-influence score of the keyword is 0.807 because they are all study finance and economics primarily. Also, the political institutions of CUPL and ECUPL have a similar result. Besides, the comprehensive institutions of PKU and FUDAN have high co-influence score as well. It is indicated that the co-influence score can capture the commonality of the research area of institutions.

Institution PKU FUDAN CUFE SHUFE CUPL ECUPL
PKU 1.000 0.696 0.400 0.389 0.134 0.160
FUDAN 0.696 1.000 0.344 0.347 0.063 0.074
CUFE 0.400 0.344 1.000 0.935 0.061 0.068
SHUFE 0.389 0.347 0.935 1.000 0.050 0.055
CUPL 0.134 0.063 0.061 0.050 1.000 0.854
ECUPL 0.160 0.074 0.068 0.055 0.854 1.000
TABLE XI: Co-influence Score of Discipline
Institution PKU FUDAN CUFE SHUFE CUPL ECUPL
PKU 1.000 0.732 0.297 0.415 0.281 0.307
FUDAN 0.732 1.000 0.298 0.476 0.111 0.131
CUFE 0.297 0.298 1.000 0.807 0.157 0.198
SHUFE 0.415 0.476 0.807 1.000 0.127 0.163
CUPL 0.281 0.111 0.157 0.127 1.000 0.948
ECUPL 0.307 0.131 0.198 0.163 0.948 1.000
TABLE XII: Co-influence Score of Keyword

Table XIII shows the overall scientific influence score when , , and are all set to 1. Except for the diagonal, we highlight the overall scientific influence score higher than 1. By integrating the self-influence score and the co-influence score, we enhanced the research area part of the self-influence score so that the overall influence score can reflect what we care about by choosing different influence graphs. Besides, the geographical position part of the self-influence score supplements the research area part of the co-influence score. Thus, we can get an overall description of the scientific influence of institutions.

Institution PKU FUDAN CUFE SHUFE CUPL ECUPL
PKU 3.000 1.429 0.703 0.806 0.419 0.470
FUDAN 1.429 3.000 0.643 0.829 0.176 0.209
CUFE 0.703 0.643 3.000 1.745 0.226 0.269
SHUFE 0.806 0.829 1.745 3.000 0.179 0.224
CUPL 0.419 0.176 0.226 0.179 3.000 1.808
ECUPL 0.470 0.209 0.269 0.224 1.808 3.000
TABLE XIII: Overall Scientific Influence Score

7 Conclusion

We have presented the design and development of GImpact, a grant-based scientific influence analysis service. It takes a graph-theoretic approach to design and develop large scale scientific influence analysis over a large research-grant repository with three original contributions. First, we mine the grant database to identify and extract important features for grant-based scientific influence analysis and represent such features using graph theoretic models. In our first prototype of GImpact, we construct an institution graph and two grant aspect specific influence graphs, i.e., a disciplines graph and a keywords graph. Second, we utilize the heat-diffusion based influence spread model to calculate the self-influence score and the co-influence scores to compute two types of collaboration relationships. Third, we compute the overall scientific influence score for every pair of institutions by introducing a weighted sum of the self-influence score and the multiple co-influence scores, and conduct an influence-based clustering analysis. Evaluating GImpact using a real grant database, consisting of 2512 institutions and their grants received over a period of 14 years, we show that GImpact can effectively identify the grant-based research collaboration groups and provide valuable insight on an in-depth understanding of the scientific influence of research grants on research programs, institution leadership, and future collaboration opportunities.

Acknowledgments

The authors from Huazhong University of Science and Technology, Wuhan, China, are supported by the Chinese university Social sciences Data Center (CSDC) construction projects (2017-2018) from the Ministry of Education, China. The first author, Dr. Yuming Wang, is a visiting scholar at the School of Computer Science, Georgia Institute of Technology, funded by China Scholarship Council (CSC) for the visiting period of one year from December 2017 to December 2018. Prof. Ling Liu’s research is partially supported by the USA National Science Foundation CISE grant 1564097 and an IBM faculty award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

References

  • [1] E. GARFIELD, “Citation indexes for science; a new dimension in documentation through association of ideas.” Science (New York, NY), vol. 122, no. 3159, pp. 108–111, 1955.
  • [2] E. Garfield, “The history and meaning of the journal impact factor,” Jama, vol. 295, no. 1, pp. 90–93, 2006.
  • [3] J. E. Hirsch, “An index to quantify an individual’s scientific research output,” Proceedings of the National academy of Sciences, vol. 102, no. 46, pp. 16 569–16 572, 2005.
  • [4] ——, “Does the h index have predictive power?” Proceedings of the National Academy of Sciences, vol. 104, no. 49, pp. 19 193–19 198, 2007.
  • [5] L. Egghe, “Theory and practise of the g-index,” Scientometrics, vol. 69, no. 1, pp. 131–152, 2006.
  • [6] S. Alonso, F. Cabrerizo, E. Herrera-Viedma, and F. Herrera, “hg-index: A new index to characterize the scientific output of researchers based on the h-and g-indices,” Scientometrics, vol. 82, no. 2, pp. 391–400, 2009.
  • [7] K. W. Boyack and K. Börner, “Indicator-assisted evaluation and funding of research: Visualizing the influence of grants on the number and citation counts of research papers,” Journal of the American Society for Information Science and Technology, vol. 54, no. 5, pp. 447–461, 2003.
  • [8] B. A. Jacob and L. Lefgren, “The impact of research grant funding on scientific productivity,” Journal of public economics, vol. 95, no. 9-10, pp. 1168–1177, 2011.
  • [9] M. E. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical review E, vol. 69, no. 2, p. 026113, 2004.
  • [10] S. Fortunato, “Community detection in graphs,” Physics reports, vol. 486, no. 3-5, pp. 75–174, 2010.
  • [11] B. S. Khan and M. A. Niazi, “Network community detection: A review and visual survey,” arXiv preprint arXiv:1708.00977, 2017.
  • [12] G. Wang, Q. Hu, and P. S. Yu, “Influence and similarity on heterogeneous networks,” in Proceedings of the 21st ACM international conference on Information and knowledge management.   ACM, 2012, pp. 1462–1466.
  • [13] T. Arif, R. Ali, and M. Asger, “Scientific co-authorship social networks: A case study of computer science scenario in india,” International Journal of Computer Applications, vol. 52, no. 12, 2012.
  • [14] S. Bergsma, R. L. Mandryk, and G. McCalla, “Learning to measure influence in a scientific social network,” in

    Canadian Conference on Artificial Intelligence

    .   Springer, 2014, pp. 35–46.
  • [15] J. Jiang, P. Shi, B. An, J. Yu, and C. Wang, “Measuring the social influences of scientist groups based on multiple types of collaboration relations,” Information Processing & Management, vol. 53, no. 1, pp. 1–20, 2017.
  • [16] P. L. Giudice, P. Russo, D. Ursino et al., “A new social network analysis-based approach to extracting knowledge patterns about research activities and hubs in a set of countries,” International Journal of Business Innovation and Research, vol. 17, no. 2, pp. 147–186, 2018.
  • [17] H. Ma, H. Yang, M. R. Lyu, and I. King, “Mining social networks using heat diffusion processes for marketing candidates selection,” in Proceedings of the 17th ACM conference on Information and knowledge management.   ACM, 2008, pp. 233–242.
  • [18] Y. Zhou and L. Liu, “Social influence based clustering of heterogeneous information networks,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2013, pp. 338–346.
  • [19] L. Kaufman and P. Rousseeuw, Clustering by means of medoids.   North-Holland, 1987.
  • [20] A. Hinneburg, D. A. Keim et al., “An efficient approach to clustering in large multimedia databases with noise,” in KDD, vol. 98, 1998, pp. 58–65.
  • [21] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms.   Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
  • [22] R. A. Hanneman and M. Riddle, “Introduction to social network methods,” 2005.
  • [23] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE transactions on pattern analysis and machine intelligence, no. 2, pp. 224–227, 1979.