Research grants from governments and industry have played an important role in seeding and fostering fundamental and cutting-edge research projects, resulting in many research innovations and scientific discoveries. However, existing scientific influence analysis services to date have mainly centered on evaluating the impact factor of a journal or a conference based on citation counts. The first proposal for the Journal Impact Factor was introduced by E. Garfield in 1955 to evaluate the influence of journals. It has been developed for more than 60 years and is still widely used today. In 2005, Hirsch[3, 4] proposed the h-index to measure the influence of an individual researcher by combining both quantify (the number of publications) and quality (the count of citations). Several variant indices have been proposed to further enhance h-index, such as g-index and hg-index by adding one or more new attributes into indices or changing the way of processing citation counts. Such publication citation-based impact factor has been used by many research labs and academic institutions as one factor to evaluate the scholarly achievement of a researcher. Google scholar is a popular service for such purpose.
As big data and cloud computing become ubiquitous, there is a growing demand for developing large scale scientific influence analysis on research grants repositories and delivering such analysis as a service. In contrast to the publication citation counts, the scientific influence analysis on research grants can provide insight on how research grants help foster new research collaborations, encourage cross-organizational collaborations, influence new research trends, and identify technical leadership. For example, by examining the research grant repository over a certain period of time, it can reveal a number of interesting perspectives: in which subject areas academic institutions and industry researchers collaborate by means of cross-organization projects, the type of influence that research grants have on prioritizing certain research subjects over the others, on the research trends, and the leadership in different research subject areas. Analysis of research grants data may also reveal the specific research subject areas that are on-demand or on the priority-list by governments or industry. However, very few research efforts have been engaged on grants based scientific influence analysis using statistical methods [7, 8].
In this paper, we develop a graph-theoretic approach to mine a research grants repository for large scale grants-based influence analysis, coined as GImpact, and our design and development goal is to deliver GImpact as a service with three original contributions. First, we mine a large scale grant database to identify and extract important features for grant-based scientific influence analysis and represent such features using graph theoretic models. For example, we can extract features to analyze research collaborations among different individual researchers and/or among different institutions by constructing a grant based researcher collaboration graph or an institution collaboration graph. For each of such graphs, we can again associate multiple aspect-based collaboration graphs that further characterize the grant-based collaborations through different aspects of collaboration, such as the disciplines identified by funding agencies of the research grants or the subject areas or keywords. Due to the space constraint, in this paper, we focus on the scientific influence analysis of grants on cross-institution collaboration. Thus, we construct an institution collaboration graph with institutions as vertices and joint grants between a pair of institutions as an edge. Similarly, we construct associated aspect-based collaboration graphs as additional features to enrich our influence analysis on cross-institution collaborations, such as a discipline graph and a keyword graph. The discipline graph has the disciplines as vertices and the grant based relationships between disciplines as edges with edge weighted by the total number of grants that are relevant to both disciplines. The keyword graph reflects the relationship among subject areas in the context of grants, and has the subject keywords as vertices and an edge between a pair of keywords if both keywords are covered by some grant(s), weighted by the total number of grants that cover both of the keywords. Second, we develop graph-theoretical algorithms to compute the collaboration relationship score between a pair of institutions based on their grant data to reflects two types of influences: self-influence and co-influence. We compute self-influence scores for each pair of institutions in terms of joint-grants based collaboration relationship, taking into account also the traversal reachability on the institution collaboration graph. We also compute the co-influence scores for each pair of institutions by incorporating each associated aspect-based collaboration graph. For example, if one institution is reachable from another institution through the graph traversal between the institution graph and one of its associated collaboration graphs, such as the discipline graph or the keywords graph, we will compute their co-influence score based on the statistical properties of all possible graph-traversal paths among the two institutions. Third, we compute the overall scientific influence scores by integrating the self-influence score and the multiple co-influence scores for each pair of institutions and conduct a scientific influence based clustering analysis on the institution graph by partitioning the institution collaboration graph into clusters, with as one of the service application interface parameters. The GImpact approach presents a general purpose scientific influence analysis as a service framework and a suite of graph-theoretic influence computation algorithms that are capable of mining large scale grant data repositories with an easy-to-use API. We evaluate GImpact using a real grant database, consisting of 2512 institutions and their grants received over a period of 14 years. Our experimental results show that the GImpact influence analysis approach can effectively identify the grant-based research collaboration groups and provide valuable insight and an in-depth understanding of the scientific influence of research grants on research programs, institution leadership, and future collaboration opportunities in different research subject areas.
2.1 Research Grants Dataset
The dataset for the study is obtained from the Social Sciences Management Databases of Chinese Universities (SMDB), which consists of all projects in Humanities and Social Sciences from the Ministry of Education, China from the period of the year 2005 to the year 2018. Table I shows research grant samples, Table II, III, IV show institution samples, discipline samples and keyword samples respectively. Table V shows basic statistical characteristics of the dataset.
|R01||CUFE, SHUFE||D63040, D79071||K06, K11|
|R02||SHUFE, SWUFE||D63044, D79071, D84074||K08, K12, K13|
|R03||BNU, FUDAN||D79071, D81030, D88031||K03, K08|
|R05||FUDAN||D7907340||K07, K05, K02, K04|
|RUC||Renmin University of China|
|BNU||Beijing Normal University|
|CUFE||Central University of Finance and Economics|
|CUPL||China University of Political Science and Law|
|ECNU||East China Normal University|
|SHUFE||Shanghai University of Finance and Economics|
|ECUPL||East China University of Political Science and Law|
|SWUFE||Southwestern University of Finance and Economics|
|SWUPL||Southwest University of Political Science and Law|
|D870||Library, Information and Documentation|
|K04||International Pricing Power|
|# of Research Grants||334068|
|# of Institutions||2512|
|# of Disciplines||1569|
|# of Keywords||20097|
|# of Institutions per Grant (Min, Avg, Max)||1, 1.74, 10|
|# of Disciplines per Grant (Min, Avg, Max)||1, 1.19, 5|
|# of Keywords per Grant (Min, Avg, Max)||1, 7.87, 19|
Raw data is plain-text records in the database managed by a relational DBMS. In order to perform the proposed scientific influence analysis on the grant database, we need to perform feature extractions and convert the relational tables of grant records in plain text format into graph representations. For example, each record contains a collection of attributes, such as institutions involved, disciplines related and keywords associated, and so forth. We show an example fragment of the grant record samples in TableI. For record R01, we can learn that institution CUFE and SHUFE have direct grant collaboration. The related discipline areas are D63040 and D79071, and the associated keywords are K06 and K11. If we focus on analyzing the scientific influence of grants on research collaboration among institutions through subject areas captured by disciplines and keywords, then we can extract features from the grant database by modeling each grant by a selection of attributes, such as the institutions, the disciplines and the keywords, then we can formulate this projected version of the grant database as with a collection of triples, each of the format , where is the institution collection, is the discipline collection, is the keyword collection. If the main focus of our scientific influence analysis is on institution collaboration, then we construct the institution graph first with institutions as vertices and joint grants between a pair of institutions as an edge weighted by the number of joint grants. For each additional attributes, we will construct an aspect-based collaboration graph, such as a discipline graph and a keyword graph, to highlight the relationship among different values of the attributes, and enrich our influence analysis on cross-institution collaborations. For each of the specific attributes, we can construct a graph by extracting the relationship features between the same type of attributes. Although each aspect-based collaboration graph is homogeneous in nature, the entire collection of graphs are heterogeneous with one primary attribute as the main collaboration graph for influence analysis and other attributes as the additional aspect of collaborations to capture the different aspects of collaboration relationships that are important to characterize the influence between the vertices in the main collaboration graph, i.e., institutions, in our case. The discipline graph has the disciplines as vertices and the grant based relationships between disciplines as edges with edge weighted by the total number of grants that are relevant to both disciplines. The keyword graph reflects the relationship among subject areas in the context of grants, and has the subject keywords as vertices and an edge between a pair of keywords if both keywords are covered by some grant(s), weighted by the total number of grants that cover both of the keywords.
We would also like to note that the techniques developed in our GImpact is generic and one can choose other primary attributes instead of the institution, such as researcher who are the PI or co-PIs of a grant, making institutions as one aspect-based collaboration graph. Due to the space constraint, in this paper we focus on showcase our approach by conducting the scientific influence analysis of grants on cross-institution collaboration. We use two example attributes, disciplines and keywords, to illustrate the selection of attributes to extract features to represent different collaboration aspects in terms of graphs.
Definition 1 (Research Grants Network)
A research grants network is a heterogeneous information network and defined by an undirected graph where is the set of vertices of heterogeneous types, representing attributes of research grants, such as institution, discipline, and keyword, and is the set of edges denoting the heterogeneous relationships between a pair of vertices of homogeneous types, such as institution-institution, discipline-discipline, keyword-keyword.
Consider the research grant samples in Table I. From record R01, we can extract homogeneous links of CUFE-SHUFE, D63040-D79071 directly from the grant database and thus expressed in solid lines, and the heterogeneous links of CUFE-D63040, CUFE-D79071, SHUFE-D63040, and SHUFE-D79071, and thus expressed in dotted lines. Similarly, from record R02, we extract homogeneous links of SHUFE-SWUFE, D63044-D79071, D63044-D84074 and D79071-D84074 and heterogeneous links of SHUFE-D63044, SHUFE-D79071, SHUFE-D84074, SWUFE-D63044, SWUFE-D79071, and SWUFE-D84074. Figure 1 shows an illustrating example of the research grants network built from record R01 and record R02. It consists of two types of attributes (vertices): institutions (black circle) and disciplines (blue square) and three types of relationships (edges): institution-institution (solid line), discipline-discipline (dot line), and institution-discipline (dash line). The edges from record R01 are marked as red, the edges from record R02 are marked as green, and the edges from both record R01 and record R02 are marked as yellow.
By building a heterogeneous research grants network using multiple homogeneous networks directly from the grant database of plain-text grant records, we can perform GImpact based scientific influence analysis on the multiple homogeneous graphs together and find both direct and indirect collaboration relationships among vertices of the primary collaboration graph, such as the institution graph and the relationships of CUFE-SHUFE, SHUFE-SWUFE, and CUFE-SHUFE-SWUFE in terms of not only joint grants but also the grant based scientific influence through mining all the homogeneous graphs collectively as a whole. As a byproduct, we can also learn about relationships between an institution and its associated aspects of collaborations, e.g., SHUFE took part in the disciplines of D63040, D63044, D79071, and D84074. We can learn hidden relationships that are not observable from the plain-text grant records through simple summarization techniques.
2.2 Related Work
The design and development of GImpact are inspired primarily by social network analysis research efforts in the last decade. Social network analysis promotes community detection [9, 10, 11] and social influence computation [12, 13, 14, 15, 16, 17, 18]. Most of existing social network analysis techniques focus on the single network of homogeneous vertices with homogeneous links, such as a social network of people with friendships among people without explicitly modeling different types of links, which constrains the social influence analysis to be at the superficial social network connection specific friendships. Also, most of the existing social network influence analysis is based on the co-authorship using DBLP dataset. To our best knowledge, none of the prior work has explored scientific influence analysis on a large scale grants databases.
Existing research efforts on research grants data repositories are limited.  is the first to study the impact of governmental funding on the publication counts and their citation counts from research programs at the National Institute on Aging (NIA), aiming at improving the quality of funded research.  evaluates the impact of receiving NIH grants on the publication by comparing the impact of receiving an NIH grant on subsequent publications and citations with publications and citations from those with unsuccessful grant applications on standard research grants of R01s, showing the insignificant difference between these two groups in terms of both publication counts and citation counts. We argue that scientific influence analysis on the innovation of research programs, institution leadership, and future collaboration opportunities can be more useful indicators for grant impact evaluation than only based on publication count and citation count.
2.3 Problem Statement
The first problem we intent to address in the development of GImpact is to develop graph-theoretic and statistical methods to compute indirect grant collaboration relationships among institutions based on those observable features captured in the grant database in terms of grant records.
Consider the research grant samples in Table I. From record R01, we can find that institution CUFE and SHUFE jointly applied for a grant in the disciplines of D63040 and D79071. From record R02, institution SHUFE and SWUFE jointly applied for a grant in the discipline of D63044, D79071, and D84074. Although CUFE and SWUFE did not apply for a grant jointly, CUFE and SWUFE both applied for a grant with SHUFE. Furthermore, CUFE and SWUFE both applied for a grant in the discipline of D79071. Thus, only based on the joint grant information to conduct grant based scientific influence analysis may lead to some biased or inaccurate results. We argue that a comprehensive scientific influence analysis on a grant repository should take into account of not only direct relationship that can be obtained from the grant database records but also the many types of indirect relationships among institutions that have contributed to the research initiatives and research projects in the same or related disciplines and on the same or similar topic keywords. We argue that measuring grant based scientific influence across institutions should consider both direct and indirect collaborations in the context of grant data. Thus, it is important to extract features that are representing different collaboration aspects in addition to the grant data on institutions. For example, disciplines and keywords are important attributes that reflect the collaboration aspects of different institutions. Furthermore, the relationships between an institution and its associated disciplines in the grant disciplines graph and its associated topic keywords in the grant keywords graph are highly relevant as well.
The second problem we propose to tackle in GImpact is to develop statistical mining algorithms to compute two types of influence measures: self-influence and co-influence. The self-influence refers to the influence score that is computed based only on the graph traversal information in a primary collaboration graph of homogeneous vertices, such as the institution graph in which vertices have edges between them if they have joint grants. The graph traversal on the joint grants based institution graph will capture the indirect relationship among institutions that have indirect grant-based collaboration relationships. The co-influence refers to the influence score that is computed based on both the graph traversal information on a primary collaboration graph and its multiple associated aspect-based collaboration graphs. The graph traversal on this collection of homogeneous graphs will also capture the indirect relationship among institutions that have indirect grant-based collaboration relationships in terms of common disciplines or common topic keywords.
In the development of GImpact, we attempt to answer two fundamental questions: (1) How to measure the overall scientific influence between any pair of institutions quantitatively; and (2) How to utilize the overall scientific influence scores to identify grant based institution clusters. To address the first question, we will compute the overall scientific influence score between any pair of institutions using a weighted sum of the self-influence score and the multiple co-influence scores. The overall scientific influence score reflects not only the collaboration patterns in the institution graph through direct and indirect joint grant relationships but also the collaboration patterns through common disciplines and common keywords as well as indirectly related disciplines in the discipline aspect graph and indirectly related topic keywords in the keyword aspect graph. To address the second question, we will develop scientific influence distance based graph clustering algorithm to partition the set of institutions, denoted by , into disjoint clusters (), where and for . The clustering result should achieve a good balance between intra-cluster similarity, i.e., the vertices within one cluster should have close collaboration relationship and similar collaboration patterns, and inter-cluster similarity, i.e., the vertices in different clusters should have relatively loser collaboration relationship and dissimilar collaboration patterns.
The final outcome of GImpact is the grant-based overall scientific influence for each given institution, which is represented by a ranked list of other institutions sorted by the influence score by this institution based on both the direct and indirect joint grants and the grants that are related directly or indirectly by common or similar disciplines and/or keywords.
2.4 Solution Approach and Overall Framework
Given a grant database of plain text records, the users of GImpact is asked to identify the primary collaboration attribute, say institution, and the secondary attributes that can be modeled as different grant-aspect graphs, say disciplines and keywords. We then construct the corresponding research grants network in three steps. (1) We first extract all distinct institutions and their associated aspect attributes from the grant database and represent each institution with the selection of attributes, such as grant ID, institution ID, discipline IDs and keyword IDs. (2) We construct the institution graph, which contains homogeneous vertices of type institutions, and homogeneous edges between institutions if they have joint grants. We use GImpact to learn self-influence collaboration patterns from direct and indirect joint-grant based collaborations. (3) We construct multiple grant-aspect graphs, each of which contains a set of homogeneous vertices of one attribute type and a set of homogeneous edges between two aspect vertices if they are reflected in a common grant, such as the two disciplines are in at least one grant, or the two keywords are appeared in at least one grant. Each of such grant-based aspect graphs will be used by GImpact to perform the scientific influence analysis to learn co-influence collaboration patterns by exploring graph traversal across both the primary institution graph and the aspect graph by utilizing direct and indirect traversal paths.
Definition 2 (Grant-based Institution Graph)
A grant-based institution graph is a subgraph of , and represented as , where is the set of institutions, and is the set of edges denoting the joint grant relationship between a pair of institutions. Let denote the total number of institutions in , we have .
Definition 3 (Grant-based Aspect Graph)
A grant-based aspect graph, denoted as , is a subgraph of , corresponding to a grant-specific aspect attribute, such as disciplines, or keywords, where is the set of distinct aspect attribute values, is the set of edges denoting the direct relationship between two aspect values if they are covered by the same grant, reflecting the grant-specific aspect relevance, such as discipline relevancy in the discipline graph or the keyword relevancy in the keyword graph. denotes the total number of grant aspect specific vertices in , and .
Definition 4 (Grant-based Influence Graph)
A grant-based influence graph is defined based on the institution graph and a grant-based aspect graph , and is denoted as , where is the set of institutions, is the set of grant-based aspect vertices in the -th aspect graph (), and is the set of edges denoting the direct relationship between two distinct aspect attribute values, such as discipline relevancy and keyword relevancy, is the set of edges, each connecting an institution vertex and an aspect vertex, denoting the direct relationship between an institution and its aspect attribute value, weighted by the #grants this institution has with the same aspect attribute value, such as the same discipline in the discipline graph.
Figure 2 provides an example workflow to illustrate the three main tasks of the GImpact framework. Figure 2a shows an example fragment of the research grants network extracted from our grant database SMDB. It consists of three types of vertices: institutions (black circle), disciplines (blue square), and keywords (green square), as well as five types of direct relationships (edges) extracted directly from the grant database plain-text records: institution-institution (black solid line), discipline-discipline (blue dotted line), keyword-keyword (green dotted line), institution-discipline (red dashed line), and institution-keyword (green dashed line). We decouple the research grants network in Figure 2a into three sub-graphs: an institution graph in Figure 2b, and two influence graphs: institution-disciplines influence graph in Figure 2c and institution-keyword influence graph in Figure 2d. Black numbers in the bracket indicate #grants of an institution, and black numbers on an institution-institution edge represent #joint-grants. Red numbers on the red dashed edge between an institution and a discipline denotes #grants that the institution applied in that discipline. Also, purple number on the purple dashed edge between an institution and a keyword denotes #grants that the institution applied to cover that keyword. Blue numbers on a blue dotted edge represents #grants covered by both disciplines. The green number on a green dotted edge represent #grants that cover both keywords. For ease of presentation, we removed the edges with weight less than 10.
Task 1: Feature Extraction from the plain-text grant database. A user of our GImpact service should first select the primary grant related attribute that he is interested in conducting grant based scientific influence analysis, such as institution, the lead PI or the entire PI and co-PIs team. Assuming we select the attribute institution in the grant record as our primary attribute for collaboration influence study. Then the user will need to select a set of secondary attributes as the collaboration aspects for the influence analysis. For example, common or similar disciplines and keywords can be indicators of research collaboration relevance. The selection of primary and secondary grant based attributes is always user-specific and influence task-specific. Consider research grant samples in Table I. From record R01, we can learn the research domain is D63040 (Enterprise Management) and D79071 (Finance) from its disciplines, and the research content is about K06 (Tax Policy) and K11 (Venture Capital) from its topic keywords. By using these information selections, one can learn that the grant R01 is about tax policy and venture capital in the domain of enterprise management and finance. Such information can be leveraged to conduct research grants based influence analysis of institution CUFE or SHUFE respectively over the grant repository. The outcome of the first task is the grant-based graphs, with an institution graph as the primary collaboration graph and grant-based aspect-specific influence graphs.
Task 2: Computing the overall scientific influence score. GImpact performs this task by computing the self-influence score on the primary institution collaboration graph and the multiple co-influence scores on the grant aspect-specific influence graphs. We divide this step into three sub-tasks: (1) Computing the self-influence score for every pair of vertices on the institution graph based on the direct and indirect joint-grant relationships. We use a homogeneous influence spread model based on heat diffusion  to compute the self-influence score. The result is a matrix, denoted by , with each entry representing the self-influence score of pair-wise institutions. (2) Computing co-influence scores on each of the grant aspect-specific influence graphs (for ). We utilize the heterogeneous influence spread model  to compute the co-influence scores. The result is a matrix, denoted by , each representing the co-influence that is spread between institutions via one grant-aspect specific attribute based influence graph such as disciplines graph. If we have number of aspect-specific influence graphs , thus we have co-influence score matrices produced in this task, denoted as . (3) Finally, to obtain the overall influence score for each pair of institutions by integrating the self-influence score and the co-influence scores by deriving a weighted scheme and record the result in a matrix, such as a weighted sum of the self-influence score with weight and the co-influence scores with weights .
Task 3: Grant-based Influence Clustering Analysis. In addition to perform influence analysis using self-influence matrix , co-influence matrices: , and the overall influence , we want to utilize the overall influence scores as an influence based distance function to perform graph clustering analysis. For example, by using the K-Medoids clustering method , we design our influence based clustering algorithm GImpact to partition all institutions into grant-based collaboration clusters. Unlike conventional K-Medoids clustering method, we refine the centroid-based initialization function. When assigning points into a cluster, we consider both the influence score to the centroid and the influence score to all points in the cluster. Also, we select the new centroid by maximizing the intra-cluster influence similarity and minimizing the inter-cluster influence similarity.
3 Scientific influence analysis
In this section, we will discuss how to compute the overall scientific influence score for each pair of institutions. We will first introduce the heat diffusion  based general influence spread model to perform the influence spread process through graph traversal in a undirected graph. Then we utilize the homogeneous influence spread model to compute self-influence score on the primary institution collaboration graph and the heterogeneous influence spread model to compute multiple co-influence scores on the grant aspect-specific influence graphs . Finally, we will integrate the self-influence score and the co-influence scores into the overall scientific influence score with weights .
3.1 General Influence Spread Model
Inspired by the heat diffusion, we developed our general influence spread model to perform the influence spread process through graph traversal. Heat diffusion is a physical phenomenon that heat transfer from a hot object to a cold object. The heat diffusion phenomenon is very similar to the influence spread. For example, the extraordinary institution can be considered as the hot object, transfer heat to the cold object or spread influence to the moderate institution considered as the cold object. Once an institution applied for a grant with other institutions. It means that this institution spread influence on other institutions. Also, if an institution applied for a grant in discipline, this institution spread influence on this discipline. The more institutions that applied for grants on this discipline, the more possibilities other institutions apply for grants on this discipline. Thus this institution influenced other institutions through this discipline.
Consider a undirected graph , where is the vertex set and is the edge set. We use to represent the size of , i.e. . The vertex can be considered as a object in the thermal system and the edge can be considered as a heat transfer tunnel (influence path) between and . Suppose at time , the amount of heat that vertex received from during a period of should be proportional to the temperature of the vertex at time
and the probabilityof the vertex receive heat from . Based on the above assumptions, we can define . As a result, the temperature change at vertex between time and time is defined by the heat it receives subtract the heat it sends. This is formulated as
where is the heat conductivity or the influence spread coefficient. For ease of representation, we can express the above formulation into a matrix form:
where is a matrix, called the one-hop heat diffusion kernel or the one-hop influence spread kernel, as the heat diffusion only considers one-hop diffusion in the whole process. The value when indicates the heat that vertex receives from its neighbor , while the value when indicates the heat that vertex sends to its all neighbors. The heat received should equal to the heat sent, therefore the sum of each row of the matrix should be 0.
In the limit , this becomes
By solving this differential equation, we can get
where is the initial heat distribution or the initial influence distribution. The matrix is a matrix, called the multi-hop heat diffusion kernel or multi-hop influence spread kernel, as the heat diffusion considers infinity times from the initial heat distribution. It can be extended as a Taylor series, where
is the identity matrix:
The multi-hop influence spread kernel capture both direct and indirect influence paths between any two vertices in the undirected graph . The influence spread coefficient is a user-specific parameter, representing the speed of the influence spread process.
3.2 Homogeneous Influence Spread Model
The institution graph is a homogeneous graph and only contains homogeneous edges between institutions. Based on the general influence spread model mentioned above, we only need to consider the homogeneous edges in the graph, i.e., is a matrix. The probability of the institution receive influence from the institution can be defined as
where (or ) is the weight of vertex (or ), e.g., #grants of the institution (or ), and is the weight of edge , e.g., #joint-grants between the institution and the institution .
By defining the probability , we can compute the one-hop influence spread kernel according to Eq.(3). Meanwhile, by giving a specific and , we can obtain the multi-hop influence spread kernel according to Eq.(7) as well. The multi-hop influence spread kernel in homogeneous influence spread model is a matrix that capturing both direct and indirect collaboration relationships, i.e., self-influence collaboration patterns. Thus the self-influence score can be defined as .
Figure 2g shows an example of the self-influence score based on the institution graph of Figure 2b, where both and are set to 1. The black number on black edge represents the self-influence score between institutions. Although the weight of edge CUPL-CUFE and the weight of edge CUPL-SWUPL are both 276 in Figure 2b, the self-influence scores of edge CUPL-CUFE and edge CUPL-SWUPL are 0.024 and 0.021 in Figure 2g respectively. It shows that the self-influence score can capture not only direct collaboration relationships but also indirect collaboration relationships.
3.3 Heterogeneous Influence Spread Model
The influence graph is a heterogeneous graph and contains not only homogeneous edges between aspect attributes but also heterogeneous edges between institution and aspect attribute. Based on the general influence spread model mentioned above, we need to redefine the one-hop influence spread kernel for heterogeneous graph  to capture both homogeneous and heterogeneous edges. Thus we divide the one-hop influence spread kernel into four parts:
where is an matrix with the similar definition as Eq.(3), representing the grant based relationship between aspect attributes, defined by Eq.(12); is an matrix representing the scientific influence that aspect attributes received from institutions, defined by Eq.(11); is an matrix representing the scientific influence that institutions received from aspect attributes, defined by Eq.(10); and is a diagonal matrix to keep the sum of each row of the matrix to 0.
where is the weight on influence path . is defined by normalized by the sum of weights on for any in .
where is the weight on influence path . is defined by normalized by the sum of weights on for any in .
where is the similarity between the aspect attribute and the aspect attribute . summarizes the influence that the aspect attribute send to other aspect attributes and institutions.
For diagonal matrix , the diagonal entry is defined as which summarizes the influence that the institution send to other aspect attributes.
By defining the one-hop influence spread kernel for heterogeneous network and giving a specific and , we can utilize Eq.(7) to obtain the multi-hop influence spread kernel . The multi-hop influence spread kernel in heterogeneous influence spread model is a matrix that capturing relationships between institutions and associated aspect attributes. The part of the multi-hop influence spread kernel capture the scientific influence that aspect attributes received from institutions, i.e., co-influence collaboration patterns.
Figure 2e and Figure 2f show examples of co-influence collaboration patterns based on discipline influence graph of Figure 2c and keyword influence graph of Figure 2d respectively, where both and are set to 1. The red numbers on red dashed lines measure the scientific influence that disciplines received from institutions. The purple numbers on purple dashed lines measure the scientific influence that keywords received from institutions. For ease of presentation, we removed the edges with weight less than 0.001.
3.4 Co-Influence Score
The part of the multi-hop influence spread kernel represents the co-influence collaboration patterns in the associated influence graph. The co-influence collaboration pattern of an institution
is a vector and can be formulated as
The co-influence score is the similarity score of co-influence collaboration patterns between pair-wise institutions. The similarity of co-influence collaboration patterns should consider not only the angle difference but also the length difference between two vectors. One prolific institution should have a longer length of co-influence collaboration vector because of more #grants, while one ordinary institution should have a shorter length of co-influence collaboration vector because of fewer #grants. Considering that one prolific institution and one ordinary institution may have little research collaboration relationship even if the angle or the proportion of participation is similar. From this point on, we can define the co-influence score between two institutions is
Figure 2h and Figure 2i show examples of co-influence score based on discipline collaboration patterns of Figure 2e and keyword collaboration patterns of Figure 2f respectively. The black numbers on black edges represent the co-influence score between aspect attributes.
From Figure 2h and Figure 2i, we can observe that the co-influence scores of discipline and keyword are similar but different. It means that the aspect attributes of discipline and keyword are complementary to each other for research collaboration relevance. Meanwhile, from Figure 2g, the self-influence score and the co-influence score are quite different. It means that the co-influence score contains different information from the self-influence score. Thus, combining them into an overall scientific influence score is of great significance.
3.5 Overall Scientific Influence Measure
The overall scientific influence score considers not only direct and indirect collaboration relationships between institutions, i.e., self-influence collaboration patterns, but also relationships between institutions and aspect attributes, i.e., co-influence collaboration patterns by integrating self-influence score and multiple co-influence scores.
By setting a given value for , self-influence score can be rewritten as , where is the influence spread coefficient that can be used as the weight of self-influence score. The overall scientific influence score is defined as
where is the number of influence graphs, is the co-influence score, is the weight for the co-influence score, and The scalar form can be written as
Figure 2j shows an example of overall scientific influence scores based on self-influence score of Figure 2g, discipline co-influence score of Figure 2h and keyword co-influence score of Figure 2i, where , and are all set to 1. The black numbers on black edges represent the overall scientific influence between institutions.
4 Graph Clustering Analysis
In this section, we will introduce our grant-based influence clustering algorithm GImpact to partition all institutions into grant-based collaboration clusters by utilizing the overall scientific influence score as an influence based distance function. GImpact follows the conventional K-Medoids clustering algorithm  and incorporate some new techniques on centroid initialization, vertex assignment and centroid update. Distance metrics is an essential step for conventional K-Medoids clustering algorithm. A simple idea is to use the adjacency matrix. For a weighted undirected graph, the distance between two vertices is the sum of weights of each edge on the shortest path connecting them. We argue that the overall scientific influence score can be more useful distance metrics than simple adjacency matrix.
4.1 Purpose and Objective
The purpose of the clustering analysis is to partition the institution set into disjoint clusters , where and for to find close collaboration clusters. The objective of clustering analysis is to maximize intra-cluster influence score and minimize the inter-cluster influence score to achieve a good balance between the following two characteristics: (1) vertices within one cluster should have close collaboration relationship and similar collaboration patterns; (2) vertices in different clusters should have relatively loser collaboration relationship and dissimilar collaboration patterns.
Definition 5 (Intra-Cluster Influence Score)
The intra-cluster influence score is the average influence score of vertices in the same cluster to the centroid. For a group of disjoint clusters , is the centroid of cluster , the intra-cluster influence score of is defined as below:
Definition 6 (Inter-Cluster Influences Score)
The inter-cluster influence score is the average influence score of vertices in one cluster to the centroid of another cluster . For a group of disjoint clusters , (or ) is the centroid of cluster (or ), the inter-cluster influence score between and is defined as below:
Without loss of generality, a good centroid can represent the cluster. Moreover, only considering the sum of the centroid to vertices in the cluster will effectively reduce the time complexity from to . It is essential for a large scale graph.
4.2 Centroid Initialization
Centroid Initialization is the first step for K-Medoids clustering algorithm. The purpose of centroid initialization is to obtain a collection of initial centroids . Good initial centroids usually have good clustering results, while bad initial centroids are the opposite. There are lots of different centroid initialization schema, such as DENCLUE 
and K-Means++. The original K-Medoids clustering algorithm uses the random initialization schema. The idea of DENCLUE is to choose the local maximum of the density function as centroids. The idea of K-Means++ is to choose the first centroid at random and to choose remaining centroids with probability proportional to its squared distance from the closest existing center. We test different centroid initialization schemes and discuss their respective features.
4.2.1 Random Initialization
The centroids are completely randomly selected in this schema. The random result is the baseline of all other schemes.
4.2.2 Top-K Degree Initialization
Degree is the number of edges that a node has . A vertex which has greater degrees has a local maximum of the number of neighbors. It can diffuse heat to as many vertices as possible. It is obviously better than a completely random selection in the data space. We sort all vertices in the descending order of their degrees and select top-K vertices as the initial centroids.
4.2.3 Top-K Density Initialization
The density function of a vertex is the sum of all influence scores related to itself.
The larger the density value of a vertex, the faster the vertex can diffuse and receive influence. Compared to top-K degree schemes, this schema considers the weights of its edge. Thus we can select top-K vertices by sorting vertices in the descending order of their density value as the initial centroids.
4.2.4 Mixed Initialization
Inspired by K-Medoids++ 
, we develop a mixed centroid initialization schema. A good centroid should be far from all other centroids, but it should not be an outlier either. Our initialization schema chooses the first centroid by the max density value of vertices, calculated by Eq.(20). Then we choose remaining centroids by considering both average influence score and maximum influence score from other existed centroids. A new centroid with the smallest average influence score will make it as far as possible from other existing centroids. However, if the distance to other existed centroids is unevenly distributed, for example, this distance to one of the centroids is very close, and the distance to other centroids is very far, the average collaboration score may also be quite small. A new centroid with the smallest maximum influence score will make it as far as possible from the nearest existing centroid. However, it may select an outlier in practice. Based on above considerations, we define the mix score of a vertex as:
where is the existing centroid of cluster in centroid set . Compared to the above schema, this schema chooses a good vertex as begin vertex, and guarantee that centroids are separated as much as possible. But other schemes do not take this point into account, and it will cause the selected centroids may belong to the same practical cluster. Thus we can select vertices by selecting the vertex with the smallest mix score at the beginning of the clustering algorithm.
4.3 Vertex Assignment
After centroids have been chosen in the iteration, we need to assign all the remaining vertices to centroids (or clusters). We test two different schemes for vertex assignment:
4.3.1 Closest Centroid
The simple idea of vertex assignment is to assign the vertex to its closest centroid . If we have good centroids all the time, this schema will work well. However, if the selection of centroids is not good in an iteration, for example, the edge of a cluster or two centroids in the same cluster, the clustering result will develop in a worse direction.
4.3.2 Dynamic Assignment
Considering the problem of the closest centroid schema, we develop a dynamic vertex assignment schema. When looking for the most appropriate cluster for a vertex, we consider not only the distance to centroid but also the distance to all assigned vertices in the cluster. When the assignment process has just begun, it gets the same result as the closest centroid schema. As the assignment process processes, it becomes different and better. Even if a lousy centroid was selected at the last iteration, for example, choosing at the edge of a cluster, we could still assign the right vertices to the centroid, because the center of the gradually generated cluster that we assign vertex to will progressively approach the actual center of the data space. The vertex order of assignment will affect the effect of the assignment, so the vertex order will be randomly shuffled in each iteration. Based on above consideration, we will choose the centroid for each vertex .
4.4 Centroid Update
After assigning all remaining vertices to centroids (or clusters), we need to update new centroids from the assigned cluster. Centroid update is essential for K-Medoids clustering algorithm. A good centroid update schema should make the whole clustering process develop in a good direction. We test two different schemes for centroid update:
4.4.1 Most Central
The simple idea of centroid update is to choose the most central vertex as the new centroid. Before discussing the concept of most central, we will first define the concept of influence score vector of vertex which is a vector of length . The element of influence score vector is the influence score between vertex and other vertices :
The average influence score vector of cluster is a vector of length . The element of average influence score vector is the average influence score between vertex and other vertices :
The new centroid is whose influence score vector is the closest to the average influence score vector. This can be formulated as
This schema relies heavily on the good assignment of vertices. If the assignment is bad, it will update to a bad centroid. The bad centroid will lead to a bad assignment. The clustering result will become worse and worse.
4.4.2 Max Objective
We think a good centroid update schema should make the clustering result develop in the direction of maximizing the objective function. Thus we will update centroid which can make objective function maximum from cluster . The objective function of a vertex is defined as:
The numerator of Eq.(25) represents the distance from the cluster to the new centroid, i.e., intra-cluster influence score. The denominator of Eq.(25) captures the distance from other clusters to the new centroid, i.e., inter-cluster influence score. To make the objective function maximum, that is, to make the numerator maximum and make the denominator minimize. The new centroid will achieve a good balance between larger intra-cluster influence score and smaller inter-cluster influence score.
The new centroid is whose objective function reaches maximum. This can be formulated as
4.5 Algorithm Summarization
Given an institution graph and associated influence graphs , Table VI shows the steps for partitioning the institution graph into close collaboration clusters.
Our research grants dataset contains some special features that we can take advantage of: the discipline attribute contains the hierarchical structure. The hierarchical structure reveals the relationship between disciplines as supervisory information that can be used by supplementing connections and aggregating connections. In this section, we will show how we use these special features to optimize the influence analysis.
5.1 The Hierarchical Structure
The discipline attribute has a three-layer structure, which is first-level discipline, second-level discipline, and third-level discipline. A first-level discipline contains multiple second-level disciplines, and a second-level discipline contains multiple third-level disciplines. Figure 3 provides an illustrating example fragment of the hierarchical structure of the first-level discipline of D820/Law.
Since the interdisciplinary grants in our dataset are not common, the connections between disciplines are always rare. The hierarchical structure can provide relationships between disciplines to supplement connections. For example, D8204010 (International Public Law) and D8204020 (International Private Law) which are under the same second-level discipline have a close relationship. Besides, we can treat it as a hand-craft classification. Thus we can classify all 1569 disciplines into 63 first-level discipline categories to aggregate connections if they are under the same first-level discipline. For each keyword, we choose the discipline with the most associated grants as feature discipline which can uniquely represent a keyword. Following discipline classification, we can classify all 20097 keywords into 63 first-level discipline categories as well.
5.2 Supplement Connections
Suppose that disciplines under the same discipline have high relevance. We define the coefficient for disciplines under the same second-level discipline, and the coefficient for disciplines under the same first-level discipline. The coefficient should larger than , because disciplines under the same second-level discipline should have higher relevance than disciplines under the same first-level discipline. The supplement weight value is based on the average weight of all edges in the discipline aspect graph. Thus the weight after supplement can be defined as
5.3 Aggregate Connections
The hierarchical structure can be seen as a hand-craft classification. We can apply such classification on influence graph to aggregate connections between institution and aspect attributes. Such aggregation can ignore the differences between particular research directions under the same research area to make co-influence collaboration pattern more visible, and improve computing performance for a large-scale graph.
To apply classification on influence graph, we can partition the influence graph into parts, denoted by . We can construct a auxiliary matrix for aggregation. The rows of the auxiliary matrix represent disjoint parts, and the columns of the auxiliary matrix represent institutions and aspect attributes. The auxiliary matrix is defined by
where is the probability of aspect attribute belonging to part . Then just multiply the auxiliary matrix by multi-hop influence spread kernel, we can get the aggregated influence spread kernel which is a matrix. Thus the formulation of the co-influence collaboration pattern of institution will be updated to
Figure (a)a and Figure (b)b shows examples of aggregated result of multi-hop influence spread kernel of Figure 2e and Figure 2f respectively. The red numbers on red dash lines measure the scientific influence that discipline categories received from institutions. As shown in the figures, CUPL, SWUPL, and ECUPL have a very high influence on D820 (Law) research area. Meanwhile, CUFE, SWUFE, and SHUFE have a very high influence on D790 (Economics)research area. Comparing to the result before aggregation, we ignore the differences under the same first-level discipline to make the similarity of institutions with the same research area much higher. It played a critical role in the graph clustering analysis.
In this section, we will show the effectiveness of our proposed influence analysis approach GImpact and the efficiency of our optimization using a real grants dataset collected from the Social Sciences Management Databases of Chinese Universities (SMDB).
6.1 Datasets and Experiments Setup
In our experiments, we choose institution as primary attribution and discipline and keyword as the set of secondary attributes. We vary the number of clusters K = 25, 50, 100, 200. The associated weights for different K we used in the experiments list at Figure VII. All experiments are conducted on Windows10 with 8GB 1600MHz DDR3 memory and 3.3GHz Intel Core i5. We implement all algorithms in Java 8.
6.2 Evaluation Methods
We use two methods to evaluate the quality of the clustering result. The first metric is density of clusters, can be defined as:
Density measures the cohesiveness within clusters. It reflects the consistent between clustering results and current institution collaboration relationships. The larger the density value, the more cohesive the clustering results.
The second metric is the Davies-Bouldin Index (DBI)  which measures the uniqueness of clusters.
where is the centroid of , is the influence score between vertex and vertex , is the average influence score of vertices in to . A clustering with higher intra-cluster collaboration score and lower inter-cluster collaboration score will have a lower DBI value.
6.3 Model Effectiveness
We will evaluate the effectiveness of the overall scientific influence score by using a different number of influence graphs under different K. By comparing the clustering results using different influence graphs, we can learn whether integrating multiple influence graphs has effectiveness on clustering analysis. Figure (a)a and Figure (b)b show the density comparison and the DBI comparison respectively by varying the number of clusters K = 25, 50, 100, 200.
The clustering results with both discipline influence graph and keyword influence graph have a significantly higher value of density and a substantially lower value of DBI. It indicates that the overall scientific influence score improves the quality of clustering effectively. It is because the overall scientific influence score considers not only self-influence collaboration patterns but also co-influence collaboration patterns. The clustering results can capture the full feature of the whole research grants network rather than the simple feature only from the institution graph.
Meanwhile, the clustering results with only either discipline influence graph or keyword influence graph are not as good as the clustering results with both two influence graphs. It indicates that just either discipline or keyword cannot describe the research area comprehensively. When K is increasing, the clustering results are even worse than not using influence graphs.
6.4 Clustering Comparison
We will evaluate the comparison between different schemes in the clustering analysis under different K. By comparing the clustering results using different schemes, we can evaluate whether good or bad the different schemes are.
6.4.1 Centroid Initialization Comparison
Figure (a)a and Figure (b)b show the quality comparison with different initialization schemes discussed above. The clustering results with mixed schema have significantly better quality than other schemes. It is because the mixed schema guarantee that initialized centroids are separated as much as possible, while other schemes do not take it into account. If bad centroids are chosen, it is hard to bring it back to the good result. Besides, the top-K degree schema and the top-K density schema are slightly better than the random schema.
6.4.2 Vertex Assignment Comparison
Figure (a)a and Figure (b)b show the quality comparison with different vertex assignment schemes discussed above. When K is small, the clustering results with dynamic assignment schema have significantly better quality than the closest centroid schema. Moreover, when K is large, the clustering results with dynamic assignment schema and closest centroid schema have almost the same quality. It is because when K is small, the number of elements in the cluster is more than the situation that K is large. The more elements in the cluster, the effectiveness of dynamic assignment schema will be more visible. When the number of elements is relatively small, i.e., K is relatively large, the quality of these two schemes should be nearly the same.
6.4.3 Centroid Update Comparison
Figure (a)a and Figure (b)b show the quality comparison with different centroid update schemes discussed above. The clustering results with max objective schema have a significantly higher density value than the most central schema, and the DBI of these two schemes is nearly the same. It demonstrates that max objective schema can find better centroid than the most central schema. It is because the max objective schema can achieve a good balance between larger intra-cluster influence score and smaller inter-cluster influence score.
6.5 Optimization Efficiency
We will evaluate the quality comparison between different optimization strategies under different K. By comparing different optimization strategies, we can evaluate the pros and cons of different optimization strategies.
Figure (a)a and Figure (b)b show the quality comparison of different optimization strategies discussed above. As shown in the figure, both supplement and aggregation can increase the density value, the combination of aggregation and supplement will have better results. However, the aggregation will increase the DBI value notably, because the order of magnitude of the influence score will be larger after aggregation. It will make the DBI value increasing, but the density value will not be affected by the order of magnitude of influence score.
6.6 Case Study
6.6.1 Co-influence Collaboration Patterns
Table VIII and Table IX show details of co-influence collaboration patterns of discipline and keyword after aggregation. For each row, we can learn the probability that an institution belongs to a first-level discipline category. For each column, we can learn the contribution of an institution to a certain first-level discipline category, i.e., the institution-level leadership in different research subject areas.
From these two tables, we can learn some interesting phenomenon. First, comprehensive institutions, e.g., PKU, RUC, and WHU, usually have relatively high values in all research subject areas. Meanwhile featured institutions, e.g., CUFE, CUPL, and BNU, usually have very high scores in their primary research subject areas and have relatively low scores in other research subject areas. It is very consistent with our common sense. Second, by summing each column, we can learn which disciplines are the primary research subject areas of social science. Meanwhile, by summing each row, we can learn which institutions have more contribution to social science development.
6.6.2 Influence Scores
Table X shows self-influence scores of PKU, FUDAN, CUFE, SHUFE, CUPL, and ECUPL. The influence matrix is a symmetric matrix. Except for the diagonal, we highlight the self-influence scores higher than 0.003. We can notice that the self-influence score between institutions is mainly based on geographical position and research area. For example, PKU, CUFE, and CUPL are all located in Beijing, and FUDAN, SHUFE, and ECUPL are all located in Shanghai too. Meanwhile, the influence score between CUPL and ECUPL achieve 0.005, because they are all study political science and law primarily.
Table XI and Table XII show co-influence scores of discipline and keyword respectively. We highlight the co-influence scores higher than 0.5 expect for the diagonal. The research area of an institution mainly decided the co-influence score because discipline and keyword can adequately describe the research area of an institution. For instance, the co-influence score of discipline between CUFE and SHUFE can reach 0.935, and the co-influence score of the keyword is 0.807 because they are all study finance and economics primarily. Also, the political institutions of CUPL and ECUPL have a similar result. Besides, the comprehensive institutions of PKU and FUDAN have high co-influence score as well. It is indicated that the co-influence score can capture the commonality of the research area of institutions.
Table XIII shows the overall scientific influence score when , , and are all set to 1. Except for the diagonal, we highlight the overall scientific influence score higher than 1. By integrating the self-influence score and the co-influence score, we enhanced the research area part of the self-influence score so that the overall influence score can reflect what we care about by choosing different influence graphs. Besides, the geographical position part of the self-influence score supplements the research area part of the co-influence score. Thus, we can get an overall description of the scientific influence of institutions.
We have presented the design and development of GImpact, a grant-based scientific influence analysis service. It takes a graph-theoretic approach to design and develop large scale scientific influence analysis over a large research-grant repository with three original contributions. First, we mine the grant database to identify and extract important features for grant-based scientific influence analysis and represent such features using graph theoretic models. In our first prototype of GImpact, we construct an institution graph and two grant aspect specific influence graphs, i.e., a disciplines graph and a keywords graph. Second, we utilize the heat-diffusion based influence spread model to calculate the self-influence score and the co-influence scores to compute two types of collaboration relationships. Third, we compute the overall scientific influence score for every pair of institutions by introducing a weighted sum of the self-influence score and the multiple co-influence scores, and conduct an influence-based clustering analysis. Evaluating GImpact using a real grant database, consisting of 2512 institutions and their grants received over a period of 14 years, we show that GImpact can effectively identify the grant-based research collaboration groups and provide valuable insight on an in-depth understanding of the scientific influence of research grants on research programs, institution leadership, and future collaboration opportunities.
The authors from Huazhong University of Science and Technology, Wuhan, China, are supported by the Chinese university Social sciences Data Center (CSDC) construction projects (2017-2018) from the Ministry of Education, China. The first author, Dr. Yuming Wang, is a visiting scholar at the School of Computer Science, Georgia Institute of Technology, funded by China Scholarship Council (CSC) for the visiting period of one year from December 2017 to December 2018. Prof. Ling Liu’s research is partially supported by the USA National Science Foundation CISE grant 1564097 and an IBM faculty award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.
-  E. GARFIELD, “Citation indexes for science; a new dimension in documentation through association of ideas.” Science (New York, NY), vol. 122, no. 3159, pp. 108–111, 1955.
-  E. Garfield, “The history and meaning of the journal impact factor,” Jama, vol. 295, no. 1, pp. 90–93, 2006.
-  J. E. Hirsch, “An index to quantify an individual’s scientific research output,” Proceedings of the National academy of Sciences, vol. 102, no. 46, pp. 16 569–16 572, 2005.
-  ——, “Does the h index have predictive power?” Proceedings of the National Academy of Sciences, vol. 104, no. 49, pp. 19 193–19 198, 2007.
-  L. Egghe, “Theory and practise of the g-index,” Scientometrics, vol. 69, no. 1, pp. 131–152, 2006.
-  S. Alonso, F. Cabrerizo, E. Herrera-Viedma, and F. Herrera, “hg-index: A new index to characterize the scientific output of researchers based on the h-and g-indices,” Scientometrics, vol. 82, no. 2, pp. 391–400, 2009.
-  K. W. Boyack and K. Börner, “Indicator-assisted evaluation and funding of research: Visualizing the influence of grants on the number and citation counts of research papers,” Journal of the American Society for Information Science and Technology, vol. 54, no. 5, pp. 447–461, 2003.
-  B. A. Jacob and L. Lefgren, “The impact of research grant funding on scientific productivity,” Journal of public economics, vol. 95, no. 9-10, pp. 1168–1177, 2011.
-  M. E. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical review E, vol. 69, no. 2, p. 026113, 2004.
-  S. Fortunato, “Community detection in graphs,” Physics reports, vol. 486, no. 3-5, pp. 75–174, 2010.
-  B. S. Khan and M. A. Niazi, “Network community detection: A review and visual survey,” arXiv preprint arXiv:1708.00977, 2017.
-  G. Wang, Q. Hu, and P. S. Yu, “Influence and similarity on heterogeneous networks,” in Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012, pp. 1462–1466.
-  T. Arif, R. Ali, and M. Asger, “Scientific co-authorship social networks: A case study of computer science scenario in india,” International Journal of Computer Applications, vol. 52, no. 12, 2012.
S. Bergsma, R. L. Mandryk, and G. McCalla, “Learning to measure influence in a
scientific social network,” in
Canadian Conference on Artificial Intelligence. Springer, 2014, pp. 35–46.
-  J. Jiang, P. Shi, B. An, J. Yu, and C. Wang, “Measuring the social influences of scientist groups based on multiple types of collaboration relations,” Information Processing & Management, vol. 53, no. 1, pp. 1–20, 2017.
-  P. L. Giudice, P. Russo, D. Ursino et al., “A new social network analysis-based approach to extracting knowledge patterns about research activities and hubs in a set of countries,” International Journal of Business Innovation and Research, vol. 17, no. 2, pp. 147–186, 2018.
-  H. Ma, H. Yang, M. R. Lyu, and I. King, “Mining social networks using heat diffusion processes for marketing candidates selection,” in Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008, pp. 233–242.
-  Y. Zhou and L. Liu, “Social influence based clustering of heterogeneous information networks,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013, pp. 338–346.
-  L. Kaufman and P. Rousseeuw, Clustering by means of medoids. North-Holland, 1987.
-  A. Hinneburg, D. A. Keim et al., “An efficient approach to clustering in large multimedia databases with noise,” in KDD, vol. 98, 1998, pp. 58–65.
-  D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
-  R. A. Hanneman and M. Riddle, “Introduction to social network methods,” 2005.
-  D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE transactions on pattern analysis and machine intelligence, no. 2, pp. 224–227, 1979.