Big Missing Data: are scientific memes inherited differently from gendered authorship?

by   Tanya Araújo, et al.

This paper seeks to build upon the previous literature on gender aspects in research collaboration and knowledge diffusion. Our approach adds the meme inheritance notion to traditional citation analysis, as we investigate if scientific memes are inherited differently from gendered authorship. Since authors of scientific papers inherit knowledge from their cited authors, once authorship is gendered we are able to characterize the inheritance process with respect to the frequencies of memes and their propagation scores depending on the gender of the authors. By applying methodologies that enable the gender disambiguation of authors, big missing data on the gender of citing and cited authors is dealt with. Our empirically based approach allows for investigating the combined effect of meme inheritance and gendered transmission. Results show that scientific memes do not spread differently from either male or female cited authors. Likewise, the memes that we analyse were not found to propagate more easily via male or female inheritance.



There are no comments yet.


page 1

page 2

page 3

page 4


Is together better? Examining scientific collaborations across multiple authors, institutions, and departments

Collaborations are an integral part of scientific research and publishin...

Gender differences in research collaboration

The debate on the role of women in the academic world has focused on var...

Citations and gender diversity in reciprocal acknowledgement networks

Acknowledgements in scientific articles suggest not only gratitude, but ...

A Geo-Gender Study of Indexed Computer Science Research Publications

This paper presents a study that analyzes and gives quantitative means f...

The role of geographic proximity in knowledge diffusion, measured by citations to scientific literature

This paper analyses the influence of geographic distance on knowledge fl...

Convexity in scientific collaboration networks

Convexity in a network (graph) has been recently defined as a property o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Researchers now live in an era of a New Data Frontier, a term coined by Maryann Feldman and co-authors in a recent paper [1]. Very large databases available in digital format contribute to enlarge the horizons of knowledge in multiple domains, such as social media communication, health, business, finance, economics, invention and innovation, and scientific diffusion and progress. Some recent literature based on big data spans multiple data sources, disciplines and applications, including i) financial markets forecasting through text-based information from news and social media [2]; ii)

investor sentiment research built on machine learning and natural language processing

[3]; iii) analysis of co-occurring terms in opinion dynamics [4]; [5] and iv) innovation diffusion based on epidemic models [6].

In addition, various datasets have been used as sources for the identification of patterns of knowledge diffusion and scientific progress at both individual and institutional levels. Big data on scientific knowledge, in particular, has received considerable attention. Studies have explored both traditional scientific data sources of publications and patents [7], as well as, new data sources like University web-sites [8] or communication within large virtual academic communities [9].

Notwithstanding, most analyses of knowledge transmission, channels and mechanisms are based on classical large scientific databases like Google Scholar, PubMed, Scopus and Web of Science, which have been intensely compared, discussed and evaluated ([10], [11], [12], [13], [14], [15]).

Besides providing raw data on publications and patents, some large scientific databases publish science metrics, such as Impact Factor per journal, h-index per author and citation scores. Studies using citation indicators have provided insights into scientific performance and impact [16], and knowledge dissemination among individuals [17], firms [18] or regions [19], as well as, between universities and firms [20]. Citation content and frequency also feed network approaches used in the study of social contagion mechanisms. Research collaboration and co-authorship help to discover patterns of collaboration within scientific communities of authors, inventors or innovators [21].

Tobias Kuhn and co-authors [22] study the spread of scientific knowledge using Dawkins’ concept of meme, the cultural analogy of gene in the context of genetic evolution. Illustrations of a scientific meme as a replicator are provided in The Selfish Gene book by Richard Dawkins: ”If a scientist hears, or reads about, a good idea, he passes it on to his colleagues and students. He mentions it in his articles and his lectures. If the idea catches on, it can be said to propagate itself, spreading from brain to brain. If the meme is a scientific idea, its spread will depend on how acceptable it is to the population of individual scientists; a rough measure of its survival value could be obtained by counting the number of times it is referred to in successive years in scientific journals.” [23].

Although the idea of memes is not completely original, as Dawkins acknowledges, it has received growing interest ever since. The concept of meme has been explored in several scientific areas. In Economics, Robert Shiller recently drew attention to memes and narratives in his Presidential Address delivered at the American Economic Association meeting: ”There is a daunting amount in the scholarly literature about narratives, in a number of academic departments, and associated concepts of memetics, norms, social epidemics, contagion of ideas. While we may never be able to explain why some narratives go viral and significantly influence thinking while other narratives do not [] We economists should not just throw up our hands and decide to ignore this vast literature.” [24].

Here, we aim to investigate if scientific memes are inherited differently from gendered authorship. Since authors of scientific papers inherit knowledge from their cited authors, once authorship is gendered (by applying methodologies that enable the gender disambiguation of authors), we are able to characterize the inheritance process with respect to the frequencies of scientific memes and their propagation scores depending on the gender of the authors. Would female inheritance - represented by the citations of female authors - favor the propagation of some specific meme? Likewise, would some particular memes propagate more via male inheritance?

Moreover, our paper seeks to build upon the previous literature about gender aspects in research communication, collaboration and co-authorship ([21], [25], [26], [27], [28], [29], [30], [31], [32], [33]) and scientific outcome impact by gender ([34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45]).

By including gender in the study of knowledge spread and adding a gender perspective to the ’spreading of good ideas, from brain to brain’, to adopt Dawkins’s words, our empirically based research aims to contribute in three ways to the improvement of the understanding of the way knowledge spreads:

  • it sheds some light on previous mixed and puzzling results about women in science;

  • it adds the meme inheritance notion to traditional citation network analysis and

  • it accurately identifies the gender of authors, dealing with the issue of missing data in large scientific databases.

The remainder of this paper is structured as follows: Section 2 briefly describes some scientific databases calling attention to the lack of information about the gender of citing and cited authors. Section 3 presents the data and the methodologies used in the paper. Section 4 presents and discusses the results from the empirical analyzes. In the final section conclusions, policy implications and some promising research avenues are provided.

2 Big Missing Data

Four databases (Web of Science WoS, SCOPUS, Google Scholar GS and PubMed), and three repositories (arXiv, RepEc and BASE-Bielefeld) were searched in order to explore the possibilities of extracting information regarding the gender of both citing and cited authors and the memes found in the abstracts of citing and cited records (papers). The following criteria were adopted to select the datasets to examine in detail: size and accuracy of the data; tracking citation possibilities; down-loading capabilities; and possibility to gather, for each record, at least, its title and abstract.

Several studies have compared the coverage, features, and citation analysis capabilities of GS, PubMed, SCOPUS and WoS. These comparative studies usually focus on a particular research topic like biomedical information [12], medical journals [14], oncology and condensed matter physics [11] library and information science [15] or environmental sciences [10]. Other studies address only the accuracy of one database [46]. This literature, however, fails to systematically review the citation analysis linked with the author full identification.

Web of Science (WoS)

The Web of Science (WoS), formerly the ISI Web of Knowledge, is self defined as the ”gold standard for research discovery and analytics” [47] and the primary research platform for information in the sciences, social sciences, arts, and humanities. It uses cited reference search to track past research and screen current advances in over 100 years worth of content that is fully indexed, including 59 million records and backfiles dating back to 1898. Web of Science consists of six databases containing information gathered from thousands of scholarly journals, books, book series and other scientific outcomes. The Master Journal List includes 22,832 titles. The databases included in WoS are: Science Citation Index Expanded (SCI-Expanded); Social Sciences Citation Index (SSCI); Arts & Humanities Citation Index (A&HCI); Conference Proceedings Citation Index - Science (CPCI-S); Conference Proceedings Citation Index - Social Sciences & Humanities (CPCI-SSH); and Emerging Sources Citation Index (ESCI). The WoS also includes two chemistry databases: Index Chemicus (IC) and Current Chemical Reactions (CCR-Expanded).

The Web of Science adopts a selection process for the inclusion of journals in its content coverage [48]. The most frequent criticisms to WoS are the bias to American-based, English-language journals, failure to completely cover other citation sources (e.g. books) and failure to include citations out of the WoS database. Despite the criticism, WoS is often used worldwide in scientometrics analysis based on information articles and articles citation on a subject ([10], [13], [29], [49]). The WoS has features for browsing, searching, sorting, saving and exporting data. A citation report can be generated by author (or by institution, etc.), and a citation map can be produced. Each record (an article) can be graphically represented or mapped, linking the record to all the records that cite or are cited by the target record. Most of the articles comprise an abstract and a set of keywords. The number of keywords varies across journals and scientific domains. From 2006 onwards the reporting of author s name in WoS changed with the inclusion of the full name (given and family name). However, the full name of cited authors, i.e., the authors in the bibliographic list of references of each article is not provided.


SCOPUS, developed and owned by Elsevier, is presented on its own Web page as ”the largest abstract and citation database of peer-reviewed literature: scientific journals, books and conference proceedings” [50]. It includes 66 million of records, 22,748 peer-reviewed journals and 7,7 million conference papers. SCOPUS includes publications from Sciences, Social Sciences and Art and Humanities. SCOPUS both covers more journals than WoS and provides better coverage of the non-North American sources. Most of the articles comprise abstracts. The citations of an author and the articles that cite the original article (using Citation Tracker) make it possible to base the analysis of citations on different criteria and enable the researcher to create an exportable spreadsheet of the citations, which may or may not include self citations. Author names in SCOPUS can be arranged differently. Consequently, there are an unknown share of authors in the database where the given name is missing. Thus, the SCOPUS database does not allow for the gender disambiguation of authors in the bibliographic list of references of the articles. In a recent publication prepared by Elsevier, the SCOPUS data are combined with other data sources in order to obtain information on the first names and gender of the authors [51].

Google Scholar (GS)

The Google Scholar was created in 2004 and comprises all fields of knowledge and several types of documents, including abstracts, peer-reviewed and non-peer-reviewed papers, print and electronic journals, conference proceedings, books, theses, dissertations, preprints papers, technical reports, monographs, conference proceedings, patents, and legal documents. GS neither defines the number of journals covered nor the time span of the database. Thus, for a document with the same title and authorship, it includes all the versions available online. Because the content coverage is unknown, there is consensus among the scientific community that it must be used with caution and is not suitable to analyze citations because of the inclusion of several versions of the same paper. Authors can create a Google Scholar Account profile. The allocation of articles to authors is carried out automatically and frequently some scientific outcomes are wrongly attributed.

Pub Med

PubMed is an important resource for clinicians and researchers. The de- veloper/owner is the National Center for Biotechnology Information (NCBI), US National Library of Medicine (NLM) National Institutes of Health. The dataset includes Medline (1966-present), old Medline (1950-1965), PubMed Central, and other NLM databases. Citations are not provided. Instead, for each article, there is information about similar articles. Presentation of the name(s) of the authors is incomplete; as it does not include their given names.

RePEc Repository

The Research Papers in Economics (RePEc) is developed by volunteers from 89 countries to promote the dissemination of research in Economics and associated fields. It includes working papers, journal articles, books, book chapters and software components in a total of 2 million research outputs from 2,300 journals and 4,300 working paper series. There are 48,000 authors registered, and they are ranked according to the citations received by their scientific outcomes. The citations coverage of RePEc is in general incomplete compared with WoS and Scopus. The citations in RePEc are collected by an experimental project, CitEc and only a minority of all works can be analyzed. Within RePEc, a Genealogy project is being constructed in a voluntary base to create a dataset of advisors and advisees [52].

BASE Bielefeld Academic Search Engines

BASE, one of the largest search engines for academic resources, indexes more than 100 million documents (about 60% full text Open Access) from more than 5,000 sources. It contains different kind of documents: text, image-video, software, and datasets). The total of 73,595,901 text documents comprises, among others, books and book parts, article contributions to journals/newspapers, patents and theses.


ArXiv is a pre-print archive of working papers in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. The repository arXiv, where High Energy Physics belongs, was founded in 1991 by Paul Ginsparg, a theoretical physicist at Cornell University, and since then it has received growing interest and use from the scientific community [53]. A study comparing this pre-publication repository with publication databases has been carried out by Bar-Ilan [54].

There are several studies about the research impact of the material in the repository of arXiv namely using citation analysis ([55], [56], [16], [57], [58]). Other studies use arXiv to identify trends and build the agenda for future research in multiple scientific domains ([58], [49]).

From name to gender

In fact, the two major bibliographic databases, Web of Science (WoS) and Scopus, both of which cover several scientific domains and many types of scientific outputs, do not include the information needed to answer our research questions. The large databases and repositories with scientific outputs, as well as, most of the repositories, independently of the coverage (by domain, period or type of document) and citation search strategies, produce quantitatively and qualitatively different citation material.

Given the goal of this paper, the databases described have strong weaknesses resulting from the absence of gender information about the authors. The large majority of bibliometric and patent databases do not include information about the gender of the author, inventor or innovator. Under certain conditions this information can be obtained indirectly through the given or family name of the author. The situation is worse with regard to the transmission of knowledge (from citing to cited author), because the databases provide neither information on citations by gender nor the full name of the cited author, i.e., authors in the bibliographic list of references. Thus, given this lack of information, the only way to overcome these weaknesses is to obtain the information from the first name or the family name of the contributors concerned.

It is possible to generate the missing information from the given name of the author if it exists, which is rare. This procedure was adopted here to deal with missing data on gender. In brief, some databases include the full name of the authors, which enables, at least partially, the identification of his/her gender. However, the bibliographic list of references (citations) in each article does not provide the full name of the cited authors. This lack of information is solved in the present research by using the dataset provided by Stanford Network Analysis Platform (SNAP) together with GitHub package - Predicting Gender from Names using Historical Data ([59], [60]) - for the gender disambiguation of authors.


Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library. Among the Stanford Large Network Dataset Collection we were able to download the dataset recorded from the repository arXiv: the hep-th High Energy Physics [61]. The detailed description of this dataset is presented in the next section.

3 Data and Methodology

We used the dataset recorded from the repository arXiv, the hep-th High Energy Physics - Theory and provided by Jure Leskovec at Stanford Large Network Dataset Collection [61]. The data covers papers in the period from January 1993 to April 2003 (124 months) within a dataset of 29,555 papers and 352,807 links.

The available dataset is organized in three files, two of which were the main source of the work herein presented:

1) cit-HepTh-abstracts

Paper Authors
Title Abstract

2) cit-HepTh provides:
• list of directed edges from Paper to Paper

Each directed link in the citation network cit-HepTh indicates that paper cites paper . If a paper cites, or is cited by, a paper outside the dataset, the list does not contain any information about this.

For each paper in the dataset, the field Abstract includes the corresponding abstract of the paper. The field Authors

provides the full name of the authors in 70% of the observations. Since most of the the papers in this dataset comprise the first (given) name of the authors, we were able to classify the citing and cited authors by gender.

Gender disambiguation of authors

While bibliometric databases have been the main sources for the study of knowledge spread, because they do not include information separated by male and female authors, the study of the scientific production by gender has been frequently limited to surveys or case studies [37]. When the number of observations is low, gender allocation is done manually. Automatically allocating gender to researchers depends on the availability of: (i) databases with gender information that allow matching by researcher name or code. For example, Abramo and coauthors [62] combine data from WoS with data from the Italian Ministry of Education, Universities and Research; (ii) databases with a list of male and female given names in different languages, as for instance, the database built under a EU project Improving Human Research Potential and the Socio-Economic Knowledge Base ([63], [64]) and the method used in a recent report from Elsevier [51]; (iii) language specific characteristics that allow for systematically allocating gender from each researcher’s given name as, for example, the Portuguese given names [21]. Polish names allow to allocate gender from the family name [65]111Most of the Asian researchers when publishing in English-language journals have to adopt a phonetic version of the given and family names and this creates ambiguities to the authors’ gender attribution [66]..

This research adopts the methodology mentioned in (ii) and uses GitHub ([59], [60]) as the data source for the gender disambiguation of authors.

3.1 Sample Characteristics

Tables 1 and 2 display an overview of the basic information compiled from arXiv: the hep-th High Energy Physics [61] dataset after the gender disambiguation of authors. Author gender was accurately assessed for 70% of the papers. As earlier mentioned, the loss of information in the process of disambiguation is in line with other studies ([34], [49]). A gendered paper is a paper that enables the gender disambiguation of at least its first author. A gendered author is an author which enables the identification of his/her gender. The average number of papers by gendered author is 2.1. A gendered link is a link between two gendered papers. 58% of the citations are gendered links. There is an overlap of approximately of the papers that are both citing and cited papers in the citation network. This ratio also applies when we consider just the papers with gendered authors, as the values in Table 1 show.

All with Gender
Number of papers 29,555 20,657
Number of citations 352,807 206,405
Number of Citing papers () 25,058 17,230
Number of Cited papers () 23,180 15,596
Size of () 27,770 19,153
Size of () 20,468 13,673

Table 1: Paper-centered information from the original dataset after gender disambiguation of authors.

Table 2 provides author-centered information considering just the first author of each paper, which corresponds to 9,830 unique authors. Although 44.6% of papers have a second author (5.1% have a third author, and just 0.009% have a fourth author), in the current research and for simplicity reasons, only the first author of each paper was considered. Its distribution by gender yields 1 ,079 female and 8,751 male authors. The percentage of female authors in the citing papers is 10.9 and in the cited papers is 9.0. We found that 22.6% of the citations are self citations. Female and male authors display percentages of 24% and 24.9% of self citations, respectively.

All Female Male missing
Number of 1st authors 14,099 1,079 8,751 4,269
Number of 2nd Authors 6,496 687 4,200 1,609
% of 1st authors citing by gender 100 10.9 89.1
% of 1st authors cited by gender 100 9.0 91.0

Table 2: Author-centered information after the disambiguation of authors.

The difference between the percentage of female cited and male cited authors seems to mirror the universe of the publications’ authorship (citing authors) by gender. Because the cited papers correspond by definition to a period of time before the citing papers, and the proportion of female researchers for the subject area Physics and Astronomy has tended to increase in developed countries [51], the slight difference found (from 10.9 and 89.1 to 9.0 and 91.0) merely reflects a potential number of citable papers authored by women that is smaller than that of the citing papers.

Figure 1: The distributions of the number of (a) authored papers by authors, (b) citations by citing authors and (c) citations by cited authors.
Figure 2: Author-based scatter plots and correlation coefficients between: (a) the number of papers authored and the number of citations made, (b) the number of papers authored and the number of citations received, and (c) the number of citations made and the number of citations received.

Figure 1 shows the distributions of the number of papers by authors (14,099: 1,079 females, 8,751 males and 4,269 missing). It also show the distribution of the number of citations by either the citing or the cited authors in the sample. Each value in the axis represents the author (numeric) identification. Numbers were assigned according to the name of the authors ranked in alphabetical order. Figure 1(a) shows 29,544 papers distributed by 14,099 authors (there were 11 papers without the author name, just 29,544 papers were thus considered). Figures 1(b) and 1(c) represent, respectively, the citations by citing and cited authors. For example, the highest value in Figure 2(b) shows that there is an author that cites more than 450 other authors. At the same time, the highest value in Figure 1(c) shows that there is an author that receives more than 3,000 citations.

Scales are different because the distribution of the citations made is much more homogeneous than the distribution of the citations received. In terms of a citation network and considering a direct network of authors, the average number of in-coming links per author is and the average number of out-going links is . Therefore, although on average, the number of in-coming links is close to the number of out-going ones, the latter are much less equally distributed. There are authors receiving more than 1,500 and even more that 3,000 citations. A much more balanced distribution characterizes the citations made, where the most citing author does not go beyond 467 citations, as the second plot in Figure 1 shows.

The three scatter plots in Figure 2 show the strength of the correlation between three pairs of observations (the number of papers authored, the number of citation made, and the number of citations received) that were accounted for each author in the dataset. Therefore, each mark (o) in the plots represents an author, being the and coordinates given by one of the following quantities: the number of papers authored, the number of citation made, and the number of citations received. Figure 2(a) shows the scatter plot of the number of papers authored against the number of citations made, Figure 2(b) shows the number of authored papers against the number of citations received, and Figure 2(c) concerns the number of citations made against the number of citations received.

The correlation coefficients (shown in the title of each scatter plot) were computed over the entire authors set (14,099 authors). As expected, the authors who are more productive increase the possibility of citing other authors since their authored papers are the citation vehicle. The correlation coefficients show that the strongest correlation (0.67) was found between the number of papers authored and the number of citations made per author. The lowest correlation between the number of papers authored and the number of citations received reflects the lag between the publication of the scientific output and acknowledgement by peers, as well as, the relation between quantity and quality. Sometimes the more productive authors (productivity evaluated by the number of papers) are not those that are cited more often.

3.2 Memes Selection

Following the approach of Kuhn and co-authors [22], our research is driven by the characterization of the propagation (or inheritance) mechanism of memes and not just by their frequency of occurrence as is usual in citation analysis.

A first step into this direction is the selection of a sub set of memes among the whole set of the most frequently occurring words in the entire set of 29,555 papers. Our meme selection process starts by using the word-counting procedure of Voyant Tools ([67], [68]). Voyant Tools allows for defining a list of words to be excluded from the word-counting procedure (stopwords). Typically, a stopword list contains functional words that do not carry much meaning, such as determiners and prepositions (”in”, ”to”, ”from”, among others). Table 3 shows 40 memes selected among the most frequently occurring words in the abstracts (without stopwords) and ranked by frequency of occurrence. It also shows the frequency of each selected meme computed from both the 29,555 papers and from the 20,657 gendered papers.

The memes selected correspond to the most frequent words in our sample that carry a specific meaning in the field of High Energy Physics. Many frequent words like ”theory”, ”dimensional” or ”field” were excluded since they are not enough specific, occurring also frequently in papers found in other scientific areas. These words were excluded together with functional words because we are interested in the thematic similarities of the papers as opposed to, for instance, stylistic similarities between different authors. Therefore, we disregarded functional words, as well as, those words common to many other scientific areas.

Rank Meme Rank Meme
1 space 9,249 2 gauge 8,082
3 string 7,517 4 quantum 6,275
5 symmetry 5,682 6 brane 5,153
7 mass 5,082 8 gravity 4,621
9 group 4,600 10 conformal 3,389
11 potential 3,331 12 spin 2,604
13 hole 2,395 14 supersymmetry 2,220
15 supergravity 2,118 16 topological 2,079
17 phase 2,068 18 abelian 2,034
19 magnetic 1,983 20 manifold 1,967
21 matter 1,829 22 spacetime 1,812
23 vacuum 1,802 24 coupled 1,795
25 tensor 1,763 26 massless 1,654
27 renormalization 1,418 28 cosmological 1,393
29 gravitational 1,362 30 bosonic 1,352
31 chern 1,277 32 temperature 1,172
33 lattice 1,033 34 discrete 1,023
35 fermionic 981 36 relativistic 932
37 superconformal 752 38 singularity 727
39 cohomology 465 40 hierarchy 464

Table 3: The absolute frequency of 40 selected memes from the frequently occurring words in the abstracts of the 29,555 papers.

Figure 3: The relative frequency of occurrence of 40 memes computed from all 29,555 papers () and from 20,657 gendered papers ().

Figure 3 shows the relative frequencies (the ratio of papers carrying the meme in each subset of papers) of each selected meme in Table 3. The relative frequencies are computed from both the 29,555 papers and from the subset of the 20,657 gendered papers. There are some small differences in the values of the relative frequencies depending on whether they are computed from either the 29,555 papers () or from the subset of the 20,657 gendered papers (). Those differences are, on average, smaller than 5% of and in two thirds of the memes the value of is greater than the corresponding value, meaning that, the relative frequencies of of the selected memes slightly increase when computed from the gendered papers. The vertical dashed line in Figure 3 points out the 15 memes whose frequency of occurrence in the subset of gendered papers is above . In the next section, we compute the propagation score of these 15 memes and discuss its relation with the frequency of occurrence.

4 Results and Discussion

Since authors of scientific papers inherit knowledge from their cited authors and once authorship is gendered, our research questions can be rephrased:

  • Is the frequency and propagation of a meme (from paper cited to paper citing) influenced by the gendered cited paper?

  • Do the selected memes spread differently from either male or female cited authors?

To answer these questions we characterize the inheritance process with respect to the frequencies of memes and their propagation scores depending on the gendered authorship of the cited papers.

Departing from such a gender-oriented perspective and restricting our sample to the set of 20,657 gendered papers, two indicators are computed for each selected meme: the relative frequency and the propagation score [22].

As already mentioned, the relative frequency of a meme computed from the set of (20,657) gendered papers () is the ratio of papers carrying the meme in this subset. The propagation score is given by:


where is the number of papers that carry the meme and cite at least one paper carrying this meme, while is the number of all papers (meme carrying or not) that cite at least one paper that carries the meme . Following Kuhn and co-authors [22], we also compute as the number of papers that carry the meme and cite at least one paper carrying this meme, and is the number of all papers (meme carrying or not) that do not cite a paper that carries the meme .

Since in , stands for gendered, its computation is made from the citation network of (206,405) gendered links. When computing the propagation score for each specific gender ( and ), we constrain the subsets of links being considered so that the cited papers conform to each specific gender. Therefore, in computing the female (male) propagation score of a meme (), the terms , , and account just for the cited papers of a female (male) author.

Figure 4: The relative frequency of memes ( and ) in gendered papers.

Figure 4 and 5 show, respectively, the values of the relative frequencies ( and ) and propagation scores ( and ) of the 15 memes whose frequency of occurrence in the subset of gendered papers is above . The only noteworthy difference in the propagation score by gender concerns the value obtained for the meme ”Spin”. In this specific case, the propagation score via male inheritance is stronger than via female inheritance. In the other 14 cases, results confirm the almost absence of any difference between female and male transmission of memes.

Figure 5: The propagation score of memes ( and ) by gendered cited.
Figure 6: The relative frequency () and propagation score () of gendered papers plotted against each other.

Figure 6 shows a scatter plot where the coordinates of each 15 meme is given by its relative frequency () and propagation score (). When the relative frequency and propagation score are plotted against each other, our results are in line with the outcomes presented in reference [22], showing that less frequent memes tend to propagate more (via citation links). A possible reason for such a simple relation between the relative frequencies and propagation scores of scientific memes may rely on the fact that the less frequent ones are presumably more informative and therefore occur less often. Likewise, functional words - such as determiners and prepositions - carrying less meaning, occur very frequently and therefore occupy the most central positions in linguistic (co-occurrence) networks ([69], [70]).

Computing the correlation coefficient between the values of and for the set of gendered papers yields . As the propagation score of a meme captures how interesting it is for the scientific community, our results confirm that being interesting is inversely related to occurring frequently. The scatter plot in Figure 6 shows that such a simple relation holds when citation ties are gendered.

Not surprisingly and given that information on the gender of the authors that one cites is usually missing, the transmission of memes are free of gender-homophily trends in citation choices.

There is a broad literature ([71], [17], [72], [73], [74]) on social relations (social networks included) showing that many social systems create contexts in which homophilic relationships hold. From friendship, co-membership and marriage, several studies have discussed the role the similarity plays in the creation of human relationships. The phenomena of establishing ties with similar individuals have been extensively studied through network approaches, regardless whether similarity is based on age, religion, education, occupation, or gender. Recent research on the structure of citation networks [75] presents a method for measuring the similarity between articles through the overlap between the bibliographic lists of references included in these articles (cited papers). One related study is discussed by Ramon Ferrer i Cancho [76] with the definition of a similarity network between articles on linguistic, cognitive and brain networks. There, instead of bibliographies, the similarity between articles is measured on the basis of similar words used in the abstract of the articles. Therefore, the network approach allows for clustering articles on linguistic networks into different modules depending on whether they deal with semantics or functional brain networks.

When the gender aspect is considered, a large scale analysis on gendered authorship [77] based on eight million papers across multiple areas reveals that women are significantly under-represented as authors of single-authored papers. Araújo and Fontainha arrived to close results when analyzing gender authorship of scientific papers through a network approach [21]. This paper, seeking to build upon the previous literature on gender aspects in research transmission adds to usual citation analysis the memes approach and propagation score methodology. Our computation of the propagation scores of memes characterized by the gendered authorship of the citing and cited papers allows for investigating the combined effect of meme inheritance and gendered transmission.

In so doing and despite the small difference accounted for the meme ”Spin”, our results show that the propagation of the selected memes does not seem to be influenced by the gendered authorship. The selected memes do not spread differently from either male or female cited authors. Neither female or male inheritance seems to favor the propagation of any of the selected memes. Likewise, with a single exception, the memes that we analyzed were not found to propagate more easily via male or female inheritance.

5 Conclusion

Our approach adds the meme inheritance notion to traditional citation analysis, as we investigate if scientific memes are inherited differently from gendered authorship. Results reveal that the inheritance process does not differ by gender. The descriptive analysis suggest the absence of any gender-homophily trend in citation ties. The empirical analysis also show that there is a very unbalanced scientific output by gender in the scientific domain under analysis. Women represent about of the authorship outputs. Moreover, our results are in line with the results presented in reference [22], confirming that there is a simple relation between the frequency of occurrence of a scientific meme and its propagation score via citation links. Here we show that such a simple relation also holds when citation ties are gendered.

The paper contributes to providing a more precise characterization of women in research, and in doing so, it can contribute to informing the design, follow-up and evaluation of research programs and projects that include gender balance in their objectives. The EU Research and Innovation programme, Horizon 2020, for example, specifically stipulates three objectives pertaining to gender equality: to foster gender balance in Horizon 2020 research teams; to ensure gender balance in decision-making; and to integrate gender/sex analysis in research and innovation (R&I) content (eige.europa). The present paper, contributing to a better understanding of knowledge transmission by gender will also help to increase the quality and relevance of the R&I outputs production and diffusion processes ([78]).

Concerning citation analysis and sciencitometrics, this paper goes a step further investigating the interplay of memes transmission and gendered authorship. The methodology can be useful for academics conducting citation studies and knowledge diffusion analyses. For big data developers, owners, editors, administrators, and funding agencies, the present study also enlarges the horizons of knowledge production and dissemination. In particular, not only are the owners of big databases in a strategic position, but they also have the resources to develop new tools to deal with the lack of information on gender. In the future, when the big bibliometric databases start to include it as a regular procedure, this study can be replicated on a broader scope, free of missing data.

Future research work is planned to further approach citation networks of gendered authors. Following the work of Ciotti and co-authors [75] we envision the application of our gender-oriented perspective to the definition of networks of authors based on the overlap between their common references. Therefore, the network approach might allow for clustering gendered authors into different groups depending on multiple characteristics of their bibliographic references. Moreover, applying well-known statistical tools inspired by network studies in other domains, may bring important contributions to the study of networks of scientific collaboration. We envision that, the finding of structural differences between citation networks of different types may be indicative of their usefulness in a more applied context as tools for knowledge diffusion and transfer.


Financial support by FCT (Fundação para a Ciência e a Tecnologia), Portugal is gratefully acknowledged. This article is part of the Strategic Project: UID/ECO/00436/2013. The authors thank R. Vilela Mendes for providing help in the identification of important physics concepts. The research reported in this paper is based on the findings of the PLOTINA project (”Promoting gender balance and inclusion in research, innovation and training”), which has received funding from the European Union’s Horizon 2020 research and innovation programme, under Grant Agreement N. 666008 ( The views and opinions expressed in this publication are the sole responsibility of the authors and do not necessarily reflect the views of the European Commission.


  • [1] Feldman, M., Kenney, M., Lissoni, F., 2015a. The new data frontier: Special issue of research policy. Research Policy, 44(9), pp.1629-1632.
  • [2] Bukovina, J., 2016. Social media big data and capital markets-An overview. Journal of Behavioral and Experimental Finance, 11, pp.18-26.
  • [3] Curme, C., Stanley, H.E., Vodenska, I., 2015. Coupled Network Approach to Predictability of Financial Market Returns and News Sentiments. International Journal of Theoretical and Applied Finance, v.18, n.7, pp.1-26.
  • [4]

    Banisch, S., Lima, R., Araújo, T., 2012. Agent based models and opinion dynamics as Markov chains. Social Networks 34, no. 4, 549-561.
  • [5] Weichselbraun, A., Gindl, S., Scharl, A., 2014. Enriching semantic knowledge bases for opinion mining in big data applications. Knowledge-Based Systems, 69, pp.78-85.
  • [6] Pulkki-Br nnstrom, A.M., Stoneman, P., 2013. On the patterns and determinants of the global diffusion of new technologies. Research Policy, 42(10), pp.1768-1779.
  • [7] Feldman, M.P., Kogler, D.F., Rigby, D.L., 2015b. rKnowledge: the spatial diffusion and adoption of rDNA methods. Regional studies, 49(5), pp.798-817.
  • [8] Geuna, A., Kataishi, R., Toselli, M., Guzmán, E., Lawson, C., Fernandez-Zubieta, A., Barros, B., 2015. SiSOB data extraction and codification: A tool to analyze scientific careers. Research Policy, 44(9), pp.1645-1658.
  • [9] Fontainha, E., Martins, J. T., Vasconcelos, A. C., 2015. Network analysis of a virtual community of learning of economics educators. Information Research, 20 (1).
  • [10] Adriaanse, L. S., Rensleigh, C., 2013. Web of Science, Scopus and Google Scholar: A content comprehensiveness comparison. The Electronic Library, 31(6), pp.727-744.
  • [11] Bakkalbasi, N., Bauer, K., Glover, J., Wang, L., 2006. Three options for citation tracking: Google Scholar, Scopus and Web of Science. Biomedical digital libraries, 3(1), p.7.
  • [12] Falagas, M.E., Pitsouni, E.I., Malietzis, G.A., Pappas, G., 2008. Comparison of PubMed, Scopus, web of science, and Google scholar: strengths and weaknesses. The FASEB journal, 22(2), pp.338-342.
  • [13] Harzing, A.W., Alakangas, S., 2016. Google Scholar, Scopus and the Web of Science: a longitudinal and cross-disciplinary comparison. Scientometrics, 106(2), pp.787-804.
  • [14] Kulkarni, A.V., Aziz, B., Shams, I., Busse, J.W., 2009. Comparisons of citations in Web of Science, Scopus, and Google Scholar for articles published in general medical journals. Jama, 302(10), pp.1092-1096.
  • [15] Meho, L.I., Yang, K., 2007. Impact of data sources on citation counts and rankings of LIS faculty: Web of Science versus Scopus and Google Scholar. Journal of the American Society for Information Science and Technology, 58(13), pp.2105-2125.
  • [16] Evans, T. S., 2012. Universality of Performance Indicators Based on Citation and Reference Counts. Scientometrics, vol. 93, no. 2, 2012, pp. 473-495.
  • [17] Sorenson, O., 2006. Complexity, Networks and Knowledge Flow. Research Policy, vol. 35, no. 7, pp. 994-1017.
  • [18] Aharonson, B. S., M. A. Schilling, 2016. Mapping the Technological Landscape: Measuring Technology Distance, Technological Footprints, and Technology Evolution. Research Policy, vol. 45, no. 1, pp. 81-96.
  • [19] Bornmann, L., Wagner, C., Leydesdorff, L., 2015. BRICS countries and scientific excellence: A bibliometric analysis of most frequently cited papers. Journal of the Association for Information Science and Technology, 66(7), pp.1507-1513.
  • [20] Azagra-Caro, J.M., Barberá-Tomás, D., Edwards-Schachter, M., Tur, E.M., 2016. Dynamic interactions between university-industry knowledge transfer channels: A case study of the most highly cited academic patent. Research Policy.
  • [21] Araújo, T., Fontainha, E., 2017. The specific shapes of gender imbalance in scientific authorships: a network approach. Journal of Informetrics, 11(1), 88-102.
  • [22] Kuhn, T., Matjaz, P., Dirk H., 2014. Inheritance patterns in citation networks reveal scientific memes. Physical Review X 4, no. 4, 041036.
  • [23] Dawkins, R., 1976. The Selfish Gene. Oxford Landmark Science, ISBN-13: 978-0198788607
  • [24] Shiller, R. J., 2017. Narrative Economics. American Economic Review, 107(4), pp. 967-1004.
  • [25] Astebro, T., Thompson, P., 2011. Entrepreneurs, Jacks of all trades or Hobosfi. Research policy, 40(5), pp.637-649.
  • [26] Bozeman, B., Gaughan M., 2011. How Do Men and Women Differ in Research Collaborationsfi An Analysis of the Collaborative Motives and Strategies of Academic Researchers. Research Policy, vol. 40, no. 10, pp. 1393-1402.
  • [27] Brooks, C., Fenton, E.M., Walker, J.T., 2014. Gender and the evaluation of research. Research Policy, 43(6), pp.990-1001.
  • [28] Gonzalez-Brambila, C., Veloso, F.M., 2007. The determinants of research output and impact: A study of Mexican researchers. Research Policy, 36(7), pp.1035-1051.
  • [29] Sugimoto, C. R., 2015. On the Relationship between Gender Disparities in Scholarly Communication and Country-Level Development Indicators. Science and Public Policy, vol. 42, no. 6, pp. 789-810.
  • [30] Tartari, V., Salter, A., 2015. The engagement gap:: Exploring gender differences in University-Industry collaboration activities. Research Policy, 44(6), pp.1176-1191.
  • [31] Van Rijnsoever, F.J., Hessels, L.K., 2011. Factors associated with disciplinary and interdisciplinary research collaboration. Research policy, 40(3), pp.463-472.
  • [32] Viana, M. P., 2013. On Time-Varying Collaboration Networks. Journal of Informetrics, vol. 7, no. 2, pp. 371-378.
  • [33] Ynalvez, M. A., Shrum, W. M., 2011. Professional Networks, Scientific Collaboration, and Publication Productivity in Resource-Constrained Research Institutions in a Developing Country. Research Policy, vol. 40, no. 2, 2011, pp. 204-216.
  • [34] Beaudry, C., Lariviere, V., 2016. Which gender gap? Factors affecting researchers’ scientific impact in science and medicine. Research Policy, 45(9), pp.1790-1817.
  • [35] Copenheaver, C. A.,2010. Lack of Gender Bias in Citation Rates of Publications by Dendrochronologists: What Is Unique About This Discipline? Tree-Ring Research, vol. 66, no. 2, pp. 127-133,
  • [36] de Melo-Martin, I., 2013. Patenting and the Gender Gap: Should Women Be Encouraged to Patent More? Science and Engineering Ethics, vol. 19, no. 2, pp. 491-504.
  • [37] Frietsch, R., Haller, I., Funken-Vrohlings, M., Grupp, H., 2009. Gender-specific patterns in patenting and publishing. Research Policy, vol. 38, no. 4, pp. 590-599.
  • [38] Giuri, P., Mariani, M., Brusoni, S., Crespi, G., Francoz, D., Gambardella, A., Garcia-Fontes, W., Geuna, A., Gonzales, R., Harhoff, D., Hoisl, K., 2007. Inventors and invention processes in Europe: Results from the PatVal-EU survey. Research policy, 36(8), pp.1107-1127.
  • [39] Ghiasi, G., 2015 On the Compliance of Women Engineers with a Gendered Scientific System. Plos One, vol. 10, no. 12, p. 19.
  • [40] Hunt, J., Jean-Philippe G., Herman, H., Munroe, D., 2013. Why Are Women Underrepresented Amongst Patentees? Research Policy, vol. 42, no. 4, pp. 831-843.
  • [41] Jung, T., Ejermo, O., 2014. Demographic Patterns and Trends in Patenting: Gender, Age, and Education of Inventors. Technological Forecasting and Social Change, vol. 86, pp. 110-124.
  • [42] Mauleón, E., Bordons, M., 2010. Male and female involvement in patenting activity in Spain. Scientometrics, vol. 83, no. 3, pp. 605-621.
  • [43] Meng, Y., 2016. Collaboration Patterns and Patenting: Exploring Gender Distinctions. Research Policy, vol. 45, no. 1, pp. 56-67.
  • [44] Mihaljevic-Brandt, H., 2016. The Effect of Gender in the Publication Patterns in Mathematics. Plos One, vol. 11, no. 10, p. 23.
  • [45] Okon-Horodynska, E., Zachorowska-Mazurkiewicz, A., Wisla, R., Sierotowicz, T., 2015. Gender in the Creation of Intellectual Property of the Selected European Union Countries. Economics & Sociology, vol. 8, no. 2, pp. 115-125.
  • [46] Franceschini, F., Maisano, D., Mastrogiacomo, L., 2016. The museum of errors/horrors in Scopus. Journal of Informetrics, 10(1), pp.174-182.
  • [47] Web of Science Web Page
  • [48] Testa, J., 2016. The Thomson Reuters Journal Selection Process.
  • [49] Lariviere, V., 2008. Long-Term Variations in the Aging of Scientific Literature: From Exponential Growth to Steady-State Science (1900-2004). Journal of the American Society for Information Science and Technology, vol. 59, no. 2, pp. 288-296.
  • [50] Scopus Elsevier Web Page
  • [51] Elsevier, 2017. Gender in the Global Research Landscape - Analysis of research performance through a gender lens across 20 years, 12 geographies, and 27 subject areas. Elsevier (
  • [52] RePEc Genealogy Web Page
  • [53] Ginsparg, P., 2011. arXiv at 20. Nature, vol. 476, no. 7359, pp. 145-147.
  • [54] Bar-Ilan, J., 2014. Astrophysics Publications on arXiv, Scopus and Mendeley: A Case Study. Scientometrics, vol. 100, no. 1, pp. 217-225,
  • [55] Brody, T., 2006. Earlier Web Usage Statistics as Predictors of Later Citation Impact. Journal of the American Society for Information Science and Technology, vol. 57, no. 8, pp. 1060-1072.
  • [56] Davis, P. M., Fromerth, M. J., 2007. Does the arXiv Lead to Higher Citations and Reduced Publisher Downloads for Mathematics Articles? Scientometrics, vol. 71, no. 2, pp. 203-215.
  • [57] Goldberg, S. R., 2015. Modelling Citation Networks. Scientometrics, vol. 105, n. 3, pp. 1577-1604.
  • [58] Haque, A. U., Ginsparg, P., 2010. Last but Not Least: Additional Positional Effects on Citation and Readership in arXiv. Journal of the American Society for Information Science and Technology, vol. 61, no. 12, pp. 2381-2388.
  • [59] Blevins, C., Mullen, L., 2015. Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction. Digital Humanities Quarterly 9, no. 3.
  • [60] Mullen, L., 2016. Predict Gender from Names Using Historical Data. R package version 0.5.2.
  • [61] Leskovec, J., Sosič, R., 2014. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. ACM Transactions on Intelligent Systems and Technology, 8.
  • [62] Abramo, G., D’Angelo, C. A., Murgia, G., 2013. Gender differences in research collaboration. Journal of Informetrics, 7, 811-822.
  • [63] Naldi, F., Parenti, V., 2002. Scientific and Technological Performance by Gender. A feasibility study on Patent and Bibliometric Indicators, Vol. II : Methodological.
  • [64] Naldi, F., Luzi, D., Valente, A., Parenti, V., 2004. Scientific and technological performance by gender. Moed, Henk, F., Gl nzel, W., Schmoch, U. (Eds.), Handbook of Quantitative Science and Technology Research - The Use of Publication and Patent Statistics in Studies of S&T Systems. Kluger Academic Publishers, Dordrecht/Boston/London, pp. 299-314.
  • [65] Kosmulski, M., 2015. Gender disparity in Polish science by year (1975-2014) and by discipline. Journal of Informetrics, 9(3), pp.658-666.
  • [66] Qiu, J., 2008. Scientific publishing: Identity crisis. Nature 451, 766-767.
  • [67] Voyant-Tools:
  • [68] Sinclair, S., Rockwell, G., 2016. Voyant Tools. Web.
  • [69] Kurths, J., Zamora-López, G., Russo, E., Gleiser, Pablo M., Zhou, C., 2011. Characterizing the complexity of brain and mind. Phil. Trans. R. Soc. A. 369, 3730 3747.
  • [70] Araújo T., Banisch S., 2016. Multidimensional Analysis of Linguistic Networks, Towards a Theoretical Framework for Analyzing Complex Linguistic Networks 107-131. Springer Berlin Heidelberg.
  • [71] Granovetter, M., 1973. The Strength of Weak Ties. American Journal of Sociology, vol. 78, n. 6, pp. 1360 1380.
  • [72] Krichel, T., Bakkalbasi, N., 2006. A social network analysis of research collaboration in the economics community. Journal of Information Management and Scientometrics, 3, 1-12.
  • [73] Borgatti, S. P., Mehra, A., Brass, D. J., Labianca, G., 2009. Network analysis in the social sciences. Science, 323(5916), 892-895.
  • [74] Cainelli, G., Maggioni, M. A., Uberti, T. E., de Felice, A., 2015. The strength of strong ties: How co-authorship affect productivity of academic economists? Scientometrics, 102, 673-699.
  • [75]

    Ciotti, V., Bonaventura, M., Nicosia, V., Panzarasa, P., Latora, V., 2016. Homophily and missing links in citation networks. EPJ Data Science, 5(1), p.7.
  • [76] Ferrer i Cancho, R., 2012. Bibliography on linguistic, cognitive and brain networks.
  • [77] West, J., Jacquet, J., King, M., Correll, S., Bergstrom, C., 2013. The Role of Gender in Scholarly Authorship. PLoS ONE 8.
  • [78] European Institute for Gender Equality Web Page