1 Measuring homophily in coauthorships
1.1 Quantifying assortativity
We consider the unit of analysis to be an authorship–an instance of coauthoring a single article–rather than an author who may have coauthored multiple articles; implications are addressed in the Discussion. For a set of authorships, Ted Bergstrom’s (bergstrom2003algebra) is , where
is the probability that a randomly selected coauthorship of a randomly selected male authorship is male, and
is the probability that a randomly selected coauthorship of a randomly selected female authorship is male.Our analysis framework is not dependent on a particular metric; however, we use due to its interpretation as a difference in risks and its connection to Wright’s coefficient of inbreeding (wright1949genetical). Indeed, is a generalization of Wright’s coefficient to the multiauthor scenario, and the two measures are equivalent when all papers have two authors. bergstrom2016index show that is equal to the observed coauthorgender correlation in a given collection of papers while wang2016relationship show that is equivalent to Newman’s networkbased assortativity coefficient (newman2003mixing) in an appropriately weighted network.
For a set of authorships, the possible values of depend on various structural aspects such as the gender ratio, the total number of authorships, and the number of authorships on each paper. For concreteness, suppose all papers have two authors, and let be the proportion of the less frequent gender. If the proportion of femalemale papers is what we would expect under random pairings, , then ; if there are no femalemale papers, ; if the proportion of femalemale papers is the largest attainable value, , then which is between and .
1.2 Homophily in a heterogeneous setting
When analyzing papers from across the scholarly landscape, observing a large value of is not sufficient to indicate that gender plays an active role in coauthorship decisions. To elucidate this point, we distinguish between the structural, behavioral, and compositional aspects of gender homophily.
Within a tightly focused intellectual community, if coauthorships are formed randomly, the expected should be near , but the exact distribution of in each community depends on the various structural aspects mentioned previously. For instance in Figure 1, when permuting authorships within their fields without regard to gender, the expected for and are not , but and respectively. We use structural homophily to describe the deviation of from 0 that arises due to these structural aspects.
When examining aggregations of intellectual communities, we no longer expect to see teams formed completely at random because individuals are more likely to coauthor with others who share intellectual interests. Given that gender representation varies across disciplines, collaboration along the lines of shared interests generate homophily that we denote compositional homophily. Figure 1 provides an illustrative example with two fields, and , with 4 papers each. Within each field, the observed configuration of authorships result in of for and for ; however, when all 8 papers are aggregated, . This occurs because the proportion of females in is much greater than that in . If the intellectual interests in differ from those in , it would be reasonable to conclude that the observed gender homophily is actually driven by discipline homophily. blau1977inequality uses “consolidation” to describe homophily induced by factors (in this case discipline) which are associated with the factor of interest (in this case gender). In statistics, it is a case of Simpson’s paradox (simpson1951contingency) where homophily is confounded by gender imbalances across scholarly fields. In population genetics, this phenomenon is known as the Wahlund effect (walhund1928zuzammensetzung): random mating within subpopulations does not imply HardyWeinberg equilibrium in the population as a whole. Indeed, under the Wahlund effect, we would not expect Wright’s F (which would be equivalent to ) to be 0.
In contrast to structural and compositional homophily, which could occur even if authors select coauthors irrespective of gender, we use behavioral homophily to describe deviations of from its expected value under structural and compositional homophily which could be due to explicit or implicit consideration of gender when selecting coauthors.
These notions of homophily map onto the two components of homophily discussed by mcpherson2001birds–baseline homophily and inbreeding homophily–in a context of voluntary and professional network ties such as those of friendship, support, and advice. Specifically, they described baseline homophily as homophily “created by the demography of the potential tie pool” and inbreeding homophily as “homophily measured as explicitly over and above the opportunity set” (mcpherson2001birds, pg 419). Our notion of structural homophily aligns with baseline homophily of mcpherson2001birds, though we prefer to use the term “structural” which does not have temporal connotations^{1}^{1}1
i.e., baseline observations in longitudinal studies
. mcpherson2001birds emphasize that their definition of inbreeding homophily does not refer to “choice homophily purified of structural factors,” but instead encompasses “homophily induced by social structures below the population level to homophily induced by other dimensions with which the focal dimension is correlated, and to homophily induced by personal preferences.” (mcpherson2001birds, pg 419). Indeed, compositional homophily accounts for homophily induced by “structures below the population level” and the correlated dimension of intellectual interests, and we assume the remaining homophily–behavioral homophily–is “induced by personal preferences.” We acknowledge, however, that the behavioral homophily we measure may not be strictly due to gender and can be potentially correlated with other social stratification dimensions that we don’t observe in our data such as race or ethnicity.1.3 Data
In the JSTOR corpus, “subpopulations” which may create compositional homophily correspond to tight intellectual communities that focus on similar research questions. To identify these communities, we apply a hierarchical implementation of the InfoMap network clustering algorithm to the citation network on the JSTOR corpus (rosvall2008maps; rosvall2011multilevel). The algorithm reveals the hierarchical structure of the corpus through efficient coding of random walks on the citation network. At the lowest level of the clustering, each paper is grouped into one of 1,450 terminal fields which form the finest partition of the data. These terminal fields are indicative of scholarly communities tied by shared narrow research topics or methodologies. Each higher level of the clustering forms a progressively coarser partition of the documents by aggregating terminal fields into composite fields. Finally, there are 24 identified toplevel fields that are indicative of disciplinary divisions (west2013role) such as molecular and cell biology, economics, statistics, and sociology. The hierarchical structure obtained from the InfoMap algorithm has up to 6 levels. At any given level of hierarchy, papers in a common field are more connected to each other via citations than they are to papers from neighboring fields. Likewise, the fields defined by a lower (finer) level of the hierarchy are more connected than fields defined by a higher (coarser) level in the hierarchical clustering. This hierarchical clustering allows us to test for behavioral homophily at varying levels of granularity. An interactive browser of the clustering can be accessed at the Eigenfactor browser: http://eigenfactor.org/projects/gender_homophily.
We include all papers clustered into one of the terminal fields that were published between 1960  2011 and have more than one author. After the cleaning procedure described in Section 4
, this amounts to 252,413 papers with 807,588 authorships. We impute the gender of authorships using first name as discussed in section
4.4.1 and the supplement.2 Results
2.1 Measuring Behavioral Homophily
To estimate the contributions of structural, compositional, and behavioral homophily to the overall observed homophily, we compare the observed to measured on plausible hypothetical configurations which aim to reflect all relevant aspects of coauthorship choice except behavioral homophily. Specifically, we sample these configurations from the null distribution–described below and given explicitly in (2
)–that encodes the null hypothesis of no behavioral homophily. Roughly speaking, we fix the papers and field structure and shuffle authorships so that coauthorships are formed without regard to gender. Systematic differences between the observed
and the values from the null would suggest behavioral gender homophily.To reflect the underlying structural homophily, we restrict the distribution to configurations that preserve structural aspects of our data: i.e., the total number of male/female authorships, the number of authorships on each paper, and the number of papers/authorships in each field. To capture compositional homophily and scholarly connectivity across terminal fields, we allow interterminal field swaps with a probability proportional to the flow of citations between the authorship’s original terminal field and other terminal fields in the corpus. Configurations with authorships in their original or nearby (as defined by citation flows) terminal fields are much more likely than configurations where they are far away; for almost all cases, an authorship remains in its original terminal field with probability above 0.9. This ensures that under the null distribution, the gender ratio of any terminal field stays close to the observed ratio. However, interfield swaps may occur with small probability to reflect crossfield collaborations; this also makes the null distribution less sensitive to the otherwise discrete assignment of documents to terminal fields. Finally, we treat authorships within a terminal field as exchangeable; i.e., in the counterfactual world, all authorships are equally likely to appear on any other paper in the terminal field and coauthorships are formed without regard to gender.
We calculate for each configuration and test for the presence of behavioral homophily. The pvalue for each field is the proportion of ’s from the null distribution which are greater or equal to the observed . A small pvalue implies that under only structural and compositional factors, the observed
is unlikely to occur and suggests the presence of behavioral homophily. Direct sampling from the distribution is intractable so we use a Markov chain Monte Carlo MetropolisHastings procedure.
2.2 Main Analysis
Table 1 summarizes results for the entire JSTOR corpus and all toplevel fields. The first column gives both the observed and the expected from the null distribution. The expected is positive for every toplevel field, implying that even when collaborator choices are genderblind, samegender coauthorships are expected to occur more often simply because of the structure and gender composition of these fields. Also, the observed exceeds the expected in all toplevel fields. Figure 2 provides a representation of the hierarchical clustering for Economics; the observed is , but given only structural and composition homophily, we would expect an of . Similar illustrations for all fields are available in the interactive browser.
For concrete interpretation, consider a setting where every field consists of 100 twoauthor papers. Then for a given and proportion of female authorships , is the number of heterophilous (femalemale) papers. In the column (Table 1), we report for the observed/expected of each toplevel field, setting to the proportion of observed female authorships in that field. Note that the magnitude of changes in does not always correspond to observed papers in a direct way; e.g., Education and Organizational and Marketing have similar observed and expected values, but the difference between the observed and expected heterophilous papers is larger in Education than in Organizational and Marketing (6.0 vs 4.2) because Education has a larger proportion of female authorships.
We perform hypothesis tests for all toplevel, composite, and terminal fields using pvalues adjusted by the BenjaminiYekutieli (benjamini2001control) procedure to control the false discovery rate at . The procedure is likely quite conservative in our setting; in the supplement, we provide results for less conservative multiple testing procedures and different false discovery rates. We reject the null hypothesis of no behavioral homophily in the JSTOR corpus, in 20/24 (83%) toplevel fields, in 82/280 (29%) of composite fields (not including the top fields), and in 124/1450 (9%) terminal fields. Across JSTOR, and in almost every toplevel field, the incidence of significant behaviorial homophily in composite fields is at least as large as the incidence among terminal fields. We posit two reasons for this. First, composite fields are an aggregation of terminal fields, and we expect behavioral homophily in a composite field aggregates roughly as an “or” operator over its terminal fields (i.e., behavioral homophily in a single terminal field typically implies behavioral homophily in the corresponding composite field). However, as seen in the Eigenfactor browser, there are composite fields with significant homophily despite having no significant terminal fields. Thus, we also posit that composite fields have higher testing power due to their larger size. In general, there is a tradeoff between increasing testing power by aggregating data versus controlling for confounders by analyzing the data at a finegrain level. This highlights the benefit of our approach which allows testing homophily in composite fields, where we have more power, while still accounting for compositional effects.
Field  Obs / Exp  Pvalue  Signif / Total  

Term  Comp  
JSTOR  .11/.05  38.6/41.1  .00  124/1450  82/280 
Mol/Cell Bio  .05/.01  38.2/39.8  .00  29/178  19/44 
Eco/Evol  .06/.02  31.9/33.3  .00  17/257  15/56 
Economics  .11/.02  18.7/20.8  .00  9/136  11/28 
Sociology  .19/.07  38.5/44.2  .00  13/94  12/21 
Prob/Stat  .09/.03  26/27.8  .00  1/90  2/23 
Org/mkt  .16/.04  29/33.2  .00  8/68  3/4 
Education  .16/.04  41.2/47.2  .00  12/42  6/10 
Occ Health  .10/.02  41.7/45.5  .00  12/24  1/1 
Anthro  .12/.03  38.5/42  .00  5/63  2/8 
Law  .17/.08  29.7/32.9  .00  0/98  1/16 
History  .16/.07  32.9/36.5  .00  0/49  1/6 
Phys Anthro  .07/.01  34.7/36.8  .00  1/32  2/10 
Intl Poli Sci  .09/.02  27.3/29.1  .03  0/34  0/2 
US Poli Sci  .15/.07  25.2/27.4  .00  2/37  1/6 
Philosophy  .10/.03  18.7/20.3  .03  0/45  0/8 
Math  .04/.01  14.3/14.7  1.00  0/46  0/9 
Vet Med  .09/.01  38.3/41.6  .00  7/19  1/2 
Cog Sci  .18/.09  35.7/39.4  .00  4/14  3/3 
Radiation  .09/.01  34/36.8  .00  3/14  1/5 
Demography  .15/.06  40.3/44.4  .00  0/20  1/2 
Classics  .07/.01  38.7/41.2  .27  0/35  0/8 
Opr Res  .03/.00  16.6/17.1  .73  0/18  0/4 
Plant Phys  .08/.02  29.1/31  .03  1/21  0/3 
Mycology  .03/.01  36.8/37.5  1.00  0/16  0/1 
In the supplement, we compare results from our approach to those from a naive approach that does not account for compositional homophily and only accounts for structural homophily by treating all individuals within a given composite or toplevel field as exchangeable. In this analysis, we find significant homophily in 21/24 toplevel fields and 157/280 composite fields. Unsurprisingly, in almost all cases, the expected under the null distribution which accounts for compositional homophily is larger than the expected when only accounting for structural homophily.
2.3 Secondary Analysis
We also test whether certain characteristics of a terminal field are associated with significant behavioral homophily observed in the multiauthor papers. We fit a logistic regression for all terminal fields where the outcome is whether or not statistically significant behavioral homophily is detected (
) and the covariates are the ratio of % of soloauthorships which are female to the % authorships on multiauthored papers which are female (), the proportion of all authorships (solo and multi) which are female (), and the log of the number of authorships ().Previous work (boschini2007team) has shown that gender homophily is positively associated with increased female representation. We also include an indicator () for whether the field is majority female (i.e., ) and the interaction term (); this allows the association between behavioral homophily and female % to change slope in majority female terminal fields since gender dynamics may systematically differ in majority female fields. Furthermore, where concerns about gender discrimination are common, we might observe a positive association across subfields between relative rates of soloauthorship for the lowerfrequency gender (women in most cases) and increased behavioral homophily, since both would be rational choices in reaction to gender discrimination in collaboration rubin2017discrimination; oconnor2016dynamics; ferber1980disadvantage; mcdowell1992effect.
We calculate robust standard errors using a generalized estimating equation procedure with a diagonal working covariance where the clusters correspond to toplevel fields.
(1) 
The ratio of female soloauthors to multiauthors is not significant (pvalue = ); however, field size () and the proportion of females () have a statistically significant positive association with behavioral homophily (pvalues respectively). The estimate of the interaction term () is negative, but both the majority female indicator and interaction term are not significant (pvalues of .07 and .052). Further details are provided in the supplement.
2.4 Sensitivity to missing gender indicators
Our main analysis used gender indicators for the 87.9% of authorships with first names that are used predominantly for one gender and removed the other 12.1% of authorships (see Section 4.4.1). This rate of missingness compares favorably with previous studies (sugimoto2013global), and we explore the impact of missingness with a sensitivity analysis using two multiple imputation strategies.
The first strategy imputes each missing indicator according to the proportions of assigned genders in its original terminal field. This assumes that there is no behavioral homophily in the missing data because the imputed genders are conditionally independent given the terminal field, providing a reasonable lower bound on the homophily we might have obtained given the full data. The second strategy imputes each missing gender indicator according to the proportions of assigned genders on its original paper; if a paper contains only unassigned authorships, we impute a single gender for all authorships according to the proportions of assigned genders for the terminal field. By construction, papers with one or no assigned authorships are always homophilous, thus this imputation strategy provides a reasonable upper bound on the homophily we might have observed given the full data. We repeat the main analysis procedure to test for behavioral homophily in each of the imputed data sets; details are given in the supplement. For each strategy, Table 2 shows the average proportion (across 10 imputations) of fields with significant behavioral homophily. In both strategies, we assume that the observed gender proportions are good estimates of the true gender proportions, and we do not address bias which may be induced if one gender is more likely to be unidentified than the other^{2}^{2}2Sugimoto et al (sugimoto2013global)
handcheck a sample of 1000 authorships randomly selected across all fields. For names for which no prior records existed, the proportions of men and women (.68 and .32) was consistent with the proportions of men and women in the classified names (.69 and .31 for authorname combinations). In names which were not classified due to prevalent use for both genders, men were slightly overrepresented (.79 and .21).
.Analysis  Terminal  Composite  Top 

Main  0.09  0.29  0.83 
Sensitivity  low homophily  0.07  0.25  0.78 
Sensitivity  high homophily  0.54  0.82  1.00 
3 Discussion
When controlling for the hierarchical structure of scholarly communities and for fieldspecific cultures of collaboration, we observe behavioral gender homophily in coauthorships across wide swaths of the JSTOR corpus. This holds across all levels of granularity, from toplevel scholarly fields to intellectually narrow terminal fields.
Although we focus on gender and coauthorship, our methodology generalizes to studying homophily in other contexts where confounding occurs. For example, racial homophily may be confounded by spatial structure and homophily by illicit substance use in adolescents may be confounded by age or peer environment. Using a similar sampling procedure to control for observable structures could allow for a more nuanced analysis of homophily in these contexts as well.
While this methodology represents a substantial step in understanding homophily by allowing identification of its structural, compositional and behavioral components, there are a number of other methodological issues that present fruitful avenues for future work. For example, there are likely compositional effects due to aggregating data across time because gender representation has changed over time. Our analysis addresses some temporal aspects explicitly by considering publications from a limited timespan and implicitly by using the hierarchical clustering which may capture some time dynamics. However, a future analysis might directly incorporate temporal information into the null distribution. Also, because disambiguating authorships across papers is difficult without additional identifying information (torvik2009author), our analysis considers authorships rather than authors. However, in terminal fields with very few female authors, this may actually overestimate structural and compositional homophily (and underestimate behavioral homophily) by allowing for configurations with large where multiple authorships corresponding to the same female author are reassigned to the same paper. In addition, since individuals are more likely to coauthor with previous coauthors, a future analysis with disambiguated authors could capture coauthorship dependency across papers. Finally, we choose not to include soloauthored papers, because their inclusion would require strong modeling assumptions about the decision to write a soloauthor paper versus to collaborate on a multiauthor paper. However, it is unclear whether this systematically biases our behavioral homophily estimates.
In the secondary analysis, we find that female representation and field size are positively associated with statistically significant behavioral homophily. Scientifically, this result may seem counterintuitive on its face; however, it is not surprising from the perspective of homophily (mcpherson1987homophily): as the representation of women increases, it becomes more likely that samegender individuals who are sufficiently compatible along other key dimensions become available as prospective coauthors. This is consistent with with prior research in Economics which finds that behavioral homophily tends to be larger in subfields with a higher proportion of females (boschini2007team). Indeed, if women’s representation increases in areas of scholarship that remain stereotyped as male, women may have implicit or explicit preferences to collaborate with other women to protect against stereotype threat: experiments demonstrate that the presence of other women enhances women’s confidence, performance, and motivation in malestereotyped domains (murphy2007signaling; sekaquaptewa2003solo; inzlicht2000threatening; stout2011steming; marx2002female). However, because more balanced gender representation and larger field size also increase the power of our testing procedure, future work is needed to disentangle whether these factors are actually associated with increased homophily or simply due to our increased ability to detect homophily. We also find that the ratio of the proportion of single authored papers written by females to the proportion of female multiauthorships is not significantly associated with behavioral homophily. However, an analysis which directly models singleauthor papers could be more conclusive.
Future work should further evaluate the shortterm and longterm strategic value of gender homophilous collaboration for women. In the shortterm, do women who engage in gender homophilous relationships experience higher rates of retention in the authorship pool^{3}^{3}3Sarson (sarsons2015gender) shows that women who coauthor (instead of soloauthor) are less likely to receive tenure, but the effect is less pronounced if women coauthor with women instead of men., productivity, and impact? In the longterm, do genderhomophilous coauthorships give rise to genderhomophilous intellectual communities? And if so, does increasing the ratio of women in an intellectual community lead to its devaluation/impact, just as increasing the ratio of women in an occupation can decrease its prestige (goldin2014pollution)?
While many open questions remain, the direct implications of our current results are important: since behavioral gender homophily is not due to structural and compositional aspects such as gender imbalances across subdisciplines, and is endemic to some of the smallest intellectual communities, it might only be mitigated by changing the current cultural norms and perceptions that drive behavioral gender homophily within those communities.
4 Data and Methods
4.1 JSTOR Data
We impute authorship gender from first names by using namespecific gender percentages from Social Security and crowd sourced records. For each authorship, as in West et. al. (west2013role), we treat gender as known if the respective first name—or one of the first names in case of double names—is used for only one gender at least 95% of the time. We start by using United States Social Security Administration records which allow gender imputation for 75.3% of authorships. Using genderize.io (wais2016gender)
, which obtains gender prevalence by first name from user profiles across major social networks, we impute gender for an additional 12.6% of authorships. The remaining 12.1% of authorship instances consists of 7.6% of authorships that appear in neither database and of 4.5% of authorships that are used for both genders with at least 5% frequency. We provide detailed descriptive statistics in the supplement. In the main analysis, authorships with unimputed genders are omitted from our analysis. For instance, a paper with 2 female, 1 male, and 2 unimputed authorships is treated as an article with 2 female and 1 male authorships.
The alpha values and pvalues for each discipline and all the other descriptive statistics reported in this paper will be made openly available on our project website http://eigenfactor.org/projects/gender_homophily. Because the raw publication data are provided by JSTOR under license to the authors, requests for the raw data should be made to JSTOR directly. Code for the analysis and plots is available at https://github.com/ysamwang/genderHomophily.
4.2 Sampling procedure
We use to denote an authorship in , the set of all authorship instances. Let be a configuration where and denote the terminal field and document to which authorship is assigned, and let denote the observed configuration and denote the probability that an authorship originally observed in terminal field might instead author a paper in terminal field . We define the genderblind null distribution as follows:
(2) 
The equivalence relation indicates that is a permutation of the authorships in which preserves the total authorships per terminal field, the total numbers of male and female authorships, and the number of authorships per paper. Because the denominator in (2) cannot be calculated easily, we sample genderblind hypothetical configurations indirectly using a Markov chain Monte Carlo MetropolisHastings sampling procedure. A description of how are determined by the observed citation flows and the details of the sampling procedure are provided in t supplement. For the main analysis we use 75,000 samples from the null distribution after burnin to calculate each pvalue. For each of the sensitivity analyses, we use 9,000 samples from the null distribution after burnin.
References
Appendix A Measure of Homophily
Recall that where is the probability that a randomly selected coauthor of a randomly selected male authorship is male and is the probability that a randomly selected coauthor of a randomly selected female authorship is male. We calculate the for the example given in Figure 1 in the main text.
For Field A, there are males and females. To calculate in the following equation, we calculate the proportion of male coauthors for each male authorship and then take the average. The values in the equation correspond to authorships from left to right. To calculate , we calculate the proportion of male coauthors for each female authorship and then take the average–again from left to right.
(3)  
For Field B, there is male and females.
(4)  
For the Fields and combined, there are 12 males and 10 females.
(5)  
In Section 2.B of the main manuscript, we describe a concrete interpretation of displayed in the “FMPapers” column of Table 1. In particular, we assume a field consists of 100 2author papers and let and be the proportion of female and male authorships respectively. We can calculate the number (which may be fractional) of FemaleMale papers (), MaleMale papers () and FemaleFemale papers () which would result in a specific . In this setting
because there are 100 papers total, and since is the proportion of female authorships. Solving for , , and then yields:
Appendix B JSTOR Description
Table 3 shows the size of each of the 24 top level fields identified by the map equation. The values are calculated for all papers published in or after 1960. Note that the table describes the data prior to the data cleaning procedure, so counts of authorships, papers, terminal fields and composite fields shown here may differ from those given in the main manuscript which refer to data after the cleaning procedure. Specifically, Classics, Law, and Philosophy have entire terminal fields which are removed by the cleaning procedure. Table 4 presents the structural characteristics of each top level field. For the multiauthor columns, we report the proportion amongst all authorships on a multiauthor paper; e.g., all female authorships on multiauthored papers divided by the total count of authorships on multiauthored papers. For the intraclass correlation (ICC) of individuals with unimputed genders, we use the statistic from (ridout1999estimating). This gives a measure of how unimputed authorships cluster by paper. Anecdotally, unimputed authorships are often names which have been Romanized. Thus a high ICC may indicate homophily by race or ethnicity.
Authors  Papers  Terminal  Composite  

Label  (Count)  (Count)  Fields  Fields 
Anthropology  37588  30499  63  8 
Classical studies  10596  9061  37  8 
Cognitive science  15715  5553  14  3 
Demography  9653  5509  20  2 
Ecology and evolution  264853  116327  257  56 
Economics  95934  59096  136  28 
Education  40188  23065  42  10 
History  26449  24043  49  6 
Law  23974  19779  105  16 
Mathematics  18348  14125  46  9 
Molecular & Cell biology  382971  92528  178  44 
Mycology  7469  3679  16  1 
Operations research  13716  7780  18  4 
Organizational and marketing  34254  17963  68  4 
Philosophy  21738  19126  46  8 
Physical anthropology  29693  16703  32  10 
Plant physiology  9159  5436  21  3 
Political science  international  15283  11835  34  2 
Political scienceUS domestic  12581  7824  37  6 
Pollution and occupational health  50967  12359  24  1 
Probability and Statistics  37471  22094  90  23 
Radiation damage  14118  4215  14  5 
Sociology  57146  31662  94  21 
Veterinary medicine  17756  4796  19  2 
Prop Single Author  SingleAuthor  MultiAuthor  

Label  Papers  Auth  % F  % M  % U  % F  % M  % U  ICC 
Anthropology  0.86  0.70  0.27  0.63  0.10  0.28  0.61  0.12  0.10 
Classical studies  0.93  0.79  0.22  0.70  0.08  0.27  0.65  0.08  0.05 
Cognitive science  0.25  0.09  0.29  0.64  0.07  0.28  0.62  0.10  0.09 
Demography  0.57  0.32  0.24  0.61  0.15  0.30  0.53  0.16  0.11 
Ecology and evolution  0.37  0.16  0.14  0.79  0.08  0.20  0.70  0.11  0.10 
Economics  0.55  0.34  0.08  0.81  0.11  0.11  0.77  0.12  0.10 
Education  0.55  0.31  0.35  0.58  0.07  0.41  0.50  0.08  0.07 
History  0.92  0.84  0.24  0.70  0.06  0.23  0.69  0.08  0.05 
Law  0.85  0.70  0.17  0.78  0.06  0.22  0.71  0.06  0.05 
Mathematics  0.75  0.58  0.06  0.76  0.18  0.06  0.73  0.21  0.19 
Molecular & Cell biology  0.14  0.03  0.19  0.70  0.10  0.23  0.61  0.16  0.09 
Mycology  0.45  0.22  0.20  0.71  0.09  0.22  0.65  0.13  0.08 
Operations research  0.48  0.27  0.05  0.81  0.14  0.08  0.72  0.20  0.21 
Organizational and marketing  0.40  0.21  0.18  0.72  0.10  0.19  0.68  0.12  0.12 
Philosophy  0.89  0.78  0.09  0.82  0.08  0.10  0.78  0.11  0.12 
Physical anthropology  0.66  0.37  0.22  0.72  0.06  0.22  0.68  0.09  0.07 
Plant physiology  0.53  0.31  0.13  0.79  0.08  0.17  0.72  0.10  0.07 
Political science  international  0.78  0.60  0.16  0.74  0.10  0.17  0.73  0.10  0.08 
Political scienceUS domestic  0.57  0.35  0.17  0.76  0.07  0.18  0.75  0.07  0.05 
Pollution and occupational health  0.22  0.05  0.24  0.65  0.11  0.31  0.53  0.17  0.19 
Probability and Statistics  0.54  0.32  0.08  0.75  0.17  0.13  0.69  0.19  0.20 
Radiation damage  0.23  0.07  0.22  0.66  0.11  0.22  0.62  0.16  0.14 
Sociology  0.52  0.29  0.30  0.63  0.08  0.38  0.53  0.08  0.07 
Veterinary medicine  0.26  0.07  0.27  0.60  0.12  0.25  0.59  0.15  0.22 
The plots below show how the following quantities have changed over time for each top level fields average number of authors per paper, proportion of papers with multiple authors, and the imputed gender proportions. The values are calculated on the data before the datacleaning procedure which removes authorship instances with unimputed genders.
Appendix C Data Cleaning Procedures
For the main analysis, we impute gender indicators for authorships with first names that are used for a single gender with at least 95% frequency in either the U.S. Social Security records or in the genderizeR database. We consider the gender indicator to be missing for authorships that either do not appear in those databases or are not used with at least 95% frequency for one gender. We subsequenty remove authorships with unimputed genders from our main analysis. This removal results in some articles which originally had multiple authors becoming single author papers, which are excluded from the analysis. The following table shows the proportion of authorships and papers which are lost solely due to unimputed genders. The denominator only includes papers which have multiple authors which were published from 19602012. The % unimputed column is the % of authors for which we do not impute a gender indicator. The % Lost column is the % of authors (or papers) which are lost after removing the authorships with unimputed gender indicators and then removing the resulting single author papers. For authorships, this percentage includes the authorships with unimputed genders.
Prop Authors with  Authors  Papers  

Label  Unimputed Gender  Remaining  Prop Lost  Remaining  Prop Lost 
Anthropology  0.12  9326  0.17  3466  0.15 
Classical studies  0.08  1976  0.11  610  0.09 
Cognitive science  0.10  12510  0.13  3814  0.07 
Demography  0.16  5069  0.22  1930  0.17 
Ecology and evolution  0.11  192091  0.13  66152  0.08 
Economics  0.12  51691  0.19  22178  0.15 
Education  0.08  24356  0.12  9396  0.09 
History  0.08  3699  0.12  1596  0.11 
Law  0.06  6526  0.10  2765  0.09 
Mathematics  0.21  5319  0.31  2459  0.25 
Molecular & Cell biology  0.16  303761  0.18  73357  0.07 
Mycology  0.13  4828  0.17  1759  0.11 
Operations research  0.20  7217  0.28  3025  0.21 
Organizational and marketing  0.12  22299  0.18  9137  0.13 
Philosophy  0.11  3897  0.18  1770  0.15 
Physical anthropology  0.09  16463  0.12  5175  0.07 
Plant physiology  0.10  5388  0.14  2287  0.10 
Political science  international  0.10  5128  0.16  2247  0.14 
Political scienceUS domestic  0.07  7269  0.11  3068  0.09 
Pollution and occupational health  0.17  39703  0.18  8845  0.06 
Probability and Statistics  0.19  18763  0.27  7600  0.21 
Radiation damage  0.16  10710  0.18  2902  0.09 
Sociology  0.08  35858  0.12  13600  0.09 
Veterinary medicine  0.15  13741  0.17  3275  0.06 
Total  0.14  807588  0.16  252413  0.11 
Appendix D Sampler Details
Recall from the main text, that for each authorship in the set of all authorships , we let denote the terminal field and denote the document to which is assigned. We denote the entire configuration of all authorships as and denote the configuration which we actually observe in the data as . We use a Markov Chain Monte Carlo MetropolisHastings sampler to draw samples from the genderblind null distribution:
(6) 
where the equivalence relationship indicates that the number of total authorships per terminal field, the total numbers of male and female authorships, and the number of authorships per paper is the same in and .
Define a permutation cycle of length to be a set of authorships in which is reassigned to the current terminal field and document of and is reassigned the current terminal field and document of . Any hypothetical configuration of authorship assignments can be decomposed into disjoint permutation cycles of the observed data . The sampling procedure starts with the observed assignments of authorships to papers within terminal fields and generates assignments by successively modifying the current state by a series of permutation cycles. We generate a proposal for each of these cycles by first randomly selecting a cycle length
from a geometric distribution. Then,
specific authorships are selected to form the permutation cycle. This proposed permutation cycle is then accepted or rejected with the appropriate MetropolisHastings probability.For , let , the authorships in a terminal field where any authorship originally from terminal field could be reassigned.
The length of the proposed cycle , where is a tuning parameter which regulates the average cycle length. A larger value of will yield longer cycles resulting in larger changes in the proposal but a lower probability of acceptance; a smaller value of will yield shorter cycles resulting in smaller changes in the proposal but a higher probability of acceptance. In general, the maximum length of a permutation cycle in the decomposition could be up to , the number of authorships in our corpus. Thus, any distribution which has positive support over would be sufficient for irreducibility. Under this scheme proposed in Algorithm 1, (as defined by genderblind null distribution) could be 0 since we have not guaranteed that . In addition, since we are selecting authorships with replacement, if an authorship is selected twice on the cycle. However, if we were to sample without replacement we would need to condition on authorships that had been previously selected, so the proposal probabilities would no longer be symmetric since the probability of traversing a cycle would not be invariant to the orientation of the cycle.
Remark 1.
Let be the described proposal distribution in Algorithm 1. Then is symmetric such that .
Proof.
Let and be two assignments which differ by cycle . For notational convenience, let and . Then,
(7)  
A proposal of from requires traversing the cycle in the opposite direction.
(8)  
∎
Remark 2.
The Markov chain produced from the proposal procedure in Algorithm 1 is irreducible if and the cycle length is chosen from a distribution with support over where is the number of authorship instances.
Proof.
For each with , there exists a decomposition of into disjoint sets such that is a permutation of some subset of . Let be the sequence of assignments which correspond to updating the permutation cycles , . Since there are a finite number of disjoint cycles and the proposal for permuting each cycle is positive, then the joint probability of permuting all cycles is also positive, so . Because the transition support is symmetric, we can also reverse each cycle to move with positive probability from .
Thus, for any two states and with positive probability under the null,
∎
To allow for collaboration across terminal fields, we use observed citation data from one terminal field to another to define the authorship reassignment probability, , between terminal fields and . Here, we make two simplifying assumptions. First, we threshold the citation flow between terminal fields at 5% of outgoing citations. Authorship reassignments between terminal fields that have little connectivity are highly unlikely. Thresholding the citation data produces a network of terminal fields that is sparser (has greater number of disjoint graph components) which allows the sampling procedure to be parallelized more efficiently. Second, to ensure that our sampling procedure can reach all that have positive probability under the null distribution, we allow for authorship reassignment between terminal fields to be possible in both directions.
More formally, let be observed the proportion of citations from terminal field to terminal field , . We define he authorship reassignment probabilities between terminal fields as follows:

Set any proportions to 0

Set

Renormalize the proportions so
This procedure allows us to take into account substantial connectivity between terminal fields and also ensures that authorship reassignments between terminal fields are possible in both directions:
Appendix E Results
e.1 Sampler Convergence
As recommended by Gelman and Shirley (gelman2011inference), we take 3 separate chains and discard roughly the first half of each chain as burn in. In particular, we take 3 chains of 45,000 MCMC samples each and discard the first 20,000 from each chain for burn in. Such long chains and burnin are necessary because all chains were initialized from the same starting values. We then combine the remaining 75000 samples (25,000 from each chain) to estimate pvalues for the observed values. To check for convergence in the distribution of each relevant field, we compare the distributions of from each chain using a KolmogorovSmirnov test as suggested by (brooks2003nonparametric). Figure 4 shows the pvalues for a twosided KolmogorovSmirnov Test for equality of distributions. Because the distribution of is discrete, instead of using the typical asymptotic distribution of the KS statistic to calculate pvalues, we bootstrap pvalues using the R function ks.boot from the Matching package (sekhon2011matching). We see that the pvalues across all comparisons are relatively uniform as we would expect if the distribution of across all chains were similar.
e.2 Comparison with Naive Approach
We can compare the expected value of from the null distribution which accounts for structural and compositional homophily (Eq. (2) in the main document) to the expected value of from a naive null distribution which only accounts for structural homophily. We construct a naive null distribution for each level in the full hierarchical clustering by preserving all structure of (terminal or composite) fields with depth less than , but treating all fields of depth as a terminal field. We then recalculate the swap probabilities given the citation flows, and then run the sampler for 5,000 samples. We discard the first 1,000 as burn in and use the remaining 4,000 samples to calculate an expected value and calculate pvalues.
Columns labeled “Struct” provide the expected value of , the expected number of femalemale papers, the pvalue for behavioral homophily, and the number of significant composite fields under the null hypothesis of only structural (but not compositional) homophily. Under only structural homophily, the expected value is smaller than the expected when also preserving compositional homophily. In addition, the pvalue decreases for all toplevel fields when only considering structural homophily. In many toplevel fields, the number of composite fields with behavioral increases when only capturing structural homophily and never decreases.
Pvalues  Signif Comp  
Field  Obs  Exp  Struct  Obs  Exp  Struct  Main  Struct  Main  Struct 
JSTOR  .11  .05  .00  38.6  41.1  43.3  .00  .00  82/280  157/280 
Mol/Cell Bio  .05  .01  .00  38.2  39.8  40.2  .00  .00  19/44  35/44 
Eco/Evol  .06  .02  .00  31.9  33.3  34.0  .00  .00  15/56  33/56 
Economics  .11  .02  .00  18.7  20.8  21.1  .00  .00  11/28  18/28 
Sociology  .19  .07  .00  38.5  44.2  47.4  .00  .00  12/21  19/21 
Prob/Stat  .09  .03  .00  26.0  27.8  28.6  .00  .00  2/23  12/23 
Org/mkt  .16  .04  .00  29.0  33.2  34.7  .00  .00  3/4  4/4 
Education  .16  .04  .00  41.2  47.2  49.2  .00  .00  6/10  9/10 
Occ Health  .10  .02  .00  41.7  45.5  46.3  .00  .00  1/1  1/1 
Anthro  .12  .03  .00  38.5  42.0  43.5  .00  .00  2/8  4/8 
Law  .17  .08  .00  29.7  32.9  35.7  .00  .00  1/16  4/16 
History  .16  .07  .00  32.9  36.5  39.1  .00  .00  1/6  2/6 
Phys Anthro  .07  .01  .00  34.7  36.8  37.1  .00  .00  2/10  2/10 
Intl Poli Sci  .09  .02  .00  27.3  29.1  29.8  .03  .00  0/2  1/2 
US Poli Sci  .15  .07  .00  25.2  27.4  29.6  .00  .00  1/6  2/6 
Philosophy  .10  .03  .00  18.7  20.3  20.8  .03  .00  0/8  0/8 
Math  .04  .01  .00  14.3  14.7  14.9  1.00  .12  0/9  0/9 
Vet Med  .09  .01  .00  38.3  41.6  42.0  .00  .00  1/2  2/2 
Cog Sci  .18  .09  .00  35.7  39.4  43.3  .00  .00  3/3  3/3 
Radiation  .09  .01  .00  34.0  36.8  37.3  .00  .00  1/5  4/5 
Demography  .15  .06  .00  40.3  44.4  47.3  .00  .00  1/2  1/2 
Classics  .07  .01  .00  38.7  41.2  41.7  .27  .01  0/8  0/8 
Opr Res  .03  .00  .00  16.6  17.1  17.1  .73  .33  0/4  0/4 
Plant Phys  .08  .02  .00  29.1  31.0  31.8  .03  .00  0/3  1/3 
Mycology  .03  .01  .00  36.8  37.5  37.9  1.00  .60  0/1  0/1 
e.3 Calculating and adjusting Pvalues
In a sampled configuration, if a field only contains authorships of a single gender, is undefined. When calculating a pvalue, we consider this as . This approach is conservative because it increases pvalues, but in practice has very little effect on our results.
In the main manuscript, we control the false discovery rate at .05 with the BenjaminiYekutieli procedure (benjamini2001control) which allows for arbitrary dependence of the pvalues, but is more conservative than the BenjaminiHochberg procedure (benjamini1995controlling), which only allows for certain types of positive dependence. Table 7 replicates the last 3 columns of Table 1 of the main manuscript using the BenjaminiYekutieli procedure with an FDR of .005 as well as the BenjaminiHochberg procedure with FDR rates of .05 and .005.
BY; Rate  BH; Rate  BH; Rate  

Field  Pvalue  Term  Comp  Pvalue  Term  Comp  Pvalue  Term  Comp 
JSTOR  .00  68/1450  65/280  .00  114/1450  81/280  .00  261/1450  125/280 
Mol/Cell Bio  .00  16/178  18/44  .00  28/178  19/44  .00  53/178  29/44 
Eco/Evol  .00  7/257  13/56  .00  14/257  15/56  .00  43/257  25/56 
Economics  .00  4/136  8/28  .00  8/136  11/28  .00  25/136  17/28 
Sociology  .00  9/94  9/21  .00  12/94  12/21  .00  26/94  14/21 
Prob/Stat  .00  0/90  1/23  .00  1/90  2/23  .00  7/90  7/23 
Org/mkt  .00  5/68  3/4  .00  6/68  3/4  .00  15/68  3/4 
Education  .00  6/42  4/10  .00  10/42  6/10  .00  17/42  6/10 
Occ Health  .00  8/24  1/1  .00  12/24  1/1  .00  15/24  1/1 
Anthro  .00  2/63  1/8  .00  5/63  2/8  .00  8/63  2/8 
Law  .00  0/98  0/16  .00  0/98  1/16  .00  1/98  4/16 
History  .00  0/49  0/6  .00  0/49  1/6  .00  1/49  1/6 
Phys Anthro  .00  1/32  2/10  .00  1/32  2/10  .00  6/32  3/10 
Intl Poli Sci  .03  0/34  0/2  .00  0/34  0/2  .00  3/34  0/2 
US Poli Sci  .00  0/37  0/6  .00  2/37  1/6  .00  4/37  3/6 
Philosophy  .03  0/45  0/8  .00  0/45  0/8  .00  5/45  0/8 
Math  1.00  0/46  0/9  .16  0/46  0/9  .16  2/46  0/9 
Vet Med  .00  5/19  1/2  .00  7/19  1/2  .00  8/19  2/2 
Cog Sci  .00  2/14  3/3  .00  4/14  3/3  .00  7/14  3/3 
Radiation  .00  3/14  1/5  .00  3/14  1/5  .00  5/14  3/5 
Demography  .00  0/20  0/2  .00  0/20  0/2  .00  3/20  2/2 
Classics  .27  0/35  0/8  .03  0/35  0/8  .03  1/35  0/8 
Opr Res  .73  0/18  0/4  .09  0/18  0/4  .09  1/18  0/4 
Plant Phys  .03  0/21  0/3  .00  1/21  0/3  .00  4/21  0/3 
Mycology  1.00  0/16  0/1  .27  0/16  0/1  .27  1/16  0/1 
e.4 Secondary Analysis
We examine whether certain terminal field characteristics are associated with statistically significant behavioral homophily. In particular, we fit a logistic regression where the dependent variable is whether or not significant behavioral homophily was detected using the BenjaminiYekutieli FDR procedure (benjamini2001control) with . We include the following independent variables: the ratio of % of soloauthorships which are female and the % authorships on multiauthored papers which are female (); the log of the number of authorships (); the proportion of female authorships (); an indicator of whether the field is majority female(); and an interaction between and . The interaction term allows the association of female proportion to differ depending on whether the field is majority female or not.
We fit the logistic regression specified in (1) using a generalized estimating equation (GEE) (gee2015); to account for dependency across terminal fields, we use robust standard errors and specify clusters aligning to top level field. We also specify a diagonal working covariance. The results are shown in Table 8. We see that the ratio of female soloauthorships to female multiauthorships is not significant, but the size of the terminal field and the proportion of female authorships is statistically significant. We also see that both the indicator for whether a field is majority female and the interaction term are not significant at the .05 level. While it is interesting to note that the estimate of the interaction term is negative, we also caution that the estimate may not be precise since the majority female indicator is only positive for of the terminal fields.
(9) 
Estimate  Robust S.E.  Robust z  Pvalue  

Intercept  14.05  1.00  14.09  0.00 
log(Authorships)  1.45  0.09  15.25  0.00 
Proportion Female  6.70  1.55  4.32  0.00 
Majority Female Indicator  13.30  7.33  1.81  0.07 
Ratio Solo vs Multi Females  0.18  0.35  0.52  0.60 
Proportion Female Majority Female Interaction  24.70  12.73  1.94  0.052 
Alternatively, if we define as whether behavioral homophily was detected under the BenjaminiHochberg (benjamini1995controlling) FDR control procedure, we see that the significant/nonsignificant covariates do not change, but the pvalues for the majority female indicator and interaction term are much further away from .05 than when defining significance using the BenjaminiYekutieli procedure. Results are shown in Table 9.
Estimate  Robust S.E.  Robust z  Pvalue  

Intercept)  10.22  0.72  14.26  0.00 
log(Authorships)  1.20  0.09  12.75  0.00 
Proportion Female  4.11  1.06  3.89  0.00 
Majority Female Indicator  3.08  4.01  0.77  0.44 
Ratio Solo vs Multi Females  0.20  0.22  0.89  0.37 
Proportion Female Majority Female Interaction  5.63  7.22  0.78  0.44 
e.5 Sensitivity Analysis: Missing Gender Indicators
To evaluate how sensitive our main results are to the missing gender indicators, we impute gender for authorships with missing gender indicators under two scenarios:

Low homophily: Each authorship with a missing gender indicator is assigned a gender at random according to the proportions of observed genders on its original terminal field. This procedure assumes that there is no behavioral homophily in the imputed data because the imputed genders are conditionally independent given the terminal field. Thus, it gives a reasonable lower bound on the homophily we might have observed given the full data.

High homophily: Each authorship with a missing gender indicator is assigned a gender at random according to the proportions of observed genders on its original paper. If the original paper contains only authorships with missing gender indicators, we assign all authorships on the paper the same gender indicator which is drawn randomly according to the proportions of observed genders for its original terminal field. Because papers with at most one assigned gender indicator are homophilous by construction, this provides a reasonable upper bound on the homophily we might have observed given the full data.
For each scenario, we carry out 10 imputations and then repeat the entire sampling and testing procedures used for the main analysis. Table 10 gives the resulting percentages of terminal, composite, and top level fields with significant behavioral homophily under the BenjaminiYekutieli FDR procedure with under the low and high homophily missing data imputation scenarios. We observed that, on average, 7%, 25%, and 78% of terminal, composite, and top level fields exhibit statistically significant respectively in the low homophily scenario; for the high homophily procedure the corresponding averages are 54%, 82%, and 100%.
Terminal  Composite  Top  
Main Analysis  0.09  0.29  0.83 
Low Imputation 1  0.06  0.23  0.83 
Low Imputation 2  0.07  0.25  0.75 
Low Imputation 3  0.06  0.25  0.75 
Low Imputation 4  0.08  0.25  0.75 
Low Imputation 5  0.06  0.26  0.75 
Low Imputation 6  0.07  0.28  0.75 
Low Imputation 7  0.07  0.25  0.83 
Low Imputation 8  0.07  0.26  0.79 
Low Imputation 9  0.07  0.25  0.75 
Low Imputation 10  0.06  0.26  0.79 
Low Imputation Avg  0.07  0.25  0.78 
High Imputation 1  0.54  0.82  1.00 
High Imputation 2  0.53  0.82  1.00 
High Imputation 3  0.53  0.82  1.00 
High Imputation 4  0.53  0.82  1.00 
High Imputation 5  0.54  0.82  1.00 
High Imputation 6  0.53  0.81  1.00 
High Imputation 7  0.54  0.82  1.00 
High Imputation 8  0.54  0.83  1.00 
High Imputation 9  0.54  0.83  1.00 
High Imputation 10  0.54  0.82  1.00 
High Imputation Avg  0.54  0.82  1.00 
Comments
There are no comments yet.