How individual behaviors drive inequality in online community sizes: an agent-based simulation

Why are online community sizes so extremely unequal? Most answers to this question have pointed to general mathematical processes drawn from physics like cumulative advantage. These explanations provide little insight into specific social dynamics or decisions that individuals make when joining and leaving communities. In addition, explanations in terms of cumulative advantage do not draw from the enormous body of social computing research that studies individual behavior. Our work bridges this divide by testing whether two influential social mechanisms used to explain community joining can also explain the distribution of community sizes. Using agent-based simulations, we evaluate how well individual-level processes of social exposure and decisions based on individual expected benefits reproduce empirical community size data from Reddit. Our simulations contribute to social computing theory by providing evidence that both processes together—but neither alone—generate realistic distributions of community sizes. Our results also illustrate the potential value of agent-based simulation to online community researchers to both evaluate and bridge individual and group-level theories.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

04/02/2020

Stochastic Multi-Agent-Based Model to Measure Community Resilience-Part 2: Simulation Results

In this paper we investigate the resiliency planning of interdependent e...
02/19/2017

Social learning in a simple task allocation game

We investigate the effects of social interactions in task al- location u...
12/22/2020

Modelling Human Routines: Conceptualising Social Practice Theory for Agent-Based Simulation

Our routines play an important role in a wide range of social challenges...
06/15/2020

Quantitatively Assessing the Benefits of Model-driven Development in Agent-based Modeling and Simulation

The agent-based modeling and simulation (ABMS) paradigm has been used to...
02/01/2018

Evolutionary Model Discovery of Factors for Farm Selection by the Artificial Anasazi

Agent-based modeling has been criticized for its apparent lack of establ...
03/26/2016

Data-Driven Dynamic Decision Models

This article outlines a method for automatically generating models of dy...
02/23/2021

Models we Can Trust: Toward a Systematic Discipline of (Agent-Based) Model Interpretation and Validation

We advocate the development of a discipline of interacting with and extr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Massive inequality in membership size is a feature of virtually every platform hosting online communities. A few very large communities attract the vast majority of contributors while most communities receive almost none. This pattern occurs in peer production communities like open source software and wikis as well as discussion-oriented communities like forums. For example, in January 2017 the “r/AskReddit” community on Reddit had over 680,000 unique contributors who made over 5.3 million comments. The next most active community on Reddit, “r/politics” had around 156,000 contributors and 2.2 million comments. Meanwhile, the median number of contributors to a Reddit community that month was three and the median number of total comments was five.

111All data from Reddit (https://reddit.com) used in this paper was gathered and republished by Pushshift (https://pushshift.io).

Why are some online communities so much larger than others? Why do people join online communities? Most prior answers to the first question in social computing research treat group size as the result of mathematical mechanisms drawn from physics such as cumulative advantage or preferential attachment. These approaches minimize agency and lack a clear link to individual-level behavior. The body of work answering the second question has provided explanations of how people make these individual decisions but only in the context of a single community. Social computing research suggests three sequential subprocesses that influence individual decisions to join or leave communities. In the first step, people learn about communities through social ties and social influence. We call this social exposure. In a second step following exposure, people decide whether to participate for the first time or not (i.e., to join a community). Finally, after a period of activity, people decide to discontinue their participation. Prior work treats joining and exit decisions as a function of individuals’ motivations, abilities, and/or expectations that a given community will satisfy their goals in some way. We refer to this approach to joining and exit decisions as individual expected benefits (IEB).

These individual-level explanations typically do not attempt to explain macro-level phenomena like the distribution of participants across communities. Logically, group sizes must emerge as a function of individual decisions about which groups to join. But how well do individual joining processes explain macro outcomes like membership size? Understanding higher-level implications of individual-level behaviors is often difficult. In social systems like online communities, individual behaviors are interdependent and combine to produce macro-level patterns in complicated ways that are hard to predict. This “micro-macro divide” makes it very difficult to directly test findings across levels.

One approach to bridging this divide is agent-based simulation (or ABS, also known as agent-based modeling). ABS involves building empirically-informed formal approximations of theories of individual behavior, using computers to step through simulated interactions between individual agents behaving according to the formalized theories, and comparing the macro-level outcomes of these simulations against empirically-observed data.

To bridge the micro-macro divide in online community research, we formalize models of theories of social exposure and IEB decisions and use ABS to simulate computational agents acting according to these models across a range of simulated situations. We also simulate a novel joint model of social exposure and IEB. We evaluate our models and the underlying theories by assessing how well agents reproduce the patterns of community size observed empirically in Reddit. We find that simulated agents acting according either social exposure or IEB alone do not produce empirically plausible community size distributions. In contrast, when the two sub-processes of exposure and IEB decisions act in tandem, sufficient positive feedback emerges to produce highly skewed distributions similar to that of Reddit.

This paper makes three primary contributions. First, we provide a framework for connecting micro-level social computing research influenced by social psychology with macro-level research on organizational behavior and group and population dynamics. In doing so, we show how higher-level patterns can enrich our understanding of individual behavior, and vice versa. Second, we provide arguments and evidence for the usefulness of agent-based simulation in social computing research. Finally, we provide a theoretical synthesis between social exposure and IEB decisions and show that both together provide a good explanation for the extreme inequality in online community sizes.

2. Background

Social computing research on community participation and community growth is divided into two largely distinct bodies of micro- and macro-level scholarship. To motivate our use of agent-based simulation as a bridge between the two levels, we briefly review salient examples of each. First, we consider micro-level explanations of community membership that focus on two sub-processes: how people learn about communities through social exposure and how they decide to join or leave them based on expected benefits. We then discuss prior explanations of community size distributions from a macro-level perspective, underscoring that such explanations typically have little to say about individual-level behaviors. Next, we introduce ABS as a method for evaluating the impact of micro-level community joining behavior on the macro-level distribution of community sizes. Finally, we discuss how micro-level processes may contribute to macro-level behavioral outcomes and why macro-level consequences matter for micro-level models.

2.1. Explaining individual-level community joining and exit

2.1.1. Social exposure

While people may learn about new communities in multiple ways, exposure often occurs through social ties (Kraut et al., 2012). Social exposure-based explanations of online community joining are one example of how social ties predict the adoption of computing behaviors. For example, the more friends who adopt a technology product (Bhatt et al., 2010), patterns of behavior (Bakshy et al., 2009; State and Adamic, 2015), or shared information (Bakshy et al., 2012), the more likely a focal person is to do the same. A few studies have looked explicitly at the decision to join online groups and have found the same dynamics: the more friends a person has in a group, the more likely they are to join it (Backstrom et al., 2006; Kairam et al., 2012). Social exposure to new online communities is a common feature of participation on the Internet and users of online communities are frequently being exposed to new communities. For example, in January 2017, 792,643 comments on Reddit—over 1% of all public comments made on Reddit that month—included links to other Reddit communities.

2.1.2. Decisions based on individual expected benefits

Once exposed to a set of communities, people join online communities in order to advance their goals and to meet their needs (Klandermans, 2004; Resnick et al., 2012). Individuals decide whether and how to participate in a community based on their personal attributes, motivations, and experiences. Potential members reason prospectively about the community and whether they can imagine themselves as a part of it (Antin and Cheshire, 2010; Antin, 2011; Antin et al., 2012). Over time, individuals will remain or leave depending on how existing members respond to them (e.g., do they bite the newbies or reach out to offer support?) or based on shifting perceptions of the community, other members, or their own role (Bryant et al., 2005; Halfaker et al., 2011, 2013; Morgan and Halfaker, 2018; Panciera et al., 2009)

. As part of these decisions, people may estimate the impact and importance of their contributions. For example, when the government of the People’s Republic of China eliminated access to Wikipedia for their citizens, thereby decreasing the editor community and audience of Chinese Wikipedia, Chinese-speaking Wikipedians from the rest of world dramatically reduced their participation

(Zhang and Zhu, 2011). All else equal, larger communities with larger audiences tend to attract more participants. After joining a group, some people deepen their commitment to it and develop attachments in the form of identity and/or social bonds (Kraut et al., 2012). The extent to which a person has these attachments also influences whether they continue to participate or leave (Danescu-Niculescu-Mizil et al., 2013; Kairam et al., 2012). Eventually, individual community members reach a peak of participation and exit a community when it no longer helps to meet their goals.

To summarize this prior work in a schematic way, people decide to join and participate in online communities as long as they expect that the benefits they receive outweigh the costs of participating. We use the term individual expected benefits (IEB) to describe this way of thinking about joining decisions. The IEB approach is illustrated in the final chapter of the book Building Successful Online Communities (Kraut et al., 2012), wherein Resnick et al. (2012) provide an equation-based model intended to summarize how people weigh costs and benefits when deciding whether to join an online community. Like the rest of the book, the model is presented as part of extensive literature review and attempts to integrate and summarize a rich body of work on the topic of community joining and exit decisions in social computing theory.

In Resnick et al.’s model, benefits fall into two categories: participation benefits and early adopter benefits. Participation benefits are a function of the eventual size of the community and might include social relationships, entertainment, and information. Early adopter benefits are benefits which only accrue to early members and include influence, reputation, or even revenue sharing. The costs to join a community (which they call “startup costs”) include learning new software, learning the norms of a new community, and building reputation and social ties. If someone believes that the expected utility of the participation benefits plus the early adopter benefits outweigh the costs then they will join that community or continue to participate. If not, they leave the community or never participate in the first place.

2.2. Explaining inequality in online community size

Although distributions of community size emerge as a consequence of individuals joining and leaving communities, prior work has analyzed these distributions without making serious efforts to explain their relationship to individual-level behaviors. One clear takeaway from studies of community size across a range of platforms and contexts is that distributions of community sizes are highly skewed (Johnson et al., 2014). For example, in one month on the platform Reddit, twelve subreddits (i.e., topic-based communities) had over 100,000 unique contributors, while over 44,000 subreddits had fewer than five contributors (Figure 1). Free/open source software communities (Graves, 2013; Healy and Schussman, 2003; Crowston et al., 2008), wikis and other peer production production projects (Benkler et al., 2015), online discussion forums (Jones et al., 2002; Panek et al., 2018), and many other platforms follow similar patterns.

Figure 1. Distribution of members per subreddit in January 2017. The x-axis is the number of users with at least five comments in that subreddit in the month. The y-axis is the count of communities with that number of users. Both axes are log-scaled.

Whether identified as following a scale-free “power law” (Adamic and Huberman, 2000; Barabási and Albert, 1999), “weakly” scale-free, or merely log-normal (Broido and Clauset, 2019) distributions, explanations of community size typically invoke cumulative advantage and/or preferential attachment (Barabási and Albert, 1999; Merton, 1968). In these models, current community size is the only variable used to predict future community size. In short, popular communities become ever more popular. Drawing from models used in physics, these explanations emphasize mathematics of accumulation without providing credible social mechanisms of individual action that might act as reasonable approximations of the mathematics. Sociologists have referred to these types of explanations as “undersocialized” accounts (e.g., Frank, 1993).

Could models of social exposure and IEB act as social mechanisms which reproduce the broader dynamics of cumulative advantage or preferential attachment, and thus explain higher-level patterns of inequality in community sizes based on empirically-grounded, individual-level research? In that social computing scholars studying joining processes have focused on individual-level and group-level outcomes, they have not attempted to understand patterns in populations of groups. As a result, we simply do not know.

Of course, researchers have considered why groups grow or founder. Most of this prior work also focuses on just one level. For example, some studies explain variation in group outcomes based on group-level behavior. In this work communities appear as independent entities and features of members, their interactions, or institutions predict outcomes like size or longevity (Crowston and Howison, 2005; Crowston et al., 2006; Kittur and Kraut, 2010; Kraut and Fiore, 2014; Shaw and Hill, 2014). Although valuable, this work tells us little about the distribution of community sizes because group membership is presumed and differences in outcomes arise from things that happen within communities.

Other studies analyze relationships between groups—e.g., whether overlaps in group membership predict group survival or growth (TeBlunthuis et al., 2017; Wang et al., 2013; Zhu et al., 2014a; Zhu et al., 2014b). Again, this research explains variation between community outcomes and does not account for the overall distributions of community size.

2.3. Agent-based simulation

In short, previous research provides little insight into how well individual behavioral models of joining and exit explain inequality in online community sizes. One explanation for this gap in our understanding is the challenges of drawing inferences across levels of theorizing and analysis. It is difficult to test competing theories of joining and community size simultaneously because doing so requires studying individual-level processes across many organizations and communities. This poses a barrier to research because many challenges involved in conducting empirical social computing research scale with the number of communities and contexts involved. For example, issues of access to data on users and their motivation, challenges related to both the size and diversity of data, and nuts-and-bolts issues related to the navigation of idiosyncratic features of multiple communities make it challenging enough for researchers to study detailed user-level behavior within a single community. Doing it across thousands or even millions of communities is often simply not possible.

Prior work in the social sciences describes this type of challenge in terms of a “micro-macro divide” (Opp, 2011). To address these challenges, researchers in ecology, economics, and sociology have characterized “agent-based complex systems” (Grimm et al., 2005) where properties of larger systems emerge from the decisions of interdependent agents. These disciplines often employ agent-based simulations in order to bridge the micro-macro divide (Grimm et al., 2005; Wilensky and Rand, 2015). Agent-based simulations (ABSs) capture important aspects of a context in the form of simple rules that agents follow. Researches then create virtual experiments by modifying these rules in order to identify the conditions under which interacting agents produce various higher-level outcomes (Lazer and Friedman, 2007). At their best, such models reveal patterns of emergent collective behavior that even thoughtful analysis of micro-level action might never predict (and vice-versa). In one of the earliest agent-based simulations, Schelling (Schelling, 1971) showed how a world in which agents hold even a very weak preference to live near someone who resembles them becomes completely segregated—without any reference to geography, jobs, moving costs, social networks, and almost everything that real people seem to consider in deciding to move.

Theories like critical mass theory (Marwell and Oliver, 1993) and complex contagion (Centola and Macy, 2007) were developed with the help of agent-based simulations and have been influential in social computing research (Raban et al., 2010; Romero et al., 2011). Despite this history of benefiting from ABS, a strong fit with social computing research questions, and calls for use from prominent scholars (Ren and Kraut, 2014), agent-based simulations remain extremely rare in HCI and social computing.

Because they bypass many of the challenges with large-scale empirical research described above, ABSs offer a feasible approach to bridge theories and findings at different levels of analysis. They can allow theory and data at higher levels to influence micro-level mechanisms and provide grounded, validated explanations for macro-level patterns. ABSs also let us explore how variation in assumptions and models of individual behavior combine and aggregate to group and population dynamics and can thus help enrich theories of individual and group behavior.

2.4. Bridging the “micro-macro divide” between joining/exit and community size

An ABS approach allows us to explore the community size distributions that emerge when agents join and leave communities according to models of social exposure and/or IEB decisions. This approach helps to formalize questions and theories, test how changes to assumptions and parameters influence outcomes, and evaluate the simulated outcomes against empirical baselines.

Consider the implications of social exposure dynamics on community size distributions. Social exposure provides a plausible micro-level mechanism of cumulative advantage. In that larger communities have more members by definition, they also have more people who can talk about them. This could lead to more people learning about and joining larger communities than smaller ones. This would result in even more people talking about the larger communities in the future. The result is a positive feedback mechanism where success begets success (van de Rijt et al., 2014).

Joining decisions based on IEB could also contribute to skewed outcomes. If individuals take participation benefits and early-adopter benefits as described by Resnick et al. (2012) into account as they decide which communities to join, they will gravitate towards those that are already large or that appear positioned to become largest. This should also create a feedback loop, where individuals join quickly growing communities, causing others to be even more likely to see them as quickly growing. As with social exposure, these dynamics could produce skewed group sizes.

Together, the positive feedback patterns created by social exposure and IEB decisions should amplify each other. With both forces operating jointly, people are both more likely to be exposed to larger communities as well as more likely to join them. Subsequently, these new members will expose others to these larger communities, who will also be more likely to join them, and so on. We would expect this to result in even more skewed distributions of community size than either model operating alone.

3. Methods

We created a set of agent-based simulations to explore whether computational models of social exposure and/or IEB decisions generate distributions of community size that resemble those observed empirically. To do so, we formalized theories of social exposure and IEB decisions into algorithms that simulated agents that decide which communities to join or leave and which to share with others. For each of the micro-level mechanisms, we simulated a series of decisions and interactions among a population of agents over time to observe the macro-level consequences in a population of hypothetical communities. We show the results of these simulations and compare the distribution of hypothetical community sizes to a sample drawn from Reddit. Additional details about the general approach, how we generated each set of simulations, and the process we follow for comparing the simulated results against the empirical baseline appear below.

3.1. Simulated models of community joining and exit

We performed four families of simulations. First, a set of null models with agents joining and leaving communities randomly to provide a starting point for comparison. We then ran models which test the influence of social exposure and IEB decisions separately. Finally, we simulated a model that includes both social exposure and IEB decisions.

For each of the four sets of simulations, we parameterize key conceptual variables involved in community joining and exit and run each simulation with slightly varied parameter values. Doing so serves several purposes. First, the varied parameters help ensure that the results from a set of simulations represent a general, meaningful pattern rather than an artifact of any specific parameter. This is especially important because many of the parameter values are choices informed by prior research rather than precise empirical estimates. Second, by observing the results over a range of plausible values, we can evaluate variation in outcomes over different realizations of the theories we test. Finally, this approach reduces the risk that our findings reflect flukes that only appear in a limited set of conditions.

All of the models are intended to simulate populations of people and communities over time. Each simulation proceeds through a series of steps. In each time step, each agent sees a subset of communities (the exposure set). Agents then decide whether to leave any of their current communities and whether to join any of the communities in the exposure set. Our model stores the results of these decisions and then moves to the next agent until all agents have completed their decisions. The process iterates repeatedly over a series of time steps until we declare a stopping point.

We fix the size of each of our simulations to 9,000 agents and 200 communities. We use 9,000 agents because it is few enough to be computationally tractable but large enough to allow for complex dynamics to emerge. We chose 200 communities because this sets the ratio of members and communities close to what is observed empirically on Reddit. In January 2017, there were 3,578,907 active commenters on Reddit who commented in 78,201 subreddits, a ratio of approximately 90:2. We run each simulation for 24 time steps in order to allow the community size distributions to reach a steady state. Finally, we measure the number of members per community.

3.1.1. Null models

We first simulate a null model, in which exposure and participation are random. The null model does the following during each simulated time step: Each agent randomly leaves any community they currently belong to with probability

. Each agent draws a random sample from the pool of all communities with with probability , and each agent randomly chooses to join each sampled community with probability .

We can observe how often people leave online communities, and so we set a value for based on the proportion of people who leave subreddits each month on reddit. Of all user accounts leaving at least 5 comments in any given subreddit in January 2017, 56% did not comment at least five times in that same subreddit in February. As a result, we set to . Exposure is less visible. People are exposed to new online communities but we don’t know how many or which ones. Neither can we estimate the probability of participating after exposure. Because we assume that both are fairly low, we vary and over levels of for our null simulations.

3.1.2. Social exposure models

Theories of social exposure suggest that people learn about new communities through people they are already connected to. Our social exposure models capture this dynamic by randomly sampling community members from each of the focal agent’s communities. Larger communities have more people and therefore more opportunities for social connections, so we set where is the current size of the th community and is the number of “neighbors” sampled from that community. The function rounds up to the nearest integer, ensuring that we always sample at least one neighboring agent.

During each time step, each of these neighboring community members samples up to of their other communities to share with the focal agent. We vary from to across different simulations. This process defines an exposure set for each agent of maximum size . The size of the exposure set can be smaller if the chosen fellow members belong to fewer than other communities or if they share a community already in the exposure set. An agent not yet belonging to any communities is exposed to a random sample of all of communities with uniform probability . Because social exposure theories have nothing to say about community exit, agents exit communities randomly with probability as in our null models. From our null models, we found that choices for and were of little importance to the qualitative distributional properties of interest. As a result, we fix and include models where in the appendix as robustness checks.

No research that we know of provides insight into which communities people are likely to share with others. We show one model designed to capture this ignorance in which agents simply choose randomly which communities to share. However, one argument of this paper is that individual-level theories can and should be informed by higher-level phenomena. If we consider the skew toward membership in large communities observed empirically, we might guess that agents are more likely to share large communities. We therefore also simulate a social exposure model where agents share the largest communities to which they belong.

3.1.3. IEB decision models

In order to formalize and simulate individual expected benefits theories, we begin with the formal IEB model by Resnick et al. (2012). Resnick et al. formalize their logic in equations but but do not elaborate rules with enough detail to define behavior for the computational agents in our simulation. As a result, we first need to translate general concepts from the Resnick et al. model into rules specific enough for agents to follow.

Following Resnick et al., we model participation decisions as a function of community size and anticipated future growth. For any agent, the expected benefit of joining a community equals participation benefits () plus early adopter benefits () minus startup costs . Resnick et al. treated community success as dichotomous, claiming that people benefit from participation in communities that “succeed” by growing to a certain (undefined) size but not from participation in those that “fail.” We extend their approach by using a continuous representation of participation benefits. The intuition behind doing so is that a community with 1,000 people has more information, opportunities for friendship, and so on, than a community with 100 people, but that the smaller community provides some benefits. Support for this choice comes from recent survey work that suggests that small communities provide value to their members (Foote et al., 2017). We expect benefits of size to scale sublinearly so that a community with 1,000 people is not 10 times more valuable than one with 100. Agents therefore calculate participation benefits () as a logarithmic function of their estimate of the future size () of a community:

In the first set of IEB models, we assume that agents estimate future community size () by observing the current size and age of the community and making linear extrapolations with slope six time steps into the future:

In this way, early adopter benefits are a function of both the current size of the community, in that joiners of small communities have more opportunities for influence and status, and of the estimated future size (), because influence and status are more valuable in larger communities.

As with benefits from community success, the Resnick et al. model treats early adopter benefits as dichotomous. In this approach, a subset of early adopters benefit if a community succeeds, while no one else does. As with participation benefits, it makes more sense to model early adopter benefits as continuous so that they increase with the eventual size of the community, but decrease with the size of a community when an agent begins participating. Specifically, agents calculate early adopter benefits () as the ratio of the natural logarithm of the their estimate of the community’s future size () and the natural logarithm of the current size if they join (). To avoid numeric issues, we add to and 2 :

Treating benefits as a continuous function makes it possible for an agent to choose the top ranked of the possible communities to join.

Finally, for the sake of parsimony, we assume that startup costs () are fixed and identical across communities, but only apply to communities an agent does not already belong to. The total individual expected benefits for some agent for a given community (denoted with the subscript ) is therefore:

Where is the agent’s estimate of the participation benefits for community , is agent ’s estimate of early adopter benefits for , and indicates when does not already belong to (and thus has to pay the startup cost to join).

Our formalization of the IEB theory captures the main ideas of the Resnick et al. model and makes more realistic assumptions in several respects. Figure 2 uses the total expected benefits function to visualize expected benefits over a range of values for the community size when an agent joins it (increasing along the -axis) and the predicted community size (increasing along the -axis). The greatest expected benefits occur in the top-left of the figure representing communities when they are predicted to grow large and when the agent has the opportunity to join early.

As our simulations proceed, agents also make decisions about whether to stay in communities or exit. More formally, an agent may join or exit a set of communities comprised of the union of their current communities and the exposure set in each time step. The agent participates in the proportion of these communities with the highest expected benefits (). We know from empirical research that participation is rare, but we don’t know how rare. We therefore run simulations across the set of values .

Figure 2. Visualization of total expected benefits in our formalized IEB process. The -axis shows the current size () of the community when the agent considers joining and the -axis represents the agent’s prediction for how large the community will grow six time steps in the future ()

As with the social exposure models, we also simulate alternative specifications of the IEB models. Profound skew in community size suggests that people may value community size more highly than linear model terms can accommodate. To address this, we also simulate agents who extrapolate from the current community size using a quadratic function. This increases the predicted size of already large communities by much more than it increases the predicted size of small communities and further reinforces the strong preference for larger communities.

3.1.4. Combined models

Finally, we simulate a model that includes both social exposure and IEB decisions. For this combined model, we use the versions of social exposure and IEB intended to produced the most skewed distributions: social exposure when sharing the largest communities and joining and exit determined by the IEB formulas when using a quadratic projection of community size.

3.2. Empirical validation

The sizes of subreddits on Reddit provide the empirical baseline for our comparison. In order to construct a sample that matches the scale of our simulations, we used data published by Pushshift to identify the number of active members of all 23,663 subreddit communities active in January 2017. We define an active member as a unique username that has commented at least five times during the period of data collection. Because the communities in our sample can only have up to 9,000 members, we truncate the plots at that point. This excludes the size data of 29 subreddits which had more than 9,000 members.

We analyze the results of our agent-based simulations through visual comparison of the distributions of community size generated by the simulations and those observed in a sample of online communities. We present complementary empirical cumulative distribution functions (eCDFs) to visualize data from our simulations and from real communities. The community size is shown along the

-axis. The proportion of communities at least as large as any given value along the -axis are shown along the -axis. Due to the skew of the data, we log both axes. We elaborate on the empirical sample, eCDFs, and the rationale for our comparisons below.

We inspect our plots to evaluate whether our simulations generate heavy tails reflecting the large number of very large communities found empirically. Such distributions appear as a straight or nearly straight diagonal line on our plots. In contrast, normal (Gaussian) distributions deviate quickly and sharply away from the diagonal. Our analysis considers whether the simulated eCDF generated by the different models produce straight lines as well as how well they align with the eCDF from Reddit.

For the different families of simulations, we present the results as grids of plots. Each cell of these grids corresponds to a single permutation of the possible parameter values described above; the permuted parameter names and values appear above and to the right of each grid. Every plot includes the community size eCDF produced by the corresponding simulation in blue as well as the community size eCDF from our sample of subreddits in red-orange. The -axis and -axis labels for the plots appear along the bottom and to the left of each grid. Each curve starts at the top of the -axis at the point corresponding to the smallest community and reaches 0 at the size of the largest community at the corresponding point on the -axis. The eCDF produced by the subreddit communities is identical in every sub-plot of every figure and differences in appearance results from shifts in the scaling of axes and aspect ratios.

Interpretation of our results rests on identifying qualitative similarities between the eCDF from Reddit and the eCDFs from our simulations. While quantitative procedures and statistical tests for comparing CDFs exist, our goal is not to simulate a process that can generate the precise distributions from Reddit. Rather, our goal is to simulate theoretical models of community joining and evaluate if they can generate a distribution of community sizes with characteristics that are qualitatively similar to those observed in reality. For each simulation, we ask whether community members’ collective behavior traces a fairly straight diagonal line across the range of the log-log eCDF plot. This shape means that a considerable number of both large and small communities exist.

4. Results

Overall, we find that only a synthesis of social exposure and IEB decisions produce community size distributions that broadly resemble empirically observed patterns of behavior. The null model fails to produce realistic distributions of community sizes. Our simulations of agents using either individual expected benefits-based decisions or social exposure rules alone do better than the null model but produce either too few large communities or too few small communities to reproduce empirical patterns. We explain each set of results in more detail below.

4.1. Null models

Figure 3. Null models with random exposure and random decisions to join. Moving from from left to right across the grid, each column reflects an increasing probability that agents will join a community they are exposed to. Rows reflect increasing probability of exposure from top to bottom. Each sub-plot visualizes eCDFs, where the -axis is the proportion of all contributors that are in a given community and the -axis is the number of communities at least that large.

Figure 3 shows a grid of eCDF plots of community sizes for the null model simulations which incorporate random exposure and random joining and exit decisions. The grid of plots show how the simulated eCDF changes with varying joining probability (increasing from left-to-right) and exposure probability (increasing from top-to-bottom).

The simulated eCDFs (in blue) do not produce straight diagonal lines nor do they align with the empirical eCDFs produced by subreddit communities. Starting in the top left of the grid, the simulation with the lowest probabilities of random joining and exit decisions and of random exposure mainly generates small communities. Towards the bottom right of the grid, the joining and exit decision and exposure probabilities get higher and lead to uniformly larger communities. Nearly all of the simulated eCDFs assume a vertical appearance at some point along the community size distribution (

-axis). In the context of a log-log CDF plot, this pattern is characteristic of more bell-shaped (normal) distributions with relatively low variance.

When the probability of exposure and joining and exit decisions are low, the distribution is slightly skewed. At higher probabilities, it is bell-shaped. The overall similarity of these results across the grid suggests that none of the parameters in the ranges we include have a strong direct effect on the skew of community sizes.

4.2. Social exposure models

Figure 4. Social exposure models. Complementary eCDF plots showing when agents are exposed to communities via others in their current communities. Moving from left to right, the number of communities that each “neighbor” shares increases. Upper plots show when agents share a random set of communities, lower plots show when they share the largest communities to which they belong.

Figure 4 plots the results from simulations of individual-level social exposure to new communities along with random joining and exit fixed at and , respectively.222The appendix includes results from a model with joining set at . In this grid, the number of communities shared per neighbor increases over the columns from left to right. The two rows contain versions of the models where agents share randomly selected communities (top) and the biggest communities (bottom). The top row suggests that social exposure when agents share a random set of communities generates a distribution of communities of similar sizes rather than a range of large and small communities. This occurs no matter how many communities each neighbor shares—even as the size of communities grows as the number of shared communities increases.

We find that models in which agents share only the largest communities to which they belong (the bottom row of Figure 4

) produce more realistic distributions of community size. As the number of communities shared per neighbor increases (looking left-to-right across the row) the simulated eCDF shifts closer to the diagonal and the eCDF from Reddit. However, substantial deviations remain even in the bottom right plots that offer the best fit. These simulations produce too few small communities and too few large communities to line up more closely with the empirical data. Visually, this is why the simulated eCDFs in this bottom row start out above the eCDF from Reddit and then cross below it at some point along the X-axis.

4.3. IEB models

Figure 5. IEB models. Complementary eCDFs of community size when agents are exposed to a random set of communities and choose based on IEB. Moving from left to right, the proportion of communities that agents join increases. From top to bottom, the probability of random exposure increases.

Figure 5 shows results of simulations with agents that make joining and exit decisions based on individual expected benefits after being exposed to a random subset of communities. The rows of the grid vary the proportion of communities from the exposure set that each agent joins. The columns vary the value of the random exposure probability (). As with the null models and the social exposure models, none of the IEB models produce results that align with the empirical baseline across the full range of the eCDFs.

The IEB models generate better fits to the empirical data than either the null models or social exposure models. The best fits in the grid appear along the left column and towards the top row. This suggests that agents making IEB decisions and joining a small proportion of the largest communities they are randomly exposed to produces the most realistic distribution of community sizes. However, the simulated results remain consistently less skewed than the empirical data at the high end of the community size range. The single best fitting simulated CDF in Figure 5 (in the top-left) tracks the eCDF very closely across most of the community size range, and then falls off. This illustrates that the IEB decision models also fail to generate a sufficient number of the very largest communities. The second IEB model, in which agents fit a quadratic rather than a linear equation, produced nearly identical results. A figure showing these results is included in the Appendix.

4.4. Combined models: Social exposure plus IEB

The combined models incorporate the two sub-processes of community joining and exit considered separately above. Figure 6 shows the results of these models across a range of parameters for each of the sub-processes. In terms of social exposure, the plots in the grid show agents that share an increasing number of communities with their neighbors, from left to right. In terms of IEB decisions, the plots display agents that join an increasing proportion of communities from their exposure set, from top to bottom.

Figure 6. Combined model. Community sizes when agents are exposed to new communities via social exposure and make participation decisions based on IEB. Moving from left to right, the number of communities that each “neighbor” shares increases. Moving from top to bottom, the proportion of the considered communities that an agent chooses to join or remain in increases.

As before, none of the plots in this grid show perfect alignment between the simulated and empirical data across the full range of both the and axes. In general, the deviations once again emerge at the upper end of the community size distribution, indicating that even both social exposure and IEB decisions combined fail to generate a sufficient number of the very largest communities within the framework of our simulations. However, several of these plots align much more closely than any of the others we have presented thus far and many of them appear as straight or nearly straight lines. In particular, the simulation data plotted in the top left cell and the plots in the right three cells of the middle row line up closely with the eCDF right up until the very upper end of the community size distribution. Other cells along the top row also show good alignment.

The additive effects of the two sub-processes likely explains the improved fit we observe among several of the combined models. The fact that the improved fit recurs across multiple parameter values suggests that the improvements come from the features of the model rather than any particular parameter value on its own. In the appendix, we show the results of a robustness test with lower exposure probabilities and find similar results.

5. Discussion

Our simulations suggest that the combination of social exposure and IEB—two mechanisms of community joining and exit identified in previous research—produce realistic distributions of community size. Simulations based on social exposure or IEB alone result in right-skewed community sizes but do not generate extremely large communities. A combined model produces community sizes that more closely resemble the entire range of empirical community sizes. These findings extend prior research by providing evidence of this joint relationship. The ABS results also illuminate aspects of the relationships between the levels of individual and collective behavior that prior research had not considered.

A few findings bear further comment. First, when we model agents exposed randomly to communities, no decision process produces enough really large communities. Random exposure cannot explain empirical distributions of community size in the absence of a plausible mechanism of exposure to a much larger proportion of communities. Mechanisms that make exposure to larger communities more likely than exposure to smaller ones are required to produce extremely large communities. Social exposure provides a compelling mechanism that does this and has prior empirical and theoretical support.

However, even the effect of social exposure depends on which communities agents share. When agents share communities randomly, social exposure also has little impact on the distribution of community sizes. The first set of social exposure simulations suggest that when people belong to only a few communities and share randomly, no community gets large enough to benefit from the feedback mechanism introduced by social exposure. Our second set of social exposure simulations uses information about patterns in community size to modify our initial model so that agents choose to share the largest communities to which they belong. Only then does social exposure lead to skewed community sizes. This suggests an extension to our understanding of social exposure: that people are more likely to share large communities. We believe this reflects a novel proposition that should be tested empirically.

That said, this mechanism appears to have some limits. Even when agents only share the largest communities they belong to, community sizes remain concentrated and very few communities reach the massive sizes we observe in Reddit. Figure 8 in our appendix shows additional analyses that suggest that this shortcoming appears even more pronounced when agents are initially exposed to a smaller proportion of the community space.

Another initially perplexing finding provides a clue as to what may be happening. Although we might expect the most extreme skew when agents share only a few of the very largest communities they belong to (thus making them ever larger), the simulations suggest the greatest skew emerges when agents share many. If we look closely at Figure 4 we see that large communities are always produced and what improves the fit is the production of more moderately large and small communities. One explanation is that by only sharing one community, people quickly converge on one or two large communities. When more communities are shared, this includes more medium-sized communities.

Our finding that neither social exposure nor IEB decisions alone could produce the extent of cumulative advantage necessary to generate the very largest communities also extends prior work. On their own, each sub-process provides a plausible social mechanism for cumulative advantage. Previous studies analyzing social exposure or IEB decisions separately had neither evaluated directly what kinds of macro-level outcomes they produce nor considered the implications of the two processes interacting. Similarly, previous studies that identified cumulative advantage as a mathematical mechanism of observed community size distributions had not evaluated whether or to what degree specific micro-level behaviors could approximate inequality in the distribution of individuals across communities. The ABS results we present bridge this divide and advance both bodies of prior work simultaneously.

Our findings are far from obvious and the simulations presented might have produced very different results. For example, at the outset of this project, we did not know whether both sub-processes would be necessary. Nor did we know whether agents who share and join the largest communities would create a bi-modal distribution where nearly all communities were small and all agents belonged to a few very large communities. Just because we, the researchers, designed and manipulated the parameters of the simulated models does not mean we could anticipate or determine the results.

The reasons that social exposure and IEB decisions produce a good approximation of the full distribution deserves attention in future research. For example, we expect that clustering may play an important role in explaining why agents who preferentially share and join the largest communities nevertheless wind up in many small and medium sized communities. Such a pattern might occur because many communities remain ”unheard of” outside of local networks of agents who share overlapping communities. This proposition merits further evaluation and might help explain other dimensions of empirically observed behavior.

In sum, our simulations suggest that social exposure combined with IEB decision-making provide a reasonable explanation for empirical community size patterns. Our results also indicate that there is more to the story, however, and that these processes only explain much of the skew in community size. Below we discuss other features that future work should consider.

5.1. Limitations

Our approach has several important limitations. The most important is common to all agent-based simulations. While we chose a set of models that we believe capture the most important aspects of the real-world social computing system we seek to understand, other reasonable formulations might lead to different outcomes. Guided by theory, we attempted to identify the aspects of the model most likely to be key sources of variation and to parameterize and test those aspects. Although we are confident that our models are useful and valid in their current form, we have no doubts that they can be productively elaborated upon.

ABS, like all scientific modeling, intentionally elides details of the real world systems we seek to understand. For example, we follow most other ABSs in treating all of the agents and collectives in our simulations as homogeneous. While our model provides plenty of opportunity for certain types of heterogeneity to arise, it is obvious that real people are heterogeneous in their resources, interests, and skills. Similarly, communities have topics which may be of broad or very narrow interest. Although we chose a more parsimonious approach to modeling that ignores it, we are confident that this heterogeneity contributes to heterogeneity in community sizes and can explain some of the remaining skew in participation our models fail to capture. Other aspects of these systems are also likely to influence group sizes and are deserving of attention, such as the role of pseudonymity/anonymity, heterogeneity in costs to contribute, tools for social interaction, or technological features like recommendation systems or default community memberships.

Finally, our work is limited in that we evaluated our ABSs using only at a single macro-level behavior (community size) from a single empirical data source (Reddit). Future work should look at how well these simulations predict additional outcomes and behavioral patterns at meso- and macro-levels such as clustering in participation networks, heterogeneity in individual participation rates, or temporal patterns of contribution. Additional studies might also validate or revisit findings against multiple empirical baselines to ensure that conclusions do not reflect the biases of a single platform, interface, or time period.

6. Conclusion

Two social computing theories of how individuals learn about, join, and leave communities provide a reasonable explanation for how highly unequal distributions of community sizes arise. These results link micro-level models of social exposure and IEB joining and leaving decisions to a distinct scholarship on population-level distributions of community size. The results also support novel theoretical extensions of these micro-level models that can be tested empirically.

In practical terms, the results underscore that highly skewed and unequal community sizes need not result from failures on the part of community leaders or participants. Instead, macro-level inequalities likely arise through the aggregation of individual tendencies and preferences magnified by the massive scale of large sites like Reddit and others. The fact that two very simple processes explain so much about community sizes also suggests that platforms should be cautious when changing things that could directly impact either what people share or the information visible to help them make decisions. Designers, community managers, advertisers and others may want to nudge users towards broader and more equitable community sizes. However, doing so may require fundamentally transforming the ways that individuals learn about or decide to participate in their communities.

Micro-macro divides such as the one explored in this study provide many opportunities for future social computing research. Theoretically-grounded ABS combined with empirical validation provides an ideally-suited approach to advancing these inquiries. We hope others will extend and evaluate the results presented here, both directly through modeling other macro-level aspects of online communities, and through using agent-based simulations to bridge other micro-macro divides to contribute to our understanding of social computing systems.

Acknowledgements.
An earlier version of this paper was part of the first author’s PhD dissertation, and some of the text and images appear in that thesis. This work was supported by NSF grants IIS-1617468, IIS-1617129, IIS-1908850 and IIS-1910202. Versions of this paper received very helpful feedback from participants at the International Communication Association conference, the Organizational Communication Mini Conference, and the International Conference on Computational Social Science. The simulations were run on the University of Washington’s Hyak computing cluster.

References

  • (1)
  • Adamic and Huberman (2000) Lada A. Adamic and Bernardo A. Huberman. 2000. Power-Law Distribution of the World Wide Web. Science 287, 5461 (March 2000), 2115–2115. https://doi.org/10.1126/science.287.5461.2115a
  • Antin (2011) Judd Antin. 2011. My Kind of People?: Perceptions About Wikipedia Contributors and Their Motivations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11). ACM, New York, NY, USA, 3411–3420. https://doi.org/10.1145/1978942.1979451
  • Antin and Cheshire (2010) Judd Antin and Coye Cheshire. 2010. Readers Are Not Free-Riders: Reading as a Form of Participation on Wikipedia. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work (CSCW ’10). ACM, New York, NY, USA, 127–130. https://doi.org/10.1145/1718918.1718942
  • Antin et al. (2012) Judd Antin, Coye Cheshire, and Oded Nov. 2012. Technology-Mediated Contributions: Editing Behaviors among New Wikipedians. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (CSCW ’12). ACM, New York, NY, USA, 373–382. https://doi.org/10.1145/2145204.2145264
  • Backstrom et al. (2006) Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. 2006. Group Formation in Large Social Networks: Membership, Growth, and Evolution. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’06). ACM, New York, NY, USA, 44–54. https://doi.org/10.1145/1150402.1150412
  • Bakshy et al. (2009) Eytan Bakshy, Brian Karrer, and Lada A. Adamic. 2009. Social Influence and the Diffusion of User-Created Content. In Proceedings of the 10th ACM Conference on Electronic Commerce (EC ’09). Association for Computing Machinery, Stanford, California, USA, 325–334. https://doi.org/10.1145/1566374.1566421
  • Bakshy et al. (2012) Eytan Bakshy, Itamar Rosenn, Cameron Marlow, and Lada Adamic. 2012. The Role of Social Networks in Information Diffusion. In Proceedings of the 21st International Conference on World Wide Web (WWW ’12). Association for Computing Machinery, Lyon, France, 519–528. https://doi.org/10.1145/2187836.2187907
  • Barabási and Albert (1999) Albert-László Barabási and Réka Albert. 1999. Emergence of Scaling in Random Networks. Science 286, 5439 (Oct. 1999), 509–512. https://doi.org/10.1126/science.286.5439.509
  • Benkler et al. (2015) Yochai Benkler, Aaron Shaw, and Benjamin Mako Hill. 2015. Peer Production: A Form of Collective Intelligence. In Handbook of Collective Intelligence, Thomas W. Malone and Michael S. Bernstein (Eds.). MIT Press, Cambridge, MA, 175–204.
  • Bhatt et al. (2010) Rushi Bhatt, Vineet Chaoji, and Rajesh Parekh. 2010. Predicting Product Adoption in Large-Scale Social Networks. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM ’10). Association for Computing Machinery, Toronto, ON, Canada, 1039–1048. https://doi.org/10.1145/1871437.1871569
  • Broido and Clauset (2019) Anna D. Broido and Aaron Clauset. 2019. Scale-Free Networks Are Rare. Nature Communications 10, 1 (March 2019), 1–10. https://doi.org/10.1038/s41467-019-08746-5
  • Bryant et al. (2005) Susan L. Bryant, Andrea Forte, and Amy Bruckman. 2005. Becoming Wikipedian: Transformation of Participation in a Collaborative Online Encyclopedia. In Proceedings of the 2005 International ACM SIGGROUP Conference on Supporting Group Work (GROUP ’05). ACM, New York, NY, 1–10. https://doi.org/10.1145/1099203.1099205
  • Centola and Macy (2007) Damon Centola and Michael Macy. 2007. Complex Contagions and the Weakness of Long Ties. Amer. J. Sociology 113, 3 (Nov. 2007), 702–734. https://doi.org/10.1086/521848
  • Crowston and Howison (2005) Kevin Crowston and James Howison. 2005. The Social Structure of Free and Open Source Software Development. First Monday 10, 2 (Feb. 2005). https://doi.org/10.5210/fm.v10i2.1207
  • Crowston et al. (2008) Kevin Crowston, Kangning Wei, James Howison, and Andrea Wiggins. 2008. Free/Libre Open-Source Software Development: What We Know and What We Do Not Know. Comput. Surveys 44, 2 (March 2008), 7:1–7:35. https://doi.org/10.1145/2089125.2089127
  • Crowston et al. (2006) Kevin Crowston, Kangning Wei, Qing Li, and J. Howison. 2006. Core and Periphery in Free/Libre and Open Source Software Team Communications. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences, 2006. HICSS ’06, Vol. 6. IEEE Computer Society, Kauai, Hawaii, 118a. https://doi.org/10.1109/HICSS.2006.101
  • Danescu-Niculescu-Mizil et al. (2013) Cristian Danescu-Niculescu-Mizil, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No Country for Old Members: User Lifecycle and Linguistic Change in Online Communities. In Proceedings of the 22nd International Conference on World Wide Web - WWW ’13. ACM Press, Rio de Janeiro, Brazil, 307–318. https://doi.org/10.1145/2488388.2488416
  • Foote et al. (2017) Jeremy Foote, Darren Gergle, and Aaron Shaw. 2017. Starting Online Communities: Motivations and Goals of Wiki Founders. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17). ACM, New York, NY, 6376–6380. https://doi.org/10.1145/3025453.3025639
  • Frank (1993) Robert H. Frank. 1993. The Strategic Role of the Emotions: Reconciling over-and Undersocialized Accounts of Behavior. Rationality and Society 5, 2 (1993), 160–184. https://doi.org/10.1177/1043463193005002003
  • Graves (2013) John David Nicholas Graves. 2013. Open Source Software Development as a Complex System. Thesis. Auckland University of Technology.
  • Grimm et al. (2005) Volker Grimm, Eloy Revilla, Uta Berger, Florian Jeltsch, Wolf M. Mooij, Steven F. Railsback, Hans-Hermann Thulke, Jacob Weiner, Thorsten Wiegand, and Donald L. DeAngelis. 2005. Pattern-Oriented Modeling of Agent-Based Complex Systems: Lessons from Ecology. Science 310, 5750 (Nov. 2005), 987–991. https://doi.org/10.1126/science.1116681
  • Halfaker et al. (2013) Aaron Halfaker, R. Stuart Geiger, Jonathan T. Morgan, and John Riedl. 2013. The Rise and Decline of an Open Collaboration System: How Wikipedia’s Reaction to Popularity Is Causing Its Decline. American Behavioral Scientist 57, 5 (May 2013), 664–688. https://doi.org/10.1177/0002764212469365
  • Halfaker et al. (2011) Aaron Halfaker, Aniket Kittur, and John Riedl. 2011. Don’t Bite the Newbies: How Reverts Affect the Quantity and Quality of Wikipedia Work. In Proceedings of the 7th International Symposium on Wikis and Open Collaboration (WikiSym ’11). ACM, New York, NY, 163–172. https://doi.org/10.1145/2038558.2038585
  • Healy and Schussman (2003) Kieran Healy and Alan Schussman. 2003. The Ecology of Open-Source Software Development. (2003).
  • Johnson et al. (2014) Steven L. Johnson, Samer Faraj, and Srinivas Kudaravalli. 2014. Emergence of Power Laws in Online Communities: The Role of Social Mechanisms and Preferential Attachment. Management Information Systems Quarterly 38, 3 (2014), 795–808.
  • Jones et al. (2002) Q. Jones, G. Ravid, and S. Rafaeli. 2002. An Empirical Exploration of Mass Interaction System Dynamics: Individual Information Overload and Usenet Discourse. In Proceedings of the 35th Annual Hawaii International Conference on System Sciences. IEEE Computer Society, Big Island, Hawaii, 1050–1059. https://doi.org/10.1109/HICSS.2002.994061
  • Kairam et al. (2012) Sanjay Ram Kairam, Dan J. Wang, and Jure Leskovec. 2012. The Life and Death of Online Groups: Predicting Group Growth and Longevity. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM ’12). ACM, New York, NY, USA, 673–682. https://doi.org/10.1145/2124295.2124374
  • Kittur and Kraut (2010) Aniket Kittur and Robert E. Kraut. 2010. Beyond Wikipedia: Coordination and Conflict in Online Production Groups. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work (CSCW ’10). ACM, New York, NY, 215–224. https://doi.org/10.1145/1718918.1718959
  • Klandermans (2004) Bert Klandermans. 2004. Why Social Movements Come into Being and Why People Join Them. In The Blackwell Companion to Sociology, Judith R. Blau (Ed.). Blackwell Publishing Ltd, Oxford, UK, 268–281. https://doi.org/10.1002/9780470693452.ch19
  • Kraut and Fiore (2014) Robert E. Kraut and Andrew T. Fiore. 2014. The Role of Founders in Building Online Groups. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’14). ACM, Baltimore, Maryland, USA, 722–732. https://doi.org/10.1145/2531602.2531648
  • Kraut et al. (2012) Robert E. Kraut, Paul Resnick, and Sara Kiesler. 2012. Building Successful Online Communities: Evidence-Based Social Design. MIT Press, Cambridge, MA.
  • Lazer and Friedman (2007) David Lazer and Allan Friedman. 2007. The Network Structure of Exploration and Exploitation. Administrative Science Quarterly 52, 4 (Dec. 2007), 667–694. https://doi.org/10.2189/asqu.52.4.667
  • Marwell and Oliver (1993) Gerald Marwell and Pamela Oliver. 1993. The Critical Mass in Collective Action: A Micro-Social Theory. Cambridge University Press, Cambridge, UK.
  • Merton (1968) Robert K. Merton. 1968. The Matthew Effect in Science. Science 159, 3810 (1968), 56–63.
  • Morgan and Halfaker (2018) Jonathan T. Morgan and Aaron Halfaker. 2018. Evaluating the Impact of the Wikipedia Teahouse on Newcomer Socialization and Retention. In Proceedings of the 14th International Symposium on Open Collaboration (OpenSym ’18). ACM, New York, NY, 20:1–20:7. https://doi.org/10.1145/3233391.3233544
  • Opp (2011) Karl-Dieter Opp. 2011. Modeling Micro-Macro Relationships: Problems and Solutions. The Journal of Mathematical Sociology 35, 1-3 (Jan. 2011), 209–234. https://doi.org/10.1080/0022250X.2010.532257
  • Panciera et al. (2009) Katherine Panciera, Aaron Halfaker, and Loren Terveen. 2009. Wikipedians Are Born, Not Made: A Study of Power Editors on Wikipedia. In Proceedings of the ACM 2009 International Conference on Supporting Group Work (GROUP ’09). ACM, New York, NY, 51–60. https://doi.org/10.1145/1531674.1531682
  • Panek et al. (2018) Elliot Panek, Connor Hollenbach, Jinjie Yang, and Tyler Rhodes. 2018. The Effects of Group Size and Time on the Formation of Online Communities: Evidence from Reddit. Social Media + Society 4, 4 (Oct. 2018), 2056305118815908. https://doi.org/10.1177/2056305118815908
  • Raban et al. (2010) Daphne R. Raban, Mihai Moldovan, and Quentin Jones. 2010. An Empirical Study of Critical Mass and Online Community Survival. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work (CSCW ’10). ACM, New York, NY, USA, 71–80. https://doi.org/10.1145/1718918.1718932
  • Ren and Kraut (2014) Yuqing Ren and Robert E. Kraut. 2014. Agent Based Modeling to Inform the Design of Multiuser Systems. In Ways of Knowing in HCI, Judith S. Olson and Wendy A. Kellogg (Eds.). Springer New York, New York, NY, 395–419. https://doi.org/10.1007/978-1-4939-0378-8_16
  • Resnick et al. (2012) Paul Resnick, Joseph Konstan, Yan Chen, and Robert E Kraut. 2012. Starting New Online Communities. In Building Successful Online Communities: Evidence-Based Social Design. MIT Press, Cambridge, MA, 231–280.
  • Romero et al. (2011) Daniel M. Romero, Brendan Meeder, and Jon Kleinberg. 2011. Differences in the Mechanics of Information Diffusion Across Topics: Idioms, Political Hashtags, and Complex Contagion on Twitter. In Proceedings of the 20th International Conference on World Wide Web (WWW ’11). ACM, New York, NY, USA, 695–704. https://doi.org/10.1145/1963405.1963503
  • Schelling (1971) Thomas C. Schelling. 1971. Dynamic Models of Segregation. The Journal of Mathematical Sociology 1, 2 (July 1971), 143–186. https://doi.org/10.1080/0022250X.1971.9989794
  • Shaw and Hill (2014) Aaron Shaw and Benjamin Mako Hill. 2014. Laboratories of Oligarchy? How the Iron Law Extends to Peer Production. Journal of Communication 64, 2 (2014), 215–238. https://doi.org/10.1111/jcom.12082
  • State and Adamic (2015) Bogdan State and Lada Adamic. 2015. The Diffusion of Support in an Online Social Movement: Evidence from the Adoption of Equal-Sign Profile Pictures. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’15). Association for Computing Machinery, Vancouver, BC, Canada, 1741–1750. https://doi.org/10.1145/2675133.2675290
  • TeBlunthuis et al. (2017) Nathan TeBlunthuis, Aaron Shaw, and Benjamin Mako Hill. 2017. Density Dependence without Resource Partitioning: Population Ecology on Change.Org. In Companion of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW ’17 Companion). ACM, New York, NY, USA, 323–326. https://doi.org/10.1145/3022198.3026358
  • van de Rijt et al. (2014) Arnout van de Rijt, Soong Moon Kang, Michael Restivo, and Akshay Patil. 2014. Field Experiments of Success-Breeds-Success Dynamics. Proceedings of the National Academy of Sciences 111, 19 (May 2014), 6934–6939. https://doi.org/10.1073/pnas.1316836111
  • Wang et al. (2013) Xiaoqing Wang, Brian S. Butler, and Yuqing Ren. 2013. The Impact of Membership Overlap on Growth: An Ecological Competition View of Online Groups. Organization Science 24, 2 (2013), 414–431. https://doi.org/10.1287/orsc.1120.0756
  • Wilensky and Rand (2015) Uri Wilensky and William Rand. 2015. An Introduction to Agent-Based Modeling: Modeling Natural, Social, and Engineered Complex Systems with NetLogo. MIT Press, Cambridge, Massachusetts.
  • Zhang and Zhu (2011) Xiaoquan Zhang and Feng Zhu. 2011. Group Size and Incentives to Contribute: A Natural Experiment at Chinese Wikipedia. The American Economic Review 101, 4 (2011), 1601–1615. https://doi.org/10.2307/23045913
  • Zhu et al. (2014a) Haiyi Zhu, Jilin Chen, Tara Matthews, Aditya Pal, Hernan Badenes, and Robert E. Kraut. 2014a. Selecting an Effective Niche: An Ecological View of the Success of Online Communities. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). ACM, New York, NY, USA, 301–310. https://doi.org/10.1145/2556288.2557348
  • Zhu et al. (2014b) Haiyi Zhu, Robert E. Kraut, and Aniket Kittur. 2014b. The Impact of Membership Overlap on the Survival of Online Communities. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). ACM, New York, NY, 281–290. https://doi.org/10.1145/2556288.2557213

Appendix A: IEB model with quadratic projection

Figure 7. IEB model with quadratic projection. Results when agents are exposed to a random set of communities and use IEB to decide which to join or leave. In these simulations, agents use a quadratic equation to estimate future community size.

Figure 7 shows the results of a set of simulations where exposure is random and joining and leaving decisions are made based on the IEB equations. The results are nearly identical to Figure 5, suggesting that people were already joining the largest projects.

Appendix B: Robustness check results

B.1 Social Exposure and Random Joining

Figure 8. Community sizes when people are exposed to new communities via people in their current communities sharing the largest communities to which they belong. Moving from left to right, the number of communities that each “neighbor” shares increases. On the top neighbors share random communities. On the bottom they share the biggest to which they belong.

Figure 8 shows that when people are less likely to join new communities, nearly all of the skew of community size disappears, suggesting that the influence of social exposure by itself is fragile.

B.2 Combined Model

On the other hand, Figure 9 shows that varying the initial proportion of communities that people are exposed to has very little effect on the overall shape of the eventual outcomes.

Figure 9. Combined model with the initial probability of exposure set to . Community sizes when agents are exposed to new communities via social exposure and make participation decisions based on IEB. Moving from left to right, the number of communities that each “neighbor” shares increases. Moving from top to bottom, the proportion of the considered communities that an agent chooses to join or remain in increases.