The Goal-Gradient Hypothesis in Stack Overflow

02/14/2020 ∙ by Nicholas Hoernle, et al. ∙ 0

According to the goal-gradient hypothesis, people increase their efforts toward a reward as they close in on the reward. This hypothesis has recently been used to explain users' behavior in online communities that use badges as rewards for completing specific activities. In such settings, users exhibit a "steering effect," a dramatic increase in activity as the users approach a badge threshold, thereby following the predictions made by the goal-gradient hypothesis. This paper provides a new probabilistic model of users' behavior, which captures users who exhibit different levels of steering. We apply this model to data from the popular Q A site, Stack Overflow, and study users who achieve one of the badges available on this platform. Our results show that only a fraction (20 activity of more than 40 badge. In particular, we find that for some of the population, an increased activity in and around the badge acquisition date may reflect a statistical artifact rather than steering, as was previously thought in prior work. These results are important for system designers who hope to motivate and guide their users towards certain actions. We have highlighted the need for further studies which investigate what motivations drive the non-steered users to contribute to online communities.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

A well known finding from behavioral science research is that efforts towards a goal increase with proximity to the goal. This phenomenon, termed the goal-gradient hypothesis, has been demonstrated in a variety of settings, from animal studies in the lab to consumer purchasing behavior (Hull, 1932; Kivetz et al., 2006). More recently, the goal-gradient effect was observed in people’s behavior in online communities in the presence of virtual rewards such as badges and reputation points (Mutter and Kundisch, 2014; Anderson et al., 2013). We study this “steering” phenomenon in one such community, that of Stack Overflow (SO). We identify who exhibits steering, who does not, and how this steering behavior can be characterised from observational data.

We present a generative model of steering as the deviation from users’ default rate of activity while allowing individual users to vary in their adherence to this deviation. The model is able to fit a complex multimodal distribution over the parameters governing users’ activities. This allows it to capture different levels of steering in the population. We apply the model to SO data which includes the interaction history of people who achieved a common badge type. When the badge threshold is crossed, meaning the user completes a requisite number of a certain action type on the platform, the badge is awarded. Using the model and the interaction data, we investigate the following research questions: Are all badge achievers affected by the goal-gradient (or steering) hypothesis in the same way? If some users do not steer, what portion of the population falls under this category? Finally, does the presence of these users in the data set change any conclusions that were previously drawn about the phenomenon of steering?

Results show the model to provide a good fit to the data, and they revealed the following insights: First, more than 40% of the users (Figure 1, scatter plot — bottom left) exhibit a consistent activity rate that does not appear to be affected by the badge. In addition, we prove that the mean activity count for these users is consistent with that of a random process. Second, about 20% of users dramatically increase their rate of activity prior to achieving the badge (Figure 1, scatter plot - top right). It is the effect of this small population of steered users on aggregate measures that have led to previous claims of steering (Anderson et al., 2013). Third, the majority of these steered users (Figure 1, top right) decrease their activity rate beyond what is claimed in prior work (Anderson et al., 2013), reaching close to after the badge has been achieved.

This paper is first to provide a data-driven approach to the design and evaluation of models for studying user behavior in the presence of threshold badges. The model can be fit to and tested on real-world data (e.g., from SO in this paper) and can thereby be used to test hypotheses about users’ steering behavior, leading to better understanding of how badges motivate users in online communities.

2. Related Work

We begin by relating to the general literature on the effect of badges in online communities. We then present in detail the specific work of Anderson et al. (2013) which helps to motivate the generative models that we develop in Section 3.

2.1. The Study of Online Badges

The goal-gradient hypothesis stems from behavioral research where animals were observed to increase their effort as they approach a reward (Hull, 1932; Kivetz et al., 2006). Kivetz et al. (2006) studied the behavior of different populations of people who were working toward various rewards and concluded that the hypothesis holds true for people. Subjects who received a loyalty card, which tracked the number of coffees purchased from a local coffee chain, purchased coffee significantly more frequently the closer they were to earning a free cup of coffee. The authors recognized the existence of a group of participants who did not complete their coffee cards for the duration of the study, and did not exhibit a noticeable change in their coffee purchasing habits. They concluded that the loyalty card effect was constrained to the population of participants who handed in their completed loyalty cards in exchange for the free-coffee reward.

Anderson et al. (2013) and Mutter and Kundisch (2014) were the first to study the goal-gradient hypothesis in online settings. They studied the observed effect of badges on the behavior of participants in large Q&A sites. Both studies found evidence that users increase their rate of work as they approach the badge threshold. However, they did not address the possibility that some users might achieve the badge as a consequence of their routine interactions on the website rather than being steered by the badge. There is a possibility that people’s actions are governed by motivations other than badges. We extend these works by allowing for this possibility, such that we can characterize the true changes to users’ behavior under the influence of a badge, and distinguish this from the case where users do not noticeably change their interaction patterns.

Other studies have independently confirmed that the presence of online badges increases the probability that a user will act in a manner to achieve the badge, as well as the rate at which the user will perform those actions 

(Kusmierczyk and Gomez-Rodriguez, 2018; Yanovsky et al., 2019; Bornfeld and Rafaeli, 2017; Ipeirotis and Gabrilovich, 2014). Kusmierczyk and Gomez-Rodriguez (2018) highlight the importance of modeling the “utility heterogeneity” among the users but they study badges which have a threshold of and do not characterize how one might change one’s behavior in the presence of the badge incentive. Yanovsky et al. (2019) study the presence of different populations within the SO database by employing a clustering routine. They discovered notably different responses to the badge based on the cluster that a user belongs to. Their study did not acknowledge the possibility that the observed data might be consistent with a hypothesis that some users do not exhibit steering. Anderson et al. (2014) studied the implementation of a badge system in a massive open online course and they provide a prescriptive system for the design of badges such that there is a maximum effect on the population. Zhang et al. (2019) suggest that SO create new badges to encourage users to integrate helpful comments into the accepted answers. They thereby present an example of how system designers might use a badge to encourage a desired behavior from their user base.

2.2. A Utility Model for Steering

Most relevant to our work is the paper of Anderson et al. (2013), who present a parametric description of a user’s utility when the user is steered by badges.111Anderson et al. (2013) coined the term “steering” which refers to the goal-gradient effect in the context of badges. This model is generative in that it describes the change to a user’s distribution over actions under the assumption that the user tries to maximize some utility derived by earning new badges. The model describes a user as having a preferred distribution from which actions are sampled. As users approach the required threshold for achieving a badge, they deviate from their preferred distributions. The deviation from the preferred distribution is controlled by the utility gained by achieving the badge and the cost for deviating from the preferred distribution.

We let refer to the distribution over the count of actions that a user takes on day . The user’s utility is a function of and it is the sum of three terms.222Our notation differs slightly from that of Anderson et al. (2013). Anderson et al. (2013) uses a parameter to refer to a user’s distribution over the next action. We rather use to denote the distribution over the count of actions on a particular day. The two are linked (the distribution over the next action influences the count of actions on a specific day), however, we choose to model directly the data that is available from SO. The first term, , is the non-negative value that a user derives from already-attained badge rewards (where is the assumed value of a badge and is the indicator that the user has attained badge ). The second term, , describes the user’s expected future utility, discounted by , when acting under the distribution . The final term, , is a cost function that penalises the user for deviating from the preferred distribution on that day. The cost represents the unwillingness of the users to change their behavior, and it is in tension with the users’ desire to achieve future badges.

The utility on day for user is then (Anderson et al., 2013):

It is important to note that the cost term is only paid when users deviate from their preferred distribution . As such, this model assumes users deviate only to attain the value from the badge and only if that value outweighs the cost that is paid for deviating. This means that a deviation on the rate of actions which are incentivized by the badge must be an increase before the badge is achieved and cannot be an increase after the badge is achieved. We will make these same assumptions in the models in Section 3.1.

This utility-based model presents a compelling description of how people respond to badges; however, it was not evaluated or tested by fitting it to specific data from SO. Rather, predictions of the model were compared to aggregated data from SO and we show in Section 6 that the aggregated analysis from this count data can lead to incorrect conclusions. The lack of analysis on individual level predictions limits the credibility of the study as well as its practical value — it is difficult to apply the utility-based model to the placement of badges without a means of determining the appropriate model parameters for a given community of contributors.

In this work we address the shortcomings of the utility-based approach by introducing a probabilistic model which allows us to use the vast literature on posterior inference in such models to assist with parameter estimation 

(Blei, 2014; Rezende and Mohamed, 2015; Kingma and Welling, 2013; Kingma et al., 2016; Ranganath et al., 2014). The probabilistic model has two advantages over this prior work: (1) posterior distributions for latent parameters in the model can be learnt from real-world interaction data and (2) the model’s fit to data can be used to test and update scientific hypotheses (for example, in this paper we propose and validate that while some users may steer in a similar way, there may exist users who do not experience steering).

3. Modeling User Activities

We model users’ activities in SO as a distribution over their action counts. The model aims to incorporate the major aspects of the utility model from Anderson et al. (2013) but it frames the problem such that parameters can be estimated from data and the models can be tested on their fit to unseen user action data to allow for model comparison (Box and Hunter, 1962; Blei, 2014). Moreover, the model allows for different users to experience different levels of steering.

In Section 3.1, we begin by providing a conceptual generative model of user activities in SO. Sections 3.2 and 3.3 describe the specifics of the model parameters. Finally, in Section 3.4 we detail how the latent parameters are represented by the model.

3.1. A Generative Model of Steering

Let be a latent parameter that controls the rate of activity for user ; this is the preferred distribution of user .

induces a probability distribution over the action counts

of user . Let denote the deviation of the user’s activity from as a result of steering. The observed data for each user, , consists of daily action counts for a predetermined number of weeks before and after achieving the badge. Thus, for days of interaction, , and

are all vectors of length


Figure 2

presents generative models of user behavior in SO. White circles denote latent random variables and colored circles denote observed random variables; solid lines represent conditional dependence. Model 0 (left) describes a non-steering model, in which the observed action counts

depend only on the user’s preferred distribution . Model 1 (right) is a steering model in which a user deviates systematically from in a manner that is controlled by . As the values for increase (above 0), the user experiences an increased rate of actions (above their preferred distribution). Similarly, as decreases (below 0), the user experiences an decreased rate of actions. Model 1 assumes that all users are steered in the same way. Model 2 (bottom) relaxes this assumption by introducing a user-specific strength parameter which mediates the effect of for user . As decreases, the user deviates less from their base distribution. When is very close to zero, the user’s activity converges to that described by Model 0.

The parameter that controls how a user responds to the badge is a vector of length (each day relative to the badge indicates a different amount of steering). To reflect the intuition developed by Anderson et al. (2013) and explained in Section 2.2, we constrain to be non-negative before day — the day when the user achieves the badge. Moreover, is constrained to be non-positive after this day to reflect the intuition that a user gains no further utility from the badge once it has been achieved (and thus does not work harder than his preferred distribution ). therefore implicitly includes the trade-off between the cost function and the badge utility discussed in Section 2.2.

Figure 2. Model 0 (baseline model) has no notion of a badge – only a user’s preferred distribution induces the distribution over the observed actions. Model 1 allows for a global badge deviation () from a user’s preferred distribution and it is experienced by all users. Model 2 has a user-specific strength parameter () that mediates the adherence to .

Steering graphical model

3.2. Likelihood of Action Counts

In this section we define the parameters that govern the distribution over users’ action counts in SO. We wish to describe a variety of behaviors, including users who contribute sporadically and those who are more consistent. We therefore model action counts using a zero-inflated Poisson distribution. The zero-inflated Poisson distribution has a rate parameter

and a Bernoulli probability associated with each user and each day of interaction. The Bernoulli probability describes the event that user is active or not on a given day. The rate parameter describes the expected count of actions that the user will perform under a Poisson distribution, conditioned on the user being active. Note that a user can be active on the platform without performing an action (e.g., logs on to the SO website but does not contribute). Conceptually, this would correspond to drawing a

from the Bernoulli distribution but a count of

actions from the Poisson distribution.

The probability that user performs actions on day is presented in Equation 1. We refer to the parameters and as a user’s rate parameters for day .


3.3. Deriving the Rate Parameters and

This section connects the rate parameters and to the generative models of Section 3.1. Each of , and includes two components, one to determine each of and . For days of interaction, comprises two real-valued vectors, each of length . is the user’s preferred distribution that is associated with and is the user’s preferred distribution associated with the parameter . Similarly, comprises two real-valued vectors of length , where is associated with and is also associated with . Finally, , is a tuple of two numbers between and which mediate (multiply) and for user .

Equation 2 derives a vector of probability values (one for each day of interaction) as the element-wise sigmoid transformation of a vector that is the addition of the user’s preferred distribution with where is mediated by (multiplied by) . Equation 2 also derives a vector of strictly positive rate values (one for each day of interaction) as the element-wise softplus transformation of the vector .


The complete generative description for Model 2 is as follows (Models 1 and 0 are generated in the same way, with parameter set to 1 and 0 respectively):

  1. Sample and from their prior distributions (see Section 3.4).

  2. Compute and using Equation 2.

  3. Sample the vector of the counts of actions for user from the zero-inflated Poisson as in Equation 1.

3.4. Generating the Latent Parameters and

The distributions over and can be complex and multi-modal (reflecting that people display varying activity patterns). Following Rezende and Mohamed (2015), we represent these distributions with a simple distribution (often an -dimensional Gaussian) which we transform via a series of bijective mappings to form a complex and possibly multi-modal distribution (Rezende and Mohamed, 2015; Papamakarios et al., 2019). The use of such transformations, called normalizing flows, has been shown to improve the modeling of complex distributions (Papamakarios et al., 2019).

The output of the normalizing flows corresponds to a sample in latent space from the generating distribution for a user’s preferred distribution. We transform the output through a feed-forward network to a real-valued vector which is . In this work we constrain such that it can learn a different distribution for each day of the week. This is to enforce that does not evolve though time but to allow different days to have different expected activity rates.

The distribution over the steering parameters for Model are assigned to be two elements in the -dimensional vector after the transformations from the normalizing flows. In this way, the latent dimensionality for all three models in Figure 2

is kept constant. These real values are transformed via a sigmoid function to create

. The dimensions, number of hidden units, and number of layers in this setup can be found in Appendix B.333Modeling and inference code can be found at the repository: anonymous for review.

4. Amortized Variational Inference for Steering

To infer the underlying parameters over the latent space, we use amortized inference (Ranganath et al., 2014; Kingma and Welling, 2013)

. Amortized inference uses a neural network to encode a data point into the latent parameters that are associated with its posterior distribution. Moreover, the inference objective allows model comparison such that hypotheses about the data can be tested (e.g., allowing us to valide the inclusion of the steering parameter


A fully-specified generative model defines a joint distribution over some latent random variables (

) and the observed random variables (). The challenge is to infer the posterior of the latent parameters given the data that was actually observed

. For all but a handful of conjugate models, the posterior is intractable to derive analytically. It is therefore common to use approximate methods which include Markov chain Monte Carlo 

(Neal, 1993) and variational inference (Blei et al., 2003; Hoffman et al., 2013). Variational inference is a popular method for approximating the intractable posterior distribution by introducing a different (and more easily sampled from and evaluated) distribution over the same latent variables, . By minimizing the KL-divergence between and the true posterior , one obtains an approximation to the true posterior (Hoffman et al., 2013).

Traditionally, variational inference proposes to update the parameters of the approximation with a (optionally stochastic) coordinate ascent routine, presenting an algorithm that has strong ties to expectation-maximization 

(Bishop, 2006). More recently, Kingma and Welling (2013) proposed that the parameters of the approximation rather be learnt as a function of an inference network such that the shared parameters of the network amortize the learning across the data examples. This new form of variational inference proposes that the parameters of the approximating distribution then are the transformation of the input data point through some network. The use of inference networks, along with the reparameterization trick (to allow for efficient back-propagation computation), has collectively been called “black-box variational inference” (Ranganath et al., 2014). If this inference network is paired with a twin generator network, the variational auto-encoder from Kingma and Welling (2013) is recovered.

It is important to note that minimizing the KL-divergence between and is equivalent to maximizing the variational objective, called the Evidence Lower BOund (see Hoffman et al. (2013) for a derivation and discussion). This ELBO derives its namesake from the fact that it lower-bounds the marginal log-likelihood of the data under the assumptions of the model, a fact easily derived in the next equation, where Jensen’s inequality is applied in the final line. It is due to this lower bound on the marginal log-likelihood, that it is also common to use the ELBO for model comparison (as is done in Section 5.1).

5. Empirical Study

SO has provided us with the anonymized data of the interaction of users on the site from January 2017 to April 2019. We focus our analysis on the users who achieved the Electorate badge, which is awarded to users who vote on 600 questions, with at least 25% of their total votes cast on questions.444The data from the users’ voting actions is not publicly available but qualitatively similar results can be obtained on other badge types in the freely available repository of data found at We thank SO for the access to this voting data. This is the same badge type that was studied by Anderson et al. (2013). The observed data is the number of actions (question votes) per user per day for weeks before and after achieving the badge, making days of interaction per person.

We compare the performance of Model , Model , and Model as described in Section 3.1. We also include a naïve baseline that uses maximum likelihood assignments for the rate parameters by setting to equal the fraction of users who were active on day , and to equal the mean of the active users’ action counts for day . For all models, we report two measures of performance: the evidence lower bound (the ELBO), which is the lower-bound on the marginal log-likelihood of the data under the model assumptions (Kingma and Welling, 2013; Hoffman et al., 2013; Rezende and Mohamed, 2015)

; and the mean square error (MSE) of the model for reconstructing the original number of actions for each user. Parameter estimation is done in Pytorch and Adam is used to maximize the ELBO with a learning rate of 0.001. We set the dimensionality of the latent space to

and use planar normalizing flows with layers (Rezende and Mohamed, 2015).

5.1. Model Comparison

We trained the models using the data of users (validation set of users,) while results are reported on a hold-out test set of users. Table 1 compares the performance of the models on this test set. The results from Table 1 show that Model outperforms the other models achieving a higher bound on the marginal log-likelihood and a lower reconstruction error on unseen data. Moreover, the benefit of the amortized approach (which learns a complex representation of for each user) is demonstrated in that Models , and all had a higher ELBO and consequently a lower reconstruction error than the naïve baseline. Both Model and outperform Model , suggesting that the inclusion of the steering parameter does increase the probability of the data.

Model 2 (w/ )
Model 1 (w/o ) 0.174
Model 0 (Baseline) 0.179
Naïve Baseline 49.33
Table 1. ELBO (lower bound to log-likelihood) and mean squared error reconstruction.

Figure 1 (left) presents a scatter plot of the magnitude of the inferred and parameters. Each point in the scatter plot corresponds to a user. The scatter plot shows two clear modes. One mode is in the top right corner of the plot which corresponds to users with . We refer to the roughly of users in this mode as the “strong-steerers” because they adhere strictly to the deviation. The second, and larger, mode is in the bottom left of the plot which corresponds to . We refer to the roughly of users in this mode as the “non-steerers” as they appear to eschew the deviation.

We show samples from the strong-steered population on the top-right hand corner of Figure 1. The plots show the true count of actions as a function of time alongside the expected number of actions under the assumptions of Model . The red vertical line, on day , corresponds to the day that the user achieved the Electorate badge. It is important to note the high number of actions (both expected and true) before day when the badge was achieved. After day , both the true and expected numbers of actions drops dramatically.

In contrast to these strong-steered users, the bottom right of Figure 1 presents samples from the non-steered population. The counts of actions appear to show no change around day . These users appear not to change their behavior in the presence of the badge.

5.2. Analysis of Steering

The form of the inferred parameter shows the effect of steering on users over time. The plot of as a function of time is presented in Figure 3. The magnitude of the values of indicate direct changes to the probability that the user is active, as well as expected changes in the number of actions on a given day. In accordance with related work, users increase their action count as they approach the day upon which they achieved the badge (Anderson et al., 2013; Bornfeld and Rafaeli, 2017; Mutter and Kundisch, 2014). At its peak, suggests that the strong-steerers will increase the number of actions (over their preferred distribution) by approximately question-votes.

A novel insight of our model is that decreases below after the badge has been achieved. That is, users may decrease their activity beyond their preferred distribution level after they have achieved the badge. This result suggests that for those users who are steered strongly, they may stop contributing altogether once the badge has been achieved.

Figure 3. Expected deviation from a user’s preferred distribution .

Steering beta response

Figure 3(b) presents the mean number of interactions per user as a function of the number of days until/after the badge is achieved. The three lines correspond to the three groups from Figure 1: non-steerers, strong-steerer and the other users who are in neither of the two modes. We choose a deliberately low cutoff to define a user as a member of the strong-steerer group but enforce that only users with low values be considered as non-steerers. In particular, we highlight the strong-steerers, 20.9% of users, who experience steering as is described by previous work — they increase their rate of work dramatically before the badge is achieved. Notice that the mean interaction count from these users’ drops passed the other groups to close to 0 after achieving the badge. In Figure 3(b), we present the mean number of interactions per user for only the weeks after the badge has been achieved, such that this behavior can be properly seen.

(a) weeks before and after badge achievement.
(b) Focused plot on only the weeks after badge achievement.
Figure 4. Mean number of actions per day for users who are grouped by the magnitude of their steering strength parameters ()

We also highlight the non-steered population (41.8%) who show no change in interaction rates before or after the receipt of the badge. There is a distinct uptick in the mean number of question-vote actions on the day before and on the day of the badge achievement (Figure 3(b), orange line). It is possible that this “bump” might mistakenly be seen as the response of the users to the badge incentive. In fact this bump is an artifact of the analysis technique which centers trajectories around a threshold that is crossed by the cumulative sum of the trajectory entries (see Section 6 and Appendix A for a discussion and proof of this claim).

Table 2 presents the sizes of these three groups (when considering the entire data set). We highlight the fact that the non-steerer population is twice as large as the strong-steerer population and while the strong-steerer population is the minority, it is the highly engaged interaction activity from these users that may have led to some previous conclusions about steering.

Group () # Users
Other Other
Table 2. Number of users where the latent parameter is used to threshold the amount of steering.

Our final result studies the number of badges that are achieved by the 20.9% of the population who are characterised as strong-steerers in comparison to the 41.8% of the population who, we claim, do not act in a manner that indicates steering. Figure 5 shows a mode on badge (given the data during the available interval) for the non-steered population with an exponential decay in the number of people who achieve two or more badges. In contrast, the steered population has a much greater count of two or more badges. We argue this is further evidence for the claim that the group in the top right of the scatter plot in Figure 1 actively pursues badges while the same cannot be said for the group in the bottom left.

Figure 5. Normalized histogram of the number of badges achieved by the strong-steerer and non-steerer populations respectively.

Steering badge counts

6. The Phantom Steering Effect

The population of non-steerers in Figure 3(a) displays a sharp uptick in the mean of their action count on the day before and on the day of the badge achievement. We prove that such a bump arises as an artifact of centering the data on day , and is therefore expected to arise even in the absence of a steering effect. We show this “phantom steering” bump occurs in the setting of Model (Figure 2) where daily action counts are independent draws from some unchanging latent distribution. Our proof (and the intuition arising from it) suggests that a similar bump arises in the presence of steering as well. It is possible that this bump may have served to inflate previous conclusions about how users change their behavior when working to achieve badges (Anderson et al., 2013).

We also derive the expected size of this bump, and evaluate the extent to which these variously-steered populations’ bumps can be explained by this statistical artifact.

6.1. Derivation of Phantom Steering for Model

For users acting under Model we present Theorem 6.1, which implies that for sufficiently large badge thresholds the expected number of actions on day (the day of badge achievement) is greater than the expected number of actions on any other day.

We introduce this theorem via the following intuitive example: Suppose that the badge threshold is chosen randomly from some large range of possible action counts. Let be the cumulative number of actions from a user up to (and including) day . As long as the user continues to act on the platform, will eventually traverse the interval . Moreover, as the count of actions on any day is a random variable (drawn from the user’s preferred distribution), is more likely to cross the threshold on a day on which the user makes relatively more contributions. This claim holds even when actions are drawn under the no-steering assumptions of Model which assumes that users’ action counts on each day are independent draws from their preferred distribution (which is not influenced by steering).

We formalize this intuition in Theorem 6.1, the proof of which appears in Appendix A. Recall that the random variable describes the number of actions that user performs on the day that they receive the badge. Denote the number of actions required to achieve the badge by , and let denote this random variable when the badge threshold is actions and user acts according to Model .

Theorem 6.1 ().

If is bounded then:

This expected bump size holds in the limit as the badge threshold becomes large with respect to the mean of . For fixed the convergence to this limit is exponential in the threshold.

Theorem 6.1 applies when a user’s distribution is identical for all days and their actions are drawn independently from it. In order to distinguish between days of the week as in Section 3.4, we now generalize this theorem to the setting where each user has a distinct distribution for each day of the week and they draw from these distributions independently in turn. Let index days of the week. Because the are indexed by day where is the badge, the sequence is the vector of a user’s preferred distribution for the entire period under study.555Note that days are indexed arbitrarily for different users indicating that they may start their interaction, and achieve the badge, on any random day. We generalize Theorem 6.1 to the following result, the proof of which is also relegated to Appendix A:

Theorem 6.2 ().

If each of the distributions is finite and nonzero (and nonnegative and integer valued), then

6.2. Empirical Comparison to the Non-Steerers

We compare the observed data under the assumptions of Model to the theoretical predictions of Theorem 6.2. For each user in the non-steerer population and for each day of the week , we use the data of their day-of-the-week contributions for the

weeks under study to compute a sample mean and (unbiased) sample variance of their action distributions for each day of the week,

We use the right-hand side of Theorem 6.2 to create an estimator of the expectation under the assumptions of Model 0:


Figure 5(b) plots a density of the residuals which are created by subtracting the user’s actual number of actions on day from their expected number of actions under the theoretical model of the phantom steering bump. Figure 5(a) shows the density for the residuals from the non-steerer population (who are assumed to act under Model ). We compare this to the density that arises from the strong-steerer population in Figure 5(b) (which is not expected to conform to Model ).

(a) Histogram of for the non-steerer group
(b) Histogram of for the strong-steerer group
Figure 6. Residuals of actual vs predicted bump sizes for various populations. These are histograms of the residuals for different groups of users.

It is striking that in Figure 5(a) the residuals are symmetric and centered nearly at , which suggests that for the non-steerer group this is in fact an accurate estimate of on average. Moreover, this means that the observed contributions of users identified as non-steerers on the day of the badge are consistent with what we would expect under the assumptions of Model

: that their number of actions on any given day is an independent draw from some day-of-the-week distribution, and they are not affected by badge proximity at all. Notably, as we move to the group of users identified as strong-steerers, the residuals skew asymmetrically and shift towards negative values. This signifies that the estimate

for a user’s day contributions systematically underestimates their actual number of contributions for the users which we identify as strong-steerers. Just as Theorem 6.2 and Model might suggest that Figure 5(a) is symmetric and centered zero, this bias in Figure 5(b) is precisely what the goal-gradient hypothesis predicts for steered users.

7. Limitations

The empirical study in Section 5 has a number of limitations which we list here. However, we note that most of these limitations can be addressed in future work which is discussed in Section 8.

First, our empirical analysis was limited to 6,916 SO users who achieved the Electorate badge, which is the highest badge category for vote actions. This is a small population of users in comparison to the 25,314 users who achieved the other vote-action badge (called the Civic Duty badge). We focused on the former badge as it was the badge associated with the highest reported steering effect from Anderson et al. (2013). Moreover, we did conduct a similar study on these these users who achieved the Civic Duty badge, which is awarded for performing vote actions. Our results are qualitatively similar (with a similar clustering of users), but the deviation that is described by is smaller for this population, perhaps reflecting that the badge is easier to achieve and therefore engenders a smaller response from the users. The mean plot of users’ actions (analogous to Figure 3(b) for the Electorate badge,) is available in Appendix C. The thresholds of alternative vote-type badges are set at a single action, and thus steering cannot be measured for users who achieve such badges (however this effect has been studied in Kusmierczyk and Gomez-Rodriguez (2018)).

A second limitation that is related to the first is our focus on only one action type, that of voting-actions. We chose to do this in order to study the effect of the badge on the action that it directly incentivizes, which is presumably associated with the largest response to the badge ( in our model). This is a limitation as previous work has raised the possibility that a badge of one type (e.g., incentivizing voting actions) can have effects on actions of another type (e.g., editing or reviewing actions) (Anderson et al., 2013; Mutter and Kundisch, 2014). We note this is a limitation of our empirical study and not a limitation of the models that have been presented as the models can extend easily to this new setting where many actions are included in a new likelihood model.

A third limitation is that we explicitly assume a single parametric form for the steering parameter . This assumption is that all steered users deviate from their preferred distribution in the same way, and that users only differ in the strength parameter . However, it is clear that a significant portion of the population () are neither steered nor non-steered (they are not in the two modes that are described in Figure 1) leading to the conclusion that this simplifying assumption is inadequate.

Finally, we note a limitation of the modeling approach outlined in Section 3. While the model does provide a good fit to the data, it does not reason about cognitive aspects affecting users’ behavior, such as intention to achieve a badge or motivation to contribute to the site. This is natural given the observational form of this study and we rather focus on providing descriptive insights into peoples’ behavior. The cognitive processes and intrinsic motivations for pursuing badges remains unknown.

8. Conclusion and Future work

We have presented a novel probabilistic model that describes how users interact on the SO platform and in particular how these users respond to badge incentives on the website. We demonstrated how this model can be fit to the data that is provided by SO and we investigated the distribution that is learnt over the latent space that describes the “steering effect”.

Our results provide a more informed understanding of how users respond to badges in online communities. First, that some users do exhibit steering supports the claims made by previous work (Anderson et al., 2013; Mutter and Kundisch, 2014; Yanovsky et al., 2019; Li et al., 2012). However, approximately 40% of the population do not exhibit steering. These users do not change the rate of their activity for the weeks under study, rather they continue to act with the same rate well after the badge has been achieved. This suggests that these users have reasons for performing voting actions on SO which do not include the desire to obtain the Electorate badge.

Second, the 20% of the population identified as strong-steerers significantly decrease their level of contributions after day zero, beyond what was previously reported. It is possible that assigning additional badges, with thresholds beyond those already in place in SO will continue to motivate such users.

Third, any analysis of badge behavior must take into account the presence of the phantom steering bump which has not previously been acknowledged in the context of badges. This statistical artifact is model independent and may lead to inflated conclusions about the effect of badges on users’ behavior.

Future work will apply the models of Section 3 on badge types from SO that are awarded for activities other than vote-actions, such as the the Strunk&White or Copy Editor badges that reward editing actions. We wish to study the indirect effect of badges of a certain type on actions that are not directly incentivized by this badge, as stated by Anderson et al. (2013) and Li et al. (2012). Our model can be directly applied to studying this question by modeling the likelihood of activities as a vector for each action type.

Also, we will update the model to include multiple parameters to capture different responses to badges across users. It is possible that the roughly of users who are not well described by Model have responses that are not described by a single parameter and allowing for this more intricate model will allow to capture more users into some behavioral archetype.

A final exciting note for this work is that we are in the process of working with SO to run a survey of the users on the website. A deeper understanding of how and why users contribute to these peer-production websites will inform the design of more personalized and effective rewards that motivate and engage the users.


  • (1)
  • Anderson et al. (2013) Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2013. Steering user behavior with badges. In Proceedings of the 22nd international conference on World Wide Web. ACM, 95–106.
  • Anderson et al. (2014) Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2014. Engaging with massive online courses. In Proceedings of the 23rd international conference on World Wide Web. ACM, 687–698.
  • Bishop (2006) Christopher M Bishop. 2006. Pattern recognition and machine learning. Springer.
  • Blei (2014) David M Blei. 2014. Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application 1 (2014), 203–232.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3, Jan (2003), 993–1022.
  • Bornfeld and Rafaeli (2017) Benny Bornfeld and Sheizaf Rafaeli. 2017. Gamifying with badges: A big data natural experiment on Stack Exchange. First Monday 22, 6 (2017).
  • Box and Hunter (1962) George EP Box and William G Hunter. 1962. A useful method for model-building. Technometrics 4, 3 (1962), 301–318.
  • Hoffman et al. (2013) Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. The Journal of Machine Learning Research 14, 1 (2013), 1303–1347.
  • Hull (1932) Clark L Hull. 1932. The goal-gradient hypothesis and maze learning. Psychological Review 39, 1 (1932), 25.
  • Ipeirotis and Gabrilovich (2014) Panagiotis G Ipeirotis and Evgeniy Gabrilovich. 2014. Quizz: targeted crowdsourcing with a billion (potential) users. In Proceedings of the 23rd international conference on World Wide Web. ACM, 143–154.
  • Kingma et al. (2016) Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems. 4743–4751.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013).
  • Kivetz et al. (2006) Ran Kivetz, Oleg Urminsky, and Yuhuang Zheng. 2006. The goal-gradient hypothesis resurrected: Purchase acceleration, illusionary goal progress, and customer retention. Journal of Marketing Research 43, 1 (2006), 39–58.
  • Kusmierczyk and Gomez-Rodriguez (2018) Tomasz Kusmierczyk and Manuel Gomez-Rodriguez. 2018. On the causal effect of badges. In Proceedings of the 2018 World Wide Web Conference. 659–668.
  • Li et al. (2012) Zhuolun Li, Ke-Wei Huang, and Huseyin Cavusoglu. 2012. Quantifying the impact of badges on user engagement in online Q&A communities. In International Conference on Information Systems.
  • Mutter and Kundisch (2014) Tobias Mutter and Dennis Kundisch. 2014. Behavioral mechanisms prompted by badges: The goal-gradient hypothesis. In International Conference on Information Systems.
  • Neal (1993) Radford M Neal. 1993. Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto Toronto, ON, Canada.
  • Papamakarios et al. (2019) George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2019. Normalizing Flows for Probabilistic Modeling and Inference. arXiv preprint arXiv:1912.02762 (2019).
  • Ranganath et al. (2014) Rajesh Ranganath, Sean Gerrish, and David Blei. 2014. Black box variational inference. In Artificial Intelligence and Statistics. 814–822.
  • Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770 (2015).
  • Yanovsky et al. (2019) Stav Yanovsky, Nicholas Hoernle, Omer Lev, and Kobi Gal. 2019. One Size Does Not Fit All: Badge Behavior in Q&A Sites. In Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization. ACM, 113–120.
  • Zhang et al. (2019) Haoxiang Zhang, Shaowei Wang, Tse-Hsun Chen, and Ahmed E Hassan. 2019. Reading Answers on Stack Overflow: Not Enough! IEEE Transactions on Software Engineering (2019).


Appendix A Omitted Proofs

Here we present the proof of Theorem 6.1, which we restate here in a more general setting. Let be a nonnegative, bounded, and integer-valued random variable. Let be independent random variables which are distributed identically to . We will be concerned with the partial sums . Let denote the random variable which is the copy that brings across the threshold ; that is, for which and .

Theorem A.1 ().

If is nonnegative, integer-valued, and bounded then

More generally, we also consider the case when the are drawn from distributions repeatedly in turn. Then the partial sums are , where all copies of are independent. Let denote the event that is drawn from distribution , and let . For this setting we have the following theorem:

Theorem A.2 ().

If each of the distributions is finite, nonzero, nonnegative, and integer valued then

Theorem A.1 follows directly from Theorem A.2 by taking the to be identically distributed. Therefore we focus on proving Theorem A.2.

We begin by showing that the likelihood of the sequence visiting any given number is asymptotically uniform. Let us define and and observe that if then . Also, if then clearly . For the for which , we have the following lemma:

Lemma A.3 ().

If is nonzero, nonnegative, and bounded then


First, it suffices to assume that . This is because the integer-valued random variable has mean and , and proving the claim for implies the claim for . It also suffices to assume that . This is because the sequence remains at a specific value only so long as the independent draws are , after which it leaves forever. The expected number of steps that lingers at for is exactly , where . Since by assumption, we may prove the claim for . Then and

and so proving the claim for proves the claim for also. To conclude, we may assume without loss of generality that and that .

Let the maximum value that obtains be . Then the obey the recurrence


with the initial conditions and for all . Because is bounded by , we may break

up into “epochs

, and then define with . For any we can then iteratively expand the terms in Equation 4 for which until the expression for each depends only on the previous epoch, which gives an alternative recurrence of the form


where (and the initial conditions are the values of for ). It is important to note that these do not depend on . These recurrences Equation 4 and Equation 5 give as a convex combination of previous values, and so we may rewrite Equation 5 as , where is a right stochastic square matrix. Furthermore it follows from the assumption that is primitive. Therefore the Perron-Frobenius Theorem implies that converges exponentially quickly to a matrix of the form , where and

are the unique right and left eigenvectors of

corresponding to the eigenvalue

. This in turn implies that converges to some uniform vector , and therefore that .

Finally we argue that . We can show this by considering

the mean number of times that intersects some interval . Since the converge, for fixed we may use linearity of expectation to choose large enough to guarantee that for any given . On the other hand, by considering the as “restarting” when they reach the epoch preceding

, we may use the central limit theorem to argue that

. Taking the limit as becomes large yields . ∎

With this lemma in hand, we are ready to analyze the distribution of in the limit.

Proof of Theorem a.2.

First, we claim that we may assume without loss of generality that the gcd of the supports of is . To see this, define new integer random variables . If the claim holds for then since it follows that

and the general claim is proven.

For a fixed threshold , we are interested in the event that both is a copy of and that . Then

It follows that

where the last equality holds by passing the limit through the sum and applying Lemma A.3 to in order to conclude that each of the probabilities on the right hand side approaches . Therefore,

We conclude that

Appendix B Network Structure

The following tables details the number of parameters, and the structure of the generator and inference networks for the models from Section 3. The structures of these networks can be updated for different inference implementations without changing the fundamental structure of the models proposed in Section 3.

Layer group Layer # Details
Latent dim - 20-dim Gaussian
Normalizing Flows - layers.
Feed-Forward 1

Hidden units: 200; Activation: ReLU

2 Hidden units: 400; Activation: ReLU
3 Hidden units: 200; Activation: ReLU
Table 3. Layers, Number of Parameters and Activation Units for Generator Network
Layer group Layer # Details
Feed-Forward 1 Hidden units: 200; Activation: ReLU
2 Hidden units: 400; Activation: ReLU
3 Hidden units: 200; Activation: ReLU
- Mean parameters for Gaussian. Hidden units: 50; Activation: ReLU
- Variance parameters for Gaussian. Hidden units: 50; Activation: ReLU
Table 4. Layers, Number of Parameters and Activation Units for Inference Network

Appendix C Plots for Civic Duty Achievers

The following two plots show the deviation that was learnt for the Civic Duty population of users () and the mean plot of actions for the groups inferred from the plot of (with the same thresholds as in the paper for the Electorate group).

Figure 7. Expected deviation from a user’s preferred distribution for Civic Duty.
Figure 8. Mean number of actions per day for users who are grouped by the magnitude of their steering strength parameters (S) from the Civic Duty badge.