1 Introduction
Many everyday situations involve sequential decision making, where one makes a decision based on some private information and the previous actions of others. Consider, for example, the following classroom experiment (see [9, Chapter 16]). An experimenter puts an urn containing three balls at the front of the room. This urn is either majority blue, containing two blue balls and one yellow ball, or majority yellow, containing two yellow balls and one blue ball; both urns are equally likely to be chosen. The students then come to the front of the room one by one. Each student draws a ball at random from the urn and puts it back without showing it to the rest of the class. The student then has to guess the majority color of the urn, announcing her guess publicly. Each student thus makes their decision based on their draw and the announcements of those gone before them.
Let us consider how such an experiment proceeds. The first student only has her own draw to go by, so she will announce the drawn color as her best guess of the majority color. The second student knows this, so together with her own draw she has two independent draws as information. If the colors of the two agree, then the second student announces this color. If the colors of the two draws differ, then she has to use a tiebreaking rule—let us assume that she breaks ties by following her own draw. With this choice we see that the second student also announces the color of the ball she drew. Hence the third student has three independent draws as information and her best guess for the majority color of the urn will be the majority color among the three draws.
Notice that if the first two announced colors were blue, then the third student announces blue regardless of the color of her draw. The fourth student knows this and hence the announcement of the third student has no information value. The fourth student is thus in the same situation as the third one and will also just announce blue following the first two students. Following the same logic, all subsequent students will announce blue, regardless of the color of their draw. This phenomenon is known as herding or as an information cascade. Its study was originated by Banerjee [4] and by Bikhchandani, Hirshleifer, and Welch [6], independently and concurrently; we refer to Easley and Kleinberg [9, Chapter 16] for an exposition.
The main issue with information cascades is that they can be wrong. For instance, it might be that the urn is majority yellow, but the first two students draw a blue ball and announce blue, and hence everyone announces blue as their guess. The main cause of this is that information cascades can be based on very little information: the actions of a few initial actors can determine all subsequent actions. This also explains why information cascades are fragile: if additional information is revealed (e.g., someone reveals not only their action but also their private signal) or if some people deviate from rational behavior, then wrong information cascades can be broken. The focus of this paper is to analyze the fragility of information cascades quantitatively.
To this end, we study a variant of the simple model of sequential decision making studied previously, where not all agents are Bayesian: some are “revealers”, who disregard the actions of others and act solely based on their private signal. This is motivated by both empirical results from laboratory experiments on human behavior in such a setting [3], as well as theoretical considerations [5]; see Section 1.3 for further discussion of related work. We assume that the player at time is a revealer with probability , independently of everything else, and is a Bayesian otherwise. While agents do not know whether those before them were Bayesians or revealers, this process still introduces additional information that can be useful for making inferences. Are wrong information cascades broken in such a model? That is, do people eventually learn the “correct” action?
We show that the answer is yes: there exist revealing probabilities such that learning occurs. Moreover, we study the optimal asymptotic rate at which the error probability at time can go to zero. We show that the optimal policy is for the player at time to follow their private information with probability , leading to a learning rate of , where the constants and are explicit.
1.1 Model and main result
We describe the simplest case of the model first, in order to focus on the conceptual points; we discuss generalizations at the end of the paper. The state of the world is , chosen uniformly at random, that is, . At times players try to guess the state of the world, based on their private information, as well as observing the actions (guesses) of those before them.
The private signals are drawn in the following way. There is an urn that contains two types of balls: type balls are blue and type balls are yellow. Given , there are balls of type in the urn and balls of the other type, where we assume that . Each player draws a single ball (with replacement) from the urn, its color is their private signal. In other words, the private signals are i.i.d. with the following distribution:
The goal of the players is to guess the majority color (type) of the balls in the urn. We denote the actions (guesses) of the players by . We assume that each player is one of two kinds:

a Bayesian
, whose guess is the maximum a posteriori (MAP) estimate
^{1}^{1}1We assume that if the posteriors are equal, then a Bayesian follows their private signal.; or 
a revealer, whose guess is their private signal.
We assume that player is a revealer with probability , independently of everything else. Formally, let
be independent Bernoulli random variables (and also independent of everything else) such that
. If , then player is a Bayesian and hence , while if , then player is a revealer and hence . Note that players do not know whether the players before them are Bayesians or revealers. We do assume, however, that the players know the probabilities .The players aim to learn the majority color/type of the urn, that is, to learn , and also to minimize the probability of an incorrect guess. Denote by
the probability that the guess of player is incorrect. We aim to understand the optimal asymptotic rate at which the error probability can go to zero; the following theorem is our main result.
Theorem 1.1.
Consider the model described above and let
(1.1) 
We have that
(1.2) 
That is, the optimal rate of learning is , and we obtain the specific constant as well in (1.1). As we shall see, one can get arbitrarily close to the optimum by taking
(1.3) 
for , where is arbitrary (and where we use the standard notation ).
1.2 Heuristic explanation of the optimal rate of learning
We now provide intuition for why is the optimal order for the rate of learning, as well as the reasons behind the constant in (1.1). First, note that the probability that player is a revealer and draws a ball of the minority color is . When this occurs, player guesses incorrectly, implying that . So in order for the error probability to go to zero, must go to zero as .
On the other hand, cannot go to zero too quickly. If , then by the BorelCantelli lemma there will be only finitely many revealers almost surely. This leads to a situation similar to when there are no revealers: if a correct cascade has not started before the last revealer, then there is a constant probability of ending up in a wrong cascade.
In fact, should decay as to achieve the optimal rate of learning. To see a lower bound of this order, let for small. By a Chernoff bound, with high probability there will be at most revealers among the first players. If the first two players and all revealers until time draw balls of the minority color, then so will every player until time . This event has probability at least for some constant , which is greater than if is small enough.
To see an upper bound, we now argue that if with large enough, then the probability that a wrong cascade (among Bayesians) lasts until time is . Indeed, in such a cascade some of the revealers will be visible—precisely those that deviate from the cascade consensus. The player at time has probability to be a revealer who draws the majority color, and probability to be a revealer who draws the minority color. Hence in a wrong cascade there will be, in expectation, deviations from the cascade consensus by time , while in a right cascade the expected number of deviations is only . The total number of deviations by time
will roughly be Poisson distributed. The probability that a
random variable is larger by a constant factor than its mean is exponentially small in and here is on the order of . Hence by taking large this exponential in will be .In fact, the heuristics of the previous paragraph give the right constant as well. To distinguish between right and wrong cascades we need to distinguish between
and random variables. The total variation distance between them satisfies(1.4) 
for some (explicit) function . The right hand side of (1.4) is roughly the error probability if player is a Bayesian. This term should be balanced with the term coming from player being a revealer and drawing a ball of the minority color. This balancing requires choosing such that , which occurs when , just as in (1.3).
1.3 Related work
The special case of the model described in the previous subsection where every agent is Bayesian (i.e., with for every ) is identical to the model described in the exposition of Easley and Kleinberg [9, Chapter 16]. The original model of Bikhchandani, Hirshleifer, and Welch [6] differs only in its tiebreaking rule (breaking ties by flipping a fair coin), while that of Banerjee [4] differs in the signal distribution (false signals are drawn from a continuous distribution). Despite these minor differences, these models all share the same phenomenological behavior as described in the introductory paragraphs.
In particular, Bikhchandani et al. [6] emphasize the fragility of information cascades with respect to different types of shocks, as prior work on conforming behavior could not explain this phenomenon. They show examples from numerous fields (e.g., politics, zoology, medicine, and finance) where cascades occur and are fragile. The current paper can be viewed as a more detailed quantitative exploration of the fragility of cascades. What amount of additional information is needed to break wrong cascades? What is the optimal rate of learning that can be achieved?
One possible source of additional information comes from people not acting in a rational, Bayesian manner. It is well documented that human behavior is often irrational (see, e.g., [14]). In the information cascades setting, laboratory experiments by Anderson and Holt [3] show that while most participants act rationally, many do not. When deviations from rational behavior occur, participants often act mainly or solely based on their private information, disregarding the information in the actions of those before them.^{2}^{2}2See also related experiments and results by Çelen and Kariv [7] for a setting with continuous signals. Such individuals effectively reveal their private signal, which is valuable information for those coming after them. The model described in Section 1.1, which contains Bayesians and revealers, captures this empirically observed behavioral phenomenon.
A closely related model was introduced and studied by Bernardo and Welch [5]. This model also contains two types of individuals: (1) rational ones and (2) overconfident ones, termed “entrepreneurs”, who put more weight on their private signal than a rational individual would. As the authors mention in their paper, their motivation was not to show that information cascades can be broken by overconfident behavior, but rather to offer a simple explanation for the existence of overconfident individuals based on group selection principles. Nevertheless, since here our focus is on breaking wrong information cascades, we compare our work with theirs in this regard.
In [5] the overconfidence of entrepreneurs is termed “modest” if they still put positive weight on the information from individuals before them, and it is termed “extreme” if they act solely based on their private signal. If the overconfidence of entrepreneurs is modest, then still a wrong cascade occurs with positive probability, bounded away from zero, which is undesirable. Only if the overconfidence of entrepreneurs is extreme—as in the model in Section 1.1—can learning occur eventually with probability one. Bernardo and Welch study their model via simulations which, in the extreme overconfidence setting, suggest that a vanishing fraction of entrepreneurs is optimal. Our work is a rigorous and much more detailed study of this model^{3}^{3}3We note that there are small differences in the model studied here and the model of Bernardo and Welch [5]; for instance, in [5] it is assumed that the identities of entrepreneurs are known whereas we do not make this assumption.; in particular, our results imply that the optimal number of entrepreneurs is logarithmic in the size of the group.
The recent work of Cheng, HannCaruthers, and Tamuz [8] also considers sequential learning models with nonBayesian agents and shows that wrong cascades can be avoided if there are some nonBayesian agents. However, they assume (like in [5]) that each agent knows which of the previous agents were revealers, an assumption that we do not make. More importantly, the main contribution of the current paper is the explicit characterization of the optimal rate of learning.
Another possible source of additional information is using the first few agents as “guinea pigs”, that is, forcing them to follow their private signals; see the work of Sgroi [20]. Le, Subramanian, and Berry [16] point out that this is related to the multiarmed bandit literature, with guinea pigs corresponding to agents used for exploring [15, 2]. They also mention that it follows from this literature that the optimal number of guinea pigs is logarithmic in the number of agents [2]; this is consistent with our results, albeit in a slightly different setting.
The framework for sequential decision making described in this paper assumes finite discrete private signals. If the informativeness of private signals is unbounded (e.g., Gaussian signals), then wrong cascades do not form and asymptotic learning occurs [21]. In such settings the main question concerns the speed of asymptotic learning, see, for instance, the work of HannCaruthers, Martynov, and Tamuz [12].
The framework of this paper also fits into the broader field of social learning. In particular, there is a large literature on learning in social networks. Acemoglu, Dahleh, Lobel, and Ozdaglar [1] consider a model of sequential decision making where agents act only once, but each agent can only observe a subset of previous actions, based on a stochastic social network. One of their results is that asymptotic learning occurs even if private signals have bounded informativeness if there are sufficiently many individuals whose neighborhoods are nonpersuasive and hence whose action will necessarily be influenced by their private signal. Similar to revealers in the model described in Section 1.1, these individuals provide a sufficient amount of information for those coming after them to lead to asymptotic learning.
Another typical setting that is studied involves agents who take repeated actions based on their private signal, as well as observing the actions of their neighbors in the network. The main questions include whether or not all agents learn the correct action eventually, what is the speed of learning if it occurs, and how do these depend on the network topology. We highlight recent work of Harel, Mossel, Strack, and Tamuz [13], which is similar in spirit to the current paper in that it provides a detailed study of the asymptotic rates of social learning in a mean field setting. A complete overview of the literature is beyond the scope of this article; we refer the reader to the two papers above, as well as to the works of Gale and Kariv [11], Mossel, Sly, and Tamuz [17, 18, 19], and the references therein for more.
2 Proof of Theorem 1.1
The action of player can be wrong in two ways: (i) if they act as a Bayesian and the MAP estimator is incorrect, or (ii) if they act on only their private signal and their draw from the urn is the minority type/color. Hence, conditioning on the coin flip deciding whether player is a Bayesian or a revealer, we obtain that
Now recall that given , we have , and so
Recall also that given , we have . Thus
(2.1) 
So we need to understand the probability that the MAP estimator is incorrect at time . We summarize the behavior of the MAP estimator in Lemma 2.1 and then prove Theorem 1.1 using this, before turning to the proof of the lemma. In the statement of the lemma and throughout the paper we use standard asymptotic notation; for instance as if and as if .
Lemma 2.1.
Consider the setting of Theorem 1.1 and fix .

Suppose that
(2.2) for every . Then
(2.3) 
Suppose that
(2.4) Then
(2.5)
Proof of Theorem 1.1.
The rest of this section consists of the proof of Lemma 2.1. We start in Section 2.1 by introducing notation and making basic observations about the MAP estimator that are useful for both bounds in Lemma 2.1. Then we turn to the proof of Lemma 2.1 (a) in Section 2.2 and we conclude with the proof of Lemma 2.1 (b) in Section 2.3.
2.1 The MAP estimator
In this subsection we introduce some notation and make basic observations about the MAP estimator that are useful for the bounds in Lemma 2.1, which is proven subsequently.
For , let denote the probability measure conditioned on , that is, . Similarly, denotes expectation conditioned on . For and , denote by the distribution of . That is, for , let
Similarly, for and , denote by the distribution of . Define also the corresponding likelihoods:
with for . An outside observer who records the actions of the first players can compute the likelihoods and , while player can compute the likelihoods and . If player is a Bayesian, then their guess is based on the likelihoods and . Specifically, since the prior on is uniform, we have that
(2.8) 
where the last line is due to the tiebreaking rule; recall that if the posteriors are equal then a Bayesian follows their private signal.
For and , define
and note that, since is independent of everything else, we have that for and . Thus in order to understand the likelihoods and , we need to analyze and . Define the likelihood and the likelihood ratios as
respectively. We can write
(2.9) 
and hence we can determine the action of player for given and . Note that the random variable takes values in , and hence we have the following three cases.

If , then and hence , regardless of the value of . Hence if player is a revealer and , and otherwise.

If , then . This can be checked by considering both cases. If then . Therefore by (2.9) we have that and hence . If then by the definition of the MAP estimator, while if then by the tiebreaking rule. The case of is analogous.

If , then and hence , regardless of the value of . Hence if player is a revealer and , and otherwise.
The three cases above describe how the action of a player depends on their private signal and on the actions of those who acted before them. This allows us to analyze how the likelihood ratio evolves.
The probability that the MAP estimator makes an error at time can be expressed using the
likelihood ratio as follows. To abbreviate the notation for vectors, we write
. First, conditioning on the value of we obtain that(2.10) 
The two terms on the right hand side of (2.10) are equal due to symmetry, so
(2.11) 
Using (2.8) we obtain the following upper and lower bounds:
It is more convenient to work with the likelihood ratio, so using the fact that we obtain that
(2.12) 
To obtain parts (a) and (b) of Lemma 2.1 we bound from above and below the probabilities appearing in (2.12).
2.2 An upper bound
Proof of Lemma 2.1 (a).
By (2.12) our goal is to show that
(2.13) 
as , and recall that we assume that the revealing probabilities are as in (2.2).
Let denote the filtration defined by the random variables . Observe that, given , the inverse of the likelihood ratio, , is a martingale with respect to . In particular, this implies that . Since is a concave function for when , we have that, given , the sequence is a supermartingale with respect to . Thus . We now compute the conditional expectation explicitly:
(2.14) 
The values of the conditional probabilities in (2.14) depend on the value of . As described previously, we have three cases:
(2.15) 
and also
(2.16) 
Plugging these back into (2.14) we obtain the conditional expectation in the three cases:
(2.17)  
(2.18)  
(2.19) 
The right hand side of (2.18) is strictly less than , while the right hand sides of (2.17) and (2.19) converge to as . To estimate the quantities in (2.17) and (2.19), note that as . Defining
we have from (2.17) — (2.19) that
(2.20)  
(2.21)  
(2.22) 
as . On the interval the function is concave and nonnegative with , it attains its maximum at
and its maximum value is
(2.23) 
where recall the definition of from (1.1). Note also that , due to the fact that .
We group the cases of (2.20) and (2.21) together, but treat them separately from the case of (2.22), which leads to defining the following random sets:
In words, the set is the set of time indices when the estimator is equal to regardless of the private signal at this time. Define also
and note that , since is a partition of . By (2.22), we have that there exists such that for any we have that
(2.24) 
Similarly, by (2.20), together with the fact that the right hand side of (2.20) is greater than the right hand side of (2.21) for all large enough, we have that there exists such that for any we have that
(2.25) 
Let , and note that by the choice of , together with (2.23), we have that
(2.26) 
as . Define also the random variable . Putting together (2.24) and (2.25), it follows by induction that there exists such that for any we have that
(2.27) 
Since , the expectation in (2.27) is bounded above by a constant independent of . That is, there exists such that
(2.28) 
Next, we claim that implies that
(2.29) 
and furthermore that there exists a constant such that
(2.30) 
We defer the proofs of both of these claims to Appendix A.
From (2.29) we get the following bound on the probability of interest:
(2.31) 
By (2.30) the second term is at most , so in order to show (2.13) it suffices to bound from above the first term in the display above. We can break the event into subevents based on the value of . Recall that , so if and , then . Letting we obtain the bound
(2.32) 
We estimate each term in this sum. First, we can rewrite this probability as follows:
If both inequalities hold in the display above, then also the product of the expressions on the left hand sides is greater than or equal to the product of the expressions on the right hand sides. We thus obtain the following bound:
Using Markov’s inequality, together with (2.28) with and , we obtain that
(2.33) 
where the second inequality follows from the facts that and . Recalling from (2.26) that , and using (2.32) and (2.33), we arrive at the following bound:
for some constant . Putting this together with (2.31) and (2.30) we obtain (2.13). ∎