1 Introduction
Recommendation systems are the backbone of numerous digital platforms—from web search engines to video sharing websites to music streaming services. To produce highquality recommendations, these platforms rely on data which is obtained through interactions with users. This fundamentally links the quality of a platform’s services to how well the platform can attract users.
What a platform must do to attract users depends on the amount of competition in the marketplace. If the marketplace has a single platform—such as Google prior to Bing or Pandora prior to Spotify—then the platform can accumulate users by providing any reasonably acceptable quality of service given the lack of alternatives. This gives the platform great flexibility in its choice of recommendation algorithm. In contrast, the presence of competing platforms makes user participation harder to achieve and intuitively places greater constraints on the recommendation algorithms. This raises the questions: how does competition impact the recommendation algorithms chosen by digital platforms? How does competition affect the quality of service for users?
Conventional wisdom tells us that competition benefits users. In particular, users vote with their feet by choosing the platform on which they participate. The fact that users have this power forces the platforms to fully cater to user choices and thus improves user utility. This phenomenon has been formalized in classical markets where firms produce homogenous products (bertrand), where competition has been established to perfectly align market outcomes with user utility. Since user wellbeing is considered central to the healthiness of a market, perfect competition is traditionally regarded as the “gold standard” for a healthy marketplace: this conceptual principle underlies measures of market power (lerner) and antitrust policy (dukelaw).
In contrast, competition has an ambiguous relationship with user wellbeing in digital marketplaces, where digital platforms are datadriven and compete via recommendation algorithms that rely on data from user interactions. Informally speaking, these marketplaces exhibit an interdependency between user utility, the platforms’ choices of recommendation algorithms, and the collective choices of other users. In particular, the size of a platform’s user base impacts how much data the platform has and thus the quality of its service; as a result, an individual user’s utility level depends on the number of users that the platform has attracted thus far. Having a large user base enables a platform to have an edge over competitors without fully catering to users, which casts doubt on whether classical alignment insights apply to digital marketplaces.
The ambiguous role of competition in digital marketplaces—which falls outside the scope of our classical understanding of competition power—has gained center stage in recent policymaking discourse. Indeed, several interdisciplinary policy reports (stiger19; cremer2019competition) have been dedicated to highlighting ways in which the structure of digital marketplaces fundamentally differs from that of classical markets. For example, these reports suggest that data accumulation can encourage market tipping, which leaves users particularly vulnerable to harm (as we discuss in more detail at the end of Section 1.1). Yet, no theoretical foundation has emerged to formally examine the market structure of digital marketplaces and assess potential interventions. To propel the field forward and arm policymaking discourse with technical tools, it is necessary to develop mathematically founded models to investigate competition in digital marketplaces.
1.1 Our contributions
Our work takes a step towards building a theoretical foundation for studying competition in digital marketplaces. We present a framework for studying platforms that compete on the basis of learning algorithms, focusing on alignment with user utility at equilibrium. We consider a stylized duopoly model based on a multiarmed bandit problem where user utility depends on the incurred rewards. We show that competition may no longer perfectly align market outcomes with user utility. Nonetheless, we find that market outcomes exhibit a weaker form of alignment: the user utility is at least as large as the optimal utility in a population with only one user. Interestingly, there can be multiple equilibria, and the gap between the best equilibria and the worst equilibria can be substantial.
Model.
We consider a market with two platforms and a population of users. Each platform selects a bandit algorithm from a class . After the platforms commit to algorithms, each user decides which platform they wish to participate on. Each user’s utility is the (potentially discounted) cumulative reward that they receive from the bandit algorithm of the platform that they chose. Users arrive at a Nash equilibrium.^{1}^{1}1In Section 2, we will discuss subtleties that arise from having multiple Nash equilibria. Each platform’s utility is the number of users who participate on that platform, and the platforms arrive at a Nash equilibrium. The platforms either maintain separate data repositories about the rewards of their own users, or the platforms maintain a shared data repository about the rewards of all users.
Alignment results.
To formally consider alignment, we introduce a metric—that we call the user quality level—that captures the utility that a user would receive when a given pair of competing bandit algorithms are implemented and user choices form an equilibrium. Table 1 summarizes the alignment results in the case of a single user and multiple users. A key quantity that appears in the alignment results is , which denotes the expected utility that a user receives from the algorithm when users all participate in the same algorithm.
For the case of a single user, an idealized form of alignment holds: the user quality level at any equilibrium is the optimal utility that a user can achieve within the class of algorithms . Idealized alignment holds regardless of the informational assumptions on the platform.
The nature of alignment fundamentally changes when there are multiple users. At a high level, we show that idealized alignment breaks down since the user quality level is no longer guaranteed to be the global optimum, , that cooperative users can achieve. Nonetheless, a weaker form of alignment holds: the user quality level nonetheless never falls below the singleuser optimum . Thus, the presence of other users cannot make a user worse off than if they were the only participant, but users may not be able to fully benefit from the data provided by others.
More formally, consider the setting where the platforms have separate data repositories. We show that there can be many qualitatively different Nash equilibria for the platforms. The user quality level across all equilibria actually spans the full set ; i.e., any user quality level is realizable in some Nash equilibrium of the platforms and its associated Nash equilibrium of the users (Theorem 2). Moreover, the user quality level at any equilibrium is contained in the set (Theorem 3). When the number of users is large, the gap between and can be significant since the latter is given access to times as much data at each time step than the former. The fact that the singleuser optimum is realizable means that the market outcome might only exhibit a weak form of alignment. The intuition behind this result is that the performance of an algorithm is controlled not only by its efficiency in transforming information to action, but also by the level of data it has gained through its user base. Since platforms have separate data repositories, a platform can thus make up for a suboptimal algorithm by gaining a significant user base. On the other hand, the global optimal user quality level is nonetheless realizable—this suggests that equilibrium selection could be used to determine when bad equilibria arise and to nudge the marketplace towards a good equilibrium.
What if the platforms were to share data? At first glance, it might appear that with data sharing, a platform can no longer make up for a suboptimal algorithm with data, and the idealized form of alignment would be recovered. However, we construct twoarmed bandit problem instances where every symmetric equilibrium for the platforms has user quality level strictly below the global optimal (Theorems 45). The mechanism for this suboptimality is that the global optimal solution requires “too much” exploration. If other users engage in their “fair share” of exploration, an individual user would prefer to explore less and freeride off of the data obtained by other users. The platform is thus forced to explore less, which drives down the user quality level. To formalize this, we establish a connection to strategic experimentation (BH98). Nonetheless, although all of the user quality levels in may not be realizable, the user quality level at any symmetric equilibria is still guaranteed to be within this set (Theorem 7).
Connection to policy reports.
Our work provides a mathematical explanation of phenomena documented in recent policy reports (stiger19; cremer2019competition). The first phenomena that we consider is market dominance from data accumulation. The accumulation of data has been suggested to result in winnertakesallmarkets where a single player dominates and where market entry is challenging (stiger19). The data advantage of the dominant platform can lead to lower quality services and lower user utility. Theorems 23 formalize this mechanism. We show that once a platform has gained the full user base, market entry is impossible and the platform only needs to achieve weak alignment with user utility to retain its user base (see discussion in Section 4.2). The second phenomena that we consider is the impact of shared data access. While the separate data setting captures much of the status quo of proprietary data repositories in digital marketplaces, sharing data access has been proposed as a solution to market dominance (cremer2019competition). Will shared data access deliver on its promises? Theorems 45 highlight that sharing data does not solve the alignment issues, and uncovers freeriding as a mechanism for misalignment.
Single user  Multiple users  

Separate data repositories  
Shared data repository  subset of  
(strict subset in saferisky arm problem) 
1.2 Related work
We discuss the relation between our work and research on competing platforms, incentivizing exploration, and strategic experimentation.
Competing platforms.
AMSW20
also examine the interplay between competition and exploration in bandit problems in a duopoly economy. They focus on platform regret, showing that platforms must both choose a greedy algorithm at equilibrium and thus illustrating that competition is at odds with regret minimization. In contrast, we take a usercentric perspective and demonstrate that competition aligns market outcomes with user utility. Interestingly, the findings in
AMSW20 and our findings are not at odds: the result in AMSW20 can be viewed as alignment, since the optimal choice for a fully myopic user results in regret in the long run. Our alignment results also apply to nonmyopic users and when multiple users may arrive at every round.Outside of the bandits framework, another line of work has also studied the behavior of competing learners when users can choose between platforms. BT17; BT19 study equilibrium predictors chosen by competing offline learners in a PAC learning setup. Other work has focused on the dynamics when multiple learners apply outofbox algorithms, showing that specialization can emerge (GZKZ21; DCRMF22) and examining the role of data purchase (KGZ22); however, these works do not consider which algorithms the learners are incentivized to choose to gain users. In contrast, we investigate equilibrium bandit algorithms chosen by online learners, each of whom aims to maximize the size of its user base. The interdependency between the platforms’ choices of algorithms, the data available to the platforms, and the users’ decisions in our model drives our alignment insights.
Other aspects of competing platforms that have been studied include competition under exogeneous network effects (R09; WW14), experimentation in price competition (BV2000), dueling algorithms which compete for a single user (IKLMPT11), and measures of a digital platform’s power in a marketplace (HJM22).
Incentivizing exploration.
This line of work has examined how the availability of outside options impacts bandit algorithms. FKKK14 show that Bayesian Incentive Compatibility (BIC) suffices to guarantee that users will stay on the platform. Followup work (e.g., MSS15; SS21) examines what bandit algorithm are BIC. KMP13 explores the use of monetary transfers.
Strategic experimentation.
This line of work has investigated equilibria when a population of users each choose a bandit algorithm. BH98; BH00; BHSimple analyze the equilibria in a riskysafe arm bandit problem: we leverage their results in our analysis of equilibria in the shared data setting. Strategic experimentation (see HS17 for a survey) has investigated exponential bandit problems (KRC15), the impact of observing actions instead of payoffs (RSV07), and the impact of cooperation (BP21).
2 Model
We consider a duopoly market with two platforms performing a multiarmed bandit learning problem and a population of users, , who choose between platforms. Platforms commit to bandit algorithms, and then each user chooses a single platform to participate on for the learning task.
2.1 Multiarmed bandit setting
Consider a Bayesian bandit setting where there are arms with priors . At the beginning of the game, the mean rewards of arms are drawn from the priors . These mean rewards are unknown to both the users and the platforms but are shared across the two platforms. If the user’s chosen platform recommends arm , the user receives reward drawn from a noisy distribution with mean .
Let be a class of bandit algorithms that map the information state given by the posterior distributions to an arm to be pulled. The information state is taken to be the set of posterior distributions for the mean rewards of each arm. We assume that each algorithm can be expressed as a function mapping the information state to a distribution over arms in .^{2}^{2}2This assumption means that an algorithm’s choice is independent of the time step conditioned on
. Classical bandit algorithms such as Thompson sampling
(T33), finitehorizon UCB (LR85), and the infinitetime Gittins index (G79) fit into this framework. This assumption is not satisfied by the infinite time horizon UCB. We let denote this distribution over arms .Running example: riskysafe arm bandit problem.
To concretize our results, we consider the riskysafe arm bandit problem as a running example. The noise distribution is a Gaussian . The first arm is a risky arm whose prior distribution is over the set , where corresponds to a “low reward” and corresponds to a “high reward.” The second arm is a safe arm with known reward (the prior is a point mass at ). In this case, the information state
permits a onedimensional representation given by the posterior probability
that the risky arm is high reward.We construct a natural algorithm class as follows. For a measurable function , let be the associated algorithm defined so
is a distribution that is 1 with probability
and 2 with probability . We defineto be the class of all randomized algorithms. This class contains Thompson sampling ( is given by ), the Greedy algorithm ( is given by if and otherwise), and mixtures of these algorithms with uniform exploration. We consider restrictions of the class in some results.
2.2 Interactions between platforms, users, and data
The interactions between the platform and users impact the data that the platform receives for its learning task. The platform action space is a class of bandit algorithms that map an information state to an arm to be pulled. The user action space is . For , we denote by the action chosen by user .
Order of play.
The platforms commit to algorithms and respectively, and then users simultaneously choose their actions prior to the beginning of the learning task. We emphasize that user participates on platform for the full duration of the learning task. (In Appendix B.2, we discuss the assumption that users cannot switch platforms between time steps.)
Data sharing assumptions.
In the separate data repositories setting, each platform has its own (proprietary) data repository for keeping track of the rewards incurred by its own users. Platforms 1 and 2 thus have separate information states given by and , respectively. In the shared data repository setting, the platforms share an information state , which is updated based on the rewards incurred by users of both platforms.^{3}^{3}3In web search, recommender systems can query each other, effectively building a shared information state.
Learning task.
The learning task is determined by the choice of platform actions and , user actions , and specifics of data sharing between platforms. At each time step:

Each user arrives at platform . The platform recommends arm to that user, where denotes the information state of the platform. (The randomness of arm selection is fully independent across users and time steps.) The user receives noisy reward .

After providing recommendations to all of its users, platform 1 observes the rewards incurred by users in . Platform 2 similarly observes the rewards incurred by users in . Each platform then updates their information state with the corresponding posterior updates.

A platform may have access to external data that does not come from users. To capture this, we introduce background information into the model. Both platforms observe the same background information of quality . In particular, for each arm , the platforms observe the same realization of a noisy reward . When , we say that there is no background information since the background information is uninformative. The corresponding posterior updates are then used to update the information state ( in the case of shared data; and in the case of separate data).
In other words, platforms receive information from users (and background information), and users receive rewards based on the recommendations of the platform that they have chosen.
2.3 Utility functions and equilibrium concept
User utility is generated by rewards, while the platform utility is generated by user participation.
User utility function.
We follow the standard discounted formulation for bandit problems (e.g. (GJ79; BH98)), where the utility incurred by a user is defined by the expected (discounted) cumulative reward received across time steps. The discount factor parameterizes the extent to which agents are myopic. Let denote the utility of a user if they take action when other users take actions and the platforms choose and . For clarity, we make this explicit in the case of discrete time setup with horizon length . Let denote the arm recommended to user at time step . The utility is defined to be
where the expectation is over randomness of the incurred rewards and the algorithms. In the case of continuous time, the utility is
where the denotes the discount factor and denotes the payoff received by the user.^{4}^{4}4For discounted utility, it is often standard to introduce a multiplier of for normalization (see e.g. (BH98)). The utility could have equivalently be defined as without changing any of our results. In both cases, observe that the utility function is symmetric in user actions.
The utility function implicitly differs in the separate and shared data settings, since the information state evolves differently in these two settings. When we wish to make this distinction explicit, we denote the corresponding utility functions by and .
User equilibrium concept.
We assume that after the platforms commit to algorithms and , the users end up at a pure strategy Nash equilibrium of the resulting game. More formally, let be a pure strategy Nash equilibrium for the users if for all . The existence of a pure strategy Nash equilibrium follows from the assumption that the game is symmetric and the action space has 2 elements (C04).
One subtlety is that there can be multiple equilibria in this generalsum game. For example, there are always at least 2 (pure strategy) equilibria when platforms play any , i.e., commit to the same algorithm — one equilibrium where all users choose the first platform, and another where all users choose the second platform). Interestingly, there can be multiple equilibria even when one platform chooses a “worse” algorithm than the other platform. We denote by the set of pure strategy Nash equilibria when the platforms choose algorithms and . We simplify the notation and use when and are clear from the context. In Section B.1, we discuss our choice of solution concept, focusing on what the implications would have been of including mixed Nash equilibia in .
Platform utility and equilibrium concept.
The utility of the platform roughly corresponds to the number of users who participate on that platform. This captures that in markets for digital goods, where platform revenue is often derived from advertisement or subscription fees, the number of users serviced is a proxy for platform revenue.
Since there can be several user equilibria for a given choice of platform algorithms, we formalize platform utility by considering the worstcase user equilibrium for the platform. In particular, we define platform utility to be the minimum number of users that a platform would receive at any pure strategy equilibrium for the users. More formally, when platform 1 chooses algorithm and platform 2 chooses algorithm , the utilities of platform 1 and platform 2 are given by:
3 Formalizing the Alignment of a Market Outcome
The alignment of an equilibrium outcome for the platforms is measured by the amount of user utility that it generates. In Section 3.1 we introduce the user quality level to formalize alignment. In Section 3.2, we show an idealized form of alignment for (Theorem 1). In Section 3.3, we turn to the case of multiple users and discuss benchmarks for the user quality level. In Section 3.4, we describe mild assumptions on that we use in our alignment results for multiple users.
3.1 User quality level
Given a pair of platform algorithms and , we introduce the following metric to measure the alignment between platform algorithms and user utility. We again take a worstcase perspective and define user quality level to be the minimum utility that any user can receive at any pure strategy equilibrium for users.
Definition 1 (User quality level).
Given algorithms and chosen by the platforms, the user quality level is defined to be .
Since the utility function is symmetric and is the class of all pure strategy equilibria, we can equivalently define as for any .
To simplify notation in our alignment results, we introduce the reward function which captures how the utility that a given algorithm generates changes with the number of users who contribute to its data repository. For an algorithm , let the reward function be defined by:
where
corresponds to a vector with
coordinates equal to one.3.2 Idealized alignment result: The case of a single user
When there is a single user, the platform algorithms turn out to be perfectly aligned with user utilities at equilibrium. To formalize this, we consider the optimal utility that could be obtained by a user across any choice of actions by the platforms and users (not necessarily at equilibrium): that is, . Using the setup of the singleuser game, we can see that this is equal to . We show that the user quality level always meets this benchmark (we defer the proof to Appendix C).
Theorem 1.
Suppose that , and consider either the separate data setting or the shared data setting. If is a pure strategy Nash equilibrium for the platforms, then the user quality level is equal to .
Theorem 1 shows that in a singleuser market, two firms is sufficient to perfectly align firm actions with user utility—this stands in parallel to classical Bertrand competition in the pricing setting (bertrand).
Proof sketch of Theorem 1.
There are only 2 possible pure strategy equilibria: either the user chooses platform 1 and receives utility or the user chooses platform 2 and receives utility . If one platform chooses a suboptimal algorithm for the user (i.e. an algorithm where ), then the other platform will receive the user (and thus achieve utility 1) if they choose a optimal algorithm . This means that is a pure strategy Nash equilibrium if and only if or . The user thus receives utility . We defer the full proof to Appendix C.
3.3 Benchmarks for user quality level
In the case of multiple users, this idealized form of alignment turns out to break down, and formalizing alignment requires a more nuanced consideration of benchmarks. We define the singleuser optimal utility of to be . This corresponds to maximal possible user utility that can be generated by a platform who only serves a single user and thus relies on this user for all of its data. On the other hand, we define the global optimal utility of to be . This corresponds to the maximal possible user utility that can be generated by a platform when all of the users in the population are forced to participate on the same platform. The platform can thus maximally enrich its data repository in each time step.
3.4 Assumptions on
While our alignment results for a single user applied to arbitrary algorithm classes, we require mild assumptions on in the case of multiple users to endow the equilibria with basic structure.
Information monotonicity requires that an algorithm ’s performance in terms of user utility does not worsen with additional posterior updates to the information state. Our first two instantations of information monotonicity—strict information monotonicity and information constantness—require that the user utility of grow monotonically in the number of other users participating in the algorithm. Our third instantation of information monotonicity—side information monotonicity—requires that the user utility of not decrease if other users also update the information state, regardless of what algorithm is used by the other users. We formalize these assumptions as follows:
Assumption 1 (Information monotonicity).
For any given discount factor and number of users , an algorithm is strictly information monotonic if is strictly increasing in for . An algorithm is information constant if is constant in for . An algorithm is side information monotonic if for every measurable function mapping information states to distributions over and for every , it holds that where has all coordinates equal to .
While information monotonicity places assumptions on each algorithm in , our next assumption places a mild restriction on how the utilities generated by algorithms in relate to each other. Utility richness requires that the set of user utilities spanned by is a sufficiently rich interval.
Assumption 2 (Utility richness).
A class of algorithms is utility rich if the set of utilities is a contiguous set, the supremum of is achieved, and there exists such that .
These assumptions are satisfied for natural bandit setups, as we show in Section 6.
4 Separate data repositories
We investigate alignment when the platforms have separate data repositories. In Section 4.1, we show that there can be many qualitatively different equilibria for the platforms and characterize the alignment of these equilibria. In Section 4.2, we discuss factors that drive the level of misalignment in a marketplace.
4.1 Multitude of equilibria and the extent of alignment
In contrast with the single user setting, the marketplace can exhibit multiple equilibria for the platforms. As a result, to investigate alignment, we investigate the range of achievable user quality levels. Our main finding is that the equilibria in a given marketplace can exhibit a vast range of alignment properties. In particular, every user quality level in between the singleuser optimal utility and the global optimal utility can be realized by some equilibrium for the platforms.
Theorem 2.
Nonetheless, there is a baseline (although somewhat weak) form of alignment achieved by all equilibria. In particular, every equilibrium for the platforms has user quality level at least the singleuser optimum .
Theorem 3.
Suppose that each algorithm in is either strictly information monotonic or information constant (see Assumption 1). In the separate data setting, at any pure strategy Nash equilibrium for the platforms, the user quality level lies in the following interval:
An intuition for these results is that the performance of an algorithm depends not only on how it transforms information to actions, but also on the amount of information to which it has access. A platform can make up for a suboptimal algorithm by attracting a significant user base: if a platform starts with the full user base, it is possible that no single user will switch to the competing platform, even if the competing platform chooses a stricter better algorithm. However, if a platform’s algorithm is highly suboptimal, then the competing platform will indeed be able to win the full user base.
Proof sketch of Theorem 2 and Theorem 3
The key idea is that pure strategy equilibria for users take a simple form. Under strict information monotonicity, we show that every pure strategy equilibrium is in the set (Lemma 12). The intuition is that the user utility strictly grows with the amount of data that the platform has, which in turn grows with the number of other users participating on the same platform. It is often better for a user to switch to the platform with more users, which drives all users to a single platform in equilibrium.
The reward functions and determine which of these two solutions are in . It follows from definition that is in if and only if . This inequality can hold even if is a better algorithm in the sense that for all . The intuition is that the performance of an algorithm is controlled not only by its efficiency in choosing the possible action from the information state, but also by the size of its user base. The platform with the worse algorithm can be better for users if it has accrued enough users.
This characterization of the set enables us to reason about the platform equilibria. To prove Theorem 2, we show that is an equilibrium for the platforms as long as . This, coupled with utility richness, enables us to show that every utility level in can be realized. To prove Theorem 3, we first show platforms can’t both choose highly suboptimal algorithms: in particular, if and are both below the singleuser optimal , then is not in equilibrium. Moreover, if one of the platforms chooses an algorithm where , then all of the users will choose the other platform in equilibrium. The full proofs are deferred to Appendix D.
4.2 What drives the level of misalignment in a marketplace?
The existence of multiple equilibria makes it more subtle to reason about the alignment exhibited by a marketplace. The level of misalignment depends on two factors: first, the size of the range of realizable user quality levels, and second, the selection of equilibrium within this range. We explore each of these factors in greater detail.
How large is the range of possible user quality levels?
Both the algorithm class and the structure of the user utility function determine the size of the range of possible user quality levels. We informally examine the role of the user’s discount factor on the size of this range.
First, consider the case where users are fully nonmyopic (so their rewards are undiscounted across time steps). The gap between the singleuser optimal utility and global optimal utility can be substantial. To gain intuition for this, observe that the utility level corresponds to the algorithm receiving times as much as data at every time step than the utility level . For example, consider an algorithm whose regret grows according to where is the number of samples collected, and let OPT be the maximum achievable reward. Since utility and regret are related up to additive factors for fully nonmyopic users, then we have that while .
At the other extreme, consider the case where users are fully myopic. In this case, the range collapses to a single point. The intuition is that the algorithm generates the same utility for a user regardless of the number of other users who participate: in particular, is equal to for any algorithm . To see this, we observe that the algorithm’s behavior beyond the first time step does not factor into user utility, and the algorithm’s selection at the first time is determined before it receives any information from users. Put differently, although can receives times more information, there is a delay before the algorithm sees this information. Thus, in the case of fully myopic users, the user quality level is always equal to the global optimal user utility so idealized alignment is actually recovered. When users are partially nonmyopic, the range is no longer a single point, but the range is intuitively smaller than in the undiscounted case.
Which equilibrium arises in a marketplace?.
When the gap between the singleuser optimal and global optimal utility levels is substantial, it becomes ambiguous what user quality level will be realized in a given marketplace. Which equilibria arises in a marketplace depends on several factors.
One factor is the secondary aspects of the platform objective that aren’t fully captured by the number of users. For example, suppose that the platform cares about the its reputation and thus is incentivized to optimize for the quality of the service. This could drive the marketplace towards higher user quality levels. On the other hand, suppose that the platform derives other sources of revenue generated from recommending content depending on who created the content. If these additional sources of revenue are not aligned with user utility, then this could drive the marketplace towards lower user quality levels.
Another factor is the mechanism under which platforms arrive at equilibrium solutions, such as market entry. We informally show that market entry can result in the the worst possible user utility within the range of realizable levels. To see this, notice that when one platform enters the marketplace shortly before another platform, all of the users will initially choose the first platform. The second platform will win over users only if , where denotes the algorithm of the second platform and denotes the algorithm of the first platform. In particular, the platform is susceptible to losing users only if . Thus, the worst possible equilibrium can arise in the marketplace, and this problem only worsens if the first platform enters early enough to accumulate data beforehand. This finding provides a mathematical backing for the barriers to entry in digital marketplaces that are documented in policy reports (stiger19).
This finding points to an interesting direction for future work: what equilibria arise from other natural mechanisms?
5 Shared data repository
What happens when data is shared between the platforms? We show that both the nature of alignment and the forces that drive misalignment fundamentally change. In Section 5.1, we show a construction where the user quality levels do not span the full set . Despite this, in Section 5.2, we establish that the user quality level at any symmetric equilibrium continues to be at least .
5.1 Construction where global optimal is not realizable
In contrast with the separate data setting, the set of user quality levels at symmetric equilibria for the platforms does not necessarily span the full set . To demonstrate this, we show that in the riskysafe arm problem, every symmetric equilibrium has user quality level strictly below .
Theorem 4.
Let the algorithm class consist of the algorithms where , , and is continuous at and . In the shared data setting, for any choice of prior and any background information quality , there exists an undiscounted riskysafe arm bandit setup (see Setup 1) such that the set of realizable user quality levels for algorithm class is equal to a singleton set:
where
Theorem 5.
In the shared data setting, for any discount factor and any choice of prior , there exists a discounted riskysafe arm bandit setup with no background information (see Setup 2) such that the set of realizable user quality levels for algorithm class is equal to a singleton set:
where
Theorems 4 and 5 illustrate examples where there is no symmetric equilibrium for the platforms that realizes the global optimal utility —regardless of whether users are fully nonmyopic or have discounted utility. These results have interesting implications for shared data access as an intervention in digital marketplace regulation (e.g. see cremer2019competition). At first glance, it would appear that data sharing would resolve the alignment issues, since it prevents platforms from gaining market dominance through data accumulation. However, our results illustrate that the platforms may still not align their actions with user utility at equilibrium.
Comparison of separate and shared data settings.
To further investigate the efficacy of shared data access as a policy intervention, we compare alignment when the platforms share a data repository to alignment when the platforms have separate data repositories, highlighting two fundamental differences. We focus on the undiscounted setup (Setup 1) analyzed in Theorem 4; in this case, the algorithm class satisfies information monotonicity and utility richness (see Lemma 8) so the results in Section 4.1 are also applicable.^{5}^{5}5 In the discounted setting, not all of the algorithms in necessarily satisfy the information monotonicity requirements used in the alignment results for the separate data setting. Thus, Theorem 5 cannot be used to directly compare the two settings. The first difference in the nature of alignment is that there is a unique symmetric equilibrium for the shared data setting, which stands in contrast to the range of equilibria that arose in the separate data setting. Thus, while the particularities of equilibrium selection significantly impact alignment in the separate data setting (see Section 4.2), these particularities are irrelevant from the perspective of alignment in the shared data setting.
The second difference is that the user quality level of the symmetric equilibrium in the shared data setting is in the interior of the range of user quality levels exhibited in the separate data setting. The alignment in the shared data setting is thus strictly better than the alignment of the worst possible equilibrium in the separate data setting. Thus, if we take a pessimistic view of the separate data setting, assuming that the marketplace exhibits the worstpossible equilibrium, then data sharing does help users. On the other hand, the alignment in the shared data setting is also strictly worse than the alignment of the best possible equilibrium in the separate data setting. This means if that we instead take an optimistic view of the separate data setting, and assume that the marketplace exhibits this bestcase equilibrium, then data sharing is actually harmful for alignment. In other words, when comparing data sharing and equilibrium selection as regulatory interventions, data sharing is worse for users than maintaining separate data and applying an equilibrium selection mechanism that shifts the market towards the best equilibria.
Mechanism for misalignment.
Perhaps counterintuitively, the mechanism for misalignment in the shared data setting is that a platform must perfectly align its choice of algorithm with the preferences of a user (given the choices of other users). In particular, the algorithm that is optimal for one user given the actions of other users is different from the algorithm that would be optimal if the users were to cooperate. This is because exploration is costly to users, so users don’t want to perform their fair share of exploration, and would rather freeride off of the exploration of other users. As a result, a platform who chooses an algorithm with the global optimal strategy cannot maintain its user base. We formalize this phenomena by establishing a connection with strategic experimentation, drawing upon the results of BH98; BH00; BHSimple (see Appendix E.2 for a recap of the relevant results).
Proof sketches of Theorem 4 and Theorem 5.
The key insight is that the symmetric equilibria of our game are closely related to the equilibria of the following game . Let be an player game where each player chooses an algorithm in within the same bandit problem setup as in our game. The players share an information state corresponding to the posterior distributions of the arms. At each time step, all of the users arrive at the platform, player pulls the arm drawn from , and the players all update . The utility received by a player is given by their discounted cumulative reward.
We characterize the symmetric equilibria of the original game for the platforms.
Lemma 6.
The solution is in equilibrium if and only if is a symmetric pure strategy equilibrium of the game described above.
Moreover, the user quality level is equal to , which is also equal to the utility achieved by players in when they all choose action .
In the game , the global optimal algorithm corresponds to the solution when all players cooperate rather than arriving at an equilibrium. Intuitively, all of the players choosing is not an equilibrium because exploration comes at a cost to utility, and thus players wish to “freeride” off of the exploration of other players. The value corresponds to the cooperative maximal utility that can be obtained the players.
To show Theorem 5, it suffices to analyze structure of the equilibria of . Interestingly, BH98; BH00; BHSimple—in the context of strategic experimentation—studied a game very similar to instantiated in the riskysafe arm bandit problem with algorithm class . We provide a recap of the relevant aspects of their results and analysis in Appendix E.2. At a high level, they showed that there is a unique symmetric pure strategy equilibrium and showed that the utility of this equilibrium is strictly below the global optimal. We can adopt this analysis to conclude that the equilibrium player utility in is strictly below . The full proof is deferred to Appendix E.
5.2 Alignment theorem
Although not all values in can be realized, we show that the user quality level at any symmetric equilibrium is always at least .
Theorem 7.
Suppose that every algorithm in is side information monotonic (Assumption 1). In the shared data setting, at any symmetric equilibrium , the user quality level is in the interval .
Theorem 7 demonstrates that the freeriding effect described in Section 5.1 cannot drive the user quality level below the singleuser optimal. Recall that the singleuser optimal is also a lower bound on the user quality level for the separate data setting (see Theorem 3). This means that regardless of the assumptions on data sharing, the market outcome exhibits a weak form of alignment where the user quality level is at least the singleuser optimal.
Proof sketch of Theorem 7.
We again leverage the connection to the game described in the proof sketch of Theorem 5. The main technical step is to showat any symmetric pure strategy equilibrium , the player utility is at least (Lemma 16). Intuitively, since is a best response for each player, they must receive no more utility by choosing . The utility that they would receive from playing if there were no other players in the game is . The presence of other players can be viewed as background updates to the information state, and the information monotonicity assumption on guarantees that these updates can only improve the player’s utility in expectation. The full proof is deferred to Appendix E.
6 Algorithm classes that satisfy our assumptions
We describe several different bandit setups under which the assumptions on described in Section 3.4 are satisfied.
Discussion of information monotonicity (Assumption 1).
We first show that in the undiscounted, continuoustime, riskysafe arm bandit setup, the information monotonicity assumptions are satisfied for essentially any algorithm.
Lemma 8.
Consider the undiscounted, continuoustime riskysafe arm bandit setup (see Setup 1). Any algorithm satisfies strict information monotonicity and side information monotonicity.
While the above result focuses on undiscounted utility, we also show that information monotonicity can also be achieved with discounting. In particular, we show that our form of information monotonicity is satisfied by ThompsonSampling (proof is deferred to Appendix F).
Lemma 9.
For the discretetime riskysafe arm bandit problem with finite time horizon, prior , users, and no background information (see Setup 3), ThompsonSampling is strictly information monotonic and side information monotonic for any discount factor .
In fact, we actually show in the proof of Lemma 9 that the ThompsonSampling algorithm that explores uniformly with probability and applies ThompsonSampling with probability also satisfies strict information monotonicity and side information monotonicity.
These information monotonicity assumptions become completely unrestrictive for fully myopic users, where user utility is fully determined by the algorithm’s performance at the first time step, before any information updates are made. In particular, any algorithm is information constant and sideinformation monotonic.
We note that a conceptually similar variant of information monotonicity was studied in previous work on competing bandits AMSW20. Since AMSW20 focused on a setting where a single myopic user arrives at every time step, they require a different information monotonicity assumption, that they call Bayes monotonicity. (An algorithm satisfies Bayes monotonicity if its expected reward is nondecreasing in time.) Bayes monotonicity is strictly speaking incomparable to our information monotonicity assumptions; in particular, Bayes monotonicity does not imply either strict information monotonicity or side information monotonicity.
Discussion of utility richness (Assumption 2).
At an intuitive level, as long as the algorithm class reflects a range of exploration levels, it will satisfy utility richness.
We first show that in the undiscounted setup in Theorem 4, the algorithm class satisfies utility richness (proof in Appendix F).
Lemma 10.
Consider the undiscounted, continuoustime riskysafe arm bandit setup (see Setup 1). The algorithm class satisfies utility richness.
Since the above result focuses on a particular bandit setup, we also describe a general operation to transform an algorithm class into one that satisfies utility richness. In particular, the closure of an algorithm class under mixtures with uniformly random exploration satisfies utility richness (proof in Appendix F).
Lemma 11.
Consider any discretetime setup with finite time horizon and bounded mean rewards. For , let be the algorithm that chooses an arm at random w/ probability . Suppose that the reward of every algorithm is at least (the reward of uniform exploration), and suppose that the supremum of is achieved. Then, the algorithm class satisfies utility richness.
Example classes that achieve information monotonicity and utility richness.
Together, the results above provide two natural bandit setups that satisfy strict information monotonicity, side information monotonicity, and utility richness.

The algorithm class in the undiscounted, continuoustime riskysafe arm bandit setup with any users (see Setup 1).

The class of Thompson sampling algorithms in the discrete time riskysafe arm bandit setup with discount factor , users, and no background information (see Setup 3).
These setups, which span the full range of discount factors, provide concrete examples where our alignment results are guaranteed to apply.
An interesting direction for future work would be to provide a characterization of algorithm classes that satisfy these assumptions (especially information monotonicity).
7 Discussion
Towards investigating competition in digital marketplaces, we present a framework for analyzing competition between two platforms performing multiarmed bandit learning through interactions with a population of users. We propose and analyze the user quality level as a measure of the alignment of market equilibria. We show that unlike in typical markets of products, competition in this setting does not perfectly align market outcomes with user utilities, both when the platforms maintain separate data repositories and when the platforms maintain a shared data repository.
Our framework further allows to compare the separate and shared data settings, and we show that the nature of misalignment fundamentally depends on the data sharing assumptions. First, different mechanisms drive misalignment: when platforms have separate data repositories, the suboptimality of an algorithm can be compensated for with a larger user base; when the platforms share data, a platform can’t retain its user base if it chooses the global optimal algorithm since users wish to freeride off of the exploration of other users. Another aspect that depends on the data sharing assumptions is the specific form of misalignment exhibited by market outcomes. The set of realizable user quality levels ranges from the singleuser optimal to the global optimal in the separate data setting; on the other hand, in the shared data setting, neither of these endpoints may be realizable. These differences suggests that data sharing performs worse as a regulatory intervention than a welldesigned equilibrium selection mechanism.
More broadly, our work reveals that competition has subtle consequences for users in digital marketplaces that merit further inquiry. We hope that our work provides a starting point for building a theoretical foundation for investigating competition and designing regulatory interventions in digital marketplaces.
8 Acknowledgments
We would like to thank Yannai Gonczarowski, Erik Jones, Rad Niazadeh, Jacob Steinhardt, Nilesh Tripuraneni, Abhishek Shetty, and Alex Wei for helpful comments on the paper. This work is in part supported by National Science Foundation under grant CCF2145898, the Mathematical Data Science program of the Office of Naval Research under grant number N000141812764, the Vannevar Bush Faculty Fellowship program under grant number N000142112941, a C3.AI Digital Transformation Institute grant, the Paul and Daisy Soros Fellowship, and the Open Phil AI Fellowship.
References
Appendix A Example bandit setups
We consider the following riskysafe arm setups in our results. The first setup is a riskysafe arm bandit setup in continuous time, where user rewards are undiscounted.
Setup 1 (Undiscounted, continuous time riskysafe arm setup).
Consider a riskysafe arm bandit setup where the algorithm class is
The bandit setup is in continuous time: if a platform chooses algorithm , then at a given time step with information state , the user of that platform devotes a fraction of the time step to the risky arm and the remainder of the time step to the safe arm. Let the prior be initialized so . Let the rewards be such that the fullinformation payoff . Let the background information quality be . Let the time horizon be infinite, and suppose the user utility is undiscounted.^{6}^{6}6Formally, this means that the user utility is the limit as the time horizon goes to , or alternatively the limit as the discount factor vanishes. See BH00 for a justification that these limits are welldefined.
The next setup is again a riskysafe arm bandit setup in continuous time, but this time with discounted rewards.
Setup 2 (Discounted, continuous time riskysafe arm setup).
Consider a riskysafe arm bandit setup where the algorithm class is . The bandit setup is in continuous time: if a platform chooses algorithm , then at a given time step with information state , the user of that platform devotes a fraction of the time step to the risky arm and the remainder of the time step to the safe arm. Let the high reward be , the low reward be , and let the prior be initialized to some where is the safe arm reward. Let the time horizon be infinite, suppose that there is no background information , and suppose the user utility is discounted with discount factor .
Finally, we consider another discounted riskysafe bandit setup, but this time with discrete time and finite time horizon.
Setup 3 (Discrete, riskysafe arm setup).
Consider a riskysafe arm bandit setup where the algorithm class is , where denotes the Thompson sampling algorithm given by . The bandit setup is in discrete time: if a platform chooses algorithm , then at a given time step with information state , the user of that platform chooses the risky arm with probability and the safe arm with probability . Let the time horizon be finite, suppose that the user utility is discounted with discount factor , that there is no background information , and the prior be initialized to
Appendix B Further details about the model choice
We examine two aspects our model—the choice of equilibrium set and the action space of users—in greater detail.
b.1 What would change if users can play mixed strategies?
Suppose that were defined to be the set of all equilibria for the users, rather than only pure strategy equilibria. The main difference is that all users might no longer choose the same platform at equilibrium, which would change the nature of the set . In particular, even when both platforms choose the same algorithm , there is a symmetric mixed equilibrium where all users randomize equally between the two platforms. At this mixed equilibrium, the utility of the users is
, since the number of users at each platform would follows a binomial distribution. This quantity might be substantially lower than
depending on the nature of the bandit algorithms. As a result, the user quality level , which is measured by the worst equilibrium for the users in , could be substantially lower than . Moreover, the condition for to be an equilibrium for the platforms would still be that , so there could exist a platform equilibria with user quality level much lower than . Intuitively, the introduction of mixtures corresponds to users no longer coordinating between their choices of platforms—this leads to no single platform accumulating all of the data, thus lowering user utility.b.2 What would change if users could change platforms at each round?
Our model assumes that users choose a platform at the beginning of the game which they participate on for the duration of the game. In this section, we examine this assumption in greater detail, informally exploring what would change if the users could switch platforms.
First, we provide intuition that in the shared data setting, there would be no change in the structure of the equilibrium as long as the equilibrium class is closed under mixtures (i.e. if , then the algorithm that plays with probability and with probability must be in ). A natural model for users switching platforms would be that users see the public information state at every round and choose a platform based on this information state (and algorithms for the platforms). A user’s strategy is thus a mapping from an information state to , and the platform would receive utility for a user depending on the fraction of time that they spend on that platform. Suppose that symmetric (mixed) equilibria for users are guaranteed to exist for any choice of platform algorithms, and we define the platform’s utility by the minimal number of (fractional) users that they receive at any symmetric mixed equilibrium. In this model, we again see that is a symmetric equilibrium for the platform if and only if is an symmetric pure strategy equilibrium in the game defined in Section 4. (To see this, note if is not a symmetric pure strategy equilibrium, then the platform can achieve higher utility by choosing that is a deviation for a player in the game . If is a symmetric pure strategy equilibrium, then ). Thus, the alignment results will remain the same.
In the separate data setting, even defining a model where users can switch platforms is more subtle since it is unclear how the information state of the users should be defined. One possibility would be that each user keeps track of their own information state based on the rewards that they observe. Studying the resulting equilibria would require reasoning about the evolution of user information states and furthermore may not capture practical settings where users see the information of other users. Given these challenges, we defer the analysis of users switching platforms in the case of separate data to future work.
Appendix C Proof of Theorem 1
We prove Theorem 1.
Proof of Theorem 1.
We split into two cases: (1) either or , and (2) or .
Case 1: or .
We show that is an equilibrium.
Suppose first that and . We see that the strategies and , where the user chooses platform 1, is in the set of equilibria . This means that . Suppose that platform 1 chooses another algorithm . Since , we see that is still an equilibrium. Thus, . This implies that is a best response for platform 1, and an analogous argument shows is a best response for platform 2. When the platforms choose , at either of the user equilibria or , the user utility is . Thus .
Now, suppose that exactly one of and holds. WLOG, suppose . Since , we see that . On the other hand, . This means that and . Thus, is a best response for platform 1 trivially because for all by definition. We next show that is a best response for platform 2. If the platform 2 plays another algorithm , then will still be in equilibrium for the users since platform 1 offers the maximum possible utility. Thus, , and is a best response for platform 2. When the platforms choose , the only user equilibria is where the user utility is . Thus .
Case 2: or .
It suffices to show that is not an equilibrium. WLOG, suppose that . We see that . Thus, . However, if platform 2 switches to , then is equal to and so . This means that is not a best response for platform 2, and thus is not an equilibrium. ∎
Appendix D Proofs for Section 4
In the proofs of Theorems 2 and 3, the key technical ingredient is that pure strategy equilibria for users take a simple form. In particular, under strict information monotonicity, we show that in every pure strategy equilibrium , all of the users choose the same platform.
Lemma 12.
Suppose that every algorithm is either strictly information monotonic or information constant (see Assumption 1). For any choice of platform algorithms such that at least one of and is strictly information monotonic, it holds that:
Proof.
WLOG, assume that is strictly information monotonic. Assume for sake of contradiction that the user strategy profile (with users choosing platform 1 and users choosing platform 2) is in . Since is an equilibrium, a user choosing platform 1 not want to switch to platform 2. The utility that they currently receive is and the utility that they would receive from switching is , so this means:
Similarly, Since is an equilibrium, a user choosing platform 2 not want to switch to platform 1. The utility that they currently receive is and the utility that they would receive from switching is , so this means:
Putting this all together, we see that:
which is a contradiction since is either strictly information monotonic or information constant. ∎
d.1 Proof of Theorem 2
We prove Theorem 2.
Proof of Theorem 2.
Since the algorithm class is utility rich (Assumption 2), we know that for any , there exists an algorithm such that . We claim that is an equilibrium and we show that .
To show that is an equilibrium, suppose that platform 1 chooses any algorithm . We claim that . To see this, notice that the utility that a user receives from choosing platform 2 is , and the utility that they would receive if they deviate to platform is . By definition, we see that: