# Robust Experimentation in the Continuous Time Bandit Problem

We study the experimentation dynamics of a decision maker (DM) in a two-armed bandit setup (Bolton and Harris (1999)), where the agent holds ambiguous beliefs regarding the distribution of the return process of one arm and is certain about the other one. The DM entertains Multiplier preferences a la Hansen and Sargent (2001), thus we frame the decision making environment as a two-player differential game against nature in continuous time. We characterize the DM value function and her optimal experimentation strategy that turns out to follow a cut-off rule with respect to her belief process. The belief threshold for exploring the ambiguous arm is found in closed form and is shown to be increasing with respect to the ambiguity aversion index. We then study the effect of provision of an unambiguous information source about the ambiguous arm. Interestingly, we show that the exploration threshold rises unambiguously as a result of this new information source, thereby leading to more conservatism. This analysis also sheds light on the efficient time to reach for an expert opinion.

## Authors

• 3 publications
• ### A General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions

This paper proposes a general framework of multi-armed bandit (MAB) proc...
08/20/2018 ∙ by Wenqing Bao, et al. ∙ 0

• ### A General Framework of Multi-Armed Bandit Processes by Switching Restrictions

This paper proposes a general framework of multi-armed bandit (MAB) proc...
08/20/2018 ∙ by Wenqing Bao, et al. ∙ 0

• ### Multiplayer Bandit Learning, from Competition to Cooperation

The stochastic multi-armed bandit problem is a classic model illustratin...
08/03/2019 ∙ by Simina Branzei, et al. ∙ 0

• ### A new approach to Poissonian two-armed bandit problem

We consider a continuous time two-armed bandit problem in which incomes ...
07/13/2019 ∙ by Alexander Kolnogorov, et al. ∙ 0

• ### Interactive Restless Multi-armed Bandit Game and Swarm Intelligence Effect

We obtain the conditions for the emergence of the swarm intelligence eff...
03/13/2015 ∙ by Shunsuke Yoshida, et al. ∙ 0

• ### Optimal Strategies for Decision Theoretic Online Learning

We extend the drifting games analysis to continuous time and show that t...
06/20/2021 ∙ by Yoav Freund, et al. ∙ 6

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

There are natural cases where the experimentation shall be performed in ambiguous environments, where the distribution of future shocks is unknown. For example, consider a diagnostician who has two treatments for a particular set of symptoms. One is the conventional treatment that has been widely tested and has a known success rate. Alternatively, there is a second treatment that is recently discovered and is due to further study. The diagnostician shall perform a sequence of experiments on patients to figure out the success/failure rate of the new treatment. However, the adversarial effects of the mistreatment on certain types of patients are fatal, thus the medics must consider the worst-case scenario on the patients while evaluating the new treatment. As another case, consider the R&D example of Weitzman (1979), where the research department of an organization is assigned with the task of selecting one of the two technologies producing the same commodity. The research division holds a prior on the generated saving of each technology, but the observations of each alternative during the experimentation stage is obfuscated by ambiguous sources such as the quality of researchers and managerial biases toward one choice. Therefore, the technology that is selected and sent to the development stage must be robust against these sources, because once developed it will be then used in mass production, thus even minor miscalculations in the research stage can lead to huge losses in the sales stage relative to what could have been possibly achieved.

At the core of our paper is an experimentation process between two projects framed as a two-armed bandit problem. The return rate to one arm is known to be

, whereas the return rate of the second arm is a binary random variable

, such that . The decision maker (henceforth DM) holds an initial prior , that can be updated when she invests in the second project and learns its output. At the outset, she has to sequentially choose arms to learn about the unknown return rate while maximizing her net experimentation payoff. Specifically in our model, the observations of the second arm are obfuscated by Wiener process whose distribution, from the perspective of the DM, is unknown and therefore is called the ambiguous arm. Central to the agent’s decision making problem is her preference for robustness against a candidate set of future shocks’ distribution which are concealing the ambiguous arm’s return rate. Our investigation of the multiplicity of shocks’ distribution is motivated both from the subjective and objective perspectives. Subjectively, the DM might be ambiguity averse and the multiple prior set (for the shock distribution) would be part of her axiomatic utility representation (Gilboa and Schmeidler (1989)). Alternatively, the DM might be subject to an experimentation setup where the results are objectively drawn from a family of distributions, and she wants to maintain a form of robustness against this multiplicity; this is along the lines of model-uncertainty pioneered by Hansen and Sargent (2001) and Hansen et al. (2006).

### 1.1 Summary of results

We frame the decision making environment in which the DM has Multiplier preference, à la Hansen and Sargent (2001), as a two-player continuous time differential game against nature — second player. The DM’s goal is to find an allocation strategy between two arms that maximizes her payoff under the distribution picked by the nature. We express the (first player’s) payoff function with respect to two control processes: (i) DM’s allocation choice process between the two arms, and (ii) the nature’s adversarial choice of underlying distribution. The DM follows the max-min strategy, namely at every point in time she chooses her allocation weights between two arms, and then the nature picks the shock distribution that minimizes the DM’s continuation payoff. We then characterize the value function (to the DM) as a solution to a certain HJBI (Hamilton-Jacobi-Bellman-Isaac) equation.

In this game, the nature’s move, i.e choice of the shock distribution, would have two important impacts (with opposite forces) on the DM. First, it affects the current flow payoff of experimentation, and secondly it distorts the DM’s posterior formation and consequently her continuation strategy. In the equilibrium the DM knows the nature’s best-response strategy, therefore, when she Bayes-updates her belief about , she is no longer concerned about all possible distributions of shocks. This gives rise to a unique law of motion for the posterior process, and reduces the HJBI equation to a second order HJB equation.

We derive a closed-form expression for the DM’s value function with respect to her posterior, i.e , and characterize her robust optimal experimentation strategy. It turns out in the equilibrium her strategy follows a cut-off rule with respect to her belief. Specifically, she switches to the safe arm from the ambiguous arm whenever her posterior drops below a certain threshold . We also find a closed-form equation for the cut-off value that allows us to perform a number of comparative statics. In particular, the threshold for selecting the ambiguous arm unambiguously rises as the DM’s ambiguity aversion index increases.111The direction of such a response is intuitive, however, the sharp characterization of the threshold via the means of continuous time techniques provides us with the extent of this response. Also, we establish that the marginal value of receiving good news about is increasing, namely .

We then explore the effect of an additional unambiguous information source. In particular, we are interested to know what happens when for e.g the experimentation unit hires an expert to release risky but unambiguous information about . The new value function is obtained in closed-form, and the DM’s optimal strategy again turns out to follow a cut-off rule (with a different threshold ). Interestingly, we show that under any circumstances, compared to the previous case the value of cut-off rises as a result of the extra information, i.e . Therefore, it is interpreted as though the DM becomes more conservative against choosing the second arm when offered with such information. Lastly, we show the surplus generated by the expert attains its maximum at the range of beliefs where the experimentation unit would otherwise select the ambiguous arm but do not have strong enough feeling and evidence in favor of this decision. Therefore, our model sheds light on the time that is best to reach an expert opinion.

### 1.2 Related literature and organization of the paper

The literature on robust bandit problem is limited, but recently there have been some attempts to bring several aspects of robustness into play. Specifically in the works done by Caro and Gupta (2013) and Kim and Lim (2015)

the discrete-time multi-armed bandit problem is studied while the state transition probabilities are drawn from an

ambiguous set of conditional distributions. In Caro and Gupta (2013) the set of multiple transition probabilities at every period is constrained through a relative entropy condition, whereas Kim and Lim (2015) chooses to impose an entropic penalty cost directly in the objective function of the DM rather than hard thresholding it as a constraint. In a different work Li (2019) studies the multi-armed bandit in which the DM entertains max-min utility and follows a prior-by-prior Bayes updating from her initial rectangular multiple prior set, where each candidate distribution in this set is identified by the i.i.d shocks it generates in the future. Our work is different from these treatments in the following aspects: (i) contrary to the first two works the Brownian diffusion treatment of the Markov transitions allows for a richer set of perturbations around benchmark model which extends the scope of robustness that the DM demands; (ii) the continuous time framework lets us to obtain sharp and closed form results on the value function and the optimal experimentation policy that in turn renders the comparative static with regard to parameters of the model and importantly the ambiguity aversion index; (iii) we are explicit about the state variable in our setup, and specifically we characterize it as the DM’s posterior process regarding the second arm’s return rate; (iv) our setup is flexible enough that can address distinct informational environments such as the effect of the provision of an expert opinion.

In the economic literature, after the seminal work of Gittins (1979), the continuous time problem of optimal experimentation in a noisy environment, where the payoff to the unexplored arm222Often the second arm is referred as the unexplored one. is subject to a Brownian motion is studied in Bolton and Harris (1999) and Keller and Rady (1999). Aside from these works, there is a growing literature on experimentation in a multiple agent environment where the free-riding issues arise.333A nonexhaustive list includes Keller et al. (2005), Heidhues et al. (2015) and Bonatti and Hörner (2017).

Our treatment of robust preferences in continuous time relies heavily on the fundamental works by Hansen and Sargent (2001), Hansen et al. (2006) and Hansen and Sargent (2011).444In a closely related discrete-time framework Epstein and Schneider (2003) and Maccheroni et al. (2006b) present recursive utility representation aimed to capture the preference for robustness. Our paper is also related to the literature studying the effects of robustness and ambiguity in different decision making frameworks such as Riedel (2009), Cheng and Riedel (2013), Miao and Rivera (2016), Wu et al. (2018) and Luo (2017). Also, it is related to the relatively understudied topic of learning under ambiguity.555For example see Marinacci (2002), Epstein and Schneider (2007) and Epstein and Ji (2019). Finally in a set of experimental works with adopting different notions of ambiguity aversion, it has been tested that the ambiguous arm of the experiment has a lower Gittins index that prompts the DM to undervalue the information from exploration. To name a few we can point to Anderson (2012) and Meyer and Shi (1995) in the context of airline choice and Viefers (2012) in the investment choice.

The remainder of the paper is organized as follows. To build intuition, in section 2 we present some of the forces behind the model in a two-period example. Next, in section 3 the full features of experimentation setup and payoff function are explained in a continuous time framework. In section 4, we apply the dynamic programming analysis and present variational characterizations of the value function. Section 5 offers the closed-form expression for value function, properties of the optimal experimentation strategy, and some comparative static results. In section 6, we extend our setup to capture the effect of an additional unambiguous information source. The concluding remarks are presented in section 7 and finally the proofs of all results are expressed in the appendix A.

## 2 Two-period example

Our goal in this example is to highlight the main trade-offs that the DM and her opponent nature face in their dynamic interaction. Let and at each period the DM allocates her resources between two available choices, namely the safe and the ambiguous project. The time incremental returns to each arm when she allocates of her resources to the safe (first) arm and to the ambiguous (second) arm are

 Δy1,t=(1−μt)rΔy2,t=μtθ+√μtεt. (2.1)

In that is the return rate of the safe project, and is the unknown return to the second arm. The DM’s prior on this set at period one is given by , which is not subject to any ambiguity. However, at each period the return to the second arm is obfuscated by an independent666For simplicity assume , and the period one belief on are independent from each other. Gaussian shock that could possibly be drawn from two distributions, namely for each the law of belongs to the set .777This set clearly doesn’t satisfy the rectangularity condition nor the convexity property of Gilboa and Schmeidler (1989), however it serves only for expositional purposes. We take no stance on whether this multiple prior set is the subjective belief of the DM or literally the objective moves that nature takes against the DM. Our solution concept for both cases is the the so-called max-min. However, the first situation reflects a decision theoretic choice of an ambiguity averse agent with a subjective multiple prior set, whereas the second interpretation is more in line with the notion of robust decision making.

The timing of this example is as follows. At the beginning of period one DM chooses . Then, nature responds by picking as the mean of . The returns to both arms, i.e are realized. DM forms the family of beliefs at the beginning of period two, and takes the appropriate action . The nature chooses as the mean of second period’s shock. Subsequently the game ends and second period’s returns are realized.

What happens at the sub-game perfect equilibrium of this game? For this we need to look at the sub-game starting at . Regardless of DM’s action , the nature always picks , because the game ends at this period and is the worst case distribution from the DM’s perspective. Because of this triviality of the nature’s choice at period two, we drop the index one from and henceforth denote it by , which is the only non-trivial choice of the nature in this example. The DM’s posterior beliefs after the realizations of first period returns are

 ph2=(1+1−p1p1exp{2(√μ1h+μ1−Δy2,1)})−11{μ1>0}+p11{μ1=0},  h∈{−0.5,0.5}. (2.2)

It is important to note that the posterior probability is no longer unique, and DM faces a set of posteriors for each choice of nature in period one. Even though that we face a two-player game where the nature’s actions are not observable to the DM, but at the equilibrium DM knows the

minimizing choice of the nature, thereby her family of posteriors effectively reduces to a single posterior induced by the worst case action of the nature say . This point becomes more clear as we proceed through the equilibrium analysis. For every member of the posterior set, the DM’s optimal action at period two (anticipating that nature will choose ) is , that leads to the expected payoff of . Note that this expectation is with respect to the equilibrium distribution choice of the nature that is . Assume the experimenter’s intertemporal discount rate is . Further, let denote the probability measure induced by the independent product of and . Therefore, the DM’s value function as of beginning of period one is

 v1(p1)=maxμ1∈[0,1]minh∈{−0.5,0.5}{[(1−μ1)+2μ1p1+√μ1h)]+δEh[v2(ph2)]}. (2.3)

Below we point out to some of the underlying equilibrium forces that will show up in this two period example.

1. [leftmargin=*,label=()]

2. The nature’s first period action, or alternatively, the most pessimistic perception of the DM in regard to shock distribution , plays two roles. Current payoff channel, in that the nature’s choice of affects the current payoff of the DM by changing the mean return of the ambiguous arm, i.e . In particular, this is a positive force, as higher ’s correspond to higher mean flow payoff. Informational channel, where the shock distribution affects the next period belief of the DM, hence changes her course of action and thereby the continuation payoff. This has a negative effect, because as increases, the distribution of shifts to the right in the FOSD sense and for a fixed lowers the likelihood ratio in (2.2) that in turns depresses the continuation payoff . At the equilibrium, nature counteracts these forces and picks the one that its negative effect outweighs the positive one, and thus reduces the DM’s payoff more. However, it can not completely balance out the marginal impact of these forces, mainly because we assumed the multiple prior set consists of only two distributions. When the complete mode is laid out in section 3, we allow for quite general multiple prior set, thus nature can precisely cancel out the marginal effects, thereby lowering the DM’s payoff as much as possible.

3. From the point of view of the DM, there is an option value of experimentation. Specifically, in the first period she selects the ambiguous arm (even partially ) only to observe the payout of second arm, and then may decide to abandon the ambiguous project depending on the outcome of the first period. In this example, the DM switches back to the safe arm in the second period if her posterior in the equilibrium, i.e , drops below a certain threshold, which in this case is .

4. The DM’s value function is unambiguously increasing in her initial belief (as can be confirmed from (2.3)), but the marginal value of good news need not be increasing (meaning is not always positive). This is mainly due to the finite-horizon setup of the two-period model, which is relaxed in later sections.

5. The value function in (2.3) refers to the max-min value of the game, which is associated to the strategic order of actions in which the DM takes her action first and then the nature responds in every period. This is the same approach that we pursue when we present the complete model. However, one might wonder when does this max-min value coincide with the min-max one? Or in the other words, when does the strategic order of players’ actions become irrelevant? In this example the max-min value is strictly less than min-max. Although not related to the study of this paper, but we confirm that with compact and convex action spaces of both players, the von-Neumann minimax theorem could be applied and therefore one can conceive the unique value of the zero-sum game between DM and the nature.

We do not intend to delve deeper into this example and express more specific results and comparative statics, mainly because such analysis will be carried out for the complete model later in the paper.

## 3 Experimentation model

Time horizon is infinite and . There are two projects available to experiment by the DM. Her choice at time is thus to allocate her resources between two alternatives, namely to the ambiguous arm and to the safe arm. The return process of the projects are888The goal of this section is to study the interplay between ambiguity regarding the new arm and optimal experimentation, thus for simplicity we assume that the conventional arm has a sure return rate of and is not subject to any source of randomness. Therefore, it is only the second arm that carries the Brownian motion term.

 dy1,t=(1−μt)rdtdy2,t=μtθdt+σ√μtdBt. (3.1)

Here is a Brownian motion relative to some underlying stochastic basis999The description of the underlying stochastic basis and the joint structure of processes are explained in the subsection devoted to the weak formulation., that represents the shock process, and is unknown to the DM but belongs to the binary set , where . The DM has an initial belief about which is independent from . The form of return processes in (3.1) follows Bolton and Harris (1999), but we let the DM to associate multiple distributions to the shock process. Specifically, the DM holds a single belief over — so that this represents the uncertainty due to risk — but has multiple beliefs regarding the shock distribution — so this represents the uncertainty due to ambiguity.101010This type of uncertainty is sometimes referred to as model uncertainty in the literature.

### 3.1 A framework for modelling ambiguity

Our take of ambiguity or model uncertainty is similar to Hansen et al. (2006) and Hansen and Sargent (2011). In particular, we assume there is a family of pairs such that for each , is a Brownian motion under , and DM views this as her multiple prior set. We think of – which thus far has not been defined – as the nature’s action space, and each is deemed as a possible nature’s move. We assume there exists a benchmark probability specification that is equivalent (mutually absolutely continuous with respect) to each member of . The benchmark measure and the set are interpreted differently based on the context. For example, DM might believe that is the underlying probability measure, but considers as the approximations of the true distribution because she has preference for robustness. Alternatively, could be conceived as the multiple prior set for the ambiguity averse DM in the axiomatic treatment of Gilboa and Schmeidler (1989).

DM has Multiplier preference and maximizes the following payoff over an admissible set of experimentation strategies — with some technical considerations that are elaborated later in the paper:

 infh∈H{EPh[δ∫∞0e−δtd(y1,t+y2,t)]+αH(Ph;P)} (3.2)

Here is the time discount rate. The first term in the DM’s utility is simply the expected discounted payoff from both projects taken with respect to the measure , and the second term penalizes the belief misspecification using the relative discounted entropy to measure the discrepancy between and . Parameter captures the extent of this penalization, where its larger values associate to smaller penalty. We shall also interpret as the inverse of ambiguity aversion and relate (3.2) to the dynamic variational utility representation of Maccheroni et al. (2006a) and Maccheroni et al. (2006b). A large means that the DM does not suffer a lot from ambiguity aversion. In contrast as , the DM experiences larger utility loss due to severe penalization.

In the next subsection we use the weak-formulation approach from the theory of stochastic processes to elaborate and simplify DM’s utility function (3.2).

### 3.2 Weak formulation

In this part we present a sound foundation for the joint structure of all the stochastic processes in the model111111The materials in this subsection might look somewhat technical and unnecessary to some readers, but are essential for rigorous development of the model.. Let be the stochastic basis, where the filtration satisfies the usual conditions.121212It is right-continuous and -complete. The average rate of return to the ambiguous project is a binary -measurable random variable.

###### Definition 1 (Strategy spaces).

The DM’s strategy space — with a representative point — is the set of all -progressive processes131313We refer to Karatzas and Shreve (2012) for the definition of progressive processes. taking value in . The nature’s strategy space — with a representative point — is the space of all bounded -progressive processes.

###### Definition 2 (Integral forms).

For any pair of processes where is -integrable141414The notion of integral depends on the context that could either be the path-wise Stieltjes integral or stochastic Itô integral. we use the alternative notation for integration: . Further, the symbol refers to identity mapping on . Then the differential return expressions in (3.1) can be represented in the integral form and .

To model the ambiguity we appeal to the weak formulation. In particular, we think of ambiguity as the source that changes the distribution of return process , but not its sample paths. For this on every finite interval we define the probability measure with the following Radon-Nikodym derivative process:

 dPhTdP∣∣ ∣∣Ft:=Lht,T=exp{(h⋅B)t−12(h2⋅ı)t},   ∀t≤T (3.3)

This relation explains how nature with its choice of could induce a new probability measure. The Girsanov’s theorem implies that is mutually absolutely continuous with respect to — that is often called equivalent measure and denoted by on . It also implies that the mean-shifted process is a -Brownian motion under over the interval . The main catch here is that we can only characterize the perturbations of benchmark probability model over finite intervals, that is for example we know how looks like on for any finite . However, what is needed for the utility representation in (3.2) is a specification of on the terminal -field . For this we need to use a limiting argument to consistently send and obtain as an appropriate limit of . Our proposal for this is as follows. For any process and an increasing sequence of finite times , we repeatedly apply the Girsanov’s theorem to obtain a family of consistent probability measures , where on for every . In a similar vein we obtain the likelihood ratio process and the Brownian motion for every . Next, we explain how to naturally define the limit of each three components.

1. [leftmargin=*,label=()]

2. Likelihood process limit: Expression (3.3) implies that the sequence of likelihood processes are path-wise consistent with each other, i.e for every . Therefore, one can define the process on in a meaningful sense, such that its restriction to any finite interval coincides with the sequence of likelihood processes. This concludes the construction of the limit likelihood process. Importantly, this construction suggests that must be a martingale process with respect to on . To see this, note that a bounded causes the Novikov’s condition to hold, thereby would be an uniformly integrable martingale — on — with respect to for every . Because of the path-wise equivalence, this would immediately establish the martingale property of on .

3. Probability measure limit: First, recall that for every , is a probability measure on . Then, the path-wise consistency resulted from (3.3) implies that these measures indeed match each other, namely for every where . Thus, we can apply theorem 4.2 in Parthasarathy (2005) that guarantees the existence of a closing probability measure on such that its restrictions to finite intervals coincide with the above sequence of probability measures, yet it need not be equivalent to on . That is restricted to every finite , on , but this may not be true on .

4. Brownian motion limit: Applying Girsanov’s theorem lets us to deduce that is a Brownian motion under on for every . Since on , then it turns out that is also a Brownian motion under . Also note that the path-wise consistency holds for the sequence of Brownian motions, namely for all . Therefore, in the same manner that we defined from , we can define as the process on such that its restrictions to any finite interval satisfy the properties of Brownian motions.

The illustrated construction of allows us to express the return process of the ambiguous project in term of -Brownian motion:

 dy2,t=[μtθ+σ√μtht]dt+σ√μtdBht (3.4)

The merit of weak formulation now becomes clear, where for every the return processes

are essentially fixed, but the probability distribution that assigns weights to the subsets of sample paths is controlled by the choice of

. So in a sense the nature’s move is to select the return’s distribution not its sample paths.

Now that we know what is meant by on we can analyze both terms of (3.2) which are expectations under , and this will be the goal of next subsection.

### 3.3 Unravelling the payoff function

We begin the simplification of (3.2) by elaborating the second term, that is the entropy cost of ambiguity aversion. Recall that and need not necessarily be equivalent measures on , yet their restrictions are indeed equivalent probability measures on . Having that said, the relative discounted entropy is defined as

 H(Ph;P):=limT→∞δ∫T0e−δtH(Pht;Pt)dt, (3.5)

where . Expression (3.5), which is proposed in Hansen et al. (2006), presents a proxy for the discrepancy between two measures that are not necessarily equivalent on the terminal -field, and hence their relative entropy could be infinite, but on each finite interval say they are equivalent and have finite relative entropy. Therefore, one shall hope that relation (3.5) is well-defined.

###### Lemma 3.

The discounted relative entropy in (3.5) is well-defined, namely for every it is finite and satisfies

 H(Ph;P)=12Eh[∫∞0e−δth2tdt]<∞. (3.6)

Roughly speaking, for the first component of the payoff function we need to take the expectation of under the measure . This is in our reach because we stated the dynamics of in terms of in (3.4). However, the drift term in contains the random variable , that needs to be learned and projected onto the DM’s information set. For this we present an optimal filtering result under each measure .

###### Remark 4.

The DM’s initial prior is unaffected under different probability distributions . This is because the benchmark measure and all its variations agree on , resulted from for every .

In light of this remark, we want to continuously estimate and update the DM’s posterior on

based on her available information at every point in time. Her information set at time contains the path of output from each project , the history of her allocation process and importantly the nature’s moves up until time , i.e . Note that at each time , the DM’s ambiguity is with regard to the future path of , and she has no uncertainty about the history of nature’s moves in the past. Some might not be willing to make this assumption about the ex-post observability of nature’s moves to the DM. However, this is not an important assumption for two reasons. First, on the equilibrium path the DM knows the history of nature’s past moves. Secondly, in theory we can find the filtering equation under every possible history of nature’s actions and then let the DM to pessimistically choose from this family of posteriors. In summary, the filtering problem that the DM faces at time is to update her posterior based on the available information set . Of secondary importance is to note that conveys no information about , thus can be dropped out of the information set.

###### Definition 5.

For every , define as the posterior probability and as the conditional mean. At , let and .

###### Lemma 6 (Liptser and Shiryaev (2013) theorem 8.1).

The conditional probability of the event given the filtration evolves according to the following stochastic differential equation:

 dpht=(¯θ−θ–)√μtσpht(1−pht)d¯Bht (3.7)

Here is called the innovation process which is a Brownian motion under , and is characterized by . As a result of this, the law of motion for would be

 dy2,t=[μtm(pht)+√μtht]dt+σ√μtd¯Bht. (3.8)

Sketch of the proof. First note that from the filtering point of view the process contains the same information as . Therefore, on the region , we have for every and . Next, applying theorem 8.1 of Liptser and Shiryaev (2013) and taking as the observable process and as the subject of filtering imply that:

 Eh[θ|F~y2,μ,ht]=Eh[θ|F~y2,μ,h0]+σ−1∫t0(Eh[θ(√μsθ+hs)∣∣F~y2,μ,hs]−Eh[θ|F~y2,μ,hs]Eh[√μsθ+hs∣∣F~y2,μ,hs])d¯Bhs=Eh[θ|F~y2,μ,h0]+σ−1∫t0√μs(Eh[θ2∣∣F~y2,μ,hs]−Eh[θ|F~y2,μ,hs]2)d¯Bhs (3.9)

This expression underlies the filtering equation for the posterior process , as it readily amounts to

 pht=p0+σ−1(¯θ−–θ)∫t0√μsphs(1−phs)d¯Bhs, (3.10)

and thus verifies equation (3.7). It is worth mentioning here that since there is no ambiguity about at time w.r.t the distribution of , the first term in the rhs of (3.9) is independent of . ∎

At this stage we have developed all the required tools to present the utility function in (3.2) in terms of initial belief and the players’ actions. For this we define the infinite horizon payoff as the limit of finite horizon counterparts. The reason is that the constructed process is only Brownian motion over finite intervals, and we can not extend it to entire , unless we impose further restrictions on and to obtain the uniform integrability of likelihood processes, which we refrain to do. Therefore, inspired by (3.2) we define the utility of DM from taking action while nature chooses by

 (3.11)
###### Proposition 7.

For every choice of and , the net discounted average payoff defined in (3.11) can be expressed as:

 V(p;μ,h)=Eh[δ∫∞0e−δt((1−μt)r+μtm(pht)+σ√μtht+α2δh2t)dt] (3.12)

This proposition serves us well, because the integrand is now -progressively measurable, that in turn allows us to perform a dynamic programming scheme to express the value function in terms of the current belief, and this will be the goal of next section.

## 4 Dynamic programming analysis

Our analysis so far offers expression (3.12) as the DM’s payoff in the two-player differential game against the nature. For any point of time, say , define the expected continuation value conditioned on as

 J(p,t;μ,h):=Eh[δ∫∞te−δs((1−μs)r+μsm(phs)+σ√μshs+α2δh2s)ds∣∣∣Gt]. (4.1)

In that is the time value of the state process . For every the process as well as are time homogeneous Markov diffusions. Furthermore, the players’ action spaces at the time sub-game — and resp. for the DM and the nature — are essentially isomorphic to and . These two premises imply that the max-min value of the game for the DM, i.e , is time homogeneous. Specifically, there exists a value function such that

 supμ∈Utinfh∈HtJ(p,t;μ,h)=e−δtv(p) (4.2)

Our goal in the next theorem is to present a verification result for the value function. For this we need to appeal to the theory of viscosity solution Crandall et al. (1984) that provides the appropriate setting for Bellman equations. The reason for this is that as it turns out the value function is not twice continuously differentiable everywhere, therefore classical verification techniques relying on Ito’s lemma would not apply. We offer some preliminary definitions that are linked to the work of Zhou et al. (1997)151515There were some technical gaps in the proof of the verification theorem in this paper, that are addressed and corrected in the follow up papers Gozzi et al. (2005) and Gozzi et al. (2010); thanks to the anonymous referee for bringing this up to the author’s attention., thereby setting the groundwork for the viscosity solution concept.

###### Definition 8.

Let . The superdifferential of at is denoted by :

 D+w(x0)=⎧⎨⎩(ξ1,ξ2)∈R2:limsupx→x0w(x)−w(x0)−(x−x0)ξ1−12(x−x0)2ξ2(x−x0)2≤0⎫⎬⎭ (4.3)

A generic member of this set is referred by . And the subdifferential, denoted by is defined as

 D−w(x0)=⎧⎨⎩(ξ1,ξ2)∈R2:liminfx→x0w(x)−w(x0)−(x−x0)ξ1−12(x−x0)2ξ2(x−x0)2≥0⎫⎬⎭. (4.4)

A generic member of this set is referred by .

Notice that a continuous function may not be once or twice continuously differentiable but it always has non-empty super(sub)-differential sets on a dense subset of Lions (1983).

In the verification theorem that follows we show that the value function in (4.2) is the viscosity solution to a certain HJBI equation with the following form

 w(p)=supμ∈[0,1]infh∈R{g(p,μ,h)+K(p,w′(p),w′′(p),μ,h)}\lx@notefootnoteNoticethat$w′$and$w′′$shouldnotbeconfusedwiththefirstandsecondderivativesastheymaynotexistforacontinuousfunction.ThisformisjustarepresentationoftheHJBIequationthathasaviscositysolutioninthesenseofdefinition???,andmaynotholdasmoothclassicalsolution., (4.5)

where the specific form of the coefficients and will be given in the theorem’s statement. As a last step before presenting the therorem, we express what is meant by being a viscosity solution to a HJBI equation.

###### Definition 9.

A function is called a viscosity solution of (4.5) if it is both a viscosity subsolution and a viscosity supersolution that are respectively equivalent to:

 −w(p)+supμ∈[0,1]infh∈R{g(p,μ,h)+K(p,ξ1,ξ2,μ,h)}≤0, ∀(ξ1,ξ2)∈D+w(p), (4.6a) −w(p)+supμ∈[0,1]infh∈R{g(p,μ,h)+K(p,ξ1,ξ2,μ,h)}≥0, ∀(ξ1,ξ2)∈D−w(p). (4.6b)
###### Theorem 10.

Suppose is Lipschitz and a viscosity solution to the following HJBI equation:

 w(p)=supμ∈[0,1]infh∈R{(1−μ)r+μm(p)+σ√μh+α2δh2+μ2δΦ(p)w′′(p)}, (4.7)

where . Then, equals , the value function in (4.2). In the equilibrium, the worst-case density generator is , where is the DM’s best response in

 w(p)=supμ∈[0,1]{(1−μ)r+μm(p)−σ2δ2αμ+μ2δΦ(p)w′′(p)}\lx@notefootnoteThisequationshouldalsobeinterpretedintheviscositysense,bydroppingtheinfimumindefnition???.. (4.8)

As stated in previous theorem, on the equilibrium path of the game, DM knows the best response of the nature, that is . Therefore, her posterior process follows that of (3.7) for the prescribed . Importantly, this means at the equilibrium the DM is no longer concerned about all possible distributions of past shocks. The one that has been picked by the nature is known to the DM on the equilibrium path, which gives rise to the unique law of motion for the posterior belief. Note that, this does not mean that ambiguity is mitigated on the equilibrium path. However, it simply means that similar to the static decision making, where the ambiguity averse agent first perceives the worst case distribution from her multiple prior set, and then responds back, here also she forms her belief and react based on the worst case distribution choice by the nature. Henceforth, by in (4.8) and in the rest of the paper we mean the equilibrium posterior value, or often for brevity is simply referred as belief.

Note that the rhs of (4.8) is linear in . This is in part due to the effect of as the volatility term in the ambiguous arm. Consequently, the DM’s optimal strategy at every point in time is to either explore the ambiguous arm or exploit the safe arm181818The trade-off between exploration vs. exploitation has studied in different context. For one we can point to Manso (2011) that explains such a trade-off for the financial incentives in entrepreneurship.. As a result, the DM’s value function satisfies the following variational relation:

 v(p)=max{r,m(p)−σ2δ2α+12δΦ(p)v′′(p)} (4.9)

In the economic terms, is the DM’s reservation value, which can always be achieved regardless of her experimentation strategy. The term is the expected rate of return from pulling the second arm when the current belief on is . The important term in expression (4.9) is , which we call it ambiguity cost. Higher ambiguity aversion, translated to lower , implies higher incurred cost upon pulling the ambiguous arm. Lastly, is the continuation payoff that the DM could expect by holding on to the second arm. We postpone a more elaborate set of analytical results on the value function to the next subsection and instead present the intuition behind the DM’s optimal strategy.

###### Lemma 11.

The DM’s optimal allocation choice with ambiguity aversion admits the following representation:

 μ∗(p)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1 if 12δΦ(p)v′′(p)−σ2δ2α>r−m(p)∈[0,1] if 12δΦ(p)v′′(p)−σ2δ2α=r−m(p)0 otherwise (4.10)

This result is the analogue of lemma 4 in Bolton and Harris (1999) tailored to capture the ambiguity aversion. One shall think of as the opportunity cost of experimentation that the DM incurs by not choosing the safe arm. Therefore, she only selects the second project when the continuation value of experimentation adjusted by the ambiguity price exceeds its opportunity cost. Particularly, whenever the two values match, the DM can pursue a mixed strategy, in that she can allocate her resources between two arms in any arbitrary proportions. However, the Lebesgue measure of the time duration on which she chooses the mixed strategy is zero, precisely because follows a diffusion process and the middle case in (4.10) never happens -a.e. The ambiguity aversion essentially creates a situation in that the DM thinks that upon the continuation she will have to face with the most destructive types of shock distribution, and this already lowers the value of experimentation. Importantly, this loss is independent of the current belief level, and shall be viewed as a fixed cost that ambiguity averse agent must be compensated for to undertake the second project.

## 5 Properties of the value function and comparative statics

In this section we propose closed-form expression for the value function and present sharp comparative statics with respect to ambiguity aversion index .

###### Theorem 12.

On the equilibrium path the DM’s follows a cut-off experimentation strategy. In particular, there exists such she selects the safe arm if and only if her posterior belief drops below . Further, the value function is convex on .

A substantive result of convexity is that even in the presence of ambiguity aversion the marginal value of good news about the second project is increasing.

Next, we want to find a closed-form expression for the value function and particularly the cut-off probability . For this we make a technical assumption that turns out to be necessary and sufficient for existence of in . Namely, we exclude the case where DM always pulls the second arm, and where she never does.

###### Assumption 13.

Define . Then we assume .

As becomes clear later, one can think of as a lower bound on . Therefore essentially means that DM never selects the ambiguous arm. This is due to a combination of two forces, namely a large ratio of safe to ambiguous return — that is the first term in — and high normalized ambiguity cost — that is the second term in — which prevents the DM from exploring the second arm. Assumption 13 not only ensures that , but as it will turn out it implies . Having made this assumption, on exploration region the following differential equation holds:

 v(p)=m(p)−σ2δ2α+12δΦ(p)v′′(p) (5.1)

That has a general solution form191919Polyanin and Zaitsev (2017) page 547.

 v(p)=m(p)−σ2δ2α+cp1−λ(1−p)λ,  on p∈(¯p,1]. (5.2)

Here is a constant determined from the boundary condition and , where . The value-matching (or equivalently no-arbitrage) condition implies that the DM should be indifferent between choosing any of the two arms at . Therefore, that yields to

 v(p)=m(p)−σ2δ2α+(r−m(¯p)+σ2δ2α)p1−λ(1−p)λ¯p1−λ(1−¯p)λ,∀p∈[¯p,1]. (5.3)

The DM faces a free-boundary problem, namely she needs to find the optimal cut-off . For that we need to apply the smooth-pasting202020Dixit (2013). condition that imposes the continuity of directional derivatives at , i.e . Assumption 13 with some amount of algebra yields to the following expression for the cut-off probability:

 ¯p=(λ−1)ηλ−η (5.4)

It is positive because , and is less than one again because . This observation now supports making assumption 13.

###### Remark 14.

The value function in (5.3) with the prescribed is continuous, increasing and convex. Therefore, its maximum derivative is attained at , that is bounded above because , thereby satisfying the Lipschitz continuity. Hence, owns all the properties of the verification theorem 10.

Some comparative statics. The cut-off value is lower-bounded by . Further, it is increasing in . Expression (5.4) provides us with a sharp characterization of the cut-off value, and one could perform a number of comparative statics on with respect to the parameters of the model. Here, we only point to two interesting ones. First, and more important is the effect of ambiguity on cut-off value. As DM becomes more ambiguity averse, namely as becomes smaller, the value of increases unambiguously. This confirms our intuition that a more ambiguity averse DM is more conservative and explores less. Expression (5.4) offers a fine indicator on the extent of this under-exploration. The second channel is the effect of , that represents the range of possible return rates under the second arm. As this range shrinks to zero, the ambiguity cost is amplified more intensely, and DM will have less incentive to pick the second project.

As a last note in this section we point out to a concern on the entangled effects of and . One might wonder that what we refer as the ambiguity aversion parameter, i.e , can be dissolved in volatility , and thus can never be identified separately even with infinite amount of data. However, this is not true, as we can offer an identification scheme that disentangles from . Suppose that all other parameters are identified, namely and . Then, a continuous stream of agent’s belief process would let us to compute the quadratic variation from (3.7). Further, by spotting the point where she stops the exploration and pulls the safe arm we can back out . These two equations can lead us to uniquely identify and .

## 6 Value of unambiguous information

In this section we aim to study the value of information with respect to which the DM holds no ambiguity. Practically, one can think of a scenario in which the experimentation unit hires an expert to continuously provide her opinion about the true rate of return of the ambiguous arm. Some questions naturally arise in this context. For example what is the fair price of such service? Or, how much must the expert be compensated for providing such information? When should the experimentation unit who faces ambiguity hire this expert?

To answer such questions, let be the information that the expert releases at time about , which in its simplest case can be thought as the noisy signal of , namely:

 dxt=θdt+γdWt (6.1)

In this expression is a -Brownian motion under the benchmark measure and is independent of and . Further, is the constant volatility that represents the level of DM’s confidence in the expert’s information. Therefore, the DM can use this signal in addition to the second arm’s payoff process to update her belief about

. Obviously, this new source of information improves the precision of the filtering process, in the sense that it lowers the conditional variance of estimated

at every point in time. The law of motion for the new posterior process with the presence of unambiguous information source follows the logic of lemma 6:

 dpht=pht(1−pht)(¯θ−θ–)[√μtσd¯Bht+1γd¯¯¯¯¯¯Wt] (6.2)

Here and are independent -Brownian motions under . Now we can state the counterpart of theorem 10 in this case, however its proof is easier as the candidate solution belongs to the space of thus we do not need the viscosity solution concept. This is owed to the fact that the diffusion coefficient for is independent of , thereby relaxing the degeneracy that appears when . As a result of restriction to the space , Ito’s lemma can be applied directly on the candidate value function and one can apply the idea of the proof in theorem 10, bypassing the steps dealing with viscosity super(sub)-solution and replacing them with Ito’s rule.

###### Proposition 15.

Suppose is the unique solution to the following HJBI equation:

 ~v(p)=supμ∈[0,1]infh∈R{(1−μ)r+μm(p)+√μσh+α2δh2+12δ(μΦ(p;σ)+Φ(p;γ))~v′′(p)} (6.3)

In that . Then, is indeed the value function in presence of unambiguous information . In the equilibrium, the worst-case density generator is , where is the DM’s best response solving:

 ~v(p)=supμ∈[0,1]{(1−μ)r+μm(p)−σ2δ2αμ+1