1 Introduction
In recent years, datadriven sequential decisionmaking has received a lot of attentions and finds a wide range of applications in operations management, such as dynamic inventory control (see, e.g., Huh et al. (2011), Chen and Plambeck (2008), Chen et al. (2019b, a), Lei et al. (2019)), dynamic pricing (see, e.g., Besbes and Zeevi (2009, 2015), Wang et al. (2014), Chen et al. (2019c), Broder and Rusmevichientong (2012)), dynamic assortment optimization (see, e.g., Rusmevichientong and Topaloglu (2012), Saure and Zeevi (2013), Agrawal et al. (2019), Wang et al. (2018), Chen et al. (2018)). Take the personalized/contextual dynamic pricing as an example; it is usually assumed that the underlying demand, which is a function of the price and customer’s contextual information, follows a certain probabilistic model with unknown parameters. Over a finite time selling horizon of length
, at each time period, one customer arrives. The seller observes the characteristic of the customer and makes the price decision. Then the arriving customer makes the purchase decision based on the posted price. The seller will observe the purchase decision, update her knowledge about the demand model, and might change the price policy accordingly for future customers. The key challenge in dynamic pricing is to accurately estimate the underlying model parameter in demand function, which will then be used to determine prices later on. Existing literature on dynamic pricing only constructs a point estimator of the underlying model parameter, i.e., estimating the parameter by a single number or a vector, without quantifying the uncertainty in the estimator. Uncertainty quantification is very useful for practitioners. It is highly desirable for the seller to obtain confidence intervals of the underlying demand function, which is guaranteed to cover the true demand function with
probability (also known as the confidence level, e.g., ).Although construction of confidence interval has been a classical topic in statistics (Stigler 2002), the existing results in statistical literature mainly deal with independent and nonadaptive data. The behavior of sequentially collected data is quite different from independent data. In particular, in the (contextual) dynamic pricing problem both the decision (e.g., the price) and the collected customers’ contextual information at each time period are adaptive, which heavily correlate with information obtained in previous periods. Due to the sequential dependence, estimators computed from adaptively collected data might have severe distributional bias even when the sample size goes to infinity (Deshpande et al. 2018, 2019). Such a bias makes the classical approach of constructing confidence intervals (e.g., Wald’s test, see Chapter 17 of Keener (2010)) no longer valid.
The main goal of our paper is to construct a debiased estimator that is asymptotically normal centered at the true model parameter with a simple covariance matrix structure. Based on the proposed debiased estimator, we construct both pointwise confidence intervals (i.e., confidence intervals valid for any given decision variable (price) and contextual information) and uniform confidence intervals (i.e., confidence intervals uniformly valid for all decision variables and contextual information). To highlight our main idea, we will consider the problem of constructing confidence intervals for demand function in dynamic pricing, which is one of the most important datadriven sequential decision problems in revenue management.
In particular, we study a stylized personalized dynamic pricing model in which there are selling periods. At each selling period , a potential customer comes with an observable personal context vector . Instead of assuming are independent across time periods as in existing literature (e.g., Chen et al. (2015), Miao et al. (2019)), we allow to depend on information from previous selling periods. This is a more practical scenario since a customer’s contextual information might be heavily correlated with previous prices and realized demands. For example, a consecutive time periods of posted lower price or higher demands will attract new customers from a different population, whose contextual information will be different from the previous customers. By observing the contextual information of the arriving customer, the seller decides the price and the customer decides on a realized demand. We assume the demand of the arriving customer follows a general probabilistic model,
(1) 
where is a parametric function parameterized by with a known form (e.g., linear or logistic), is an unknown parameter vector that models the demand behaviors, and are zeromean, conditionally independent (conditioning on and ) noise variables. A typical objective of the retailer is to maximize his/her expected revenue, or more specifically
without knowing the model a priori. In this paper, our goal is to construct confidence intervals for both the true model parameter and the underlying demand function (see the definition in Sec. 1.1).
The demand model in Eq. (1) is very general and covers two widely used demand models: the linear model and the logistic model. In the linear model, is modeled as
(2) 
where is a known feature map for the price and contextual information, and
are noise variables. In the logistic regression model,
is a binary demand realized according to the logistic model(3) 
For example, Qiang and Bayati (2016) and Miao et al. (2019) consider a special case of the feature map, where is the concatenation of the price and the contextual vector .
In contextual dynamic pricing models, two dependency or feedback structures are essential to model the pricing dynamics in practice. The first feedback structure is that the retailer, after observing a sequence of customers’ purchasing activities, could leverage his/her knowledge or estimates of the unknown model to offer more profitable pricing decisions. In other words, the prices sequentially decided by the retailer are statistically correlated with the purchasing activities of prior customers. The second feedback structure involves the types (reflected in context vectors ) of customers arriving, which could well depend on the historical prices (e.g., a consistent high price offering might attract more affluent customers) and the realized demands in previous selling periods. Hence, the context vectors are statistically correlated with the prices and demands in previous time periods.
Now we rigorously formulate the abovementioned feedback structures. A contextual dynamic pricing model can be written as , where is the time horizon, is the unknown regression model, is the feature map, is the price range, and characterizes the underlying context generation procedure, such that , where
is a certain random variable. A contextual dynamic pricing algorithm/strategy over
time periods can be written as , where is a function mapping from the history of prior selling periods to the offered price for incoming customer at time . here is another random variable. The functions and capture the two feedback structures mentioned in the previous paragraph, where both and are statistically correlated with in prior selling periods .1.1 Our contribution: uncertainty quantification in sequentially collected data
The main objective of this paper is to quantify the uncertainty for the learned demand function from purchase data on dynamically, adaptively chosen prices and contexts. Namely, we will construct two types of confidence intervals of the underlying demand function , pointwise confidence intervals and uniform confidence intervals, which are introduced as follows.
For a prespecified confidence level at the end of time periods, where is usually a small constant such as 0.1 or 0.05, our goal is to construct upper and lower confidence interval edges , , such that for any given price , context , and ,
(4) 
The confidence interval in (4) is known as the pointwise confidence interval since it holds for a fixed price and context vector .
In many applications, we are also interested in confidence intervals with uniform coverage. More specifically, for a predetermined confidence level , , satisfy for all that
(5) 
where is a certain compact subset of as the domain of all context vectors.
To construct these confidence intervals, we also provide the confidence interval of the model true parameter , which might have its own independent interest in practice.
As we mentioned, the main difficulty in constructing these confidence intervals lies in the two dependency structures of the price and contexts. Therefore, in contrast to the nonadaptive case where the maximum likelihood estimator (MLE) is unbiased, the MLE based on the adaptive data will have a significant distributional bias. In the next subsection, we briefly discuss two popular contextual dynamic pricing algorithms in the literature to better illustrate the adaptive data collection process. We also explain in Sec. 3 why the classical construction of confidence intervals fails in our problem.
1.2 Online policies for contextual dynamic pricing
We mention two popular online policies for the contextual dynamic pricing problem.
The greedy policy.
An greedy policy (Watkins 1989) has a parameter to balance the tradeoff between exploration and exploitation. At each selling period , with probability , a price is selected uniformly at random for exploration. With probability , the exploitation price is set based on the current estimate :
(6) 
which is the regularized empiricalrisk minimization (ERM) using sales data from prior selling episodes. Here is a certain risk function depending on the particular class of the underlying demand model . For example, for the linear demand model, the leastsquares function is commonly used:
For the logistic demand model, the negative loglikelihood function is often adopted,
A common choice of would be the negative loglikelihood function. In principle, the risk function should be selected such that the underlying true model minimizes the function in expectation. Detailed assumptions on will be given in Sec. 2.
The UpperConfidence Bound (UCB) policy.
In the UCB policy (or more specifically the LinUCB policy for linear or generalized linear contextual bandits (Rusmevichientong and Tsitsiklis 2010, Filippi et al. 2010, AbbasiYadkori et al. 2012)), a regularized MLE is calculated for every selling period in (6). Afterwards, an offered price is selected to maximize an upper bound of the demand function , or more specifically
(7) 
where is a certain form of confidence bound such that with high probability for all and , where is the underlying true model parameter. We refer the readers to the works of AbbasiYadkori et al. (2012), Rusmevichientong and Tsitsiklis (2010), Filippi et al. (2010) for the different variants of forms in linear and generalized linear contextual bandits.
While the UCB policy naturally constructs “upper confidence bounds”, such constructed confidence bounds are inadequate for the use of predicting reasonable demand ranges because the upper confidence bound gives too wide intervals to be useful. In fact, confidence bounds in UCB are constructed using concentration inequalities, in which the constants are far from tight. Given the prespecified confidence level , our goal is to construct demand confidence intervals that have statistically accurate coverage as defined in (4) and (5), allowing potential users to understand exactly the range of expected demands at certain confidence levels.
1.3 Related works
Datadriven sequential decisionmaking has been extensively studied for revenue and inventory management problems with unknown or changing environments. In most existing literature, effective online policies are developed to maximize revenues. However, how to provide accurate confidence intervals for the key underlying probabilistic model parameters (e.g., demand function or utility parameters) have not been wellexplored in the literature. Recently, the work of Ban (2020) considered the construction of confidence intervals (for the demand functions) in an inventory control model. Compared to approaches proposed in this paper, the work of Ban (2020) derives asymptotic normality of certain SAA strategies, while our approach debiases general empiricalrisk minimizers so that the constructed confidence intervals are applicable to a wide range of online policies, such as
greedy, upper confidence bounds or Thompson sampling. Technically, the limiting distributions in
Ban (2020) were established using Stein’s methods, while our proposed approach is inspired by the onestep estimators in asymptotic statistics (Van der Vaart 2000).Recently, the debiased estimator has been extensively investigated in highdimensional penalized estimators (Van de Geer et al. 2014, Zhang and Zhang 2014, Javanmard and Montanari 2014, Wang et al. 2019) since the regularization (e.g., penalty in Lasso (Tibshirani 1996)) leads to the bias in the estimator. However, these works only deal with nonadaptively collected data and thus cannot be applied to our setting. The recent works of Deshpande et al. (2018, 2019) applied the debiasing approach to confidence intervals of adaptively collected data, including multiarmed and linear contextual bandit problems. While the works of Deshpande et al. (2018, 2019) mainly focus on linear models, this paper provides confidence intervals for general parametric models
. The extension to general parametric model classes poses some unique technical challenges, such as the sequential estimation of Fisher’s information matrix. Further details are given in our Sec.
4.1.4 Notations and paper organization
Throughout this paper we adopt the following asymptotic notations. For sequences and , we write or if ; we write or if .
The rest of the paper is organized as follows: in Sec. 2 we list the assumptions made in this paper, including discussion on why the imposed assumptions are useful and relevant; in Sec. 3 we review the classical approach of Wald’s intervals for constructing confidence intervals, and explain why such a classical approach fails in contextual dynamic pricing problems; in Sec. 4 we propose the debiased approach and demonstrate, through both theoretical and empirical analysis, that our proposed confidence intervals are accurate in dynamic pricing. Finally, in Sec. 5 we conclude the paper by mentioning several future directions for research. Proofs of some technical lemmas are deferred to the supplementary material.
2 Models and Assumptions
In this section we state assumptions that will be imposed throughout of this paper. Most of the assumptions are standard in the literature of dynamic pricing or contextual bandits. There are however a few additional assumptions for the specific purposes of building accurate confidence intervals, which are often made in statistical literature.
2.1 Assumptions on the demand model
We first list assumptions on the underlying demand function (i.e., the mean of the demand), as well as assumptions on the underlying true parameter .

For , and for some compact , and for some known compact parameter class ;

The demand function is continuously differentiable with respect to , and furthermore for all and ;
Assumptions (A1) and (A2) assert that both the context vectors and the unknown model parameter are bounded, and furthermore the known demand function satisfies basic smoothness properties. This assumption implies that the expected demands are bounded and cannot be arbitrarily large. The two examples
(linear regression model) and
(logistic regression model) satisfy both conditions, provided that the feature map is bounded.2.2 Assumptions on the noise variables
Recall that the noise variable is defined as
(8) 
which is the difference between the realized demand and its (conditional) expectation. We list assumptions on the noise variables across the selling periods.

are independent, centered and bounded subGaussian random variables;

There exists a known variance function
such that(9) for all , and Lipschitz continuous with respect to ; .
In the above assumptions, (B1) is a standard assumption that the noise variables are all centered and subGaussian with light tails, conditioned on the offered price and the context vector . (B2) imposes further assumptions on the variance of the noise variables. In particular, it assumes that the conditional variance of (conditioned on and ) is bounded, never zero, and smooth. Such an assumption is useful in demand models
which are inherently heteroscedastic. For example, in the logistic demand model where
is a Bernoulli variable with , it is easy to verify that , and all conditions in Assumption (B2) hold true.2.3 Assumptions on the risk function
The empirical risk minimization problem in Eq. (6) is the workhorse of our model estimates . As discussed, popular risk functions
include the leastsquares loss function
and the negative loglikelihood function . Below we give a list of assumptions imposed on the risk function so that the ERM estimates satisfy desired properties.
The risk function is three times continuously differentiable with respect to , and furthermore for all and ;

For all , ;
Here in Assumption (C1), is a symmetric tensor, and its operator norm is defined as . For the linear demand model and leastsquares losses , Assumption (C1) is implied by the boundedness of ; for other parametric models (e.g., the logistic regression model) and the negative loglikelihood loss , Assumption (C1) are standard conditions used in the analysis of maximum likelihood estimator. Finally, Assumption (C2) means that the true model parameter is a stationary point of the loss function , which is satisfied by both the leastsquares loss function and the negative loglikelihood loss function. In statistical literature, is known as (the negative of) the score function, whose expectation is zero under .
2.4 Assumptions on the contextual pricing model
At last, we state an assumption on the behavior of the contexts under the contextual pricing model .

There exists a positive constant such that, for any selling period and filtration , it holds that and for all and , which could potentially depend on .
Assumption (D1) concerns two quantities: the (expected) outer product of demand gradients , which by definition is always positive semidefinite, and the (expected) Hessian of the loss function , which can theoretically be any symmetric matrix but is in general positive semidefinite for common loss functions like the least squares or negative loglikelihoods. Assumption (D1) then assumes, essentially, that both quantities and
are positive definite in a “strict” sense, by lower bounding the least eigenvalues of both
and by a positive constant . Since both expectations are conditioned upon the adaptively chosen prices and context vectors , in Assumption (D1) we assume that the lower bound on the smallest eigenvalues holds for any such chosen prices/contexts in prior selling periods. Finally, we remark that the exact value of does not need to be known, as it is only used in the theoretical analysis of the validity of confidence intervals constructed by our proposed algorithm.3 Limitation of Classical Wald’s Intervals
In classical parametric statistics with i.i.d. data points, the Wald’s interval is a standard approach towards building asymptotic estimation or confidence intervals on maximum likelihood estimates. In this section, we review the approach of Wald’s interval in the context of contextual dynamic pricing, and discuss why such a classical method cannot be directly applied because of the feedback structures presented in our problem.
Suppose after selling periods the offered prices, purchase activities and customers’ context vectors are . Let be the maximum likelihood estimate
(10) 
which is equivalent to Eq. (6) with and . Using classical statistics theory (see, e.g., Van der Vaart (2000)), if are statistically independent, then under mild regularity conditions it holds that
(11) 
where is the sample Fisher’s information matrix. With Eq. (11), using the Delta’s method ^{1}^{1}1The delta’s method asserts that if then . See for example the reference of Van der Vaart (2000). we have for fixed that
(12) 
A confidence interval on can then be constructed as
(13) 
where is the
quantile of a standard normal random variable
and denotes the cumulative distribution of function of , i.e., .While the Wald’s interval is a generalpurpose and the most classical approach of constructing confidence intervals, one of the key assumptions made in the construction of the Wald’s interval is the statistical independence among the collected data across selling periods . It is known that, without such independence assumptions, the Wald’s interval could be significantly biased, as in the case of multiarmed bandit predictions (Deshpande et al. 2018) and leastsquares estimation in nonmixing time series (Lai and Wei 1982).


To better illustrate the failure of Wald’s test for adaptively collected data, in Figure 1, we plot the empirical distributions of the normalized estimation error and the normalized errors of predicted demands for the Wald’s interval approach. In particular, we consider the simple logistic demand model , with , and . The price range is . The context generating process is designed as , where and . The empirical distributions are obtained with 5000 independent trials, each with selling periods and prices determined by the LinUCB algorithm as described in Sec. 1.2. The top panels in Figure 1 depict the distributions of two coordinates of , and the bottom panels are normalized demand prediction errors for the cases of ; ; , respectively. One can easily see that, in contextual dynamic pricing the confidence intervals constructed for both the estimation errors and the prediction error (of demands) deviate significantly from the desired limiting distributions (see (11)) and (see (12)), calling for more sophisticated methods to construct accurate confidence intervals.
4 Main Algorithm and Analysis
The pseudocode of our proposed algorithm for constructing confidence intervals of the demand function is given in Algorithm 1.
At a high level, the objective of Algorithm 1 is to construct accurate confidence intervals in both the “pointwise” sense (i.e., confidence intervals for the expected demand in (4) for a single price and context ) and the “uniform” sense (i.e., confidence intervals in (5) for that hold uniformly over all possible prices and contexts). The input to Algorithm 1 is the historical price, context, and demand data over selling periods, during which an adaptive dynamic pricing strategy is used. The adaptivity of the pricing strategy means that the demands and prices are highly correlated, and therefore the basic Wald’s intervals cannot be directly applied, as discussed in the previous section.
The key idea behind our proposed approach is the idea of “debiasing” the empirical risk estimate (also termed as the “pilot” estimate in Algorithm 1). More specifically, built upon the biased pilot estimate , we construct a “whitening” matrix satisfying certain correlation and norm conditions (the procedure of constructing such a whitening matrix is presented in Algorithm 2 and Sec. 4.3), a debiased estimate is computed by adding the biascorrection term to the ERM estimate , or more specifically
(14) 
where , . For example, in the linear demand case, , while in the logistic case, . With the bias correction, it can be proved that the bias contained in can be dominated by the main error terms that are asymptotically normal, as shown in Theorem 4.1 later. With the asymptotic normality of , both pointwise and uniform confidence intervals can be constructed using either the Delta’s method in Eq. (12) or MonteCarlo methods, as shown in Steps 1 and 15 of Algorithm 1.
(15) 
In the rest of this section we provide a rigorous analysis of the proposed confidence intervals in Algorithm 1. In Sec. 4.1, we perform a biasvariance decomposition analysis of the debiased estimate and prove in Theorem 4.1 that, under certain conditions,
is asymptotically normally distributed; In Sec.
4.2 we upper bound the estimation error of the pilot estimate , and in Sec. 4.3 we propose a procedure of constructing the whitening matrix such that the conditions in Theorem 4.1 are satisfied. Finally, in Sec. 4.4 we prove that both pointwise and uniform confidence intervals are asymptotically level, theoretically establishing the accuracy of constructed intervals.4.1 Analysis of the debiased estimator
In Step 1 of Algorithm 1, a debiased estimate is constructed based on the biased ERM estimate and a certain “whitening matrix” . In this section we analyze the asymptotic distributional properties of based on certain conditions on . The question of how to obtain a whitening matrix satisfying the desired conditions will be discussed in the next section.
For notational simplicity, we denote the gradient at time by and . Also recall the definition of for in (8). The following lemma shows a biasvariance decomposition . The estimation error can be decomposed to , where the bias term satisfies almost surely and the variance , .
Proof.
Proof of Lemma 4.1. Recall the definition that , where and . Define also . By definition, . Subsequently,
Next, by Taylor expansion and the smoothness of (see Assumption (A2)), we have for every that . Hence, . We then have
which completes the proof. ∎
Our next lemma shows that, when the bias term is sufficiently small, the error
converges in distribution to a multivariate Gaussian distribution. Suppose the following conditions hold:

The nonanticipativity condition: the th column of , , is measurable conditioned on ;

as ;

Let be a diagonal matrix with ;
(16)
Then it holds that
(17) 
We note that the first and second items and the condition on in the third item are all related to the whitening matrix , which will be satisfied according our construction of in Sec. 4.3 (see Lemma 4.3 and Corollary 2). The convergence rate condition on the pilot estimator in the third item will be verified in the next subsection (Sec. 4.2) using tools from selfnormalized empirical process.
The key idea behind the proof of Theorem 4.1 mainly involves two steps. The first step is to show that, under the nonanticipativity conditions imposed on , the variance term converges in distribution to a normal distribution using martingale CLT type arguments. The second step shows that the bias term is asymptotically dominated by , and therefore the entire estimation error converges in distribution to a normal distribution. The complete proof is given below.
Proof.
Proof of Theorem 4.1. Adopt the decomposition of in Lemma 4.1. By definition, . For every define and . Because by the nonanticipativity condition, we know that is a martingale
. The following lemma shows how the characteristic functions of
converge to the characteristic function of . Let be a fresh sample from the standard dimensional Gaussian distribution. Define also . Then for any , , it holds thatThe proof of Lemma 4.1 is based on standard Fourieranalytic approaches (Billingsley 2008, Lai and Wei 1982, Brown 1971), and is deferred to the supplementary material. Lemma 4.1 shows that the characteristic function of converges pointwise to the characteristic function of , provided that as . By Levy’s continuity theorem, this implies , or more specifically
(18) 
Because and is treated as a constant in this paper, the third condition in Theorem