Consider a monopolist that prices and sells a variety of products over time. Accounting for the impact of prices on demand can greatly improve revenue. This dependence is often complex, with purchase decisions influenced not only by prevailing prices but also price histories. For example, purchases can be triggered by price reductions or prices of alternatives. In this paper we develop an approach to learning such behavioral patterns through setting prices and observing sales, with a goal of maximizing cumulative revenue over the course of many product life cycles.
Efficient learning calls for a thoughtful balance between maximizing revenue generated by a current product and probing to gain information that can be leveraged to increase subsequent revenue. Reference effects, by which we mean dependencies of current demand on past prices, bring substantial complexity to this so-called exploration-exploitation dilemma. First, there can be ambiguity as to whether purchases are triggered by the current price or some relation to past prices. To disambiguate, an effective learning algorithm must attribute delayed consequences to inter-temporal pricing decisions. Second, exploration entails coordinated selection of price sequences; independent selection of spot prices may not suffice. This is because demand may respond favorably to particular price histories, and probing appropriately selected price sequences can be required to efficiently learn that.
Despite the considerable complexity of this problem, we provide what to our knowledge is the first tractable systematic approach. We proceed by framing the problem as one of reinforcement learning and then, for particular classes of demand models, developing a computationally efficient learning algorithm based on Thompson sampling. To offer some assurance of statistical efficiency, we establish a bound on expected regret. With respect to the number of past products , the bound grows as
, which indicates that per-period expected regret vanishes over time. The bound applies very broadly across model classes, as it depends on the Kolmogorov and eluder dimensions, which are statistics that quantify model complexity in relation to data requirements for model fitting and for exploration, respectively. In particular, the dominant term in our bound grows with the geometric mean of the two notions of dimension. We also present simulation results that demonstrate strong performance relative to less sophisticated exploration schemes.
For the sake of exposition, most of our discussion will focus on a simplified setting in which the firm sells indistinguishable products in sequence under unchanging market conditions, discontinuing each product before launching the next, and with each product marketed over a fixed number of time periods. The price can be adjusted in each of these time periods, and the demand for a product in any given period depends on its prevailing and previous prices. In Section 5, we explain how algorithms and results can be extended to treat more complex models that capture important features of realistic problems. This includes models with covariates that capture distinguishing features of products and varying market conditions and that allow for simultaneous pricing and sales of multiple products with overlapping life cycles of varying duration.
There is a substantial literature on pricing with reference effects. Mazumdar et al. (2005) provides a comprehensive survey that covers both behavioral research that provides evidence and examines the structure of reference effects and methodological research on how pricing strategies should respond. Strategies for particular model classes have been developed in Greenleaf (1995), Kopalle et al. (1996), Fibich et al. (2003), Ahn et al. (2007), Popescu and Wu (2007), Heidhues and Kőszegi (2014)
. However, these papers treat the problem of pricing given known demand models, with no learning required. Electronically-mediated markets, the increasing availability of data, and advances in the field of machine learning have fueled a vast and growing literature on learning to price. We refer the reader toden Boer (2015) for a comprehensive review of the literature and research directions. To our knowledge, our work is the first to treat learning in the presence of reference effects.
The fact that pricing decisions can result in delayed consequences has presented an obstacle to the development of efficient algorithms that learn to price effectively with reference effects. In this paper, we offer a new approach through framing the problem as one of reinforcement learning and bringing to bear recent developments in the application of Thompson sampling to such problems. It is worth noting, however, that we are not the first to apply Thompson sampling to a pricing problem. In particular, Ferreira et al. (2015) considers an approach based on Thompson sampling to address a multiproduct pricing problem with resource constraints, though without reference effects.
Our pricing strategy is based on the posterior sampling reinforcement learning (PSRL) algorithm, originally proposed by Strens Strens (2000)2014) for Thompson sampling, regret analyses for PSRL are developed in Osband et al. (2013), Osband and Van Roy (2014). In principle, results of Osband and Van Roy (2014) apply to the problem we consider in this paper, but the associated regret bound depends on a Lipschitz constant which is not clear how to characterize in our context. We instead build directly on the technical tools of Russo and Van Roy (2014) and Osband and Van Roy (2014) to derive custom regret bounds for PSRL in our pricing model.
The rest of this paper is organized as follows. In Section 2, we formulate a dynamic pricing problem that addresses reference effects. In Section 3 , we propose Thompson Pricing (TP) as a heuristic strategy for the problem and provide a general regret bound in Theorem 3.2 of that section. To carry out a more concrete study, in Section 4, we consider the special case of linear demand. For this context, we specialize and interpret regret bounds and present computational results that illustrate merits of TP. Section 5 discusses how our dynamic pricing model and algorithm can be generalized to accommodate complexities arising in practical settings, such as observation of covariates that inform demand forecasts, coordinated pricing across multiple products, and pricing of products with overlapping sales seasons. Section 6 presents an analysis of TP in a general setting that accommodates various aforementioned complexities, leading to the main technical result of the paper (Theorem 6.2). We offer concluding remarks in Section 7.
2 Problem Formulation
In this section, we formulate a dynamic pricing problem and highlight the role played by reference effects. To facilitate exposition of core ideas that we will develop in the paper, our model leaves out many complexities that may be required to adequately address practical contexts. In Section 5, we will discuss how the model and ideas can be extended to accommodate some such complexities.
Consider a monopolist selling indistinguishable products over a sequence of sales seasons under unchanging market conditions. We will think of each sales season as an episode of interaction between the monopolist and the consumer market. Let each episode last for time periods. At the start of each time period, the monopolist sets a price, observes random demand, and collects revenue. We assume that the monopolist faces no supply constraint, so that all demands are met. As an illustration, one might think of the monopolist as a seller of coats that are sold over the Autumn and Winter who adjusts price over each week. In this case, each episode is a six month period and each period lasts a week.
At the start of each period of an episode , the seller sets a price and observes demand
, which, conditioned on information available when the price is set, is log-normally distributed with parametersand . Hence, denotes expected demand, while represents uncertainty. The expected revenue upon setting the price is given by .
The expected demand for a product may depend on factors such as the quality of the item, the price, and consumer behavior. If the monopolist does not understand these dependencies a priori, in order to identify an optimal price, he must learn through experimentation.
2.1 Memoryless Demand
In the simplest case, one might assume that expected demand depends only on the current price; that is for some function , where is an unknown parameter representing what the monopolist does not know about demand structure. Given knowledge of , the monopolist should maintain a constant price over time. However, a monopolist may have to learn through experimentation. This calls for a pricing strategy that balances between exploration and exploitation and converges over time to an optimal price. This can be viewed as a structured bandit learning problem, and several pricing algorithms have been proposed for variations of the problem Ferreira et al. (2015), Kincaid and Darling (1963), Gallego and Van Ryzin (1994, 1997), Bitran et al. (1998), Besbes and Zeevi (2009), Araman and Caldentey (2009), Farias and Van Roy (2010), Besbes and Zeevi (2012), Lobo and Boyd (2003).
In the memoryless demand model we have described, the way in which consumers respond to a current price does not depend on price history. In reality, reference effects play a substantial role in purchase decisions Mazumdar et al. (2005). For example, offering a discount often increases demand not only because the new price is low, but also because it is lower than the previous price. Black Friday and Cyber Monday sales constitute well-known examples of this phenomenon. While such reference effects naturally occur they cannot be learned using memoryless demand models. A broader class of dynamic models is called for, as well as more sophisticated strategies involving strategic sequencing of prices to maximize cumulative revenue. We now discuss such a model.
2.2 Reference Effects
We consider a model in which expected demand over a time period may depend not only on current price but also on previous prices within the current episode. Here, is a parameter that represents duration of memory in the demand model. In any period of episode , we consider the state of the demand model to be
which represents the -step price history of the product. Then, the expected demand at period is taken to be
for a demand function , which depends on an unknown parameter . Dependence of the expected demand on the state captures reference effects.
Note that prices in environments with reference effects bear delayed consequences: a price does not only influence immediate but also future demand. Therefore, given knowledge of , an optimal pricing strategy does not fix a constant price, as would be the case with a memoryless demand model, but rather, plans a sequence of prices that vary over periods of the episode. Such a sequence generates expected revenue
over the episode, where is the state at the start of period , under this price sequence. Therefore, an optimal price sequence is given by
Note that this optimization problem can be solved via dynamic programming. We write
to indicate that is the solution of the associated dynamic program applied to above optimization problem with environment variable and horizon . We also provide an argument for a price constraint ; if then each price is constrained to the interval .
Since the environment is unknown to the seller, he must experiment to learn the demand model while earning revenue. A pricing strategy is a sequence of policies such as to be executed in consecutive episodes, where policy is a (possibly random) function of the history observed prior to episode . We will assess the performance of a pricing strategy in terms of its cumulative regret over episodes defined by
With reference effects, the agent has the ability to influence purchasing behavior by exploiting the manner in which consumers react to price trajectories. As such, the learning problem involves learning how to influence consumer behavior, which is a deeper issue than that addressed when learning with a memoryless demand model.
3 Thompson Pricing
In this section, we present Thompson Pricing (TP), a heuristic strategy that learns to price with reference effects and bound its cumulative regret.
3.1 The Algorithm
With TP, the seller begins with a prior distribution over the unknown parameter . Let
denote the history of observations made prior to the start of the th episode. Based on this history, the agent generates a sample from the posterior distribution and then, treating this sample as truth, computes the policy
The resulting policy is applied through episode . After episode , the posterior distribution is updated based on observations made over the episode, and the process repeats. A more precise description of TP is provided as Algorithm 1.
Note that TP determines the price trajectory for the entire episode at the start of each episode. Each trajectory can probe the market to reveal consequences not only of individual prices but of a price sequence. Sampling from the posterior distribution over models trades off between exploiting what has been learned and exploring the unknown.
3.2 Regret Bound
In principle, TP can be applied with any distribution over demand functions, though computational requirements vary greatly depending on the problem class. We now provide a general regret bound that applies broadly. In subsequent sections, we specialize TP and its regret bound to more specific problem classes that admit efficient computation.
Let denote the support of the prior distribution . In each time period, the state of the system is characterized by prices quoted so far within the episode. As such, the state space is given by . The demand function maps the current price and state to expected demand. Hence, the set of possible demand functions is given by
We assume that the range of the demand function is bounded. There exists such that, for all , , and , .
We will provide a regret bound that applies to any class of demand functions. The dependence of regret on the class of demand functions can be characterized by statistics that reflect suitable notions of complexity. The regret bound that we will provide depends on two such statistics.
Let be a collection of functions, each mapping a set to . For all , let denote the -covering number of with respect to the supremum norm. Let be a collection of functions, each mapping a set to . The Kolmogorov dimension of , denoted by , is
The Kolmogorov dimension is a notion of complexity commonly used to quantify the number of data samples required to avoid statistical overfit. Sample complexity results in statistical learning that build on this concept typically apply to contexts where data is generated by a stationary source. In our pricing problem, data is not produced by an exogenous stationary source but rather through probing actions that hone in on an optimal price sequence. To bound sample requirements in such a context, a new notion of complexity is called for. To serve this need, we will use the eluder dimension, as introduced in Russo and Van Roy (2014). To define this, we begin with a notion of dependence. Let be a collection of functions, each mapping a set to . An element is -dependent on with respect to if any pair of functions satisfying also satisfies . Further, is -independent of with respect to if is not -dependent on . Intuitively, an action is independent of if two functions that make similar predictions at can nevertheless differ significantly in their predictions at . This concept suggests the following notion of dimension. Let be a collection of functions, each mapping a set to . The -eluder dimension is the length of the longest sequence of elements of such that, for some , every element is -independent of its predecessors.
and, in an asymptotic notation,
Theorem 3.2 provides an upper bound on the regret of TP applied to an arbitrary class of demand functions. The regret bound established in this theorem depends on the geometric mean of Kolmogorov and eluder dimensions. Furthermore, this regret bound is increasing in the maximum price and demand uncertainty . The proof of this theorem is quite involved and presented in Section 6. In fact, the analysis of Section 6 addresses a more general problem that allows for covariates and multiproduct pricing. Theorem 3.2 follows immediately from the more general result.
4 A Linear Demand Model
In this section, we study the special case of a linear demand model, in which
are unknown parameters of the demand function. Note that the state vector increases in length over the firsttime periods, and represent coefficient vectors that multiply state vectors of different lengths. The model includes a total of unknown parameters, which can be encoded in terms of a vector . We consider a normal prior distribution over with mean and covariance matrix .
Thanks to conjugacy properties of normal distributions, the posterior distribution of after any number of episodes remains normal. To specify the update rules for posterior means and covariances, let us define a few auxiliary variables. Given observations gathered over the ’th episode, define
and for , let
Then, the mean and covariance matrix of the posterior distribution are updated according to
At the start of each ’th episode, TP samples from the prevailing posterior distribution and applies the policy
throughout the episode. Note that the state evolves deterministically over the episode; that is, the state in any time period is determined by the state in the previous time period and the selected price. With these linear dynamics and linear demand model, the dynamic program reduces to a quadratic optimization problem which can be solved efficiently, as carried out by Algorithm 2.
To illustrate successive steps of Algorithm 2, we introduce some notation. Given an matrix and for , we let denote the ’th element of . For , we take to be the submatrix of consisting of the elements in column and rows to . Similarly for , denotes the submatrix of consisting of the elements in row and columns to . By (3) and (12), the expected revenue of policy in an episode is
where , for . Now, let be an matrix which satisfies
for all ,
for all ,
all other entries of are equal to 0.
Given , (15) can be expressed as
where with a slight abuse of notation, is treated as an dimensional vector. Therefore, given the sampled parameter vector , the dynamic program in (14) is equivalent to
The quadratic optimization problem in (17) can be solved efficiently via the standard convex optimization tools provided that the matrix is negative semi-definite. On the other hand, if is not negative semi-definite, the optimization problem in (17) is NP-hard. While, is not guaranteed to be negative semi-definite for all realizations of , for appropriate values of the prior mean (for example with mean of being negatively large),
would be negative semi-definite with high probability. From a practical point of view, at the start of episode, sampling from the posterior distribution can be repeated until the sampled results in a negative semi-definite . Algorithm 3 describes the specialization of TP to the described linear environment.
In addition to TP, let us consider two other pricing strategies. First, consider a seller who is agnostic to reference effects and adopts a simple pricing strategy based on Thompson sampling which is suitable for memoryless demands. Specifically, such a seller assumes that the expected demand at period in episode is
for some unknown and considers a normal prior distribution over with mean and covariance matrix . At the beginning of each period in episode , is sampled from the prevailing posterior distribution and the price
is set for the product throughout the period. Upon observing the random demand at the end of this period, the posterior distribution remains normal with its mean and covariance updated via
The above memoryless pricing strategy, which performs near optimally in memoryless environments Ferreira et al. (2015), will drastically fail in the presence of reference effects. This failure can be attributed to ignorance towards the reference effects. With the above misspecified demand model, the seller is not taking the delayed consequences of prices into account while the optimal pricing strategy takes advantage this phenomenon. As a result, the memoryless pricing strategy is not able to learn the optimal pricing strategy in an environment with reference effects.
As a second pricing strategy, consider a seller who assumes the demand model of (12), but instead of TP employs a weak version of Thompson sampling as follows. At the beginning of each period within an episode, a model is sampled from the prevailing posterior distribution and treating it as truth, a price is set greedily to maximize the expected immediate revenue. Upon observing the random demand at the end of the period, the posterior parameters are updated according to (13) and the process repeats.
In this pricing strategy, the seller indeed accounts for the effect of previous prices on current demand when maximizing the expected immediate revenue, but he overlooks the effect of current price on future demands. This is while the optimal pricing strategy determines the price trajectory for an episode in a way to fully exploit the delayed consequences of the prices and maximize the total expected revenue. For example, the optimal pricing strategy might suggest keeping the prices low (and collect a low revenue) at the initial few periods in exchange for large demands (and large revenues) at the subsequent periods. The greedy behavior of the weak version of Thompson sampling does not allow for such strategic plannings and hence prevents the seller from learning the optimal strategy.
To compare the performance of TP with the above two alternative pricing strategies, we simulate an environment with linear demand as described. In the simulation, we let , , and . We assume that the prior distributions of and are and , respectively. Further, we assume that each component of has a prior distribution for . Figure 1 shows the per-episode regret of these three pricing strategies which are averaged over thousand random realizations. As depicted in this figure, TP (Algorithm 3) quickly learns the optimal strategy and hence its per-episode regret diminishes quickly. However, as a result of model misspecification, the memoryless pricing which ignores the reference effects fails to learn the unknown parameters and suffers from a large non-diminishing per-episode regret. This observation points out that more sophisticated pricing strategy is required in the presence of reference effects and neglecting such effects massively degrades the performance. Figure 1 also depicts the per-episode regret achieved by the weak version of Thompson sampling described above. As discussed earlier, this version of Thompson sampling does not plan for the future and, as Figure 1 shows, its per-episode regret converges to a non-zero constant.
To explore the effect of memory duration on the performance of TP, we simulated the same scenario but with different values for . Figure 2 shows the expected per-episode regret of TP over 100 episodes for . As depicted in this figure, TP suffers more regret when is larger as in this case the prices have more persistent consequences and it takes longer for TP to learn the optimal pricing policy for an episode.
An alternative pricing algorithm that is suitable in the described linear environment is one designed based on certainty equivalence principle. The certainty equivalence pricing strategy works similar to TP, except that, instead of sampling from the posterior distribution at the beginning of each episode, it uses the Maximum Likelihood estimate ofto compute price trajectory within each episode. Furthermore, to enforce exploration in such an algorithm, dithering techniques, such as -greedy, can be adopted. At any period, -greedy pricing strategy follows certainty equivalence strategy with probability and sets a random price with probability . To compare the performance of these alternative pricing strategies with that of TP, we simulate the same scenario described above. Figure 3 shows average per-episode regret of TP, certainty equivalence strategy and -greedy strategy for various values of . The vertical axis in Figure 3 is in logarithmic scale to better present the differences. As depicted in this figure, certainty equivalence strategy works reasonably well in the described environment and adding randomness via dithering does not improve its performance. However, TP presents a superior performance as its per-episode regret converges to 0 at a faster rate. Note that in this scenario, the maximum possible price is and hence the difference between the per-episode regret of TP and certainty equivalence strategy after 1000 episodes presents a significant improvement.
Based on the results of Theorem 3.2, an upper bound can be established for cumulative regret of TP when applied in the described linear demand environment. Before stating the result, we assume that the parameter space is bounded. There exists such that . The following corollary follows directly from Theorem 3.2. Consider an environment where expected demand is linearly parameterized as in (12). Under Assumptions 8 and 3, cumulative regret of TP after episodes satisfies
Corollary 3 shows that per-episode regret of TP in the described linear environment decreases at a rate of with the number of episodes . Pricing strategies that have been proposed in the literature for memoryless demand models achieve a similar per-episode regret rate in the absence of reference effects Ferreira et al. (2015). This indicates that although dynamic pricing with reference effects entails additional challenges, TP performs efficiently in that context with no additional cost in terms of the regret rate. Moreover, the regret bound established by Corollary 3 is increasing in the history parameter . Clearly, as increases the prices have more persistent effects and the optimal strategy admits a more complicated structure. Hence, it takes longer for TP to learn the optimal strategy. From another perspective, dictates the number of unknown parameters of the model and it takes longer to effectively learn within a model with more unknown parameters. Furthermore, as the horizon increases, the optimal policy within an episode takes a more complicated form as more sophisticated planning is required for larger horizons. Therefore, as reflected in (18), it takes longer for TP to effectively learn the optimal policy in larger horizons.
For the sake of exposition of our main ideas, we have so far focused our attention on a simplified setting where indistinguishable products are sold sequentially and our description of TP has been adapted to this scenario. In this section, we discuss how TP can be generalized to incorporate the effect of covariates on the demand and carry out multiproduct pricing possibly with variable and overlapping life cycles.
5.1 The Effect of Covariates
The setting of Section 2 can be extended to the case where the product being sold at subsequent episodes are distinct. For example, at an episode the seller may be selling a certain type of coat while in the next episode, he will be selling a certain type of shoe. Although in both episodes the seller deals with the same environment, the coat and the shoe will experience different demands when offered with the same price. More generally, in addition to prices and consumer behavior, demand for a product depends on different characteristics of the product itself. To allow for such dependencies, we assume that at the beginning of episode , the agent has access to a context vector which encodes different characteristics of the product being sold at that episode such as its lifetime, its production cost and whether a similar product currently exists in the market. Furthermore, may contain other covariates which can influence the demand at episode such as the current inflation rate and the average income of the potential consumers. The context vector may differ from episode to episode while it remains fixed over each episode.
To incorporate the effect of the context on the demand, we extend our formulation in (2) and assume that the expected demand observed at period for product – the product being sold at episode – is
where similar to Section 2, is the price of product at period , is the step price history of the product representing the reference effect and is an unknown parameter. In this case, a pricing policy achieves an expected revenue of
in episode , where is the state induced by policy at period . Then, the optimal pricing policy at episode is . Note that the optimal pricing policy in episode depends on the context . Similar to Section 2 and given the parameter , the optimal policy in episode can be computed by means of a dynamic program. We write
to indicate that the policy is the solution of the associated dynamic program at episode with periods and given the parameter and context . Also, represents the maximum allowable price for the products.
To better illustrate the extension of TP to this setting, we focus on a linear demand model. Specifically, we assume that the expected demand at period in episode is given by
where , and are unknown parameters of the demand function. There are a total of unknown parameters in this model which can be encoded in terms of a vector where is an dimensional vector generated by stacking the columns of into a single column.
TP can be adopted in the same way as in Section 4 to generate pricing policies for this problem. Specifically, starting with a prior distribution on , TP draws a sample from the posterior distribution at the start of each episode and uses dynamic programming to compute a policy which is then executed throughout the episode. Thanks to conjugacy properties of normal distributions, the posterior distribution of after any number of episodes remains normal. To specify the update rules for posterior means and covariances, let us define some auxiliary variables as in Section 4. Given the observations gathered at episode , define
and for any , let
where denotes the Kronecker product. Then, at the end of episode , the mean and covariance matrix of the posterior distribution are updated according to
Note that the state in any time period is determined by the state in the previous time period and the selected price. Thanks to such deterministic evolution of the states and linearity of the demand function, the dynamic program step in TP reduces to a quadratic optimization problem. To see this, note that from (20) and (21), the expected revenue of a price vector in the ’th episode can be expressed as
Now, given the set of parameters , define the matrix such that
for all ,
for all ,
all other entries of are equal to 0.
Then, (23) can be expressed as
Therefore, given the sampled parameter at the beginning of episode , the policy to be applied in episode is given by
Similar to Section 4, the matrix is not guaranteed to be negative semi-definite in which case the optimization problem in (25) is not convex. However, with appropriate choice of the prior mean (for example when is negatively large with high probability) would be negative semi-definite with high probability. When implementing TP, posterior sampling at the beginning of episode can be repeated until the sampled results in a negative semi-definite . Algorithm 4 describes successive steps of the above solution and Algorithm 5 presents the generalization of to incorporate the effect of covariates.
We can also generalize the result of Theorem 3.2 and establish a regret bound for TP in such a scenario. Let denote the set of all possible context vectors. We make the following assumption. There exists such that . The following theorem provides a regret bound for TP in the above linear environment. Consider an environment where the expected demand is given as in (21). Under Assumptions 8, 3 and 5, the regret of TP after episodes would be
Theorem 5 has been proved in Section 6. This Theorem yields a similar regret bound for TP in the presence of the covariates as in the case of no covariates. The only difference is the dependence of (26) on the number of covariates . Note that scales the number of unknown parameters of the model and appears lineraly in the regret bound.
5.2 Multiproduct Pricing
So far, we have been considering a single product dynamic pricing problem where at each episode, the seller prices and sells a single product. In many practical situations, however, multiple products are being sold by the seller and he needs to simultaneously price all of them. Potentially, these products can be related to each other in a way that the demand for one of them depends on the price of all the products. Specifically, suppose that a set of products are being sold at episode . At period in this episode, the agent sets a price vector such that its ’th component, denoted as , is the price of product . While we can extend the model to incorporate the effect of covariates on the demand as well, we neglect such effects here to ease the exposition. Let
be the dimensional state vector at period in episode and let be the demand vector at this period such that is the expected demand for product . Focusing on a linear demand function, the expected demand can be modeled as
where , and are the unknown parameters of the demand function. There are a total of unknown parameters which can be encoded in a vector where is a vector generated by stacking the columns of on top of each other and is generated in the same way for .
The agent observes a random demand vector at period in episode such that , the demand observed for product
, is a log-normal random variable with parametersand . Note that we have . The expected revenue achieved at this period is . A pricing policy in this case is a sequence of price vectors such that . The optimal pricing policy is the one that maximizes the expected revenue over the episode:
Similar to the single product setting, this optimization problem can be solved via a dynamic program. We overload our notation and write
to denote that is the solution of the optimization problem in (28) when the demand function is governed by the parameter .
TP can be adapted to learn the optimal policy in this multiproduct pricing problem. Similar to single product scenario, TP starts with a prior distribution on . Similar to Section 4 and thanks to conjugacy properties of normal distributions, the posterior distribution of after any number of episodes remains normal. To specify the update rules for posterior means and covariances, we define some auxiliary variables. Upon observing at episode , let be
and for any , define as
Some linear algebra leads to the following update rules for the mean and covariance matrix at the end of episode :