1 Introduction
A key challenge for many physical retailers is choosing where to display their products. In many large stores, it can be difficult for consumers to find what they are looking for since a typical retailer may sell thousands of products. Additionally, consumers often purchase goods that they had not intended to buy beforehand, but are made on an impulse. Proper placement reduces search costs and maximizes ”impulse” buys [2]. For example, suppose a shopper visits a supermarket intending to purchase groceries. As the shopper checks out he sees a soft drink beverage placed near the cash register, and adds it to his cart. The shopper’s decision to purchase the drink was in part a function of the environmental cues and placement of the product [13]. The main idea of this work is propose a strategy for automating the decision process of product placement in an optimal way.
Some existing work explores domains adjacent to the optimal product allocation problem. A large body of operations research analyzes shelf space distribution. For example, early work proposed a dynamic programming algorithm to discover an optimal shelf allocation strategy [20]. Other work poses shelf space allocation as a constrained optimization problem that can be solved via simulated annealing [3]. More contemporary studies propose frequent pattern mining approaches to determine profitable product item sets [4] [1]. To the best of our knowledge, none of the existing literature has studied the spatial effects of product locations across the entire store.
However, learning a strategy for optimal product allocation is nontrivial. First, the number of candidate allocation strategies is large but the historical data usually only explores a small subset. Not to mention that sales are also correlated with other factors such as holidays and store promotions, which makes the search space even bigger. Because of this issue of data sparsity we cannot directly rely on historical data to learn the best strategy. Second, the cost of experimentation and exploration is high. It is not feasible to perform extensive experiments due to the potential lost revenue and the physical cost of moving products around the store. Finally, the correlation between product positions and sales is likely complex and nonlinear due to the dynamic nature of the market; simple search heuristics may not provide an optimal policy. For all of these reasons, we need an approach that can accurately reflect the environment in a costefficient way.
Therefore, we design a new framework to solve these challenges. We propose a probabilistic spatial demand simulator to be a mirror of the real environment and act as a mechanism to study more complex search algorithms such as reinforcement learning without incurring the high cost of exploration in the physical world. We train the proposed model using a new, realworld dataset. Additionally, when deployed online, the model could be used to perform Monte Carlo rollouts for efficient exploration and experimentation
[8].In our experiments, we demonstrate that the proposed model can effectively recover ground truth test data in two retail environments. Finally, we do a preliminary study into different optimization techniques using the proposed model.
In summary the key contributions of our paper are:

We study the new problem of optimal product allocation in physical retail

We propose a probabilistic model of spatial demand that can accurately recover observed data, and generate data for new environment states

We train PSDsim on real data from two different retail stores

We do a preliminary study into various optimization methods and show that Deep QLearning can learn an optimal allocation policy
2 Problem Definition
In the following section, we provide a formal definition of the optimal allocation problem. Additionally, we define the necessary components of our reinforcement learning agent: the state space, action space, reward function, and state transition function.
2.1 Optimal Allocation Problem
In a physical retail environment with a set of spatial regions, we represent the environment with a spatial graph , where each region is a vertex in the graph, the spatial neighboring relation between two regions and are represented as . From , we can construct the adjacency matrix, A.
Additionally, we observe a set of products, that are sold. For each product, , we know the retail price, .
The decision process faced by the retailer is to allocate each product in across regions in . We define the allocation policy as a function :
(1) 
(2) 
Where is the set of selected product region, such that , and . This function is typically dynamic over time, which we denote as . To simplify computation, we treat as an grid and refer to it as the board configuration at time, . An optimal retail strategy is to find the allocation policy that maximizes revenue:
(3) 
where is the price for product , and is the quantity sold in region and is the future time horizon of analysis. The main idea of the current work is to discover the longterm, optimal allocation policy, from data.
2.2 Optimal Allocation as a Markov Decision Process
We believe that the optimal allocation problem is well suited for reinforcement learning because the RL agent is designed for sequential decision making that maximizes expected discounted reward over time. We frame the inputs as a Markov Decision Process (MDP). An MDP is defined by the tuple
, where is the state space, is the set of possible actions, is the (typically unkown) state transition function, is the reward function and is the discount factor.
State At each time, , we observe the state of the retail environment, . We define the state, , as the tuple of state features, , where is the current board configuration, is the current day of the week (e.g., Sunday 0), and
is a vector denoting the revenue at the previous time,

Action We define the action space , indicating “to place”, “take way” or “do nothing” for each product, in each region, .

Reward The reward function in this case is the total product revenue at time , constrained by the monetary cost, , of placing a set of products in each region:
(4) 
State transition function: The state transition, is defined as
, which gives the probability of moving to state,
given the current state and action. In the optimal allocation problem the exact transition function, is unknown since the current state, depends on the results of the previous time, . We model this transition as a stochastic process.
3 Proposed Method
An overview of the proposed model as a Bayesian network. The boxes are “plates” representing structures in the data. The plates marked by
, andrepresent products, regions, and time, respectively. Circles denote random variables and squares are deterministic quantities. The model decomposes quantity as a function of region, product, time, and autoregressive weights.
In this section, we define our framework for solving the optimal allocation problem. Specifically, we outline our proposed environment model that is used to simulate spatial demand.
3.1 Stochastic Model of Spatial Demand
We propose the following stochastic model of spatial demand in physical retail. See Figure 2 for an overview. In the current work, the stochastic model is used as a ‘simulator’ to enable offline policy learning. There are many advantages of using a probabilistic model in the optimal product allocation problem. First, we are able to incorporate prior knowledge about the data generating process, which can again improve data efficiency and model effectiveness. Second, it provides a natural framework for simulating future scenarios through Monte Carlo rollouts.
Our ultimate objective is to maximize total revenue at time, , which is defined as:
(5) 
where is the revenue for region, . Regionlevel revenue is calculated over products, :
(6) 
The key variable of interest is, , the quantity sold for product, , region, , at time, . We model as a truncated normal random variable:
(7) 
where,
is the pdf of the truncated normal distribution. The term,
is the standard normal pdf, andis its cumulative distribution function. See
[5] for more details. We set and , which forces and constrains quantity, . The prior for is characterized by the mean, , which is a linear function of environment features, x and learned weights, w, and the inverse gamma distribution for the variance,
:(8) 
(9) 
In our environment, we observe temporal features, , region features, , product features, , and autoregressive features, : x =
. We discuss our feature extraction approach more in section
3.33.1.1 Regionlevel Weights
We initially model the weights for each spatial region with a multivariate normal distribution, with mean vector, and covariance matrix, :
(10) 
3.1.2 Productlevel Weights
We also define weights for each product, , as follows:
(11) 
(12) 
(13) 
We put a multivariate normal prior over the mean vector,
which has hyperparameters
and . Additionally, we put an LKJ prior over the covariance matrix, . We reparameterize as its cholesky decomposition, , so that the underlying correlation matrices follows an LKJ distribution [10]. The standard deviations,
, follow a halfcauchy distribution. The advantage of the LKJ prior is that is more computationally tractable than other covariance priors
[10].3.1.3 Temporal weights
The temporal features capture the longterm and shortterm seasonality of the environment. The temporal weights are defined similar to the product weights. Namely, the temporal weights, , follow a multivariate normal distribution, with a normal prior over the mean, and the LKJ prior for the covariance matrix:
(14) 
(15) 
(16) 
3.1.4 Autoregressive weight
Finally, we specify the weight of previously observed revenue values on . The feature, is an autoregressive feature denoting the previous k values of productlevel revenue, . We assume truncated normal prior for , and half cauchy priors for the location, and scale, :
(17) 
(18) 
(19) 
We again set and such that .
(20) 
(21) 
Note that both and share the same same covariance structure. Thus, the region weights are only hierarchical in their means. Additionally, we treat the upperlevel mean vector, as hyperparameter. In Section 4 we test which environment model is more effective at predicting revenue on a test set.
3.2 Training
We train the proposed model using the No UTurn Sampler (NUTS) algorithm [7]. This allows us to draw samples from the posterior distribution of model weights, W
, as well as the posterior predictive distribution of quantity,
, and revenue . We use Automatic Differention Variational Inference (ADVI) [9] as an initialization point for the sampling procedure. All models are implemented in PyMC3 [15]We initialize with ADVI using 200,000 iterations. Once initialized, we sample the posterior using NUTS with a tuning period of 5,000 draws followed by 5,000 samples across four chains.
3.3 Feature Extraction
In order to train the proposed model, we extract environmentlevel features, x, which is composed of temporal features, , region features, , product features, , previous sales features and .

Temporal features We use a onehot vector denoting the day of the week for, . This feature vector captures the shortterm temporality commmon in physical retail settings. For example, weekends tend to be busier shopping days than weekdays.

Region features We again use a onehot vector for spatial regions, . This feature vector ’turns on’ the weight that each region has on quantity via the weight vector, .

Product features We expect each product to vary in popularity. We capture this effect by constructing a onehot vector for products, .

Previous sales features Finally, we construct an autoregressive sales feature that represents the sales at time, . We use the previous sales for product , summed across all regions, . This feature captures microfluctuations in demand for each product.
4 Experiments
In the following section we first describe the dataset and discuss interesting features of the problem. Next, we perform empirical evaluations of the proposed model across two large retail environment by showing that it can more accurately recover test data better than more elementary baselines. We explore the model by discussing the estimation of region weights, and show that it is robust to previously unseen states. Finally, we do a preliminary inquiry into effective methods for optimization.
4.1 Dataset Decription
Store Id  Regions  Products  Time Horizon 

#1  17  15  8/2018  8/2019 
#2  12  15  8/2018  8/2019 
Stores: We collect data from two large supermarket and retail stores in Salt Lake City, UT, USA. Each store primarily sells groceries, common household goods and clothing. Our dataset is comprised of transactions from August 2018 to August 2019.
Products: We observe quantities sold for a set of 15 products, as well as each product’s average price over the year. All of the products in our dataset are popular beverage products.
Regions: The data provides daily counts of quantities at the regionproduct level. Additionally, the locations of the products are varied in product ”displays”. These displays are small groups of products intended to catch the eye of the shopper. See Figure 1 for an example of a product display layout. Store 1 is comprised 17 regions, and store 2 has 12. Each region represents a section of the store. In general regions tend to be constructed based the functionally of each space (e.g., pharmacy, deli, etc.). We construct a spatial graph of these regions.
4.2 Model Evaluation
We first evaluate the effectiveness of the proposed model in predicting revenue on a test dataset. Specifically, we partition the time series into a training period from August 1, 2018  July 4, 2019 , and a test period of July 5, 2019 to August 31, 2019. We compare the proposed model to a variety of discriminitive baselines, and simpler variants of the proposed model. We evaluate all models in terms of the following error metrics:
(22) 
(23) 
where the predicted revenue is equal to the quantity times price for the product, in the region, at time, : . To compare to the discriminitive models, we obtain a point estimate for by computing the mean of the samples taken from posterior predictive distribution.
Store 1  Store 2  
Environment Model  MSE  MAE  MSE  MAE 
OLS  2845.61  28.01  4816.41  34.81 
RF  2908.73  26.77  5090.11  36.34 
MLP  4037.91  34.66  7322.86  44.37 
Proposed  2615.32  27.67  4492.52  34.48 
4.2.1 Baseline Approaches
The proposed model is a generative environment model and is able to draw samples from the full posterior distribution of revenue, . We also compare to the following discriminative prediction models:

Linear Regression (OLS): Classical least squares regression that decomposes predicted quantity as a linear function of weights: .

Random Forest (RF)
: An ensemble regressor that learns many decisions trees and averages over the labels in each terminal node to compute,
. We use 100 trees. 
Multilayer Perceptron (MLP)
: A simple neural network with two hidden layers of dimensions 256, and 128 with ReLU activations, MSE loss, and stochastic gradient descent optimizer.
We use the same features for all baselines. The features used in the experiment are described above.
4.2.2 Results
We report the results in Table 2. Additionally, predictions over the test set are plotted in Figure 3. Overall we have the following observations from the experiment.
First, the proposed model is overall more accurate at predicting future states than baselines. In particular, the proposed model yields the smallest MSE scores. MSE give a higher penalty to large errors, so in general the proposed model tends to make fewer, bad mistakes than all other baselines. This result holds both in store 1, and store 2. Additionally the proposed model minimizes the MAE score in store 2, but is beat out by only the Random Forest baseline for store 1. Upon closer analysis we see that the Random Forest baseline has the second largest MSE score in store 1, which indicates that the Random Forest regressor has a higher variance than the proposed model. Overall, the proposed model is better or comparable to all baselines in both retail stores.
Second, the use of prior information in the proposed model allows it to perform better than the discriminitive baselines. Because the proposed model is a generative, Bayesian regression model we are able to set key hyperparameters at values according to our prior knowledge. For example, we know that retail sales increase on the weekends. By guiding the estimation of model parameters through the use of human knowledge the proposed is able to achieve prediction performance superior to OLS, RF, and the MLP in nearly all cases.
4.3 Optimization Techniques
In this section we perform a preliminary study into various search algorithms to solve the optimal product allocation problem with the the proposed model environment model. Because exploration and experimentation in the physical world is costly, it is often preferable to design an agent that can learn a policy offline before deploying into the online environment [8].
4.3.1 Search Algorithms
To this end we compare four methods to search the problem space: random search, naive search, Tabu search, and Deep QLearning

Random Search A search algorithm that relies on a totally random policy: at each time step, choose a random action.

Naive Search The naive strategy in this case is simply “do nothing.” At each time step, we do not move any products and do not deviate from the initialized allocation policy. This baseline allows us to assess whether searching and exploration is useful at all.

Tabu Search: A local neighborhood search algorithm that maintains a memory structure called a “Tabu” list. The “Tabu” list is comprised of recent actions to encourage exploration and avoid getting trapped in local maxima. We implement the Tabu algorithm with a “Tabu” list of the previous 50 actions. We treat the local neighborhood search as the enumeration over set of feasible actions given the current state, .

Deep QLearning (DQN): A reinforcement learning algorithm that utilizes a neural network to approximate the stateaction function, . The DQN typically employs an strategy for exploration. The exploration probability, is typically annealed throughout training. DQN has been shown to be effective for learning policies in complex, dynamic environments such as Atari [14], Go [16] [17], and ride dispatching [12], and traffic signal control [18]. We train our DQN using 50,000 training iterations prior to the test period.
4.3.2 Policy Evaluation
In this section we conduct a policy evaluation experiment. We randomly fix the initial environment state and allow each of the search algorithms listed above to interact with the environment according to its corresponding strategy in a test period of one episode. The state in store 1 is initialized with 96 productregion pairs, while the state in store 2 has 30. We record the total reward accumulated by each agent during the entire episode. For each store, we vary the episode length in 30 day increments: 30, 60, and 90 days in the future. This allows us to evaluate whether longer rollouts have an effect on the policy of each agent. The results of the policy evaluation experiment are reported in table 4.
In general, we see that DQN is the most effective search algorithm in both stores, and across all three episode settings. In each case, it accumulates the most total reward in the test episode. On average, DQN is 24.5% better than Tabu, in terms of cumulative test reward. Tabu is the second most effective search strategy, beating out the random and naive search heuristics in all cases. Interestingly, the naive search baseline of “do nothing” is more effective than random searching in store 1, but not in store 2.
Additionally, it appears that as the episode length is increases, so too does the relative effectiveness of DQN as compared to Tabu. In the store 1, 30 day episode setting, DQN exceeds Tabu by $10k. This difference increases to $30k for 60 days and $72k for 90 days. In store 2 we see a similar effect. The difference between DQN and Tabu increases from $12k to $13.5k to $16k in the 30, 60, and 90 day settings respectively. Not only is DQN more effective, but its performance relative to other baselines gets better with longer episodes.
DQN excels as episode length increases in large part because the underlying function is an approximation of discounted, expected reward over time. This allows the agent to potentially think multiple steps ahead and take a set of actions that yield low immediate reward, but higher reward in later steps. Conversely, the random and Tabu search baselines are shortterm or greedy search algorithms. Especially in the case of Tabu; at each time step, an action is solely selected based on what will maximize shortterm reward. These results suggest that the correlations between spatial allocation and sales is complex and dynamic. Thus both of the two baselines achieve suboptimal policies.
It is also interesting to note the behavior of the naive search compared to the random strategies across the two stores. In store 1, the environment is initialized with an allocation strategy that already has many product placements (96). We see that the naive strategy is a strong baseline, and is superior to the random policy in each of the 30, 60 and 90 day settings. However, in store 2 where the initial allocation is more sparse (30 placements), the random policy is better than or equal to the naive search. This suggest that as more products are placed it is more difficult to find incremental improvements in the allocation strategy.
5 Related Work
There are two major streams of literature that intersect with our problem: 1) shelf space allocation and 2) deep reinforcement learning for spatial allocation.
5.1 Shelf Space Allocation
The shelf space allocation allocation problem has been studied in the operations research literature for many decades. Some classical work approaches the problem by proposing a dynamic programming algorithm to allocate limited shelf space among a finite set of products. In this case, the objective function is composed of revenue, costs and a set of constraints [20]. Later work proposed a simulated annealing optimization approach that accounts for two primary decisions variables: product assortment and allocated space for each product [3]. This optimization technique accounts for many different environment variables such as item profitability, brand elasticities, and supply chain features. More recently, frequent pattern mining algorithms have been proposed to allocate product shelf space. For instance Brijs et al. [4] propose the PROFSET algorithm, which an association rule algorithm that mines customer basket sets to identify profitable product pairings. This algorithm is a extension of frequent item set algorithms that also accounts for product value. Extensions of this idea have also been proposed. Aloysius and Binu propose a PrefixSpan algorithm for shelf allocation that first identifies complementary categories from historical purchase data before identifying product mix strategies within categories [1].
These existing studies differ from our work in the following ways. First, they all focus on microregions (shelves) within the retail environment. The spatial effects these models capture are markedly different from the macrolevel ones tackled in the current work. Second, these studies focus on the number of each product on a shelf. They try to maximize profitability given the fixed shelf volume. This optimization problem is fundamentally different from allocating products across the entire store. For these reasons, none of these methods can be directly applied to our problem.
5.2 Deep Reinforcement Learning for Spatial Resource Allocation
Recent breakthroughs in reinforcement learning [14] [16] [17] have spurred interest in RL as an optimization approach in complex and dynamic environments. In particular, recent studies have proposed RL algorithms as a mechanism for spatiotemporal resource allocation.
Order dispatching. Significant attention has been paid to the order dispatching problem in ride sharing systems. Briefly, order dispatching refers to the problem of efficiently matching riders and drivers in an urban environment. The RL agent must learn the complex spatial dynamics to learn a policy to solve the dispatching problem. For example, Lin et al. [12] tackle the dispatch problem by proposing a contextual multiagent reinforcement learning framework that coordinates strategies among a large number of agents to improve driver allocation in physical space. Additionally, Li et al. [11] also approach the order dispatching problem with multiagent reinforcement learning (MARL). Their method relies on the mean field approximation to capture the dynamic, spatially distributed fluctuations in supply and demand. They empirically show that MARL can reduce supplydemand gaps in peak hours.
Traffic signal control Increasing traffic congestion is a key concern in many urban areas. Recent efforts to optimize traffic control systems via reinforcement learning has shown encouraging results. These systems seek to adjust traffic lights to realtime fluctuations in traffic volume and road demand. Wei et al [18] propose IntelliLight, which is a phasegated deep neural network that approximates stateaction values. More recently [19] proposes a graph attentional network to facilitate cooperation between many traffic signals.
Spatial demand for electronic tolls Chen et al. [6] propose a dynamic electronic toll collection system that adjusts to traffic patterns and spatial demand for roads in real time. Their proposed algorithm, PG, is an extension of policy gradient methods and decreases traffic volume and travel time.
While these reinforcement learning methods deal with the largescale optimization of spatial resource, they cannot be directly applied to the product allocation problem because the all rely on domainspecific simulators. We propose our model in an effort to extend these stateoftheart optimization techniques to our problem.
6 Conclusion
In this paper, we studied the automation of product placement in retail settings. The problem is motivated by the fact that well placed products can maximize impulse buys and minimize search costs for consumers. Solving this allocation problem is difficult because locationbased, historical data is limited in most retail settings. Consequently, the number of possible allocation strategies is massive compared to the number of strategies typically explored in historical data. Additionally, it is generally costly to experiment and explore new policies because of the economic costs of sub optimal strategies, and operational cost of deploying a new allocation strategy. Therefore, we propose a probabilistic environment model called that is designed to mirror the real world, and allow for automated search, simulation and exploration of new product allocation strategies. We train the proposed model on real data collected from two large retail environments. We show that the proposed model can make accurate predictions on test data. Additionally, we do a preliminary study into various optimization methods using the proposed model as a simulator. We discover that Deep learning techniques can learn a more effective policy than baselines. On average, DQN offers an improvement of 24.5% over Tabu search in terms of cumulative test reward.
References
 [1] (2011) An approach to products placement in supermarkets using prefixspan algorithm. Journal of King Saud University  Computer and Information Sciences. Cited by: §1, §5.1.
 [2] (2015) Does urge to buy impulsively differ from impulsive buying behaviour? assessing the impact of situational factors. Journal of Retailing and Consumer Services. Cited by: §1.
 [3] (1994) A model for determining retail product category assortment and shelf space allocation. Decision Sciences. Cited by: §1, §5.1.
 [4] (2001) A data mining framework for optimal product selection in retail supermarket data: the generalized profset model. In arXiv, External Links: Link Cited by: §1, §5.1.
 [5] (2014) The truncated normal distribution. External Links: Link Cited by: §3.1.

[6]
(2018)
DyETC: dynamic electronic toll collection for traffic congestion alleviation.
In
in Proceedings of the 2018 conference of Association for the Advancement of Artificial Intelligence (AAAI’18)
, Cited by: §5.2. 
[7]
(2011)
The nouturn sampler: adaptively setting path lengths in hamiltonian monte carlo.
Journal of Machine Learning Research
. Cited by: §3.2.  [8] (2019) Modelbased reinforcement learning for atari. In arXiv, External Links: Link Cited by: §1, §4.3.
 [9] (2017) Automatic differentiation variational inference. Journal of Machine Learning Research. Cited by: §3.2.

[10]
(2009)
Generating random correlation matrices based on vines and extended onion method.
Journal of Multivariate Analysis
. Cited by: §3.1.2.  [11] (2019) Efficient ridesharing order dispatching with mean field multiagent reinforcement learning. In Proceedings of the World Wide Web Conference, TheWebConf’19, Cited by: §5.2.
 [12] (2018) Effecient largescale fleet management via multiagent deep reinforcement learning. In Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, KDD’18, Cited by: 4th item, §5.2.
 [13] (2008) The role of store environmental stimulation and social factors on impulse purchasing. Journal of Services Marketing. Cited by: §1.
 [14] (2015) Playing atari with deep reinforcement learning. In arXiv, External Links: Link Cited by: 4th item, §5.2.
 [15] (2016) The truncated normal distribution. PeerJ Computer Scienc. Cited by: §3.2.
 [16] (2016) Mastering the game of go with deep neural networks and tree search. Nature. Cited by: 4th item, §5.2.
 [17] (2017) Mastering the game of go without human knowledge. Nature. Cited by: 4th item, §5.2.
 [18] (2018) IntelliLight: a reinforcement learning approach for intelligent traffic light control. In in Proceedings of the 2018 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18), Cited by: 4th item, §5.2.
 [19] (2019) CoLight: learning networklevel cooperation for traffic signal control. In in Proceedings of the 2019 ACM on Conference on Information and Knowledge Management (CIKM’19), Cited by: §5.2.
 [20] (1986) A dynamic programming approach for product selection and supermarket shelfspace allocation. The Journal of Operational Research Society. Cited by: §1, §5.1.