Large-scale shopping cart data provides unprecedented opportunities for researchers to understand consumer behavior and to predict how it responds to interventions such as promotions and price changes. Consider the shopping cart in Figure 1. This customer has purchased items for their baby (diapers, formula), their dog (dog food, dog biscuits), some seasonal fruit (cherries, plums), and the ingredients for tacos (taco shells, salsa, and beans). Shopping cart datasets may contain thousands or millions of customers like this one, each engaging in dozens or hundreds of shopping trips.
In principle, shopping cart datasets could help reveal important economic quantities about the marketplace. They could help evaluate counterfactual policies, such as how changing a price of a product will affect the demand for it and for other related products; and they could help characterize consumer heterogeneity, which would allow firms to consider different interventions for different segments of their customers. However, large-scale shopping cart datasets are too complex and heterogenous for classical methods of analysis, which necessarily prune the data to a handful of categories and a small set of customers. In this paper, our goal is to develop a method that can usefully analyze large collections of complete shopping baskets.
Shopping baskets are complex because, as the example shopper in Figure 1 demonstrates, many interrelated forces are at play when determining what a customer decides to buy. For example, the customer might consider how well the items go together, their own personal preferences and needs (and whims), the purpose of the shopping trip, the season of the year, and, of course, the prices and the customer’s personal sensitivity to them. Moreover, these driving forces of consumer behavior are unobserved elements of the market. Our goal is to extract them from observed datasets of customers’ final purchases.
To this end we develop shopper, a sequential probabilistic model of market baskets. shopper uses interpretable components to model the forces that drive customer choice, and we designed shopper to capture properties of interest regarding how items interact with other items; in particular, we are interested in answering counterfactual queries with respect to item prices. We also develop an efficient posterior inference algorithm to estimate these forces from large-scale data. We demonstrate shopper by analyzing data from a major chain grocery store (Che, Chen and Chen, 2012). We found that shopper provides accurate predictions even under price interventions, and that it also helps identifying complementary and substitutable pairs of items.
1.1 Main idea
shopper is a hierarchical latent variable model of market baskets for which the generative process comes from an assumed model of consumer behavior.111In the language of social science, shopper is a structural model of consumer behavior, where the elements of the model include the specification of consumer preferences, information, and behavior (i.e., utility maximization). In the language of probabilistic models, it can equivalently be defined by a generative model. shopper posits that a customer walks into the store and chooses items sequentially. Further, the customer might decide to stop shopping and pay for the items; this is the “checkout” item. At each step, the customer chooses among the previously unselected items, while conditioning on the items already in the basket. The customer’s choices also depend on various other aspects: the prices of the items and the customer’s sensitivities to them, the season of the year, and the customer’s general shopping preferences (which are specific for each customer).
One key feature of shopper is that each possible item is associated with latent attributes
, vector representations that are learned from the data. This is similar in spirit to methods from machine learning that estimate semantic attributes of vocabulary words by analyzing the words close to them in sentences(Bengio et al., 2003). In shopper “words” are items and “sentences” are baskets of purchased items.
shopper uses the latent attributes in two ways. First, they help represent the existing basket when considering which item to select next. Specifically, each possible item is also associated with a vector of interaction coefficients
, parameters that represent which kinds of basket-level attributes it tends to appear with. For example, the interaction coefficients and attributes can capture that when taco shells are in the basket, the customer has a high probability of choosing taco seasoning. Second, they are used as the basis for representing customer preferences. Each customer in the population also has a vector of interaction coefficients, which we call “preferences,” that represent the types of items that they tend to purchase. For example, the customer preferences and attributes can capture that some customers tend to purchase baby items or dog food.
Mathematically, at the th step of the sequential process, the customer chooses item with probability that depends on the latent features of item , on her preferences , and on the representation of the items that are already in the basket, . In its vanilla form, shopper posits that this probability takes a log-bilinear form,
Section 3 provides more details about the model and also describes how to incorporate price and seasonal effects.
When learned from data, the latent attributes capture meaningful dimensions of the data. For example, Figure 2 illustrates a two-dimensional projection of learned latent item attributes. Similar items are close together in the attribute space, even though explicit attributes (such as category or purpose) are not provided to the algorithm.
In the simplest shopper model, the customer is myopic: at each stage they do not consider that they may later add additional items into the basket. This assumption is problematic when items have strong interaction effects, since it is likely that a customer will consider other items that complement the currently chosen item. For example, if a customer considers purchasing taco seasoning, they should also consider that they will later want to purchase taco shells.
To relax this assumption, we include a second key feature of shopper, called “thinking ahead,” where the customer considers the next choice in making the current choice. For example, consider a scenario where taco shells have an unusually high price and where the customer is currently contemplating putting taco seasoning in the basket. As a consequence of thinking ahead, the model dampens the probability of choosing seasoning because of the high price of its likely complement, taco shells. shopper models one step of thinking ahead. (It may be plausible in some settings that customers think further ahead when shopping; we leave this extension for future work.)
1.2 Main results
We fit shopper
to a large data set of market baskets, estimating the latent features of the items, seasonal effects, preferences of the customers, and price sensitivities. We evaluate the approach with held-out data, used in two ways. First, we hold out items from the data at random and assess their probability with the posterior predictive distribution. This checks how wellshopper (and our corresponding inference algorithm) captures the distribution of the data. Second, we evaluate on held-out baskets in which items have larger price variation compared to their average values. This evaluation assesses how well shopper can evaluate counterfactual price changes. With both modes of evaluation, we found shopper gave better predictions than state-of-the-art econometric choice models and machine learning models for item consumption.
Further, shopper can help identify pairs of items that are substitutes and complements—where purchasing one item makes the second item less or more attractive, respectively. This is one of the fundamental challenges of analyzing consumer behavior data. Formally, two items are complements if the demand of one is increased when the price of the other decreases; they are substitutes if the demand for one rises when the price of the other item increases. Studying complementary and substitutable items is key to make predictions about the (joint) demand.
Notice that being complements is more than the propensity for two items to be co-purchased; items may be purchased together for many reasons. As one reason, two items may be co-purchased because people who like one item tend to also like the other item. For example, baby formula and diapers are often co-purchased but they are not complements—when baby formula is unavailable at the store, it does not affect the customer’s preference for diapers. shopper can disentangle complementarity from other sources of co-purchase because prices change frequently in the dataset. If two items are co-purchased but not complements, increasing the price of one does not decrease the purchase rate of the other.222As shown by Athey and Stern (1998) among others, if there is sufficient variation in the price of each item, with a sufficiently large dataset it is theoretically possible to separate correlated preferences from complementarity, that is, the parameters that describe complementarity are econometrically identified. Using the item-to-item interaction term in shopper, we develop a measure to quantify how complementary two items are.
Now we turn to substitutes, where purchasing one item decreases the utility of another, e.g., two brands of (otherwise similar) taco shells. Although in principle substitutes can be treated symmetrically to complements (increasing the price of one brand of taco shells increases the probability that the other is purchased if they are substitutes), in practice the two concepts are not equally easy to discover in the data. The difference is that in most shopping data, purchase probabilities are very low for all items, and so most pairs of items are rarely purchased together. In this paper, we introduce an alternative way to find relationships among products, “exchangability.” Two products are exchangeable if they tend to have similar pairwise interactions with other products; for example, two brands of taco shells would have similar interactions with products such as tomatoes and beans. Of course, products that are usually purchased together, like hot dogs and buns, will also have similar pairwise interactions with other products (such as ketchup). Our results suggest that items that are exchangeable and not complementary tend to be substitutes (in the sense of being in the same general product category in the grocery store hierarchy).
2 Related work
shopper relates closely to several lines of research in machine learning and economics. We discuss them in turn.
2.1 Machine learning: Word embeddings and recommendation
One of the central ideas in shopper is that the items in the store have latent vector representations and we estimate these representations from observed shopping basket data. In developing shopper, we were directly inspired by the neural probabilistic language model of Bengio et al. (2003, 2006)
. That model specifies a joint probability of sequences of words, parameterized by a vector representation of the vocabulary. In language, vector representations of words (also called “distributed representations”) allow us to reason about their usage and meaning(Harris, 1954; Firth, 1957; Bengio et al., 2003; Mikolov et al., 2013a). Here we expand on this idea to study consumer behavior, and we show how the same mathematical concepts that inspired word representations can help understand large-scale consumer behavior.
We note that neural language models have spawned myriad developments in other so-called word embedding methods for capturing latent semantic structure in language (Mnih and Hinton, 2007; Mnih and Teh, 2012; Mikolov et al., 2013b; Mikolov et al., 2013a; Mikolov, Yih and Zweig, 2013; Mnih and Kavukcuoglu, 2013; Pennington, Socher and Manning, 2014; Levy and Goldberg, 2014; Vilnis and McCallum, 2015; Arora et al., 2016; Barkan, 2016; Bamler and Mandt, 2017). There has been some work on extrapolating embedding ideas to shopping items (Rudolph et al., 2016; Liang et al., 2016; Barkan and Koenigstein, 2016), but these newer methods are not readily appropriate to the analyses here. The reason is that they are conditionally specified models, defined by the conditional distribution of each item given the others. In contrast, shopper
directly models the joint distribution of items; this enablesshopper to more easily account for additional information, such as price and customer preferences. From the econometric perspective, casting shopper as a proper generative model enables a direct interpretation as a structural equation model with meaningful parameters. This lets us use shopper to make counterfactual predictions, such as about how price changes will affect firm revenue and consumer welfare.
The way shopper models customer preferences is similar to modern recommender systems, where matrix factorization models are standard. Matrix factorization decomposes observed data of customer/item interactions into latent preferences and item attributes (Canny, 2004; Hu, Koren and Volinsky, 2008; Ma et al., 2011; Wang and Blei, 2011; Gopalan, Hofman and Blei, 2015), potentially with an infinite number of latent parameters (Görür, Jäkel and Rasmussen, 2006; Gopalan et al., 2014). Matrix factorization methods have been used for web services recommendations (Stern, Herbrich and Thore, 2009) and also market basket data (Wan et al., 2017). Though it is effective, matrix factorization does not directly capture item-to-item co-occurrence, which is the main motivation behind models based on latent attributes. shopper includes both attributes and preferences, thereby combining ideas from distributed representations with those from recommendation. shopper also goes further by including price (and price sensitivities) and seasonal effects, and generally seeking realistic model of consumer behavior. It is more appropriate than machine learning recommendation methods for evaluating counterfactual questions. Incorporating observed attributes also relates to recommender systems with attributes (see, e.g., Abernethy et al., 2009), although shopper can consider time-varying item attributes such as prices.
2.2 Economics and marketing: Discrete choice and utility maximization
shopper expands on discrete choice models of consumer behavior, models widely used in economics and marketing. The majority of papers in this literature analyze the discrete choice of a consumer who selects a single product from a set of prespecified imperfect substitutes; examples include choices of laundry detergent, personal computers, or cars. As reviewed by Keane et al. (2013), this literature focuses on estimating cross-price elasticities, accounting for (latent) consumer heterogeneity in tastes. A small literature on “market mapping” (Elrod, 1988; Elrod and Keane, 1995; Chintagunta, 1994) considers latent attributes of items within a product category (e.g., laundry detergent). shopper is similar in spirit in its focus on estimating latent characteristics of items using panel data. However, shopper differs in the scale it handles, considering hundreds or thousands of items and hundreds of thousands of shopping trips. This is not simply a matter of improving computational speed; analyzing the whole market for counterfactual prediction requires a more thorough treatment of issues like complementarity, substitutability, and customer heterogeneity. A complementary approach to ours is the work by Semenova et al. (2018), who consider observational high-dimensional product attributes (e.g., text descriptions and images) rather than latent features. Also see our related paper for another approach at this same scale (Athey et al., 2017).
shopper also differs significantly from other discrete choice models in that it considers interaction effects among a large number of items in a consumer’s basket without a priori imposing a structure of product relationships. A variety of datasets track a panel of consumers over time, recording their baskets in each shopping trip. This includes supermarket shopping datasets and online panel datasets of shopping from vendors like Neilsen and comScore, as well as data collected internally by online and offline retailers. However, there is relatively little work in economics and marketing that analyzes consumers’ choices over many items in a basket using this type of data. Even when multiple items are considered, few methods attempt to estimate directly whether items are substitutes or complements; instead, products are typically assumed to be independent of one another across categories, or strong substitutes within categories (in the sense that purchasing one item precludes purchasing others). shopper does not require a prespecified ontology of categories; the interaction effects are estimated directly from the data.
Of course, estimating the extent to which products are substitutes and complements for one another is a hard problem when there are many products: if there are products, there are baskets, and without further restrictions, a consumer’s preferences over baskets thus has parameters. One approach is to make use of more aggregate data combined with functional form assumptions; for example, the almost ideal demand system (Deaton and Muellbauer, 1980) considers a consumer maximizing utility subject to a budget constraint over a period of time, where the share of budget allocated to each good is taken to be continuous. With functional form assumptions, the consumer’s expenditure share for each product can be written as a linear function of transformations of the prices of other products, and the parameters of these demand functions can be interpreted as expressing underlying consumer preferences, including the extent to which products are substitutes or complements. This approach has a number of attractive features, including tractability and computational simplicity, and it can be used to handle many products simultaneously. However, in reality prices often change relatively frequently, and without data about when a consumer shopped or what prices the consumer saw, inferences drawn from aggregate data can be misleading. In addition, more aggregated data loses crucial information about the co-purchases an individual makes on the same trip.
A small set of papers in economics and marketing (Athey and Stern, 1998) attempts to use individual choice data estimate parameters that describe interaction effects (substitutes or complements) with a very small number of items; see Chintagunta and Nair (2011) and Berry et al. (2014) for recent surveys of the literature. For example, Train, McFadden and Ben-Akiva (1987)
treat each bundle as a discrete alternative, but use nested logit to account for correlation among related bundles, whileGentzkow (2007) incorporates a parameter in the utility for complementarity among two items. Song and Chintagunta (2007) build a utility-maximization framework where consumers select not just whether to purchase, but how much, and apply it to supermarket purchase data for two products, laundry detergent and fabric softener. However, due to computational considerations, most of the papers that deal with disaggregated data have focused on a very small number of products and a small number of customers. With shopper, we are able to jointly model thousands of items and millions of purchased items.
3 A Bayesian model of sequential discrete choice
We develop shopper, a sequential probabilistic model of market baskets. shopper treats each choice as one over the set of available items, where the attributes of each item are latent variables.
We describe shopper in three stages. Section 3.1 describes a basic model of sequential choice with latent item attributes; Section 3.2 extends the model to capture user heterogeneity, seasonal effects, and price; Section 3.3 develops “thinking ahead,” where we model each choice in a way that considers the next choice as well.
shopper comes with significant computational challenges, both because of its complex functional form and the size of the data that we would like to analyze. We defer these challenges to Section 5, where we develop an efficient variational algorithm for approximate posterior inference.
3.1 Sequential choice with latent item attributes
We describe the basic model in terms of its generative process or, in the language of economics and marketing, in terms of its structural model of consumer behavior. Each customer walks into the store and picks up a basket. She chooses items to put in the basket, one by one, each one conditional on the previous items in the basket. The process stops when she purchases the “checkout” item, which we treat as a special item that ends the trip.
From the utility-maximizing perspective, shopper works as follows. First, the customer walks into the store and obtains utilities for each item. She considers all of the items in the store and places the highest-utility item in her basket.333In some applications, it may not be reasonable to model the consumer as considering all items. In a supermarket, a customer might consider one part of the store at the time; in an online store, a customer may only consider the products that appear in a search results page. Corresponding extensions to shopper are straightforward but, for simplicity, we do not include them here. Modeling customers as choosing the most desirable items in the store first ensures that, among goods that are close substitutes, they select the most desirable one. In the second step, once the first item is in the basket, the customer again considers all remaining items in the store and selects the highest-utility choice. However, relative to the first decision, the utilities of the products change. First, the specification of the utility allows for interaction effects—items may be substitutes (e.g., two brands of taco seasoning) or complements (e.g., taco shells and taco seasoning), in the sense that the utility of an item may be higher or lower as a result of having the first item in the basket. Second, the customer’s utilities have a random component that changes as she shops. This represents, for example, changing ideas about what is needed, or different impressions of products when reconsidering them.444Another extension of the model would be to allow for correlation of the random component over choice events within a shopping trip; we leave this for future work. The customer repeats this process—adjusting utilities and choosing the highest-utility item among what is left in the store—until the checkout item is the highest-utility item.
We describe this perspective more formally. Consider the th trip
and let be the number of choices, i.e., the number of purchased items. Denote
the (ordered) basket , where each
is one of items and choice is always the
checkout item.555 As an aside, the special checkout item
connects to self-delimiting neural networks
, where a “halt neuron” is used to signal
the end of a sequence.
As an aside, the special checkout item connects to self-delimiting neural networks(Schmidhuber, 2012)
, where a “halt neuron” is used to signal the end of a sequence.Let denote the items that are in the basket up to position , .
Consider the choice of the th item. The customer makes this choice by selecting the item that has maximal utility , which is a function of the items in the basket thus far. The utility of item is
Here is the deterministic part of the utility, a function of the other items in the basket. We define for
, so that the choice is effectively over the set of non-purchased items. The random variableis assumed to follow a zero-mean Gumbel distribution (generalized extreme value type-I), which is independent across items.
This behavioral rule—and particularly the choice of Gumbel error terms—implies that the conditional choice probability of an item that is not yet in the basket is a softmax,
Notice the denominator is a sum over a potentially large number of items. For example, Section 6 analyzes data with nearly six thousand items. Section 5 describes fast methods for handling this computational bottleneck.
Eq. 3 gives a model of observed shopping trips, from which we can infer the utility function. The core of shopper is in the unnormalized log probabilities , which correspond to the mean utilities of the items. We assume that they have a log-linear form,
There are two terms. The first term is a latent variable that varies by item and by trip; below we use this term to capture properties such as user heterogeneity, seasonality, and price. We focus now on the second term, which introduces two important latent variables: the per-item interaction coefficients and the item attributes , both real-valued -vectors. When is large, then having in the basket increases the benefit to the customer of placing in the basket (the items are complements in the customer’s utility); conversely, when the expression is negative, the items are substitutes.
Unlike traditional factorization methods, the factorization is asymmetric. We interpret as latent item characteristics or attributes; we interpret as the interaction of item with other items in the basket, as described by their attributes (i.e., their ’s). These interactions allow that even though two items might be different in their latent attributes (e.g., taco shells and beans), they still may be co-purchased because they are complements to the consumer. It can also capture that similar items (e.g., two different brands of the same type of taco seasoning) may explicitly not be purchased together—these are substitutable items. Below we also allow that the latent attributes further affect the item’s latent mean utility on the trip .
We also note that the second term has a scaling factor, . This scaling factor captures the idea that in a large basket, each individual item has a proportionally smaller interaction with new purchases. This may be a more reasonable assumption in some scenarios than others, exploring alternative ways to account for basket size is a direction for future work.
Finally, note we assumed that is additive in the other items in the basket; this rules out non-linear interactions with other items. Again, this restriction may be more realistic in some applications than in others, but it is possible to extend the model to consider more complex interaction patterns if necessary. In this paper, we only consider linear interaction effects.
3.1.1 Baskets as unordered set of items
Given Eq. 3, the probability of an ordered basket is the product of the individual choice probabilities. Assuming that is the checkout item, it is
The probabilities come from Eq. 3 and Eq. 4, and we have made explicit the dependence on the interaction coefficients and item attributes. The parameters of this softmax are determined by the interaction vectors and the attributes of the items that have already been purchased. Given a dataset of (ordered) market baskets, we can use this likelihood to fit each item’s latent attributes and interaction.
In many datasets, however, the order in which the items are added to the basket is not observed. shopper implies the likelihood of an unordered set by summing over all possible orderings,
Here is a permutation (with the checkout item fixed to the last position) and is the permuted basket . Its probability is in Eq. 5. In Section 6, we study a large dataset of unordered baskets; this is the likelihood that we use when fitting shopper.
3.1.2 Utility maximization of baskets
Here we describe how the sequential model behind shopper relates to a utility maximization model over unordered sets of items, in which the customer has a specific utility for each basket (i.e., with choices, where is the number of items).
Let be the (unordered) set of items purchased in trip . Define a consumer’s utility over unordered baskets as follows:
where is a term that describes the interaction between and . Now consider the following model of utility maximization. At each stage where a consumer must select an item, the consumer has an extreme form of myopia, whereby she assumes that she will immediately check out after selecting this item, and she does not consider the possibility that she could put any items in her basket back on the shelf. Other than this myopia, she behaves rationally, maximizing her utility over unordered items as given by Eq. 7.
If (and only if) , the behavioral model of sequential shopping is consistent with the myopic model of utility maximization; this says that the impact of product on the purchase of product is symmetric to the impact of on product , holding fixed the rest of the basket. Thus, we can think of utility maximization by this type of myopic consumer as imposing an additional constraint on the probabilistic model.666We do not impose such a symmetry constraint in our model because of the reasons outlined below Eq. 4. This follows the common practice in economic modeling to estimate the richer models motivated by theory, but without imposing all the restrictions on the parameters implied by that theory; this approach often simplifies computation. Browning and Meghir (1991) also estimate an econometric model without imposing symmetry restrictions implied by utility maximization, and then impose the restrictions in a second step using a minimum-distance approach.
Finally, we note that a fully rational consumer with full information of the price of all products could in principle consider all of the possible bundles in the store simultaneously and maximize over them. However, given that the number of bundles is
, we argue that considering such a large number of bundles simultaneously is probably not a good approximation to human behavior. Even if consumers are not as myopic as we assume, it is more realistic to assume that they follow some simple heuristics.
3.2 Preferences, seasons, popularity, and price
The basic shopper model of the previous section captures a customer’s sequential choices as a function of latent attributes and interaction coefficients. shopper is flexible, however, in that we can include other forces in the model of customer choice; specifically, they can be incorporated into the (unobserved) mean utility of each item , , which varies with trip and customer. Here we describe extensions to capture item popularity, customer preferences, price sensitivity, and seasonal purchasing habits (e.g., for holidays and growing seasons). All these factors are important when modeling real-world consumer demand.
3.2.1 Item popularity
We capture overall (time-invariant) item popularity with a latent intercept term for each item. When inferred from data, popular items will have a high value of , which will generally increase their choice probabilities.
3.2.2 Customer preferences
In our data, each trip is associated with a particular customer , and that customer’s preferences affect her choices. We model preference with a per-customer latent vector . For each choice, we add the inner product to the unnormalized log probability of each item. This term increases the probability of types of items that the customer tends to purchase. Note that shopper shares the attributes with the part of the model that characterizes interaction effects. The inference algorithm finds latent attributes that interact with both customer preferences and also with other items in the basket.
3.2.3 Price sensitivity
We next include per-customer price sensitivity. Let denote the price of item at trip . We consider each customer has an individualized price sensitivity to each item, denoted , and we add the term to the unnormalized log probabilities in Eq. 4. We place a minus sign in the price term to make the choice less likely as the price increases. Further, we constrain to be positive; this constraint ensures that the resulting price elasticities are negative.777The price elasticity of demand () is a measure used in economics to show the responsiveness of the quantity demanded of a good () to a change in its price (). It is defined as .
Including posits a large number of latent variables (for many customers and items) and, moreover, it is reasonable to assume the sensitivities will be correlated, e.g., a customer’s sensitivity to peanut butter and almond butter might be similar. Thus we use a matrix factorization to model the price sensitivities. Specifically, we decompose the user/item price sensitivity matrix into per-user latent vectors and per-item latent vectors , where . This factorization models the complete matrix of price sensitivities with fewer latent variables.
Finally, instead of the raw price , we use the normalized price, i.e., the price for this trip divided by the per-item mean price. Normalized prices have two attractive properties. First, they allow to be on a comparable scale, avoiding potential issues that arise when different items have prices that vary by orders of magnitude. Second, they ensure that the other parameters in Eq. 4 capture the average outcome distribution: the price term vanishes when the price takes its average value because the log of the normalized price is zero.
3.2.4 Seasonal effects
We complete the model with seasonal effects. shopper is designed to estimate the effect of counterfactual changes in policy, such as prices. So, it is important that the parameters of the model associated with price represent the true causal effect of price on customer choices. Seasonal effects are a potential confounder—the season simultaneously affects the price and the demand of an item—and so neglecting to control for seasonality can lead to estimates of the latent variables that cannot isolate the effects of changing prices. For example, the demand for candy goes up around Halloween and the supermarket may decide to put candy on sale during that period. The demand increases partly because of the price, but also because of the time of year. Controlling for seasonal effects isolates the causal effect of price.
We assume that seasonal effects are constant for each calendar week888Athey et al. (2017) conducts a series of empirical studies that support the idea that controlling for week effects is sufficient to identify the causal effect of price. and, as for the price effect, we factorize the week-by-item seasonal effects matrix. Denote the week of trip as . We posit per-week latent vectors and per-item latent vectors , and we include a term in Eq. 4 that models the effect of week on item ’s probability, . Note this allows correlated seasonal effects across items; items with similar vectors will have similar seasonal effects. So Halloween candy and pumpkin cookies might share similar vectors.
3.2.5 Putting it together
We described popularity, preferences, price, and season. We combine these effects in the per-item per-trip latent variable ,
The mean utility is used in the probabilities for all choices of trip . With these extensions we have introduced several new latent variables to infer from data: item popularities , customer preferences , price sensitivity factorizations and , and seasonal factorizations and .
3.3 Thinking ahead
As a final extension, we develop a model of customers that “think ahead.” Consider a pair of items and , where both and are high, so that the value of increases when is in the basket and vice versa. In this case, we say that the goods are complements to the consumer. Accurately accounting for complementarity is particularly important when estimating counterfactuals based on different prices; theoretically increasing the price of one item should lower the probability of the other.
In this scenario, our baseline model specifies that the consumer will consider the goods one at a time. The effect of a high price for item is that it reduces the probability the item is chosen (and if chosen, it is less likely to be chosen early in the ordering). Not having item in the basket at each new consideration event reduces the attractiveness of item . However, a more realistic model has the consumer anticipate when she considers item that she might also consider item . When the consumer considers the pair, the high price of one item deters the purchase of both.
We address this issue by introducing thinking ahead. When considering the item at step , the consumer looks ahead at the th step the customer may purchase. We emphasize that the model does not, at this point, assume that the consumer actually purchases the next item according to that rule. When the consumer comes to consider what to choose for step , she follows the same think-ahead-one-step rule just described.
Thinking ahead adjusts the unnormalized log probabilities of the current item in Eq. 4. Specifically, it adds a term to the log linear model that is the utility of the optimal next item. (Note this could be the checkout item.) The unnormalized log probability is
where is a hypothetical basket that contains the first purchased items and item . Since the customer has in mind a tentative next item when deciding about the current item, we call this “thinking one step ahead.”
Note that thinking ahead assumes that the item will itself be selected without looking further ahead, that is, it is chosen purely based on its own utility without accounting for the interaction effects of with future purchases. The thinking-ahead idea can be extended to two or more steps ahead, but at the expense of a higher computational complexity. See Section 4 for an illustrative example that clarifies how the thinking-ahead procedure works.
The thinking-ahead mechanism is an example of inferring agent’s preferences from their behavior in a dynamic model, a type of exercise with a long history in applied econometrics (e.g. Wolpin (1984); Hotz and Miller (1993)
). From the machine learning perspective, it closely resembles the motivation behind inverse reinforcement learning(Russell, 1998; Ng and Russell, 2000). Inverse reinforcement learning analyzes an agent’s behavior in a variety of circumstances and aims to learn its reward function. (This is in contrast to reinforcement learning, where the reward function is given.) To see the connection, consider the unnormalized log probability of Eq. 9 as the reward for choosing and then acting optimally one step (i.e., choosing the best next item). With this perspective, fitting shopper to observed data is akin to learning the reward function.
Finally, note the thinking-ahead model also connects to utility maximization over unordered baskets, with the utility function given in Eq. 7. Now the model is consistent with a consumer who maximizes her utility over unordered items at each step, but where at each step she is myopic in two ways. First, as before, she does not consider that she can remove items from her cart. Second, she assumes that she will buy exactly one more item in the store and then check out, and the next item will be the one that maximizes her utilty over unordered baskets (taking as given her cart at that point, including the latest item added, as well as the belief that she will then check out). Again, our probabilistic model is consistent with this interpretation if and only if for all (a constraint that we do not enforce).
3.4 Full model specification
We specified shopper, a sequential model of shopping trips. Each trip is a tuple containing a collection of purchased items (), the customer who purchased them (), the week that the shopping trip took place (), and the prices of all items (). The likelihood of each shopping trip captures the sequential process of the customer choosing items (Eq. 3), where each choice is a log linear model (Eq. 4). shopper includes terms for item popularity, customer preferences, price sensitivity, and seasonal effects (Eq. 8). We further described a variant that models the customer thinking ahead and acting optimally (Eq. 9). Though this is a model of ordered baskets, we can use it to evaluate the probability of an unordered basket of items (Eq. 6).
shopper is based on log linear terms that involve several types of latent parameters: per-item interaction coefficients , per-item attributes , per-item popularities , per-user preferences , per-user per-item price sensitivities , and per-week per-item seasonal effects . Our goal is to estimate these parameters given a large data set of trips .
We take a Bayesian approach. We place priors on the parameters to form a Bayesian model of shopping trips, and then we form estimates of the latent parameters by approximating the posterior. We use independent Gaussian priors for the real-valued parameters, , , , , , and . We use gamma priors for the positive-valued parameters associated with price sensitivity, and . With the approximate posterior, we can use the resulting estimates to identify various types of purchasing patterns and to make predictions about the distribution of market baskets; in particular predictions under price changes.
4 Illustrative simulation
We now describe a simulation of customers to illustrate how shopper works. The purpose of our simulation study is to show shopper’s ability to handle heterogenous purchases, complements, and counterfactual settings (i.e., interventions on price). More specifically, we ask whether the model can disentangle correlated preferences and complements, both of which might contribute to co-purchasing. We illustrate that distinguishing between these two types of relationships relies on the thinking ahead property from Section 3.3.
We simulate the following world.
There are different items: coffee, diapers, ramen, candy, hot dogs, hot dog buns, taco shells, and taco seasoning.
Customers have correlated preferences and there are two types of customers: new parents and college students. New parents frequently buy coffee and diapers; college students frequently buy ramen and candy. What correlated preferences means is that the preferred items (e.g., coffee and diapers) are decided on independently (based on their price). But note that new parents never buy ramen or candy, regardless of price, and college students never buy coffee or diapers.
The other items represent complementary pairs. In addition to their preferred items, each customer also buys either hot dogs and hot dog buns or taco shells and taco seasoning. In this imaginary world, customers never buy just one item in the complementary pair, and they always buy one of the pairs (but not both pairs).
Customers are sensitive to price. When the price of a preferred item is low (e.g., coffee for a new parent), they buy that item with probability ; when the price of a preferred item is high, they buy it with probability . Each customer decides on buying their preferred items independently. That is, a new parent makes independent decisions about coffee and diapers (each based on their respective prices), and similarly for college students and their choices on candy and ramen.
Sensitivity to the price of complementary pairs is different, because a high price of one of the items in a pair (e.g., hot dogs) will lower the probability of purchasing the pair as a whole. Specifically, when the price of all complementary pairs is low, each customer purchases one or the other with probability . When one item (e.g., hot dogs) has a high price, each customer buys the lower priced pair (taco shells and taco seasoning, in this case) with probability and buys the pair with the high priced item with probability .
With this specification, we simulate different customers ( new parents and college students) and trips per customer. For each trip, the first four items (coffee, diapers, ramen, candy) each have a chance of being marked up to a high price. Further, there is a chance that one of the items in the complementary pairs (hot dogs, hot dog buns, taco shells, taco seasoning) is marked up. At most one of the items in the complementary pairs has a high price.
Given the simulated data, we fit a shopper model that contains terms for latent attributes, user preferences, and price. For simplicity, we omit seasonal effects here. (The approximate inference algorithm is described below in Section 5.) One of our goals is to confirm the intuition that thinking ahead correctly handles complementary pairs of items. Thus we fit shopper both with and without thinking ahead.
Consider a shopping trip where the prices of coffee and taco shells are high, and consider a customer who is a new parent. As an example, this customer may buy diapers, hot dogs, hot dog buns, and then check out. Table 1 shows the predicted probabilities of each possible item at each stage of the shopping trip. They are illustrated for the models with and without thinking ahead.
First consider the preference items in the first stage, before anything is placed in the basket. The customer has a high probability of purchasing diapers, and low probabilities of purchasing coffee, ramen, and candy. The probabilities for ramen and candy are low because of the type of customer (new parent); the probability for coffee is low because of its high price.
Now consider the complementary pairs. In the model without thinking ahead, the customer has high probability of buying hot dogs, hot dog buns, and taco seasoning; because of its high price, she has a low probability of buying taco shells. But this is incorrect. Knowing that the price of taco shells is high, she should have a lower probability of buying taco seasoning because it is only useful to buy along with taco shells. The thinking-ahead model captures this, giving both taco seasoning and taco shells a low probability.
Subsequent stages further illustrate the intuitions behind shopper. First, each stage zeros out the items that are already in the basket (e.g., at stage 2, diapers have probability ). Second, once one item of the complementary pair is bought, the probability of the other half of the pair increases and the probabilities of the alternative pair becomes low. In this case, once the customer buys hot dogs, the probability of the taco products goes to zero and the probability of hot dog buns increases.
|stage 1: diapers||stage 2: hot dogs||stage 3: hot dog buns||stage 4: checkout|
|hot dog buns|
|taco shells ()|
|hot dog buns|
|taco shells ()|
As a final demonstration on simulated data, we generate a test set from the simulator with shopping trips for each customer. On this test set, we “intervene” on the price distribution: the probability of a preference item having a high price is and one of the four complementary items always has a high price. On this test set, a better model will provide higher held-out log probability and we confirm that thinking ahead helps. The thinking-ahead model gives an average held out log probability of ; the model without thinking ahead gives an average held out log probability of .
Section 3 defined shopper, a Bayesian model of market baskets. Given a large data set, our goal is to infer the latent parameters, the components that govern item attributes, user preferences, price sensitivity, and seasonal effects. We then use the inferred parameters to make predictions, e.g., about demand, and to characterize patterns of purchase behavior. In this section we describe a variational inference algorithm to solve this problem.
For conciseness, we denote the collection of latent variables as , the observed shopping baskets as , and the observed characteristics of shopping trips as , where indicates the customer, calendar week, and item prices for the th trip. The posterior is
It is difficult to compute the posterior in closed form because of the intractable denominator, the marginal likelihood of the observed data. Further, as we described in Section 3
, the likelihood term is itself intractable because of the sum over the large number of items in the denominator of the softmax function. We must use approximate Bayesian inference.
Variational inference (Jordan et al., 1999; Wainwright and Jordan, 2008) is an alternative to MCMC for approximating otherwise intractable Bayesian posteriors. Compared to MCMC, it more easily scales to large data, and especially so in non-conjugate models such as shopper.
Variational inference approximates the posterior with a parameterized family of distributions of the latent variables ; the parameters are called variational parameters. The idea is to find the member of the family that is closest to the exact posterior, where closeness is measured by the KL divergence. Minimizing the KL divergence is equivalent to maximizing the ELBO,
where the expectations are taken with respect to the variational distribution.
The ELBO provides a lower bound on the marginal likelihood , and hence its name. Its maximizer is (which also minimizes the KL divergence). When we use variational inference, we first optimize the ELBO and then use the resulting distribution as a proxy for the exact posterior, e.g., to investigate the inferred hidden structure or to approximate the posterior predictive distribution. In summary, variational inference uses optimization to approximate the posterior. For a review of variational inference, see Blei, Kucukelbir and McAuliffe (2017).
To develop a variational inference algorithm for shopper, we first specify the variational family and then develop the optimization procedure for fitting its parameters. Appendix A gives all the details; here we sketch the main ideas.
The first step to deriving a variational inference algorithm is to specify the variational family . Following many successful applications of variational inference, we posit the mean-field family, where each latent variable is governed by its own distribution and endowed with its own variational parameter. We use Gaussian variational factors for the latent variables with Gaussian priors; and gamma variational factors for the latent variables with gamma priors. Though the mean-field variational family makes strong independence assumptions, we emphasize that latent variables are not identically distributed and thus the variational family is still very flexible. Thus, for example, the variational distribution of the latent attributes of peanut butter will peak at one point in attribute space; the latent attributes of taco shells will (likely) peak at a different point.
Given the variational family, we next formulate how to optimize Eq. 11 with respect to the variational parameters; we use gradient-based stochastic optimization. This problem is complicated for several reasons: the data set sizes are large, the expectations are intractable, and the likelihood is expensive to calculate. To overcome these obstacles, we bring together a variety of recent innovations around variational inference, summarized below and detailed in Appendix A.
The first issue is that the data sets are large. The gradient contains a term for every item purchased (e.g., around million terms for the experiments of Section 6), and this is too expensive to be practical, especially because we need to calculate it at each iteration. Thus we lean on stochastic optimization (Robbins and Monro, 1951; Blum, 1954; Bottou, Curtis and Nocedal, 2016), where we follow cheaper-to-compute unbiased noisy estimates of the gradient. Following Hoffman et al. (2013), we calculate noisy gradients by subsampling from the data and taking a scaled gradient relative to the subsample.
The second issue is that, even setting aside the large data, the expectations in Eq. 11 are analytically intractable. Again we sidestep this issue with stochastic optimization, devising the gradient itself as an expectation and then forming unbiased noisy gradients with Monte Carlo approximations. In particular, we use the reparameterization gradient (Kingma and Welling, 2014; Titsias and Lázaro-Gredilla, 2014; Rezende, Mohamed and Wierstra, 2014) and its generalizations (Ruiz, Titsias and Blei, 2016; Naesseth et al., 2017).
The third issue is that we need to evaluate the probability of unordered baskets (Eq. 6), which involves an expensive summation over permutations of items, even for moderately-sized baskets. We address this by deriving a variational bound using Jensen’s inequality, which resembles the variational inference procedure for the Indian buffet process (Doshi-Velez et al., 2009). After applying this bound, we can leverage stochastic variational inference by subsampling permutations (Hoffman et al., 2013).
The final computational issue is how to evaluate likelihoods of the form in Eq. 3. This is expensive because of the number of items to sum over in the denominator. We solve the problem with the approach developed by Titsias (2016)
. This approach derives a bound for which we can obtain cheap unbiased estimates of the gradient.999Alternatively, we could adapt the approach by Ruiz et al. (2018).
Our implementation of shopper in C++ code is publicly available.101010See https://github.com/franrruiz/shopper-src.
6 Empirical Study
We now use shopper to study a real-world data set of consumers. Our dataset contains 570,878 baskets from an large grocery store (anonymized111111The dataset is available to researchers at Stanford and Berkeley by application; it has been used previously in other research papers. See https://are.berkeley.edu/SGDC. for data privacy reasons) (Che, Chen and Chen, 2012). These baskets contain 5,968,528 purchases of 5,590 unique items. The data spans 97 weeks and 3,206 different customers. We split the baskets into a training, validation, and test set. The test set contains all the baskets in the last 2 months of the data collection period; the validation set contains of the remaining purchases, chosen at random.
We study the predictive performance of shopper both observationally and under price intervention. Note that the store changes prices regularly, and this variation allows us to empirically separate correlated latent preferences from price effects, once we control for seasonality.
We also demonstrate how to use shopper to qualitatively analyze the data. It finds interpretable latent features, which can be used to find complementary and exchangeable items.
We first focus on category-level data, where we group individual items—according to their UPC—into their category (such as “oranges,” “flour,” “pet supplies,” etc.); there are unique categories. The category-level data contains fewer items and is computationally less demanding for any model; so we use these data to compare the predictive performance of shopper to other models. We obtain better predictive accuracy (in terms of test log-likelihood) than the competing methods, especially under price interventions.
Secondly, we focus on UPC data, all 5,590 items. Based only on posterior inference with the observed basket data, shopper identifies characteristics of items, including complements and exchangeable items. We define quantities to measure these concepts, and we find that exchangeability is a good proxy to find substitutes.
6.1 Category-level data
We compare shopper to other latent factor models of shopping data. In particular, we apply HPF (Gopalan, Hofman and Blei, 2015) and exponential family embeddings (Rudolph et al., 2016). HPF focuses on user preferences; exponential family embeddings focuses on item-to-item interactions. Table 2 summarizes the differences between shopper and these models; note that none of the comparison models captures price or seasonal effects.
We also fitted two simpler models. First, we studied a multinomial logistic regression model that predicts an item conditioned on the rest of the items in the basket and the item prices. The inputs are binary indicators of the items in the basket excluding the target item, as well as the logarithm of the normalized prices of all the items. Second, we studied factor analysis(Cattell, 1952) on the basket-by-item matrix. However, we do not report the results of these two models in the paper because they gave worse predictions than all other methods on the test set (corresponding to the last two months of data).121212On the validation set, the multinomial logistic regression model performed slightly better than exponential family embeddings but worse than HPF. We emphasize that multinomial logistic regression takes the item prices as inputs, in contrast to exponential family embeddings and HPF.
|B-Emb (Rudolph et al., 2016)||Binary||✓|
|P-Emb (Rudolph et al., 2016)||Count||✓|
|HPF (Gopalan, Hofman and Blei, 2015)||Count||✓|
|shopper (this paper)||Binary||✓||✓||✓||✓|
6.1.1 Quantitative results
We fit each model to the category-level data. In shopper
we set most of the Gaussian hyperparameters to zero mean and unit variance. (The seasonal effect hyperparameters have a smaller variance, 0.01, because we do not expect large seasonal effects.) The price sensitivity parameters have Gamma priors; we use a shape ofand a rate of . As for the comparison models, we modify HPF to allow for multiple shopping trips of the same user; in its original construction it can only capture a single trip. Finally, we implement two versions of exponential family embeddings, Bernoulli embeddings for binary data (labeled “B-Emb”) and Poisson embeddings for count data (labeled “P-Emb”). We weight the zeros by a factor of , as suggested by Rudolph et al. (2016). For all methods, we use the validation set to assess convergence.
To choose the number of latent factors , we first set the number of factors of the price and seasonal vectors to and run shopper with . We choose the value because it provides the best predictions on the validation set. For that value of , we then explore the number of latent factors for the price and seasonal vectors in the set , finally choosing . We also set for HPF and exponential family embeddings.
To evaluate the models, we calculate the average log-likelihood of the test-set items. For each, we calculate its probability conditioned on the other observed items in the basket. Higher log probability indicates a better model fit. Table 3
shows the results. The numbers in parentheses show the standard deviation, obtained by using bootstrap on the test samples.
We study several types of test sets. The second column corresponds to a typical test set, containing two months of data. Columns 3-5 focus on skewed test sets, where the target items have more extreme prices with respect to their average price within each month (outside of the range, , and ). To control for seasonal effects, we consider items whose price is outside that range with respect to the per-month average price. These evaluations are suggestive of the performance under price intervention, i.e., where the distribution of price is different in testing than it is in training. The numbers in parentheses on the table heading indicate the number of purchases considered in each column.
We report results that incrementally add terms to the basic shopper model. The most basic model contains user information with item-to-item interactions (“I+U”); it improves predictions over the competing models. The next model includes seasonal effects (“I+U+S”); it improves performance on the typical test set only.131313To compute predictions for the models with seasonal effects, we set the seasonal parameter for the weeks in the test set equal to the value of corresponding to the same week but the year before. We also consider a model that includes price sensitivity (“I+U+P”); it further improves performance. The full model adds seasonal effects (“I+U+P+S”); in general it gives the best performance. As expected, modeling price elasticity is important in making counterfactual predictions. The performance gap between the models with and without price increases as the test-set prices diverge from their average values.
Table 3 compares models with different numbers of latent parameters. (Table 4 gives the number of latent parameters in each model.) Exponential family embeddings posit two -length vectors for each item. HPF has one -length vector for each user and item. shopper has one -length vector for each user and two -length vectors for each item, in addition to the intercept terms and the price and seasonal components.
|B-Emb (Rudolph et al., 2016)||()||()||()||()|
|P-Emb (Rudolph et al., 2016)||()||()||()||()|
|HPF (Gopalan, Hofman and Blei, 2015)||()||()||()||()|
|Model||Number of latent parameters|
|per user||per item||per week||total|
|B-Emb (Rudolph et al., 2016)|
|P-Emb (Rudolph et al., 2016)|
|HPF (Gopalan, Hofman and Blei, 2015)|
|Three items||Entire baskets|
Finally, we study the empirical performance of “thinking ahead,” particularly when predicting groups of items. Table 5 shows two metrics: the first column is the average (per-item) test log-likelihood over three items, conditioned on the rest of items in the basket; the second column is average (per-item) log-likelihood over the entire basket. (Here we exclude the checkout item, and we compute the predictions based on the ordering in which items are listed in the test set.) The model with “thinking ahead” more correctly handles complements and price sensitivity, and it provides better predictive performance.
6.1.2 Qualitative results
shopper provides a better predictive model of consumer behavior. We now use the fitted model to qualitatively understand the data.
First, we assess the attributes vectors , confirming that they capture meaningful dimensions of the items. (Recall that each is a 100-dimensional real-valued vector.) As one demonstration, we project them on to 2-dimensional space using tSNE (van der Maaten and Hinton, 2008), and then examine the items in different regions of the projected space. Figure 5 shows two particular regions: one collects different types of pet food and supplies; the other collects different cleaning products. As a second demonstration, we can use the cosine distance to find similar items similar to a “query item.” Table 6 shows the top-three most similar items to a set of queries.
|mollusks||organic vegetables||granulated sugar||cat food dry/moist|
|finfish all other - frozen||organic fruits||flour||cat food wet|
|crustacean non-shrimp||citrus||baking ingredients||cat litter & deodorant|
|shrimp family||cooking vegetables||brown sugar||pet supplies|
|Halloween candy||cherries||turkey - frozen|
Other latent variables reveal different aspects of consumer behavior. For example, Table 7 show the highest and lowest seasonal effects for a set of items. The model correctly captures how Haloween candy is more popular near Halloween; and turkey is more popular near Thanksgiving. It also captures the seasonal availability of fruits, e.g., cherries.
These investigations are on the category-level analysis. For more fine-grained qualitative assessments—especially those around complementarity and exchangeability—we now turn to the UPC-level model.
6.2 UPC-level data
We fit shopper to UPC-level data, which contains , unique items. We use the same dimensionality of the latent vectors as in Section 6.1, i.e., for , , and , and latent features for the seasonal and price vectors. We additionally tie the price vectors and seasonal effect vectors to all items in the same category. To speed up computation, we fit this model without thinking ahead.
We can again find similar items to “query” items using the cosine distance between attribute vectors . Table 8 shows similar items for several queries; the model identifies qualitatively related items.
For another view, Figure (a)a shows a two-dimensional tSNE projection (van der Maaten and Hinton, 2008) of the attribute vectors. This figure colors the items according to their group,141414Groups are defined as one level of hierarchy above categories. Some examples of groups are “jams, jellies, and spreads,” “salty snacks,” or “canned fruits.” and it reveals that items in the same category are often close to each other in attribute space. When groups are mixed in a region, they tend to be items that appear in similar contexts, e.g., hot dogs, hamburger buns, and soda. (Figure (b)b).
|Dentyne ice gum peppermint||california avocados||
Coca Cola classic soda fridge pack 1
|Wrigleys gum orbit white peppermint||tomatoes red tov/cluster||Sprite soda fridge pack|
|Dentyne ice shivermint||apples fuji medium 100ct||Coca Cola classic fridge pack 2|
|Dentyne ice gum spearmint||tomatoes roma red||Coca Cola soda cherry fridge pack|
6.2.1 Substitutes, complements, and exchangeability metrics
A key objective for applications of shopper is to be able to estimate interaction effects among products. These effects are described by the coefficients and attributes . When and are large, this means that purchasing item increases the consumer’s preference for , and vice-versa. When these two terms are negative and large, the items may be substitutes—putting one item in the basket reduces the need for the other item.
All else equal, complements are relatively likely to be purchased together, while substitutes are less likely to be purchased together. In addition, for complementary items and , when the price of increases, the customer is less likely to purchase item . We define the complementarity metric as
Most shopping data is sparse—each individual item has a low probability of purchase—and so it can be difficult to accurately identify pairs of items that are substitutes. We introduce a new metric, which we call exchangeability, that can help identify substitutes. For a pair of items , our notion of exchangeability depends on the distributions of items that are induced when conditioning on or ; if those distributions are similar then we say that and are exchangeable.
Let denote the probability of item given that item is the only item in the basket.151515In the definition of and , we zero-out the probabilities of items , , and checkout. We measure the similarity between the distributions and with symmetrized KL divergence. The exchangeability metric is
With this definition, two items that are intuitively exchangeable—such as two different prepared sandwiches, or two yogurts that are the same brand—will exhibit smaller values of .
With several example queries, Table 9 shows the three most complementary items and the three most exchangeable items.161616In the computation of the probabilities , we consider an “average customer” (obtained by averaging the per-user vectors) in an “average week” (obtained by averaging the per-week vectors). The probabilities were also obtained assuming that no other item except is in the shopping basket. shopper correctly recovers complements, such as taco shells and taco seasoning, hot dogs and buns, and mustard and hamburger buns. shopper typically identifies the most exchangeable items as being of the same type, such as several types of buns or tortillas; highly exchangeable items are suggestive of substitutes.
Finally, note that shopper can model triplets of complementary items when they are also pairwise complements, but it cannot in cases where the triplets do not form complementary pairs.
|query items||complementarity score||exchangeability score|
|mission tortilla soft taco 1||taco bell taco seasoning mix||mission fajita size|
|mcrmck seasoning mix taco||mission tortilla soft taco 2|
|lawrys taco seasoning mix||mission tortilla fluffy gordita|
|private brand hot dog buns||bp franks meat||ball park buns hot dog|
|bp franks bun size||private brand hotdog buns potato 1|
|bp franks beed bun length||private brand hotdog buns potato 2|
|private brand mustard squeeze bottle||private brand hot dog buns||frenchs mustard classic yellow squeeze|
|private brand cutlery full size forks||frenchs mustard classic yellow squeezed|
|best foods mayonnaise squeeze||heinz ketchup squeeze bottle|
|private brand napkins all occasion||private brand selection plates 6 7/8 in||vnty fair napkins all occasion 1|
|private brand selection plates 8 3/4 in||vnty fair napkins all occasion 2|
|private brand cutlery full size forks||private brand selection premium napkins|
We developed shopper, a probabilistic model of consumer behavior. shopper is a sequential model of a consumer that posits a latent structure to how he or she shops. In posterior inference, shopper estimates item attributes, item-item interactions, price elasticities, seasonal effects, and customer preferences. We used shopper to analyze large-scale shopping data from a grocery store and evaluated it with out-of-sample predictions. In addition to evaluating on a random sample of held out data, we also evaluated on counterfactual test sets, i.e., on days where prices are systematically different from average.
shopper is a computationally feasible, interpretable approach to modeling consumer choice over complex shopping baskets. While the existing literature in economics and marketing usually considers two or three products (Berry et al., 2014), shopper enables considering choices among thousands of products that potentially interact with one another. Using supermarket shopping data, we show that shopper can uncover these relationships.
There are several avenues for future work. Consider the question of how the probabilistic model relates to maximizing a consumer’s utility of the entire basket. Section 3.1.2 introduced a heuristic model of consumer behavior that is consistent with the probabilistic model. This heuristic lends itself to computational tractability; it enables us to analyze datasets of baskets that involve thousands of products and thousands of consumers. But the heuristic also involves a fairly myopic consumer. In future work, it is interesting to consider alternative heuristic models.
Another avenue is to embellish the distribution of baskets. One way is to expand the model to capture within-basket heterogeneity. Shopping trips may reflect a collection of needs, such as school lunches, a dinner party, and pet care, and items may interact with other items within each need. Capturing the heterogenous patterns within the baskets would sharpen the estimated interactions. Another embellishment is to expand the model of baskets to include a budget. A budget imposes constraints on the number of items (or their total price) purchased in a single trip.
This work was supported by ONR N00014-15-1-2209, ONR 133691-5102004, NIH 5100481-5500001084, NSF CCF-1740833, the Alfred P. Sloan Foundation, the John Simon Guggenheim Foundation, Facebook, Amazon, and IBM. Francisco J. R. Ruiz is supported by the EU H2020 programme (Marie Skłodowska-Curie Individual Fellowship, grant agreement 706760). The authors also thank Tilman Drerup and Tobias Schmidt for research assistance, as well as the Cyber Initiative at Stanford. Finally, we also acknowledge the support of NVIDIA Corporation with the donation of a GPU used for this research.
Appendix A Details on the Inference Algorithm
Here we provide the technical details of the variational inference procedure and a description of the full algorithm.
Recall the notation introduced in Section 5 of the main paper, where denotes the set of all latent variables in the model, is the collection of (unordered) baskets, and , where are observed covariates of the shopping trips. We use mean-field variational inference, i.e., we approximate the posterior with a fully factorized variational distribution,
We set each variational factor in the same family as the prior, i.e., Gaussian variational distributions with diagonal covariance matrices for , , , , , and , and independent gamma variational distributions for the price sensitivity terms and . We parameterize the Gaussian in terms of its mean and standard deviation, and we parameterize the gamma in terms of its shape and mean.
Let denote the vector containing all the variational parameters. We wish to find the variational parameters that maximize the ELBO,
In this supplement, we show how to apply stochastic optimization to maximize the bound on the log marginal likelihood. More in detail, we first describe how to tackle the intractable expectations using stochastic optimization and the reparameterization trick, and then we show how to leverage stochastic optimization to decrease the computational complexity.
a.1 Intractable expectations: Stochastic optimization and the reparameterization trick
We are interested in maximizing an objective function of the form
The main challenge is that the expectations in Eq. 16 are analytically intractable. Thus, the variational algorithm we develop aims at obtaining and following noisy estimates of the gradient . We obtain these estimates via stochastic optimization; in particular, we apply the reparameterization trick to form Monte Carlo estimates of the gradient (Kingma and Welling, 2014; Titsias and Lázaro-Gredilla, 2014; Rezende, Mohamed and Wierstra, 2014).
In reparameterization, we first introduce a transformation of the latent variables and an auxiliary distribution , such that we can obtain samples from the variational distribution following a two-step process:
The requirement for and is that this procedure must provide a variable that is distributed according to . Here we have considered the generalized reparameterization approach (Ruiz, Titsias and Blei, 2016; Naesseth et al., 2017), which allows the auxiliary distribution to depend on the variational parameters (this is necessary because the gamma random variables are not otherwise reparameterizable).
Once we have introduced the auxiliary variable , we can rewrite the gradient of the objective in Eq. 16 as an expectation with respect to the auxiliary distribution ,
We now push the gradient into the integral and apply the chain rule for derivatives to express the gradient as an expectation,
To obtain this expression, we have assumed that because the only dependence of on is through the term , and the expectation of the score function is zero.
We can now obtain a Monte Carlo estimate of the expectation in Eq. 19 (and therefore of the gradient of interest) by drawing a sample from and evaluating the argument of the expectation. That is, we form the gradient estimator as
where . This assumes that we are able to evaluate and its gradient. (We show in Section A.2 how to do that efficiently.)
We use a transformation and auxiliary distribution for each variational factor. For a Gaussian variational factor with mean and standard deviation , we use the standard reparameterization,
and is defined though a rejection sampling procedure. See Naesseth et al. (2017) for further details about the reparameterization trick for gamma random variables.171717In particular, we also apply the “shape augmentation trick,” which allows us to reparameterize a gamma random variable with shape in terms of another gamma random variable with shape , where is a positive integer. We use . See Naesseth et al. (2017) for additional details.
Algorithm 1 summarizes the resulting variational inference procedure. At each iteration, we obtain a sample from via the auxiliary distribution and the transformation ; we evaluate the function and its gradient with respect to the latent variables ; we obtain the gradient estimate in Eq. 20; and we take a gradient step for the variational parameters . In the stochastic optimization procedure, we adaptively set the step size as proposed in the ADVI algorithm (Kucukelbir et al., 2017).
a.2 Computational complexity: Stochastic optimization and lower bounds
The algorithm in Section A.1 requires to evaluate the model log joint (as well as its gradient). There are three issues that make it expensive to evaluate the log joint. First, evaluating the log likelihood is expensive because it involves a summation over shopping trips. This represents a problem when the dataset is large. Second, evaluating the softmax involves computing its normalization constant, which contains a summation over all items. This becomes an issue when there are thousands of items and we need to evaluate many softmax probabilities. Third, computing the probability over unordered baskets is also expensive, as it involves a summation over all possible permutations.
We address these issues by combining two techniques: data subsampling and lower bounds on the ELBO. We first describe data subsampling for evaluating the log likelihood. Note that the log likelihood involves a summation over many terms,
We can obtain an unbiased estimate of the log likelihood (and its gradient) by sampling a random subset of shopping trips. Let be the (randomly chosen) set of trips. The estimator
is unbiased because its expected value is the log likelihood (Hoffman et al., 2013). Thus, we subsample data terms to obtain unbiased estimates of the log likelihood and its gradient, resulting in a computationally more efficient algorithm.
Second, we describe how to form variational bounds to address the issue of the expensive normalization constant of the softmax. Each softmax log probability is given by
The summation over is expensive, and we cannot easily form an unbiased estimator because of the non-linearity introduced by the logarithm. Hence, we apply the one-vs-each bound (Titsias, 2016), which allows us to write
is the sigmoid function. We can form unbiased estimates of the summation via subsampling. More precisely, we randomly sample a setof items (each of them distinct from and from the other items in the basket). Then, we form the following unbiased estimator:
Here, stands for the total number of items.
Finally, we show how to deal with the issue of unordered baskets. Recall that each log likelihood term involves a summation over all possible permutations of the baskets (holding the checkout item in the last position),
Following a similar procedure as Doshi-Velez et al. (2009), we introduce an auxiliary distribution to rewrite the expression above as an expectation with respect to , and then we apply Jensen’s inequality:
Since the bound involves a direct summation over permutations, we can subsample terms to alleviate the computational complexity. For simplicity, we set
to be a uniform distribution over all possible permutations, and thus we do not introduce auxiliary variational parameters that would be too expensive to obtain otherwise. In particular, we form an unbiased estimate of the bound by sampling one random permutationand evaluating the term181818We ignore the term because it is a constant.
To sum up, we have derived a bound of the ELBO,
which can still be written as an expectation with respect to the variational distribution . More importantly, we can efficiently evaluate the argument of such expectation and its gradient with respect to the latent variables .
Putting all together, the function that we use is given by
We obtain an unbiased estimate via subsampling shopping trips, one permutation for each one, and items . We use trips and items in our experiments.