1 Introduction
Estimation of discretechoice models in which consumers face highdimensional choice sets is computationally challenging. In this paper, we propose a new estimator that is tractable for semiparametric multinomial models with very large choice sets. Our estimator utilizes random projection, a powerful dimensionalityreduction technique from the machine learning literature. To our knowledge, this is the first use of random projection in the econometrics literature on discretechoice models. Using random projection, we can feasibly estimate highdimensional discretechoice models without specifying particular distributions for the random utility errors – our approach is semiparametric.
In random projection, vectors of highdimensionality are replaced by random lowdimensional linear combinations of the components in the original vectors. The JohnsonLindenstrauss Lemma, the backbone of random projection techniques, justifies that with high probability, the highdimensional vectors are embedded in a lower dimensional Euclidean space in the sense that pairwise distances and inner products among the projecteddown lowerdimensional vectors are preserved.
Specifically, we are given a by data matrix, where is the dimensionality of the choice sets. When is very large, we encounter computational problems that render estimation difficult: estimating semiparametric discretechoice models is already challenging, but large choice sets exacerbate the computational challenges; moreover, in extreme cases, the choice sets may be so large that typical computers will not be able to hold the data in memory (RAM) all at once for computation and manipulation.^{1}^{1}1For example, Ng (2015) analyzes terabytes of scanner data that required an amount of RAM that was beyond the budget of most researchers.
Using the idea of random projection, we propose first, in a data preprocessing step, premultiplying the large by data matrix by a by (with
) stochastic matrix, resulting in a smaller
by compressed data matrix that is more manageable. Subsequently, we estimate the discretechoice model using the compressed data matrix, in place of the original highdimensional dataset. Specifically in the second step, we estimate the discretechoice model without needing to specify the distribution of the random utility errors by using inequalities derived from cyclic monotonicity: – a generalization of the notion of monotonicity for vectorvalued functions which always holds for randomutility discretechoice models; see (Rockafellar (1970), Chiong et al. (2016).A desirable and practical feature of our procedure is that the random projection matrix is sparse, so that generating and multiplying it with the large data matrix is computationally parsimonious. For instance, when the dimensionality of the choice set is , the random projection matrix consists of roughly 99% zeros, and indeed only 1% of the data matrix is needed or sampled.
We show theoretically that the random projection estimator converges to the unprojected estimator, as grows large. We utilize results from the machine learning literature, which show that random projection enables embeddings of points from highdimensional into lowdimensional Euclidean space with high probability, and hence we can consistently recover the original estimates from the compressed dataset. In the simulation, even with small and moderate , we show that the noise introduced by random projection is reasonably small. In summary, controls the tradeoff between using a small/tractable dataset for estimation, and error in estimation.
As an application of our procedures, we estimate a model of soft drink choice in which households choose not only which soft drink product to purchase, but also the store that they shop at. In the dataset, households can choose from over 3000 (store/soft drink product) combinations, and we use random projection to reduce the number of choices to 300, onetenth of the original number.
1.1 Related Literature
Difficulties in estimating multinomial choice models with very large choice sets were already considered in the earliest econometric papers on discretechoice models (McFadden (1974, 1978
)). There, within the special multinomial logit case, McFadden discussed simulation approaches to estimation based on sampling the choices faced by consumers; subsequently, this “sampled logit” model was implemented in
Train et al. (1987). This sampling approach depends crucially on the multinomial logit assumption on the errors, and particularly on the independence of the errors across items in the large choice set.^{2}^{2}2See also Davis et al. (2016) and Keane and Wasi (2012) for other applications of sampled logittype discrete choice models. On a related note, Gentzkow et al. (2016) use a Poisson approximation to enable parallel computation of a multinomial logit model of legislators’ choices among hundreds of thousands of phrases.In contrast, the approach taken in this paper is semiparametric, as we avoid making specific parametric assumptions for the distribution of the errors. Our closest antecedent is Fox (2007), who uses a maximumscore approach of Manski (1975, 1985) to estimate semiparametric multinomial choice models with large choice sets but using only a subset of the choices.^{3}^{3}3Fox and Bajari (2013) use this estimator for a model of the FCC spectrum auctions, and also point out another reason whereby choice sets may be highdimensionality: specifically, when choice sets of consumers consist of bundles of products. The size of this combinatorial choice set is necessarily exponentially increasing in the number of products. Even though the vectors of observed market shares will be sparse, with many zeros, as long as a particular bundle does not have zero market share across all markets, it will still contain identifying information. Identification relies on a “rankorder” assumption, which is an implication of the Independence of Irrelevant Alternatives (IIA) property, and hence can be considered as a generalized version of IIA. It is satisfied by exchangeability of the joint error distribution.
In contrast, our cyclic monotonicity approach allows for nonexchangeable joint error distribution with arbitrary correlation between the choicespecific error terms, but requires full independence of errors with the observed covariates.^{4}^{4}4Besides Fox (2007), the literature on semiparametric multinomial choice models is quite small, and includes the multipleindex approach of Ichimura and Lee (1991) and Lee (1995), and a pairwisedifferencing approach in Powell and Ruud (2008). These approaches do not appear to scale up easily when choice sets are large, and also are not amenable to dimensionreduction using random projection. Particularly, our approach accommodates models with error structures in the generalized extreme value family (ie. nested logit models; which are typically nonexchangeable distributions), and we illustrate this in our empirical application below, where we consider a model of joint store and brand choice in which a nestedlogit (generalized extreme value) model would typically be used.
Indeed, Fox’s rankorder property and the cyclic monotonicity property used here represent two different (and nonnested) generalizations of Manski’s (1975) maximumscore approach for semiparametric binary choice models to a multinomial setting. The rankorder property restricts the dependence of the utility shocks across choices (exchangeability), while cyclic monotonicity restricts the dependence of the utility shocks across different markets (or choice scenarios).^{5}^{5}5Haile et al. (2008) refer to this independence of the utility shocks across choice scenarios as an “invariance” assumption, while Goeree et al. (2005) call the rankorder property a “monotonicity” or “responsiveness” condition.
The ideas of random projection were popularized in the Machine Learning literature on dimensionality reduction (Vempala (2000); Achlioptas (2003); Dasgupta and Gupta (2003)). As these papers point out, both by mathematical derivations and computational simulations, random projection allows computationally simple and lowdistortion embeddings of points from highdimensional into lowdimensional Euclidean space. However, the random projection approach will not work with all high dimensional models. The reason is that while the reduceddimension vectors maintain the same length as the original vectors, the individual components of these lowerdimension matrices may have little relation to the components of the original vectors. Thus, models in which the components of the vectors are important would not work with random projection.
In many highdimensional econometric models, however, only the lengths and inner products among the data vectors are important– this includes leastsquares regression models with a fixed number of regressors but a large number of observations and, as we will see here, aggregate (marketlevel) multinomial choice models where consumers in each market face a large number of choices. But it will not work in, for instance, least squares regression models in which the number of observations are modest but the number of regressors is large – such models call for regressor selection or reduction techniques, including LASSO or principal components.^{6}^{6}6See Belloni et al. (2012), Belloni et al. (2014), and Gillen et al. (2015). Neither LASSO nor principal components do not maintain lengths and inner products of the data vectors; typically, they will result in reduceddimension vectors with length strictly smaller than the original vectors.
Section 2 presents our semiparametric discretechoice modeling framework, and the moment inequalities derived from cyclic monotonicity which we will use for estimation. In section 3, we introduce random projection and show how it can be applied to the semiparametric discretechoice context to overcome the computational difficulties with large choice sets. We also show formally that the randomprojection version of our estimator converges to the fullsample estimator as the dimension of the projection increases. Section 4 contains results from simulation examples, demonstrating that random projection works well in practice, even when choice sets are only moderately large. In section 5, we estimate a model of households’ joint decisions of store and brand choice, using storelevel scanner data. Section 6 concludes.
2 Modeling Framework
We consider a semiparametric multinomial choice framework in which the choicespecific utilities are assumed to take a single index form, but the distribution of utility shocks is unspecified and treated as a nuisance element.^{7}^{7}7 Virtually all the existing papers on semiparametric multinomial choices use similar setups (Fox (2007), Ichimura and Lee (1991), Lee (1995), Powell and Ruud (2008)). Specifically, an agent chooses from among alternatives or choices. Highdimensionality here refers to a large value of . The utility that the agent derives from choice is , where are unknown parameters, and is a vector of covariates specific to choice . Here, is a utility shock, encompassing unobservables which affect the agent’s utility from the th choice.
Let denote the deterministic part of utility that the agent derives from choice , and let , which we assume to lie in the set . For a given , the probability that the agent chooses is . Denote the vector of choice probabilities as . Now observe that the choice probabilities vector is a vectorvalued function such that .
In this paper, we assume that the utility shocks are distributed independently of
, but otherwise allow it to follow an unknown joint distribution that can be arbitrarily correlated among different choices
. This leads to the following proposition:Proposition 1.
Let be independent of . Then the choice probability function satisfies cyclic monotonicity.
Definition 1 (Cyclic Monotonicity).
Consider a function , where . Take a length cycle of points in , denoted as the sequence . The function is cyclic monotone with respect to the cycle if and only if
(1) 
where . The function is cyclic monotone on if and only if it is cyclic monotone with respect to all possible cycles of all lengths on its domain (see Rockafellar (1970)).
Proposition 1 arises from the underlying convexity properties of the discretechoice problem. We refer to Chiong et al. (2016) and Shi et al. (2016) for the full details. Briefly, the independence of and implies that the social surplus function of the discrete choice model, defined as,
is convex in . Subsequently, for each vector of utilities , the corresponding vector of choice probabilities , lies in the subgradient of at ;^{8}^{8}8See Theorem 1(i) in Chiong et al. (2016). This is the WilliamsDalyZachary Theorem (cf. McFadden (1981)), generalized to the case when the social surplus function may be nondifferentiable, corresponding to cases where the utility shocks have bounded support or follow a discrete distribution. that is:
(2) 
By a fundamental result in convex analysis (Rockafellar (1970), Theorem 23.5), the subgradient of a convex function satisfies cyclic monotonicity, and hence satisfies the CMinequalities in (1) above. (In fact, any function that satisfies cyclic monotonicity must be a subgradient of some convex function.) Therefore, cyclic monotonicity is the appropriate vector generalization of the fact that the slope of a scalarvalued convex function is monotone increasing.
2.1 Inequalities for Estimation
Following Shi et al. (2016), we use the cyclic monotonic inequalities in (1) to estimate the parameters .^{9}^{9}9See also Melo et al. (2015) for an application of cyclic monotonicity for testing gametheoretic models of stochastic choice. Suppose we observe the aggregate behavior of many independent agents across different markets. In this paper, we assume the researcher has access to such aggregate data, in which the marketlevel choice probabilities (or market shares) are directly observed. Such data structures arise often in aggregate demand models in empirical industrial organization (eg. Berry and Haile (2014), Gandhi et al. (2013)).
Our dataset consists of , denotes the vector of choice probabilities, or market shares, in market , and is the matrix of covariates for market (where row of corresponds to , the vector of covariates specific to choice in market ). Assuming that the distribution of the utility shock vectors is i.i.d. across all markets, then by Proposition 1, the cyclic monotonicity inequalities (1) will be satisfied across all cycles in the data : that is,
(3) 
Recall that a cycle in data is a sequence of distinct integers , where , and each integer is smaller than or equal , the number of markets.
From the cyclic monotonicity inequalities in (3), we define a criterion function which we will optimize to obtain an estimator of . This criterion function is the sum of squared violations of the cyclic monotonicity inequalities:
(4) 
where . Our estimator is defined as
The parameter space is defined to be a convex subset of . The parameters are normalized such that the vector has a Euclidean length of 1. This is a standard normalization that is also used in the Maximum Rank Correlation estimator, for instance, in Han (1987) and Hausman et al. (1998). Shi et al. (2016) shows that the criterion function above delivers consistent interval estimates of the identified set of parameters under the assumption that the covariates are exogenous. The criterion function here is convex, and the global minimum can be found using subgradient descent (since it is not differentiable everywhere).^{10}^{10}10Because the cyclic monotonicity inequalities involve differences in , no constant terms need be included in the model, as it would simply difference out across markets. Similarly, any outside good with mean utility normalized to zero would also drop out of the cyclic monotonicity inequalities.
The derivation of our estimation approach for discretechoice models does not imply that all the choice probabilities be strictly positive – that is, zero choice probabilities are allowed for.^{11}^{11}11Specifically, Eq. (2) allows some of the components of the choice probability vector to be zero. The possibility of zero choice probabilities is especially important and empirically relevant especially in a setting with large choice sets, as dataset with large choice sets (such as storelevel scanner data) often have zero choice probabilities for many products (cf. Gandhi et al. (2013)).
For reasons discussed earlier, highdimensional choice sets posed particular challenges for semiparametric estimation. Next, we describe how random projection can help reduce the dimensionality of our problem.
3 Random Projection
Our approach consists of twosteps: in the first datapreprocessing step, the data matrix is embedded into a lowerdimensional Euclidean space. This dimensionality reduction is achieved by premultiplying with a random projection matrix, resulting in a compressed data matrix with a fewer number of rows, but the same number of columns (that is, the number of markets and covariates is not reduced, but the dimensionality of choice sets is reduced). In the second step, the estimator outlined in Equation (4) is computed using only the compressed data .
A random projection matrix , is a by matrix (with ) such that each entry is distributed i.i.d according to , where is any mean zero distribution. For any dimensional vectors and , premultiplication by yields the random reduceddimensional () vectors and ; thus, and are the random projections of and , respectively.
By construction, a random projection matrix has the property that, given two highdimensional vectors and , the squared Euclidean distance between the two projecteddown vectors
is a random variable with mean equal to
, the squared distance between the two original highdimensional vectors. Essentially, the random projection procedure replaces each highdimensional vector with a random lowerdimensional counterpart the length of which is a meanpreserving spread of the original vector’s length.^{12}^{12}12For a detailed discussion, see Chapter 1 in Vempala (2000).Most early applications of random projection utilized Gaussian random projection matrices, in which each entry of
is generated independently from standard Gaussian (normal) distributions. However, for computational convenience and simplicity, we focus in this paper on
sparserandom projection matrices, in which many elements will be equal to zero with high probability. Moreover, different choice of probability distributions of
can lead to different variance and error tail bounds of
. Following the work of Li et al. (2006), we introduce a class of sparse random projection matrices that can also be tailored to enhance the efficiency of random projection.Definition 2 (Sparse Random Projection Matrix).
A sparse random projection matrix is a by matrix such that each th entry is independently and identically distributed according to the following discrete distribution:
By choosing a higher , we produce sparser random projection matrices. Li et al. (2006) show that:
(5) 
It appears from this variance formula that higher value of reduces the efficiency of random projections. It turns out that when is large, which is exactly the setting where random projection is needed, the first term in the variance formula above dominates the second term. Therefore, we can set large values of to achieve very sparse random projection, with negligible loss in efficiency. More concretely, we can set to be as large as . We will see in the simulation example that when , setting implies that the random projection matrix is zero with probability 0.986 – that is, only 1.4% of the data are sampled on average. Yet we find that sparse random projection performs just as well as a dense random projection.^{13}^{13}13More precisely, as shown by Li et al. (2006), is that if all fourth moments of the data to be projecteddown are finite, i.e. , , , for all , then the term in the variance formula (Eq. 5) dominates the second term for large (which is precisely the setting we wish to use random projection).
Besides the sparse random projection (), we will also try , where the minimum variance is achieved. We call this the optimal random projection. If we let , we obtain a variance of , which interestingly, is the same variance achieved by the benchmark Gaussian random projection (each element of the random projection matrix is distributed i.i.d. according to the standard Gaussian, see Achlioptas (2003)). Since Gaussian random projection is dense and has the same efficiency as the sparse random projection with , the class of random projections proposed in Definition 2 is preferred in terms of both efficiency and sparsity. Moreover, random uniform numbers are much easier to generate than Gaussian random numbers.
3.1 Random Projection Estimator
We introduce the random projection estimator. Given the dataset , define the compressed dataset by , where for all markets , and being a sparse random projection matrix as in Definition 2.
Definition 3 (Random projection estimator).
The random projection estimator is defined as , where is the criterion function in Equation (4) in which the input data is .
The compressed dataset has number of rows, where the original dataset has a larger number of rows, . Note that the identities of the markets and covariates (i.e. the columns of the data matrix) are unchanged in the reduceddimension data matrix; as a result, the same compressed dataset can be used to estimate different utility/model specifications with varying combination of covariates and markets.
We will benchmark the random projection estimator with the estimator , where is the criterion function in Equation (4) in which the uncompressed data is used as input. In the next section, we will prove convergence of the random projection estimator to the benchmark estimator using uncompressed data, as grows large. Here we provide some intuition and state some preliminary results for this convergence result.
Recall from the previous section that the Euclidean distance between two vectors are preserved in expectation as these vectors are compressed into a lowerdimensional Euclidean space. In order to exploit this feature of random projection for our estimator, we rewrite the estimating inequalities – based on cyclic monotonicity – in terms of Euclidean norms.
Definition 4 (Cyclic Monotonicity in terms of Euclidean norms).
Consider a function , where . Take a length cycle of points in , denoted as the sequence . The function is cyclic monotone with respect to the cycle if and only if
(6) 
where , and denotes . The function is cyclic monotone on if and only if it is cyclic monotone with respect to all possible cycles of all lengths on its domain.
The inequalities (1) and (6) equivalently defined cyclic monotonicity, a proof is given in the appendix. Therefore, from Definition 4, we can rewrite the estimator in (4) as where the criterion function is defined as the sum of squared violations of the cyclic monotonicity inequalities:
(7) 
To see the intuition behind the random projection estimator, we introduce the JohnsonLindenstrauss Lemma. This lemma states that there exists a linear map (which can be found by drawing different random projection matrices) such that there is a lowdistortion embedding. There are different versions of this theorem; we state a typical one:
Lemma 1 (JohnsonLindenstrauss).
Let . Let be a set of points, and . There exists a linear map such that for all :
Proofs of the JohnsonLindenstrauss Lemma can be found in, among others, Dasgupta and Gupta (2003); Achlioptas (2003); Vempala (2000). The proof is probabilistic, and demonstrates that, with a nonzero probability, the choice of a random projection satisfies the error bounds stated in the Lemma. For this reason, the JohnsonLindenstrauss Lemma has become a term that collectively represents random projection methods, even when the implication of the lemma is not directly used.
As the statement of the Lemma makes clear, the reduceddimension controls the tradeoff between tractability and error in estimation. Notably, these results do not depend on , the original dimension of the choice set (which is also the number of columns of .) Intuitively this is because the JL Lemma only requires that the lengths are maintained between the set of projected and unprojected vectors. The definition of the random projection matrix (recall section 3
above) ensures that the length of each projected vector is an unbiased estimator of the length of the corresponding unprojected vector, regardless of
; hence, plays no direct role in satisfying the error bounds postulated in the JL Lemma.^{14}^{14}14However, does affect the variance of the length of the projected vectors, and hence affects the probabilities of achieving those bounds; see Achlioptas (2003) for additional discussion.According to Li et al. (2006),“the JL lemma is conservative in many applications because it was derived based on Bonferroni correction for multiple comparisons.” That is, the magnitude for in the statement of the Lemma is a worstcase scenario, and larger than necessary in many applications. This is seen in our computational simulations below, where we find that small values for still produce good results.
The feature that the cyclic monotonicity inequalities can be written in terms of Euclidean norms between vectors justifies the application of the JohnsonLindenstrauss Lemma, and hence random projection, to our estimator, which is based on these inequalities. In contrast, the “rankorder” inequalities, which underlie the maximum score approach to semiparametric multinomial choice estimation,^{15}^{15}15For instance,Manski (1985), Fox (2007). The rankorder property makes pairwise comparisons of choices within a given choice set, and state that, for all , iff . cannot be rewritten in terms in terms of Euclidean norms between data vectors, and hence random projection cannot be used for those inequalities.
3.2 Convergence
In this section we show that, for any given data , the random projection estimator computed using the compressed data converges in probability to the corresponding estimator computed using the uncompressed data , as grows large, where is the number of rows in the random projection matrix . We begin with simplest case where the dimensionality of the original choice set is fixed, while the reduceddimension grows.^{16}^{16}16In Appendix D we consider the case where grows with .
In order to highlight the random projection aspect of our estimator, we assume that the market shares and other data variables are observed without error. Hence, given the original (uncompressed) data , the criterion function is deterministic, while the criterion function is random solely due to the random projection procedure.
All proofs for results in this section are provided in Appendix C. We first show that the randomprojected criterion function converges uniformly to the unprojected criterion function:
Theorem 1 (Uniform convergence of criterion function).
For any given dataset , we have , as grows.
Essentially, from the defining features of the random projection matrix , we can argue that converges in probability to , pointwise in . Then, because is convex in (which we will also show), we can invoke the Convexity Lemma from Pollard (1991), which says that pointwise and uniform convergence are equivalent for convex random functions.
Finally, under the assumption that the deterministic criterion function (i.e. computed without random projection) admits an identified set, then the random projection estimator converges in a setwise sense to the same identified set. Convergence of the set estimator here means convergence in the Hausdorff distance, where the Hausdorff distance is a distance measure between two sets is: .
Assumption 1 (Existence of identified set ).
For any given data , we assume that there exists a set (that depends on ) such that and , , where denotes a union of open balls of radius each centered on each element of .
Theorem 2.
Suppose that Assumption 1 hold. For any given data , the random projection estimator converges in halfHausdorff distance to the identified set as grows, i.e. as grows.
In the Appendix D, we analyse the setting where the dimensionality of the choice set, , grows with , with growing much faster than . Specifically, we let and show that convergence still holds true under one mild assumption. This assumption says that for all dimensional vectors of covariates in the data , the fourth moment exists as grows.
4 Simulation Examples
In this section, we show simulation evidence that random projection performs well in practice. In these simulations, the sole source of randomness is the random projection matrices. This allows us to starkly examine the noise introduced by random projections, and how the performance of random projections varies as we change , the reduced dimensionality. Therefore the market shares and other data variables are assumed to be observed without error.
The main conclusion from this section is that the error introduced by random projection is negligible, even when the reduced dimension is very small. In the tables below, we see that the random projection method produces interval estimates that are always strictly nested within the identified set which was obtained when the full uncompressed data are used.
4.1 Setup
We consider projecting down from to . Recall that is the number of choices in our context. There are markets. The utility that an agent in market receives from choice is , where and independently across all choices and markets .^{17}^{17}17We also considered two other sampling assumptions on the regressors, and found that the results are robust to: (i) strong brand effects: , where , , and ; (ii) strong market effects: , where , , and .
We normalize the parameters such that . This is achieved by parameterizing using polar coordinates: and , where . The true parameter is .
To highlight a distinct advantage of our approach, we choose a distribution of the error term that is neither exchangeable nor belongs to the generalized extreme value family. Specifically, we let the additive error term be a MA(2) distribution where errors are serial correlated in errors across products. To summarize, the utility that agent in market derives from choice is , where , and is distributed i.i.d with .
Using the above specification, we generate the data for markets, where corresponds to the by1 vector of simulated choice probabilities for market : the th row of is . We then perform random projection on to obtain the compressed dataset . Specifically, for all markets , , where is a realized random projection matrix as in Definition 2. Having constructed the compressed dataset, the criterion function in Eq. 4 is used to estimate . We restrict to cycles of length 2 and 3 in computing Eq. 4; however, we find that even using cycles of length 2 did not change the result in any noticeable way.
The random projection matrix is parameterized by (see Definition 2). We set , which corresponds to the optimal random projection matrix. In Table 2, we show that sparse random projections ( in Definition 2) perform just as well. Sparse random projections are much faster to perform – for instance when , we sample less than 2% of the data, as over 98% of the random projection matrix are zeros.
In these tables, the rows correspond to different designs where the dimension of the dataset is projected down from to
. For each design, we estimate the model using 100 independent realizations of the random projection matrix. We report the means of the upper and lower bounds of the estimates, as well as their standard deviations. We also report the interval spans by the 25th percentile of the lower bounds as well as the 75th percentile of the upper bounds. The last column reports the actual identified set that is computed without using random projections. (In the Appendix, Tables
5 and 6, we see that in all the runs, our approach produces interval estimates that are strictly nested within the identified sets.)The results indicate that, in most cases, optimization of the randomlyprojected criterion function yields a unique minimum, in contrast to the unprojected criterion function , which is minimized at an interval. For instance, in the fourth row of Table 1 (when compressing from to ), we see that the true identified set for this specification, computed using the unprojected data, is , but the projected criterion function is always uniquely minimized (across all 100 replications). Moreover the average point estimate for is equal to 2.3766, where the true value is 2.3562. This is unsurprising, and occurs often in the moment inequality literature; the random projection procedure introduces noise into the projected inequalities so that, apparently, there are no values of the parameters which jointly satisfy all the projected inequalities, leading to a unique minimizer for the projected criterion function.
Design  mean LB (s.d.)  mean UB (s.d.)  25th LB, 75th UB  True id set 

2.3459 (0.2417)  2.3459 (0.2417)  [2.1777, 2.5076]  [1.4237, 3.2144]  

2.2701 (0.2582)  2.3714 (0.2832)  [2.1306, 2.6018]  [1.2352, 3.4343] 

2.4001 (0.2824)  2.4001 (0.2824)  [2.2248, 2.6018]  [1.1410, 3.4972] 

2.3766 (0.3054)  2.3766 (0.3054)  [2.1306, 2.6018]  [1.2038, 3.5914] 

2.2262 (0.3295)  2.4906 (0.3439)  [1.9892, 2.7667]  [1.2038, 3.5914] 

Replicated 100 times using independently realized random projection matrices. The true value of is 2.3562. Rightmost column reports the interval of points that minimized the unprojected criterion function.
Design  mean LB (s.d.)  mean UB (s.d.)  25th LB, 75th UB  True id set 

2.3073 (0.2785)  2.3073 (0.2785)  [2.1306, 2.5076]  [1.4237, 3.2144]  

2.2545 (0.2457)  2.3473 (0.2415)  [2.0363, 2.5076]  [1.2352, 3.4343] 

2.3332 (0.2530)  2.3398 (0.2574)  [2.1777, 2.5076]  [1.1410, 3.4972] 

2.3671 (0.3144)  2.3671 (0.3144)  [2.1777, 2.5547]  [1.2038, 3.5914] 

2.3228 (0.3353)  2.5335 (0.3119)  [2.1306, 2.7667]  [1.2038, 3.5914] 

Replicated 100 times using independently realized sparse random projection matrices (where in Definition 2). The true value of is 2.3562. Rightmost column reports the interval of points that minimized the unprojected criterion function.
5 Empirical Application: a discretechoice model incorporating both store and brand choices
For our empirical application, we use supermarket scanner data made available by the Chicagoarea Dominicks supermarket chain.^{18}^{18}18This dataset has previously been used in many papers in both economics and marketing; see a partial list at http://research.chicagobooth.edu/kilts/marketingdatabases/dominicks/papers. Dominick’s operated a chain of grocery stores across the Chicago area, and the database recorded sales information on many product categories, at the store and week level, at each Dominick’s store. For this application, we look at the soft drinks category.
For our choice model, we consider a model in which consumers choose both the type of soft drink, as well as the store at which they make their purchase. Such a model of joint store and brand choice allows consumers not only to change their brand choices, but also their store choices, in response to acrosstime variation in economic conditions. For instance, Coibion et al. (2015) is an analysis of supermarket scanner data which suggests the importance of “storeswitching” in dampening the effects of inflation in posted store prices during recessions.
Such a model of store and brand choice also highlights a key benefit of our semiparametric approach. A typical parametric model which would be used to model store and brand choice would be a nested logit model, in which the available brands and stores would belong to different tiers of nesting structure. However, one issue with the nested logit approach is that the results may not be robust to different assumptions on the nesting structure– for instance, one researcher may nest brands below stores, while another researcher may be inclined to nest stores below brands. These two alternative specifications would differ in how the joint distribution of the utility shocks between brands at different stores are modeled, leading to different parameter estimates. Typically, there are no
a priori guides on the correct nesting structure to impose.^{19}^{19}19Because of this, Hausman and McFadden (1984) have developed formal econometric specification tests for the nested logit model.In this context, a benefit of our semiparametric is that we are agnostic as to the joint distribution of utility shocks; hence our approach accommodates both models in which stores are in the upper nest and brands in the lower nest, or vice versa, or any other model in which the stores or brands could be divided into further subnests.
We have “markets”, where each market corresponds to a distinct twoweeks interval between October 3rd 1996 to April 30th 1997, which is the last recorded date. We include sales at eleven Dominicks supermarkets in northcentral Chicago, as illustrated in Figure 1
. Among these eleven supermarkets, most are classified as premiumtier stores, while two are mediumtier stores (distinguished by dark black spots in Figure
1); stores in different tiers sell different ranges of products.Definition  Summary statistics  

The average price of the storeupc at market  Mean: $2.09, s.d: $1.77  
The fraction of weeks in market for which storeupc was on sale as a bonus or promotional purchase; for instance “buyonegetonehalfoff” deals  Mean: 0.27, s.d: 0.58  
total units of storeupc sold in market  Mean: 60.82, s.d: 188.37  
A dummy variable indicating the period spanning 11/14/96 to 12/25/96, which includes the Thanksgiving and Christmas holidays 
6 weeks (3 markets)  
Medium, nonpremium stores.  2 out of 11 stores  
Number of storeupc  3059 
Number of observations is 45885=3059 upcs 15 markets (2week periods).
: Stores in the same tier share similar product selection, and also pricing to a certain extent.
Our store and brand choice model consists of choices, each corresponds to a unique store and universal product code (UPC) combination. We also define an outside option, for a total of choices.^{20}^{20}20The outside option is constructed as follows: first we construct the market share as , where is the total units of storeupc sold in market , and is the total number of customers visiting the 11 stores and purchasing something at market . The market share for market ’s outside option is then .
The summary statistics for our data sample are in Table 3.
Specification  (A)  (B)  (C)  (D) 

price  
bonus  
price bonus 

holiday  0.0901  0.0238  
price holiday  
price medium_tier  
Cycles of length 2 & 3 
First row in each entry present the median coefficient, across 100 random projections. Second row presents the 25th and 75th percentile among the 100 random projections. We use cycles of length 2 and 3 in computing the criterion function (Eq. 4).
Table 4 presents the estimation results. As in the simulation results above, we ran 100 independent random projections, and thus obtained 100 sets of parameter estimates, for each model specification. The results reported in Table 4 are therefore summary statistics of the estimates for each parameter. Since no location normalization is imposed for the error terms, we do not include constants in any of the specifications. For estimation, we used cycles of length of length 2 and 3.^{21}^{21}21The result did not change in any noticeable when we vary the length of the cycles used in estimation.
Across all specifications, the price coefficient is strongly negative. The holiday indicator has a positive (but small) coefficient, suggesting that, all else equal, the endofyear holidays are a period of peak demand for soft drink products.^{22}^{22}22cf. Chevalier et al. (2003). Results are similar if we define the holiday period to extend into January, or to exclude Thanksgiving. In addition, the interaction between price and holiday is strongly negative across specifications, indicating that households are more pricesensitive during the holiday season. For the magnitude of this effect, consider a soft drink product priced initially at $1.00 with no promotion. The median parameter estimates for Specification (C) suggest that during the holiday period, households’ willingnesstopay for this product falls as much as if the price for the product increases by $0.27 during nonholiday periods.^{23}^{23}23, where equals a scaling factor we used to scale the price data so that the price vector has the same length as the vector. (The rescaling of data vectors is without loss of generality, and improves the performance of random projection by Eq. (5).)
We also obtain a positive sign on bonus, and the negative sign on the interaction price bonus across all specifications, although their magnitudes are small, and there is more variability in these parameters across the different random projections. We see that discounts seem to make consumers more price sensitive (ie. make the price coefficient more negative). Since any price discounts will be captured in the price variable itself, the bonus coefficients capture additional effects that the availability of discounts has on behavior, beyond price. Hence, the negative coefficient on the interaction price bonus may be consistent with a boundedrationality view of consumer behavior, whereby the availability of discount on a brand draws consumers’ attention to its price, making them more aware of a product’s exact price once they are aware that it is on sale.
In specification (D), we introduce the storelevel covariate mediumtier, interacted with price. However, the estimates of its coefficient are noisy, and vary widely across the 100 random projections. This is not surprising, as mediumtier is a timeinvariant variable and, apparently here, interacting it with price still does not result in enough variation for reliable estimation.
6 Conclusion
In this paper, we used of random projection – an important tool for dimensionreduction from machine learning – for estimating multinomialchoice models with large choice sets, a model which arises in many empirical applications. Unlike many recent applications of machine learning in econometrics, dimensionreduction here is not required for selecting among highdimensional covariates, but rather for reducing the inherent highdimensionality of the model (ie. reducing the size of agents’ choice sets).
Our estimation procedure takes two steps. First, the highdimensional choice data are projected (embedded stochastically) into a lowerdimensional Euclidean space. This procedure is justified via results in machine learning, which shows that the pairwise distances between data points are preserved during data compression. As we show, in practice the random projection can be very sparse, in the sense that only a small fraction (1%) of the dataset is used in constructing the projection. In the second step, estimation proceeds using the cyclic monotonicity inequalities implied by the multinomial choice model. By using these inequalities for estimation, we avoid making explicit distributional assumptions regarding the random utility errors; hence, our estimator is semiparametric. The estimator works well in computational simulations and in an application to a realworld supermarket scanner dataset.
We are currently considering several extensions. First, we are undertaking another empirical application in which consumers can choose among bundles of brands, which would thoroughly leverage the benefits of our random projection approach. Second, another benefit of random projection is that it preserves privacy, in that the researcher no longer needs to handle the original dataset but rather a “jumbledup” random version of it.^{24}^{24}24cf. Heffetz and Ligett (2014). We are currently exploring additional applications of random projection for econometric settings in which privacy may be an issue.
Appendix A Additional Tables and Figures
Design  min LB, max UB  True id set 

[1.8007, 3.3087]  [1.4237, 3.2144]  

[1.7536, 2.9317]  [1.2352, 3.4343] 

[1.6593, 2.9317]  [1.1410, 3.4972] 

[1.6593, 3.1202]  [1.2038, 3.5914] 

[1.6593, 3.1202]  [1.2038, 3.5914] 
Design  min LB, max UB  True id set 

[1.4237, 2.9788]  [1.4237, 3.2144]  

[1.7536, 2.9788]  [1.2352, 3.4343] 

[1.6122, 3.0259]  [1.1410, 3.4972] 

[1.4237, 3.3558]  [1.2038, 3.5914] 

[1.6593, 3.0259]  [1.2038, 3.5914] 

Appendix B Equivalence of alternative representation of cyclic monotonicity
Here we show the equivalence of Eqs. (1) and (6), as two alternative statements of the cyclic monotonicity inequalities. We begin with the second statement (6). We have
Similarly
In the previous two displayed equations, the first two terms cancel out. By shifting the indices forward we have:
Moreover, by definition of a cycle that , , we then have:
Hence
Therefore, cyclic monotonicity of Eq. (1) is satisfied if and only if this formulation of cyclic monotonicity in terms of Euclidean norms is satisfied.
∎
Appendix C Proof of Theorems in Section 3.2
We first introduce two auxiliary lemmas.
Lemma 2 (Convexity Lemma, Pollard (1991)).
Suppose is a sequence of convex random functions defined on an open convex set in , which converges in probability to some , for each . Then goes to zero in probability, for each compact subset of .
Lemma 3.
The criterion function is convex in for any given dataset , where is an open convex subset of .
Proof.
We want to show that , where , and we suppress the dependence of on the data .
(8)  
(9)  
Proof of Theorem 1: Recall from Eq. (5) that for any two vectors , and for the class of by random projection matrices, , considered in Definition 2, we have:
(10)  
(11) 
Therefore by Chebyshev’s inequality, converges in probability to as grows large. It follows that for any given , and , we have , where and are the projected versions of and . Applying the Continuous Mapping Theorem to the criterion function in Eq. 7, we obtain that converges in probability to pointwise for every as grows large.
By Lemma 3, the criterion function is convex in for any given data , where is an open convex subset of . Therefore, we can immediately invoke the Convexity Lemma to show that pointwise convergence of the function implies that converges uniformly to .
Proof of Theorem 2: The result follows readily from Assumption 1 and Theorem 1, by invoking Chernozhukov et al. (2007). The key is in recognizing that (i) in our random finitesampled criterion function, the randomness stems from the by random projection matrix, (ii) the deterministic limiting criterion function here is defined to be the criterion function computed without random projection, taking the full dataset as given. We can then strengthen the notion of halfHausdorff convergence to full Hausdorff convergence following the augmented set estimator as in Chernozhukov et al. (2007).
Appendix D Additional convergence result
Assumption 2.
Suppose that as the dimensionality of the choice set, , grows, the (deterministic) sequence of data satisfies the following two assumptions. (i) Let be any vector of covariates in , then exist and is bounded as grows. Secondly, without loss of generality, assume that for all vectors of covariates in the data , as grows. This part is without loss of generality as the cardinality of utilities can be rescaled.
As before, the only source of randomness is in the random projection. Accordingly, the sequence of data as grows is deterministic.
Theorem 3.
Proof.
Let be the dimensional vector of utilities that market derives from each of the choices (before realization of shocks), and let be the corresponding observed choice probabilities for market in the data . For any , and any pair of markets , we have from Equation 5:
(12) 