1 Introduction
Choice modeling aims at finding statistical models capturing the human behavior when faced with a set of alternatives. Classical examples include consumer purchasing decisions, choices of schooling or employment, and commuter choices for modes of transportation among available options. Traditional models are based on different assumptions about human decision making, e.g. Thurstone’s Case V model (Thurstone, 1927) or BradleyTerryLuce (BTL) model (Bradley and Terry, 1952). Nevertheless, in complex scenarios, like online shopping sessions presenting numerous alternatives to userspecific queries, these assumptions are often too restrictive to provide accurate predictions.
Formally, there is a universe of alternatives , possibly infinite. In each choice situation, some finite choice set is considered. A choice model is a distribution over the alternatives of a given choice set
, where the probability of choosing the item
among is denoted as . These models can be further parameterized by the alternatives’ features and by those of the individual making the choice.An important class of choice models is the Multinomial Logit (MNL), a generalization of the BTL model—defined for pairwise choices only—to larger sets. Any model satisfying Luce’s axiom also known as independence of irrelevant alternatives (Luce, 1959) is equivalent to some MNL model (Luce, 1977). In this class, the probability of choosing some item from a given set can be expressed as where is the latent value of the item . Luce’s axiom implies stochastic transitivity i.e. if and , then where (Luce, 1977). Stochastic transitivity implies the necessity of a total order across all elements and also prevents from expressing cyclic preference situations like the stochastic rockpaperscissors game described in Section 3.2. Thurstone’s Case V model exhibits strict stochastic transitivity but does not satisfy Luce’s axiom (Adams and Messick, 1958). Luce’s axiom and stochastic transitivity are strong assumptions that often do not hold for empirical choice data (see (Ragain and Ugander, 2016) and references therein). For example, Luce’s axiom prevents model from expressing framing effects like asymmetric dominance (or decoy effect), which occurs when the addition of a third alternative dominated by one of the alternatives in a choice set of two, increases the preference towards the alternative dominating the decoy (Huber et al., 1982).
A larger class of models is the one of Random Utility Models (RUM) (Block and Marschak, 1960; Manski, 1977), which includes MNL but also other models satisfying neither Luce’s axiom nor stochastic transitivity. This class affiliates with each
and defines for each subset the probability . RUM exhibits regularity i.e. if then . Regularity also prevents models from expressing framing effects and asymmetric dominance (Huber et al., 1982). The class of Nested MNL (McFadden, 1980) allows to express RUM models but also others that do not obey regularity. Nevertheless, inference is practically difficult for Nested MNL models.Recently, a more flexible class of models called Pairwise Choice Markov Chains has been introduced in Ragain and Ugander (2016). This class includes MNL but also other models that satisfy neither Luce’s axiom, nor stochastic transitivity, nor regularity. This class defines the choice distribution as the stationary distribution of a continuous time Markov chain defined by some transition rate matrix. Still, it satisfies a weakened version of Luce’s axiom called uniform expansion stating that if we add “copies” (with no preference between them), the probability of choosing one element of the copies is invariant to the number of copies. Although the flexibility of this class is appealing, the proposed inference is based on maximizing the likelihood of the rate matrix for the observed choices which is prone to overfitting when the number of observations for each possible alternative is small and is inappropriate when new alternatives can be seen at test time.
Alternatives and individuals making choices can be described by a set of features that can be then used to understand their impact on the choice probability. A linearinparameters MNL assumes that the latent value is given by a linear combination of the parameters of the alternatives and the individual. Features of the individual can be taken into account by these models but inference suffers from scarcity and is inappropriate when new alternatives can be seen at test time. The latent class MNL (LCMNL) model (Greene and Hensher, 2003) takes into account individual heterogeneity by using a Bayesian mixture over different latent classes—whose number must be specified—in which homogeneity and linearity is assumed. A linear in features parameterization for PCMC was suggested in (Ragain and Ugander, 2016, Appendix) but still requires building a matrix at training time, which makes it inappropriate for large universes. In complex cases like airline itinerary choice, where the alternatives are strongly dependent on an individualspecific query and some features, like price, can be dynamic, the previous approaches have limited expressive power or are inappropriate.
Two recently introduced methods allow complex feature handling for alternatives and individuals. Mottini and AcunaAgost (2017)
proposes a recurrent neural network method consisting in learning to point, within a sequence of alternatives, to the chosen one. This model is appealing because of its feature learning capability but neither its choicetheoretic properties have been studied nor its dependence on the order of the sequence.
Lhéritier et al. (2019)proposes to train a Random Forest classifier to predict whether an alternative is going to be predicted or not independently of the rest of the alternatives of the choice set. This approach does not take into account the fact that in each choice set exactly one alternative is chosen. For this reason, the probabilities provided by the model are only used as scores to rank the alternatives, which can be interpreted as latent values—making it essentially equivalent to a nonlinear MNL. To escape this limitation and make the latent values dependent on the session, relative features are added (e.g. the price for
th alternative is converted to ). The nonparametric nature of this model is appealing but its choicetheoretic properties have not been studied either.In this work, we propose to enable PCMC with neural networks based feature handling, therefore enjoying both the good theoretical properties of PCMC and the complex feature handling of the previous neural network based and nonparametric methods. This neural network parameterization of PCMC makes the inference amortized allowing to handle large (and even infinite) size universes as shown in our experiments for airline itinerary choice modeling.
2 Background: Pairwise Choice Markov Chains
2.1 Definition
A Pairwise Choice Markov Chain (PCMC) (Ragain and Ugander, 2016) defines the choice probability as the probability mass on the alternative of the stationary distribution of a continuous time Markov chain (CTMC) whose set of states corresponds to . The model’s parameters are the offdiagonal entries of a rate matrix indexed by pairs of elements in . Given a choice set , the choice distribution is the stationary distribution of the continuous time Markov chain given by the matrix obtained by restricting the rows and columns of to elements in and setting for each . Therefore, the distribution is parameterized by the transition rates of .
The constraint
(1) 
is imposed in order to guarantee that the chain has a single closed communicating class which implies the existence and the unicity of the stationary distribution (see, e.g., Norris (1997)) obtained by solving
(2) 
where and
are row vectors of zeros and ones, respectively. Since any column of
is the opposite of the sum of the rest of the columns, it is equivalent to solve(3) 
where .
2.2 Properties
In Ragain and Ugander (2016), it is shown that PCMC allow to represent any MNL model, but also models that are nonregular and do not satisfy stochastic transitivity (using the rockscissorpaper example of Section 3.2).
In addition, they show that PCMC models feature a property termed contractibility, which intuitively means that we can “contract” subsets to a single “type” when the probability of choosing an element of is independent of the pairwise probabilities between elements within the subsets. Formally, a partition of into nonempty sets is a contractible partition if for all for some for . Then, the following proposition is shown.
Proposition 1 (Ragain and Ugander (2016)).
For a given , let be a contractible partition for two PCMC models on represented by with stationary distributions Then, for any ,
Then it is shown, that contractibility implies uniform expansion formally defined as follows.
Definition 1 (Uniform Expansion).
Consider a choice between elements in a set and another choice from a set containing copies of each of the n elements: The axiom of uniform expansion states that for each and all .
2.3 Inference
Given a dataset , the inference method proposed in Ragain and Ugander (2016) consists in maximizing the log likelihood of the rate matrix indexed by
(4) 
where denotes the probability that is selected from as a function of and denotes the number of times in the data that was chosen out of set .
This optimization is difficult since there is no general closed form expression for and the implicit definition also makes it difficult to derive gradients for with respect to the parameters . The authors propose to use Sequential Least Squares Programming (SLSQP) to maximize , which is nonconcave in general. However, in their experiments, they encounter numerical instabilities leading to violations () of the PCMC definition, which were solved with additive smoothing at the cost of some efficacy of the model. In addition, when the examples of each alternative are scarce like in the application of Section 4, this inference approach is prone to severe overfitting and is inappropriate to predict unseen alternatives. These two drawbacks motivate the amortized inference approach we introduce next.
3 PCMCNet
We propose an amortized inference approach for PCMC based on a neural network architecture called PCMCNet that uses the alternatives’ and the individual’s features to determine the transition rates and can be trained using standard stochastic gradient descent techniques.
3.1 Architecture
Input layer
Let be the tuple of features of the th alternative of the choice set belonging to a given feature space and be the tuple of the individual’s features belonging to a given feature space . The individual’s features are allowed to be an empty tuple.
Representation layer
The first layer is composed of a representation function for the alternatives’ features
(5) 
and a representation function for the individual’s features
(6) 
where and are the sets of weights parameterizing them and
are hyperparameters. These functions can include, e.g., embedding layers for categorical variables, a convolutional network for images or text, etc., depending on the inputs’ types.
Cartesian product layer
In order to build the transition rate matrix, all the pairs of different alternatives need to be considered, this is accomplished by computing the cartesian product
(7) 
The combinations of embedded alternatives are concatenated together with the embedded features of the individual, i.e.
(8) 
where denotes vector concatenation.
Transition rate layer
The core component is a model of the transition rate :
(9) 
where consists of multiple fully connected layers parameterized by a set of weights and is a hyperparameter. Notice that taking the maximum with 0 and adding guarantees nonnegativity and the condition of Eq. 1. The transition rate matrix is then obtained as follows:
(10) 
Stationary distribution layer
The choice probabilities correspond to the stationary distribution that is guaranteed to exist and be unique by the condition of Eq. 1 and can be obtained by solving the system
(11) 
by, e.g., partiallypivoted LU decomposition which can be differentiated with automatic differentiation.
The whole network is represented in Fig. 1.
3.2 Properties
Nonregularity
As shown in Ragain and Ugander (2016), nonregular models can be obtained by certain rate matrices. For example, the stochastic rockpaperscissors game can be described by a nonregular model obtained with the following transition rate matrix with :
(12) 
PCMCNet can represent such a model by setting the following design parameters. In this case, the individual’s features correspond to an empty tuple yielding an empty vector as representation. By setting to a onehot representation of the alternative (thus ), a fully connected network
consisting of one neuron (i.e. six coefficients and one bias) is enough to represent this matrix since six combinations of inputs are of interest.
Nonparametric limit
More generally, the following theorem shows that any PCMC model can be arbitrarily well approximated by PCMCNet.
Theorem 1.
If and are given enough capacity, PCMCNet can approximate any PCMC model arbitrarily well.
Proof.
PCMCNet forces the transition rates to be at least , whereas the PCMC definition allows any as long as . Since multiplying all the entries of a rate matrix by some does not affect the stationary distribution of the corresponding CTMC, let us consider, without loss of generality, an arbitrary PCMC model given by a transition rate matrix , whose entries are either at least or zero. Let be its stationary distribution. Then, let us consider the matrix obtained by replacing the null entries of by and by multiplying the nonnull entries by some , and let be its stationary distribution. Since, by Cramer’s rule, the entries of the stationary distribution are continuous functions of the entries of the rate matrix, for any , there exist such that .
Since deep neural networks are universal function approximators (Hornik et al., 1989), PCMCNet allows to represent arbitrarily well any if enough capacity is given to the network, which completes the proof. ∎
Contractibility
Let be the rate matrices obtained after the transition rate layer of two different PCMCNets on a finite universe of alternatives . Then, Proposition 1 can be applied. Regarding uniform expansion, when copies are added to a choice set, their transition rates to the other elements of the choice set will be identical since they only depend on their features. Therefore, PCMCNet allows uniform expansion.
3.3 Inference
The logarithmic loss is used to assess the predicted choice distribution given by the model parameterized by on the input against the index of actual choice denoted , i.e.
(13) 
Training can be performed using stochastic gradient descent and dropout to avoid overfitting, which is stable unlike the original inference approach.
4 Experiments on airline itinerary choice modeling
In this section, we instantiate PCMCNet for the case of airline itinerary choice modeling. As shown in Babutsidze et al. (2019), this kind of data often exhibit asymmetric dominance, calling for more flexible models such as PCMC. Nevertheless, in the considered dataset, alternatives rarely repeat themselves, which makes the original inference approach for PCMC inappropriate.
4.1 Dataset
We used the dataset from Mottini and AcunaAgost (2017) consisting of flight bookings sessions on a set of European origins and destinations. Each booking session contains up to 50 different itineraries, one of which has been booked by the customer. There are 815559 distinct alternatives among which 84% are singletons and 99% are observed at most seven times. In total, there are 33951 choice sessions of which 27160 were used for training and 6791 for testing. The dataset has a total of 13 features, both numerical and categorical, corresponding to individuals and alternatives (see Table 1).
Type  Feature  Range/Cardinality  
Individual  Categorical  Origin/Destination  97 
Search Office  11  
Numerical  Departure weekday  [0,6]  
Stay Saturday  [0,1]  
Continental Trip  [0,1]  
Domestic Trip  [0,1]  
Days to departure  [0, 343]  
Alternative  Categorical  Airline (of first flight)  63 
Numerical  Price  [77.15,16781.5]  
Stay duration (minutes)  [121,434000]  
Trip duration (minutes)  [105, 4314]  
Number connections  [2,6]  
Number airlines  [1,4]  
Outbound departure time (in s from midnight)  [0, 84000]  
Outbound arrival time (in s from midnight)  [0, 84000] 
4.2 Instantiation of PCMCNet
PCMCNet was implemented in PyTorch
(Paszke et al., 2017). During training, a minibatch is composed of a number of sessions whose number of alternatives can be variable. Dynamic computation graphs are required in order to adapt to the varying session size. Stochastic gradient optimization is performed with Adam (Kingma and Ba, 2015).In our experiments, numerical variables are unidimensional and thus are not embedded. They were standardized during a preprocessing step. Each categorical input of cardinality is passed through an embedding layer, such that the resulting dimension is obtained by the usual rule of thumb .
We maximize regularization by using a dropout probability of (see, e.g., Baldi and Sadowski (2013)). The additive constant was set to . The linear solver was implemented with torch.solve, which uses LU decomposition. Table 2 shows the hyperparameters and learning parameters that were optimized by performing 25 iterations of Bayesian optimization (using GPyOpt authors (2016)). Early stopping is performed during training if no significant improvement (greater than
with respect to the best log loss obtained so far) is made on a validation set (a random sample consisting of 10% of the choice sessions from the training set) during 5 epochs.
parameter  range  best value 

learning rate  0.001  
batch size (in sessions)  16  
hidden layers in  2  
nodes per layer in  512  
activation  {ReLU, Sigmoid, Tanh, LeakyReLU} 
LeakyReLU 
Using the hyperparameters values returned by the Bayesian optimization procedure and the number of epochs at early stopping (66), the final model is obtained by training on the union of the training and validation sets.
4.3 Results
We compare the performance of the PCMCNet instantiation against three simple baselines:

Uniform: probabilities are assigned uniformly to each alternative.

Cheapest: alternatives are ranked by increasing price. This method is nonprobabilistic.

Shortest: alternatives are ranked by increasing trip duration. This method is also nonprobabilistic.
We also compare against the results presented in Lhéritier et al. (2019)

Multinomial Logit (MNL): choice probabilities are determined from the alternatives’ features only, using some feature transformations to improve the performance.

Latent Class Multinomial Logit (LCMNL): in addition to the alternatives’ features, it uses individual’s features which are used to model the probability of belonging to some latent classes whose number is determined using the Akaike Information Criterion. Feature transformations are also used to improve the performance.

Random Forest (RF): a classifier is trained on the alternatives as if they were independent, considering both individual’s and alternatives’ features and using as label whether each alternative was chosen or not. Some alternatives’ features are transformed to make them relative to the values of each choice set. Since the classifier evaluates each alternative independently, the probabilities within a given session generally do not add to one, and therefore are just interpreted as scores to rank the alternatives.
And, finally, we compare to

Deep Pointer Networks (DPN) (Mottini and AcunaAgost, 2017): a recurrent neural network that uses both the features of the individual and those of the alternatives to learn to point to the chosen alternative from the choice sets given as sequences. The results are dependent on the order of the alternatives, which was taken as in the original paper, that is, as they were shown to the user.
We compute the following performance measures on the test set:

TOP accuracy: proportion of choice sessions where the actual choice was within the top ranked alternatives. In case of ties, they are randomly broken. We consider .

Normalized Log Loss (NLL): given a probabilistic choice model , .
Table 3 shows that PCMCNet outperforms all the contenders in all the considered metrics. It achieves a 21.3% increase in TOP1 accuracy and a 12.8% decrease in NLL with respect to the best contender for each metric. In particular, we observe that the best in TOP
accuracy among the contenders are LCMNL and RF, both requiring manual feature engineering to achieve such performances whereas PCMCNet automatically learns the best representations. We also observe that our results are significantly better than those obtained with the previous deep learning approach DPN, showing the importance of the PCMC definition in our deep learning approach to model the complex behaviors observed in airline itinerary choice data.
method  TOP 1  TOP 5  NLL 

Uniform  .063  .255  3.24 
Cheapest  .164  .471  – 
Shortest  .154  .472  – 
MNL*  .224  .624  2.44 
LCMNL*  .271  .672  2.33 
RF*  .273  .674  – 
DPN  .257  .665  2.33 
PCMCNet  .331  .745  2.03 
5 Conclusions
We proposed PCMCNet, a generic neural network architecture equipping PCMC choice models with amortized and automatic differentiation based inference using alternatives’ features. As a side benefit, the construction allows to condition the probabilities on the individual’s features. We showed that PCMCnet is able to approximate any PCMC model arbitrarily well and, thus, maintains the flexibility (e.g., allowing to represent nonregular models) and the desired property of uniform expansion. Being neural network based, PCMCNet allows complex feature handling as previous machine learning and deep learning based approaches but with the additional theoretical guarantees.
We proposed a practical implementation showing the benefits of the construction on the challenging problem of airline itinerary prediction, where asymmetric dominance effects are often observed and where alternatives rarely appear more than once—making the original inference approach for PCMC inappropriate.
As future work, we foresee investigating the application of PCMCNet on data with complex features (e.g. images, texts, graphs …) to assess the impact of such information on preferences and choice.
References
 An axiomatic formulation and generalization of successive intervals scaling. Psychometrika 23 (4), pp. 355–368. External Links: ISSN 18600980, Document Cited by: §1.
 GPyOpt: a bayesian optimization framework in python. Note: http://github.com/SheffieldML/GPyOpt Cited by: §4.2.
 Asymmetric dominance in airfare choice. In 6th International Conference of Choice Modelling, Cited by: §4.
 Understanding dropout. In Advances in neural information processing systems, pp. 2814–2822. Cited by: §4.2.
 Random orderings and stochastic theories of response. Contributions to Probability and Statistics 2, pp. 97–132. Cited by: §1.
 Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. Cited by: §1.
 A latent class model for discrete choice analysis: contrasts with mixed logit. Transportation Research Part B: Methodological 37 (8), pp. 681–698. Cited by: §1.
 Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §3.2.
 Adding asymmetrically dominated alternatives: violations of regularity and the similarity hypothesis. Journal of consumer research 9 (1), pp. 90–98. Cited by: §1, §1.
 Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Cited by: §4.2.
 Airline itinerary choice modeling using machine learning. Journal of Choice Modelling 31, pp. 198–209. Cited by: §1, §4.3.
 The choice axiom after twenty years. Journal of mathematical psychology 15 (3), pp. 215–233. Cited by: §1.
 Individual choice behavior: a theoretical analysis. Wiley, New York, NY, USA. Cited by: §1.
 The structure of random utility models. Theory and decision 8 (3), pp. 229–254. Cited by: §1.
 Econometric models for probabilistic choice among products. Journal of Business, pp. S13–S29. Cited by: §1.
 Deep choice model using pointer networks for airline itinerary prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1575–1583. Cited by: §1, 1st item, §4.1.
 Markov chains. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. External Links: Document Cited by: §2.1.
 Automatic differentiation in pytorch. Cited by: §4.2.
 Pairwise choice markov chains. In Advances in Neural Information Processing Systems, pp. 3198–3206. Cited by: §1, §1, §1, §2.1, §2.2, §2.3, §3.2, Proposition 1.
 A law of comparative judgment.. Psychological review 34 (4), pp. 273. Cited by: §1.
Comments
There are no comments yet.