PCMC-Net: Feature-based Pairwise Choice Markov Chains

09/25/2019 ∙ by Alix Lheritier, et al. ∙ Amadeus IT Group SA 0

Pairwise Choice Markov Chains (PCMC) have been recently introduced to overcome limitations of choice models based on traditional axioms unable to express empirical observations from modern behavior economics like framing effects and asymmetric dominance. The inference approach that estimates the transition rates between each possible pair of alternatives via maximum likelihood suffers when the examples of each alternative are scarce and is inappropriate when new alternatives can be observed at test time. In this work, we propose an amortized inference approach for PCMC by embedding its definition into a neural network that represents transition rates as a function of the alternatives' and individual's features. We apply our construction to the complex case of airline itinerary booking where singletons are common (due to varying prices and individual-specific itineraries), and asymmetric dominance and behaviors strongly dependent on market segments are observed. Experiments show our network significantly outperforming, in terms of prediction accuracy and logarithmic loss, feature engineered standard and latent class Multinomial Logit models as well as recent machine learning approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Choice modeling aims at finding statistical models capturing the human behavior when faced with a set of alternatives. Classical examples include consumer purchasing decisions, choices of schooling or employment, and commuter choices for modes of transportation among available options. Traditional models are based on different assumptions about human decision making, e.g. Thurstone’s Case V model (Thurstone, 1927) or Bradley-Terry-Luce (BTL) model (Bradley and Terry, 1952). Nevertheless, in complex scenarios, like online shopping sessions presenting numerous alternatives to user-specific queries, these assumptions are often too restrictive to provide accurate predictions.

Formally, there is a universe of alternatives , possibly infinite. In each choice situation, some finite choice set is considered. A choice model is a distribution over the alternatives of a given choice set

, where the probability of choosing the item

among is denoted as . These models can be further parameterized by the alternatives’ features and by those of the individual making the choice.

An important class of choice models is the Multinomial Logit (MNL), a generalization of the BTL model—defined for pairwise choices only—to larger sets. Any model satisfying Luce’s axiom also known as independence of irrelevant alternatives (Luce, 1959) is equivalent to some MNL model (Luce, 1977). In this class, the probability of choosing some item from a given set can be expressed as where is the latent value of the item . Luce’s axiom implies stochastic transitivity i.e. if and , then where (Luce, 1977). Stochastic transitivity implies the necessity of a total order across all elements and also prevents from expressing cyclic preference situations like the stochastic rock-paper-scissors game described in Section 3.2. Thurstone’s Case V model exhibits strict stochastic transitivity but does not satisfy Luce’s axiom (Adams and Messick, 1958). Luce’s axiom and stochastic transitivity are strong assumptions that often do not hold for empirical choice data (see (Ragain and Ugander, 2016) and references therein). For example, Luce’s axiom prevents model from expressing framing effects like asymmetric dominance (or decoy effect), which occurs when the addition of a third alternative dominated by one of the alternatives in a choice set of two, increases the preference towards the alternative dominating the decoy (Huber et al., 1982).

A larger class of models is the one of Random Utility Models (RUM) (Block and Marschak, 1960; Manski, 1977), which includes MNL but also other models satisfying neither Luce’s axiom nor stochastic transitivity. This class affiliates with each

a random variable

and defines for each subset the probability . RUM exhibits regularity i.e. if then . Regularity also prevents models from expressing framing effects and asymmetric dominance (Huber et al., 1982). The class of Nested MNL (McFadden, 1980) allows to express RUM models but also others that do not obey regularity. Nevertheless, inference is practically difficult for Nested MNL models.

Recently, a more flexible class of models called Pairwise Choice Markov Chains has been introduced in Ragain and Ugander (2016). This class includes MNL but also other models that satisfy neither Luce’s axiom, nor stochastic transitivity, nor regularity. This class defines the choice distribution as the stationary distribution of a continuous time Markov chain defined by some transition rate matrix. Still, it satisfies a weakened version of Luce’s axiom called uniform expansion stating that if we add “copies” (with no preference between them), the probability of choosing one element of the copies is invariant to the number of copies. Although the flexibility of this class is appealing, the proposed inference is based on maximizing the likelihood of the rate matrix for the observed choices which is prone to overfitting when the number of observations for each possible alternative is small and is inappropriate when new alternatives can be seen at test time.

Alternatives and individuals making choices can be described by a set of features that can be then used to understand their impact on the choice probability. A linear-in-parameters MNL assumes that the latent value is given by a linear combination of the parameters of the alternatives and the individual. Features of the individual can be taken into account by these models but inference suffers from scarcity and is inappropriate when new alternatives can be seen at test time. The latent class MNL (LC-MNL) model (Greene and Hensher, 2003) takes into account individual heterogeneity by using a Bayesian mixture over different latent classes—whose number must be specified—in which homogeneity and linearity is assumed. A linear in features parameterization for PCMC was suggested in (Ragain and Ugander, 2016, Appendix) but still requires building a matrix at training time, which makes it inappropriate for large universes. In complex cases like airline itinerary choice, where the alternatives are strongly dependent on an individual-specific query and some features, like price, can be dynamic, the previous approaches have limited expressive power or are inappropriate.

Two recently introduced methods allow complex feature handling for alternatives and individuals. Mottini and Acuna-Agost (2017)

proposes a recurrent neural network method consisting in learning to point, within a sequence of alternatives, to the chosen one. This model is appealing because of its feature learning capability but neither its choice-theoretic properties have been studied nor its dependence on the order of the sequence.

Lhéritier et al. (2019)

proposes to train a Random Forest classifier to predict whether an alternative is going to be predicted or not independently of the rest of the alternatives of the choice set. This approach does not take into account the fact that in each choice set exactly one alternative is chosen. For this reason, the probabilities provided by the model are only used as scores to rank the alternatives, which can be interpreted as latent values—making it essentially equivalent to a non-linear MNL. To escape this limitation and make the latent values dependent on the session, relative features are added (e.g. the price for

-th alternative is converted to ). The non-parametric nature of this model is appealing but its choice-theoretic properties have not been studied either.

In this work, we propose to enable PCMC with neural networks based feature handling, therefore enjoying both the good theoretical properties of PCMC and the complex feature handling of the previous neural network based and non-parametric methods. This neural network parameterization of PCMC makes the inference amortized allowing to handle large (and even infinite) size universes as shown in our experiments for airline itinerary choice modeling.

2 Background: Pairwise Choice Markov Chains

2.1 Definition

A Pairwise Choice Markov Chain (PCMC) (Ragain and Ugander, 2016) defines the choice probability as the probability mass on the alternative of the stationary distribution of a continuous time Markov chain (CTMC) whose set of states corresponds to . The model’s parameters are the off-diagonal entries of a rate matrix indexed by pairs of elements in . Given a choice set , the choice distribution is the stationary distribution of the continuous time Markov chain given by the matrix obtained by restricting the rows and columns of to elements in and setting for each . Therefore, the distribution is parameterized by the transition rates of .

The constraint

(1)

is imposed in order to guarantee that the chain has a single closed communicating class which implies the existence and the unicity of the stationary distribution (see, e.g., Norris (1997)) obtained by solving

(2)

where and

are row vectors of zeros and ones, respectively. Since any column of

is the opposite of the sum of the rest of the columns, it is equivalent to solve

(3)

where .

2.2 Properties

In Ragain and Ugander (2016), it is shown that PCMC allow to represent any MNL model, but also models that are non-regular and do not satisfy stochastic transitivity (using the rock-scissor-paper example of Section 3.2).

In addition, they show that PCMC models feature a property termed contractibility, which intuitively means that we can “contract” subsets to a single “type” when the probability of choosing an element of is independent of the pairwise probabilities between elements within the subsets. Formally, a partition of into non-empty sets is a contractible partition if for all for some for . Then, the following proposition is shown.

Proposition 1 (Ragain and Ugander (2016)).

For a given , let be a contractible partition for two PCMC models on represented by with stationary distributions Then, for any ,

Then it is shown, that contractibility implies uniform expansion formally defined as follows.

Definition 1 (Uniform Expansion).

Consider a choice between elements in a set and another choice from a set containing copies of each of the n elements: The axiom of uniform expansion states that for each and all .

2.3 Inference

Given a dataset , the inference method proposed in Ragain and Ugander (2016) consists in maximizing the log likelihood of the rate matrix indexed by

(4)

where denotes the probability that is selected from as a function of and denotes the number of times in the data that was chosen out of set .

This optimization is difficult since there is no general closed form expression for and the implicit definition also makes it difficult to derive gradients for with respect to the parameters . The authors propose to use Sequential Least Squares Programming (SLSQP) to maximize , which is nonconcave in general. However, in their experiments, they encounter numerical instabilities leading to violations () of the PCMC definition, which were solved with additive smoothing at the cost of some efficacy of the model. In addition, when the examples of each alternative are scarce like in the application of Section 4, this inference approach is prone to severe overfitting and is inappropriate to predict unseen alternatives. These two drawbacks motivate the amortized inference approach we introduce next.

3 PCMC-Net

We propose an amortized inference approach for PCMC based on a neural network architecture called PCMC-Net that uses the alternatives’ and the individual’s features to determine the transition rates and can be trained using standard stochastic gradient descent techniques.

3.1 Architecture

Input layer

Let be the tuple of features of the -th alternative of the choice set belonging to a given feature space and be the tuple of the individual’s features belonging to a given feature space . The individual’s features are allowed to be an empty tuple.

Representation layer

The first layer is composed of a representation function for the alternatives’ features

(5)

and a representation function for the individual’s features

(6)

where and are the sets of weights parameterizing them and

are hyperparameters. These functions can include, e.g., embedding layers for categorical variables, a convolutional network for images or text, etc., depending on the inputs’ types.

Cartesian product layer

In order to build the transition rate matrix, all the pairs of different alternatives need to be considered, this is accomplished by computing the cartesian product

(7)

The combinations of embedded alternatives are concatenated together with the embedded features of the individual, i.e.

(8)

where denotes vector concatenation.

Transition rate layer

The core component is a model of the transition rate :

(9)

where consists of multiple fully connected layers parameterized by a set of weights and is a hyperparameter. Notice that taking the maximum with 0 and adding guarantees non-negativity and the condition of Eq. 1. The transition rate matrix is then obtained as follows:

(10)

Stationary distribution layer

The choice probabilities correspond to the stationary distribution that is guaranteed to exist and be unique by the condition of Eq. 1 and can be obtained by solving the system

(11)

by, e.g., partially-pivoted LU decomposition which can be differentiated with automatic differentiation.

The whole network is represented in Fig. 1.

Figure 1: PCMC-Net. denotes the cartesian product and vector concatenation.

3.2 Properties

Non-regularity

As shown in Ragain and Ugander (2016), non-regular models can be obtained by certain rate matrices. For example, the stochastic rock-paper-scissors game can be described by a non-regular model obtained with the following transition rate matrix with :

(12)

PCMC-Net can represent such a model by setting the following design parameters. In this case, the individual’s features correspond to an empty tuple yielding an empty vector as representation. By setting to a one-hot representation of the alternative (thus ), a fully connected network

consisting of one neuron (i.e. six coefficients and one bias) is enough to represent this matrix since six combinations of inputs are of interest.

Non-parametric limit

More generally, the following theorem shows that any PCMC model can be arbitrarily well approximated by PCMC-Net.

Theorem 1.

If and are given enough capacity, PCMC-Net can approximate any PCMC model arbitrarily well.

Proof.

PCMC-Net forces the transition rates to be at least , whereas the PCMC definition allows any as long as . Since multiplying all the entries of a rate matrix by some does not affect the stationary distribution of the corresponding CTMC, let us consider, without loss of generality, an arbitrary PCMC model given by a transition rate matrix , whose entries are either at least or zero. Let be its stationary distribution. Then, let us consider the matrix obtained by replacing the null entries of by and by multiplying the non-null entries by some , and let be its stationary distribution. Since, by Cramer’s rule, the entries of the stationary distribution are continuous functions of the entries of the rate matrix, for any , there exist such that .

Since deep neural networks are universal function approximators (Hornik et al., 1989), PCMC-Net allows to represent arbitrarily well any if enough capacity is given to the network, which completes the proof. ∎

Contractibility

Let be the rate matrices obtained after the transition rate layer of two different PCMC-Nets on a finite universe of alternatives . Then, Proposition 1 can be applied. Regarding uniform expansion, when copies are added to a choice set, their transition rates to the other elements of the choice set will be identical since they only depend on their features. Therefore, PCMC-Net allows uniform expansion.

3.3 Inference

The logarithmic loss is used to assess the predicted choice distribution given by the model parameterized by on the input against the index of actual choice denoted , i.e.

(13)

Training can be performed using stochastic gradient descent and dropout to avoid overfitting, which is stable unlike the original inference approach.

4 Experiments on airline itinerary choice modeling

In this section, we instantiate PCMC-Net for the case of airline itinerary choice modeling. As shown in Babutsidze et al. (2019), this kind of data often exhibit asymmetric dominance, calling for more flexible models such as PCMC. Nevertheless, in the considered dataset, alternatives rarely repeat themselves, which makes the original inference approach for PCMC inappropriate.

4.1 Dataset

We used the dataset from Mottini and Acuna-Agost (2017) consisting of flight bookings sessions on a set of European origins and destinations. Each booking session contains up to 50 different itineraries, one of which has been booked by the customer. There are 815559 distinct alternatives among which 84% are singletons and 99% are observed at most seven times. In total, there are 33951 choice sessions of which 27160 were used for training and 6791 for testing. The dataset has a total of 13 features, both numerical and categorical, corresponding to individuals and alternatives (see Table 1).

Type Feature Range/Cardinality
Individual Categorical Origin/Destination 97
Search Office 11
Numerical Departure weekday [0,6]
Stay Saturday [0,1]
Continental Trip [0,1]
Domestic Trip [0,1]
Days to departure [0, 343]
Alternative Categorical Airline (of first flight) 63
Numerical Price [77.15,16781.5]
Stay duration (minutes) [121,434000]
Trip duration (minutes) [105, 4314]
Number connections [2,6]
Number airlines [1,4]
Outbound departure time (in s from midnight) [0, 84000]
Outbound arrival time (in s from midnight) [0, 84000]
Table 1: Features of the airline itinerary choice dataset.

4.2 Instantiation of PCMC-Net

PCMC-Net was implemented in PyTorch

(Paszke et al., 2017). During training, a mini-batch is composed of a number of sessions whose number of alternatives can be variable. Dynamic computation graphs are required in order to adapt to the varying session size. Stochastic gradient optimization is performed with Adam (Kingma and Ba, 2015).

In our experiments, numerical variables are unidimensional and thus are not embedded. They were standardized during a preprocessing step. Each categorical input of cardinality is passed through an embedding layer, such that the resulting dimension is obtained by the usual rule of thumb .

We maximize regularization by using a dropout probability of (see, e.g., Baldi and Sadowski (2013)). The additive constant was set to . The linear solver was implemented with torch.solve, which uses LU decomposition. Table 2 shows the hyperparameters and learning parameters that were optimized by performing 25 iterations of Bayesian optimization (using GPyOpt authors (2016)). Early stopping is performed during training if no significant improvement (greater than

with respect to the best log loss obtained so far) is made on a validation set (a random sample consisting of 10% of the choice sessions from the training set) during 5 epochs.

parameter range best value
learning rate 0.001
batch size (in sessions) 16
hidden layers in 2
nodes per layer in 512
activation

{ReLU, Sigmoid, Tanh, LeakyReLU}

LeakyReLU
Table 2: Hyper- and learning parameters optimized with Bayesian optimization.

Using the hyperparameters values returned by the Bayesian optimization procedure and the number of epochs at early stopping (66), the final model is obtained by training on the union of the training and validation sets.

4.3 Results

We compare the performance of the PCMC-Net instantiation against three simple baselines:

  • Uniform: probabilities are assigned uniformly to each alternative.

  • Cheapest: alternatives are ranked by increasing price. This method is non-probabilistic.

  • Shortest: alternatives are ranked by increasing trip duration. This method is also non-probabilistic.

We also compare against the results presented in Lhéritier et al. (2019)

  • Multinomial Logit (MNL): choice probabilities are determined from the alternatives’ features only, using some feature transformations to improve the performance.

  • Latent Class Multinomial Logit (LC-MNL): in addition to the alternatives’ features, it uses individual’s features which are used to model the probability of belonging to some latent classes whose number is determined using the Akaike Information Criterion. Feature transformations are also used to improve the performance.

  • Random Forest (RF): a classifier is trained on the alternatives as if they were independent, considering both individual’s and alternatives’ features and using as label whether each alternative was chosen or not. Some alternatives’ features are transformed to make them relative to the values of each choice set. Since the classifier evaluates each alternative independently, the probabilities within a given session generally do not add to one, and therefore are just interpreted as scores to rank the alternatives.

And, finally, we compare to

  • Deep Pointer Networks (DPN) (Mottini and Acuna-Agost, 2017): a recurrent neural network that uses both the features of the individual and those of the alternatives to learn to point to the chosen alternative from the choice sets given as sequences. The results are dependent on the order of the alternatives, which was taken as in the original paper, that is, as they were shown to the user.

We compute the following performance measures on the test set:

  • TOP accuracy: proportion of choice sessions where the actual choice was within the top ranked alternatives. In case of ties, they are randomly broken. We consider .

  • Normalized Log Loss (NLL): given a probabilistic choice model , .

Table 3 shows that PCMC-Net outperforms all the contenders in all the considered metrics. It achieves a 21.3% increase in TOP-1 accuracy and a 12.8% decrease in NLL with respect to the best contender for each metric. In particular, we observe that the best in TOP

accuracy among the contenders are LC-MNL and RF, both requiring manual feature engineering to achieve such performances whereas PCMC-Net automatically learns the best representations. We also observe that our results are significantly better than those obtained with the previous deep learning approach DPN, showing the importance of the PCMC definition in our deep learning approach to model the complex behaviors observed in airline itinerary choice data.

method TOP 1 TOP 5 NLL
Uniform .063 .255 3.24
Cheapest .164 .471
Shortest .154 .472
MNL* .224 .624 2.44
LC-MNL* .271 .672 2.33
RF* .273 .674
DPN .257 .665 2.33
PCMC-Net .331 .745 2.03
Table 3: Results on airline itinerary choice prediction. * indicates cases with feature engineering.

5 Conclusions

We proposed PCMC-Net, a generic neural network architecture equipping PCMC choice models with amortized and automatic differentiation based inference using alternatives’ features. As a side benefit, the construction allows to condition the probabilities on the individual’s features. We showed that PCMC-net is able to approximate any PCMC model arbitrarily well and, thus, maintains the flexibility (e.g., allowing to represent non-regular models) and the desired property of uniform expansion. Being neural network based, PCMC-Net allows complex feature handling as previous machine learning and deep learning based approaches but with the additional theoretical guarantees.

We proposed a practical implementation showing the benefits of the construction on the challenging problem of airline itinerary prediction, where asymmetric dominance effects are often observed and where alternatives rarely appear more than once—making the original inference approach for PCMC inappropriate.

As future work, we foresee investigating the application of PCMC-Net on data with complex features (e.g. images, texts, graphs …) to assess the impact of such information on preferences and choice.

References

  • E. Adams and S. Messick (1958) An axiomatic formulation and generalization of successive intervals scaling. Psychometrika 23 (4), pp. 355–368. External Links: ISSN 1860-0980, Document Cited by: §1.
  • T. G. authors (2016) GPyOpt: a bayesian optimization framework in python. Note: http://github.com/SheffieldML/GPyOpt Cited by: §4.2.
  • Z. Babutsidze, W. Rand, E. Mirzayev, I. Rafai, N. Hanaki, T. Delahaye, and R. Acuna-Agost (2019) Asymmetric dominance in airfare choice. In 6th International Conference of Choice Modelling, Cited by: §4.
  • P. Baldi and P. J. Sadowski (2013) Understanding dropout. In Advances in neural information processing systems, pp. 2814–2822. Cited by: §4.2.
  • H. D. Block and J. Marschak (1960) Random orderings and stochastic theories of response. Contributions to Probability and Statistics 2, pp. 97–132. Cited by: §1.
  • R. A. Bradley and M. E. Terry (1952) Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. Cited by: §1.
  • W. H. Greene and D. A. Hensher (2003) A latent class model for discrete choice analysis: contrasts with mixed logit. Transportation Research Part B: Methodological 37 (8), pp. 681–698. Cited by: §1.
  • K. Hornik, M. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §3.2.
  • J. Huber, J. W. Payne, and C. Puto (1982) Adding asymmetrically dominated alternatives: violations of regularity and the similarity hypothesis. Journal of consumer research 9 (1), pp. 90–98. Cited by: §1, §1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Cited by: §4.2.
  • A. Lhéritier, M. Bocamazo, T. Delahaye, and R. Acuna-Agost (2019) Airline itinerary choice modeling using machine learning. Journal of Choice Modelling 31, pp. 198–209. Cited by: §1, §4.3.
  • R. D. Luce (1977) The choice axiom after twenty years. Journal of mathematical psychology 15 (3), pp. 215–233. Cited by: §1.
  • R. D. Luce (1959) Individual choice behavior: a theoretical analysis. Wiley, New York, NY, USA. Cited by: §1.
  • C. F. Manski (1977) The structure of random utility models. Theory and decision 8 (3), pp. 229–254. Cited by: §1.
  • D. McFadden (1980) Econometric models for probabilistic choice among products. Journal of Business, pp. S13–S29. Cited by: §1.
  • A. Mottini and R. Acuna-Agost (2017) Deep choice model using pointer networks for airline itinerary prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1575–1583. Cited by: §1, 1st item, §4.1.
  • J. R. Norris (1997) Markov chains. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. External Links: Document Cited by: §2.1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.2.
  • S. Ragain and J. Ugander (2016) Pairwise choice markov chains. In Advances in Neural Information Processing Systems, pp. 3198–3206. Cited by: §1, §1, §1, §2.1, §2.2, §2.3, §3.2, Proposition 1.
  • L. L. Thurstone (1927) A law of comparative judgment.. Psychological review 34 (4), pp. 273. Cited by: §1.