1 Introduction
The airline industry is evolving. After years of consolidation, carriers are refocusing their competitive advantages on promoting and merchandising the right product to the right traveler at the right time. To gather more valuable travelers’ data and squeeze the distribution structure, carriers are encouraging travelers to purchase through their own website or app. IATA (Harteveldt, 2016) projects purchase through airline mobile app will surge from 2% to 7% while website purchase will increase from 33% to 37% by year 2021. This creates great opportunities for carriers to achieve traveler personalization and increase their travelers’ stickiness as currently rich travelers’ information is retained by distributors. Personalization also means traditional methods of using aggregate data to infer travelers’ preference may no longer be satisfactory and techniques requiring smaller samples from the same individual may be more beneficial.
An airline capturing travelers’ online actions related to promotions such as clickthroughs or bookings sits on a trove of data. These online actions may result in clicked offers, purchased offers and nonclicked offers. Furthermore, the airline can record more information such as visit duration based on an offer, number of repeated clicks based on a different offer, etc. With the collected actions and information, the airline can construct pairwise comparisons each time it promotes. On the next occasion, more attractive promotions can be actively found through an active learning algorithm and pushed to the traveler based on learned preference. One advantage of active learning is a low number of interactions with the traveler in order to learn his or her preferences. This not only improves customer experience but also increases revenue.
We develop an algorithm and model that can actively search for one individual’s most preferred choice based exclusively on pairwise comparisons. We rely on an individual’s preference over different choices as a latent function depending on choices’ features. The probabilistic model takes pairwise comparisons and latent function values as input. We apply pairwise comparison as it is a more effective approach to collect individual’s preference than using a choice set with multiple choices. With a series of pairwise comparisons as the input, we approximate the joint distribution describing the comparisons first and then build a probabilistic model which is different from multiplying each pairwise probability one by one under the traditional assumption of independency. In order to capture discrete choice models our noise is Gumbel distributed and not normal as standardly assumed. This yields the fact that the joint distribution over pairs does not decompose. The goal of the overall algorithm is to minimize the number of queries we request before finding the most preferred choice. One difference between our active learning algorithm and the past works is that we use deep Gaussian process (DGP) priors which have been studied in other supervised tasks (Damianou and Lawrence, 2013; Bui et al., 2016). In the airline industry and elsewhere, choice models are commonly used in stated preference inference. Within that, Gumbel noises are a required assumption and this motivates us to develop models using similar assumptions. In the following sections, we use “instance,” “choice” and “itinerary” interchangeably.
The contributions of this paper are as follows:

To the best of our knowledge, we are the first to use DGPs, not only in conjunction to Gumbel noises but also in preferencebased active learning;

We apply correlated Gumbel noises to active preference learning. To our best knowledge, no prior works have focused on this since normal noise is always assumed. This captures the setting of nested logit models popular in discrete choice utility studies;

Deviating from traditional choice models which use a linear combination between weights and features as the utility, we assume DGP priors to the latent utility;

We develop a way to approximate a preference chain derived from pairwise comparisons which proved to be effective by means of the experiments;

A new acquisition function has been proposed and used in the active learning process.
The rest of the paper is organized as follows. Section 2 provides the literature review and Section 3 introduces the preliminaries needed. Section 4 provides model assumptions, the DGP models, probabilistic formulations, the acquisition function and the active learning algorithm. Section 5 includes experiments on latent functions and an airline itinerary dataset.
2 Literature review
Travel demand and preference studies have previously focused on choice models with wide use in airlines’ revenue management systems such as passenger origindestination simulators (Carrier and Weatherford, 2015). For empirical studies using choice models, we refer to Coldren and Koppelman (2005), Warburg et al. (2006) and Garrow (2010). In these studies, multinomial logit models, nested logit (NL) models, mixed logit models and generalized extreme value (GEV) models are widely used. For details of these models including model assumptions and formulas, we refer to McFadden (1978), BenAkiva and Bierlaire (1999) and Koppelman and Sethi (2000).
We rely on Bayesian optimization which is applicable in situations where a closedform expression of the objective function is not available, but where observations (possibly noisy) of this function at sampled values (Brochu et al., 2010) are readily available. Instead of modeling the utility function as a linear combination between weights and features which is a traditional practice in choice models, we assume utilities over different choices follow a black box function. Different from Bayesian optimization which returns the function value at each step, our model learns hidden values from pairwise preference relations using a probabilistic approach. In this way, the algorithm only learns which choice/point yields a higher utility/function value. Chu and Ghahramani (2005) introduce a nonparametric Bayesian approach to preference learning over instances or labels with the latent function values imposed by GP priors. Brochu et al. (2007) develop an algorithm based on GP which automatically decides what items are best presented to an individual in order to find the item that they value highly in as few trials as possible. The difference between Brochu et al. (2007) and Chu and Ghahramani (2005) is that the latter selects instances from a finite pool while Brochu et al. (2007) maximize a latent function over a continuous space. A random selection method is used in Brochu et al. (2007) to search a continuous space and the results are used as a baseline which is a common benchmark in active learning for regression models. Our work is different from previous works in three ways. Firstly, instead of using probit or ThurstoneMosteller models, our assumptions are based on NL models and this leads to more complex probabilistic formulations but more realistic models. Secondly, besides GP priors which can be considered as simplified DGPs with no hidden hierarchies, we mainly focus on DGP models that in turn yield better performances. Thirdly, the proposed modeling assumptions lead to a unique probability distribution to compare instances. We approximate this distribution and incorporate it in the associated acquisition function within the active learning algorithm.
GP models have been widely studied in regression and classification. Analytic inference, expressivity, integration over hypotheses and closedform predictive distribution are the main useful properties of GPs (Duvenaud, 2014). Besides the works on active preference learning, GP models can also be applied to semisupervised learning. For example, semisupervised GP (Sindhwani et al., 2005) provides a novel prior that exploits the localized spatial structure spanned by the given unlabeled data. For details of GPs and machine learning, we refer to William and Rasmussen (1996).
Empirically, deep models seem to have structural advantages that can improve the quality of learning in complicated datasets associated with abstract information (Bengio, 2009). The GP latent variable model (GPLVM) introduced by Lawrence (2004, 2005) assumes a single hidden layer and treats the hidden layer as latent variables. Titsias and Lawrence (2010) variationally integrate out the latent variables in GPLVM and compute a closedform Jensen’s lower bound on the true log marginal likelihood of the data. The key is to expand the probability space with extra variables which allows for priors on the latent space to be propagated through a nonlinear mapping. These extra variables are called inducing points. The concept of using inducing points was originally proposed to solve GP models in Snelson and Ghahramani (2006) with a large dataset because solving the models involves computing the inverse of an matrix where is the number of instances. The inducing variables are latent function values evaluated at some subset of the dataset and can be sampled from training data (Titsias, 2009).
Compared to GP models or the GPLVM model, DGP models are more general that may provide better prediction performances. DGP models come with deep structures and use GPs as the mappings to connect layers. A single layer DGP model is effectively the GPLVM model (Damianou and Lawrence, 2013). Damianou and Lawrence (2013) apply a variational approach which approximately marginalises out the latent space, thus allowing for automatic structure discovery in DGP models’ hierarchy. However, in the variational approach (Damianou and Lawrence, 2013), the number of variational parameters increases linearly with the number of training data points which hinders the use of this method for largescale datasets (Bui et al., 2016). Bui et al. (2016) then use a novel approximate Expectation Propagation (EP) scheme and a probabilistic backpropagation algorithm for DGP models. Different from the variational approach (Damianou and Lawrence, 2013), the algorithm in Bui et al. (2016) does not retain an explicit approximate distribution over the hidden variables. Our algorithmic approach relies on the active learning framework in Brochu et al. (2007) and the DGP approximation in Bui et al. (2016).
3 Preliminaries
In this section, we start with NL models and then introduce some previous works on active preference learning with probit models, and an approximation in DGP models. Suppose we have a set of choice vectors
with , and a set of pairwise preference relations at step , . The active learning algorithm adds a new pairwise comparison to after completing step . We assume there is an unobservable latent function value for each so the latent values reflect pairwise comparisons in well, i.e. if , then where and are noises. The goal is to find based on .3.1 NL models
NL models belong to the family of GEV models. In GEV models, the unobservable portions of utility for all alternatives are jointly distributed as generalized extreme values. In general, NL models assume that a set of choices can be grouped into a nest. For example, nonstop itineraries can be grouped into one nest while onestop itineraries belong to another nest. If we have a set of unobserved utilities for instances in , then the cumulative joint distribution is
(1) 
In equation (1), is the nest index and refers to the set of all choices within nest . Value can be used to measure the correlation between unobserved utilities among choices within the same nest . If is the utility of choice , then the correlation between and within the same nest can be written as
Based on assumptions on NL models, we next summarize the expressions of and (Dagsvik and Liu, 2006; Oviedo and Yoo, 2016). For , if and are in the same nest then
(2) 
If and are in different nests, we have
(3) 
With regard to , there are 5 cases to consider where is in nest , in nest and in nest :

Case 1:

Case 2:

Case 3: or )

Case 4:

Case 5: .
Corresponding to these cases, we have the probability formulas defined in Table 1.
Table 1. Probability expressions
Case #  Probability expression 

Case 1  
Case 2  
Case 3  
Case 4  
Case 5 
3.2 Active Preference Learning with Probit Models
Here we summarize the approach from Brochu et al. (2007). In active preference learning, the objective is
When using probit models,
is assumed to be normally distributed with variance
. Then the probability of a pairwise comparison can be developed as where is the cumulative density function of the standard normal distribution.The active preference learning approach in Brochu et al. (2007) with probit models assumes GP priors on
. In GPs, the prior probability of latent values
is a multivariate Gaussian distribution with
(4) 
In this distribution, matrix is specified by using Gaussian kernel functions
Each element in refers to .
For simplicity we use instead of . We next outline the approach from Brochu et al. (2007). To compute or approximate the posterior distribution of based on , we use Laplace approximation which carries out the Taylor expansion at the maximum a posteriori (MAP) point and retains the terms up to the second order (MacKay, 1994):
We denote and as the Hessian matrix at the MAP point.
In view of this, we first find hidden values maximizing . According to Bayes we have and thus we can find the MAP point by maximizing the probability (Chu and Ghahramani, 2005; Brochu et al., 2010).
If then
since follows a multivariate Gaussian distribution.
To get the optimal MAP, we compute the gradient of as
Similarly, we have the Hessian matrix defined as
To find , we can use second order optimization methods.
3.3 DGP models and EP
In this outline, we follow Bui et al. (2016). In a DGP model,
with as a Gaussian noise and as GP mappings. As a result, is the same as in GP models while is different. We can compute through a marginal distribution by integrating out as
The marginal distribution has no explicit formula and is intractable since and are treated through different nonlinear GP priors. If is the noise at the hidden layer , we can write
The flow of the entire propagation is presented in Figure 1. We augment the probability space with at the hidden and top layers. Each point has an inducing input, namely . Using the bottom layer as an example, and map through the same GP prior to and .
With the inducing points and , we can rewrite and expand the marginal distribution of as
The EP algorithm provides an approximation for distributions belonging to the exponential family (Seeger, 2007; Bui et al., 2016). Bui et al. (2016) approximate by
(5) 
Here, includes all model parameters and is the number of instances. Vectors , correspond to natural parameters of the approximate posterior to , and prior , respectively. Vector is the natural parameter of the cavity distribution (not corresponding to a data point). Function is a log normalizer of Gaussian.
In addition, we have where with being an approximate data factor. Term can be expanded as
According to the property of conditional Gaussian (Damianou, 2015), we have where
and the form refers to the kernel matrix. Then the term can be computed by marginalizing out using marginalization and conditioning of Gaussian distributions (Damianou, 2015). The term can be defined in the same way as . After marginalizing out the inducing points , Bui et al. (2016) approximate
by marginalizing out the hidden layer where and have explicit formulas. Those formulas involve convolutions of covariance functions with Gaussian distributions. For the general formulas, we refer to Titsias and Lawrence (2010).
4 Models and Algorithms
4.1 Problem Setting
We start by formally defining the discrete choice GP (DCGP) and discrete choice DGP (DCDGP) models considered herein. Let
be the joint cumulative distribution function of a multivariate extreme random variable. Let
be a zero mean Gaussian prior and let be the covariance matrix computed by a predetermined kernel function. The DCGP model readsFor simplicity we use to represent in the DCGP model. In , is the scale parameter for nest in a Gumbel distribution. The difference between DCGP and standard GP is in the noise which is Gumbel distributed instead of Gaussian. This has two consequences: (1) the distribution is much more complex, and (2) the acquisition function should capture this distribution.
In the definition of the DCDGP model, we assume there is one hidden layer and hidden mappings. It is easy to extend the hierarchy to deeper layers. Each hidden mapping follows a GP and maps to . There is also a mapping that transforms to the response layer. Each represents a latent feature vector . Similar to , we write to represent . The network structure is depicted in Figure 2.
The model reads
Random variable is a noise term and is normally distributed. Here we assume that takes as input and takes as input, and thus we have
since we assume are additive independent identically distributed Gaussian noises (Rasmussen and Williams, 2006).
The DCDGP model has the same differentiating factors as GP. We do want to point out that deep GPs have never been applied before for preference learning.
4.2 Algorithm
In both cases, we follow the algorithm from Section 3.2. In this section we focus on aspects that are different: an approximate preference chain to compute , a novel acquisition function, and algorithmic specifications.
4.2.1 DCGP and DCDGP algorithms
In the DCGP model, we use Laplace approximation to approximate as a function of . To get the gradients and the Hessian matrix, we use and as explained in Section 3.2. The probability is developed based on the observed pairwise comparisons where we use a preference chain explained later instead of independent pairwise probabilities to approximate it. The distribution is defined in (4). The model parameter vector in DCGP includes kernel and nest parameters.
In the DCDGP model, Laplace approximation is also used for the distribution and the preference chain behind the probability is formulated identically as in the DCGP model. However, due to the complex structure of deep GP models we use as explained in Section 3.3 to approximate the distribution . In the DCDGP model, we find by maximizing the loglikelihood function . The model parameter vector in DCDGP includes kernel, nest and DGP parameters introduced in (5).
4.2.2 Preference chain
We focus here on estimating
. The challenge is the fact that the distribution does not decompose by pairwise comparison and indeed, it seems impossible to get it exactly.We view as an acyclic graph. We call a preference chain any digraph consisting of a directed path with nodes on the path having at most one other incoming arc with the tail node of such an arc having zero indegree. In layman words, a preference chain has one “main path” and “offsprings” of single arcs. In the overall algorithm, we always maintain a preference chain based on the digraph with respect to . Probability is first approximated by which is further approximated.
All previous work assume pairwise comparisons are independent but we use the preference chain for two reasons: (1) the preference chain has a better representation of an individual’s ordered preference on choices; (2) in our models, Gumbel errors are correlated so the independent assumption does not apply.
Probability needs to be computed in each iteration of the algorithm presented in Section 3.2. There are two updating step types: 1) in a MAP optimization iteration, does not change, but changes, and 2) from to one additional comparison is added. During MAP optimization, all model parameters are updated: , , etc.
At the beginning, we construct the initial based on the initial as follows. From each nest we select a random arc in . The highest utility instance (head of the arc) becomes a candidate of as the main chain. All these instances are nodes in and we find the longest path among them. This becomes the main path in . To it we add offsprings corresponding to the lower utility instances. The only exception is the lower utility instance from the lowest utility node in the main path which is actually added as a main path node (instead of an offspring). This procedure yields the initial preference chain . In Figure 3, both red and black arcs correspond to where the connected red arcs represent the main path in and the dotted arcs stand for offsprings. Note that nest 3 does not have a representative in .
When moving from to , a new pairwise comparison is added. This pairwise comparison is always between an instance not in and the highest utility instance in , i.e. the zero outdegree node in . Based on the comparison in the new pair, the new instance is either added as an offspring to or to the main path in as the new highest utility instance. Clearly this preserves the structure of the preference chain.
We have observed that the initial has to include many instances from different nests as otherwise s do not update well. For this reason we have designed the appropriate initialization strategy outlined above.
Based on Section 3.1, we can derive the probability when the length of a chain is less than or equal to 3. However, the probability of the longest chain for nested logit model is difficult to compute when the length of a chain is greater than 3 using the same method detailed in Dagsvik and Liu (2006). So we further approximate , where is the main path in and are the offsprings.
Probability is decomposed by pair and each individual probability is computed by (2) or (3). For we approximate it by independent triplets (and a possible pair). For triplets we use Table 1. For example,
4.2.3 Acquisition function with probability improvement
In Bayesian optimization, acquisition functions are used to sample the next point. Typically, acquisition functions are defined such that high acquisition corresponds to potentially high values of the objective function, whether because the probability of improvement is high, the expected value is high, the uncertainty is low, or both (Brochu, 2010). The probability of improvement is a popular choice. For example, in a noisefree situation (i.e. no Gumbel terms), we can use
with defined as the index of the next instance to sample, as the index of the current most preferred instance, and .
In both DCGP and DCDGP models, the noise is Gumbel distributed and thus we need to compute the probability .
Let define the logistic distribution with location 0 and scalar . We have where and because is based on GP.
According to Crooks (2009), we can approximate as a reparameterized logistic function
Here if are in different nests and otherwise according to the common nest .
To summarize, we have the probability improvement function or acquisition function defined as
(6) 
4.2.4 Active learning algorithm
The general framework of the active learning algorithm is presented in Algorithms 1 and 2. In Algorithm 1, refers to instances which have not yet been compared and is a set containing instances which have already been compared, i.e. they appear at least once in . Value represents the instance yielding the highest utility up to iteration
is the lower threshold on probability improvement. The algorithm covers a single step of going from to (index is omitted for clarity).Function takes and returns preference chain based on the method outlined in Section 4.2.2. Function based on Section 3.2 returns the Hessian matrix computed at MAP with corresponding , estimated covariance matrix , maximum estimated mean of all alternatives in and model parameters . Function returns the acquisition value. Function returns the true preference comparison between two given choices. In Algorithm 2, the probability improvement is computed. Index refers to the index of in . Function is based on the predictive distribution in Brochu et al. (2007). We use to compute the cumulative density of the normalGumbel convolution based on (6).
5 Experiments
We implement the proposed DCGP and DCDGP models in two different scenarios. The first scenario is to measure capabilities of the algorithm of searching a maximum of some latent functions. The latent functions come from a prior work (Brochu et al., 2007), but we modify them by adding correlated Gumbel noises to the function values. The second scenario applies the models to a public airline itinerary dataset. We implement the algorithms in Python 2.7 and run them on a Linux server with Intel Xeon CPU at 2.2 GHZ and 32 GB memory. In training, to find the hyperparameters in the kernel function and NL parameters , we optimize them together with hidden by using Adam as the optimization algorithm in both DCGP and DCDGP models.
For each experiment, we create 10 different starting scenarios where each scenario starts with different pairwise comparisons. As the algorithm requires computing nest parameters, to ensure the starting scenarios include instances from every nest we use a twophase comparison method. In the first phase, we randomly select two instances from each nest and compare them. In the second phase, we use the preferred instances (i.e. the “winners” in the comparisons) from the first phase and compare them to get a complete order among them. We use two different DCDGP model structures: one model with a one dimension hidden layer named as DCDGP1 and one model with a five dimension hidden layer named as DCDGP5. The results that follow represent overall performance values across these 10 random starting scenarios.
To benchmark the active learning algorithm, we create a random search method. At each query
, it randomly selects an instance from those that have not yet been compared. The random search method applies to the same starting scenarios as our algorithm. To obtain an average performance, we run it 500 times for each scenario. With 500 runs, the standard deviation of performance values is low.
5.1 Latent Function Experiments
5.1.1 Experiment Design
The functions we evaluate are defined as
which come from Brochu et al. (2007).
To attach correlated Gumbel noises to and values, we first discretize the space which also provides an effective reflection to our active learning problem setting (i.e. a finite number of instances). In discretization, we split each dimension equally to obtain a grid. In different latent functions we choose a different grid size. In the 2D function, we use 0.05 as the discretization length so the dataset has points. In the 4D function, we use 0.2 and have 1,296 points. In the 6D function, we use 0.25 which yields 15,625 points.
Now we explain how to allocate each point to different nests. Based on the contour plot of the 2D function, there are 4 local maxima at points (0.15, 0.15), (0.15, 0.65), (0.65, 0.15), (0.65, 0.65). We assume that each local maximum represents a nest. To decide the nest for other points, we use these four centers and assign other grid points based on their Euclidean distances to the centers. In the 2D case, we end up with 4 nests. In the 4D and 6D case, as the functions are symmetric it is easy to infer their local maxima (i.e. the centers). However in the 4D case, this strategy creates a large number of 16 nests which is not typical in NL model studies. We combine them as follows. One center in the 4D case is (0.15, 0.15, 0.15, 0.65). Since the function is symmetric, some permutations of these coordinates are local maxima. We assume that all such permutations are in the same nest. As the functions are symmetric, this strategy guarantees that local maxima with the same value belong to the same nest. In this way, we divide all points into 5 nests in the 4D case. To consolidate nests for the 6D case, we use the same method as in the 4D case to obtain 7 nests.
After space discretization and nest assignment, reasonable noises are generated. In most empirical studies within the transportation industry, the nest parameter , hence we use this fact to derive parameters to generate Gumbel noises. Firstly, we rank the nests based on the objective value of the center. Then we generate a mean value for each as follows. In the 4D function, we have 5 nests, and we assume the mean values are taken sequentially from (0.80, 0.75, 0.70, 0.65, 0.60). The variances are configured based on coefficient of variation . We then use a truncated normal distribution to sample with lower and upper bounds to be . Taking 0.60 as an example, we use .
5.1.2 Results
The results from running the algorithm on the 2D/4D/6D functions are presented in Figures 4 to 6. The xaxis represents the relative gap between the current best found value (i.e. at query step ) and the maximum of the latent function. The yaxis is the number of queries (or additional pairwise comparisons) the algorithm requests. The lower a number of queries is, the better.
In general, the DCDGP models outperform the DCGP model and the random method after the query in the 2D experiment. In contrast, the DCGP model is only better than the random method after the query. In the 4D experiment in Figure 5, we observe a similar trend as in the 2D case. However, in the 6D experiment all three models outperform the benchmark quickly. In summary,

in the 2D experiment, to achieve the same performance of the benchmark at the query, the DCDGP models take around 1012 queries;

in the 4D experiment, to achieve the same performance of the benchmark at the query, the DCDGP models take around 20 queries; and

in the 6D experiment, to achieve the same performance of the benchmark at the query, the DCDGP models take around 1821 queries.
As the dimensionality is increasing, the performances of the DCDGP models are more pronounced. Secondly, when the dimensionality increases and the number of instances increases sharply, the DCGP model has a performance boost. It is not only better than the benchmark but also yields a comparable performance as the DCDGP models within the first 10 queries. This indicates that when the latent function’s complexity is increasing, the DCGP model’s performance increases. We think at low dimensionality, the DCGP model using the proposed acquisition function is easily trapped at a local optimum so its performance is worse than the random method in the 2D and 4D cases. As for DCDGP1 and DCDGP5, when the function’s complexity increases, the more complex model (i.e. DCDGP5) explores a better representation of the hidden function and yields a better result.
In Table 3, we present the computation time comparison in the 6D case for the first 10 queries based on our proposed acquisition function.
The main conclusions are as follows.

There is not much of a difference in the first 5 queries. The random method is worse but not by much.

After 5 queries, DCDGP models drastically outperform the random method.

In a low dimension, DCDGP1 performs better than DCDGP5. As complexity increases, DCDGP5 becomes better.

As the number of queries increases, DCDGP5 is much better.

The DCGP model with the proposed acquisition function does not work well for small problems.
In summary, if a problem is high dimensional with a large number of choices, we recommend DCDGP5. If the dimensionality is low, we should use DCDGP1. If a medium or low number of queries is required, we advise to use DCDGP1, else DCDGP5.
Furthermore, we also test the adaptive upper confidence bound (UCB) (Brochu, 2009) function. This function is defined as
where is the step, is the dimension of (e.g. 6 in the 6D case), and
is a random value drawn from the uniform distribution between 0 and 1. We have results using adaptive UCB compared with our proposed acquisition function in the 2D, 4D and 6D experiments in Appendix 8.1, Figures 9 to 11. From the comparisons, we draw three main conclusions as follows.

The DCGP model with adaptive UCB works better than our proposed acquisition function and it outperforms the benchmark.

The DCDGP models with adaptive UCB are not as effective as our proposed acquisition function when a problem is low dimensional.

The DCDGP models with adaptive UCB can be more effective than our acquisition function if a problem is high dimensional.
5.2 Airline Itinerary Experiment
The itinerary dataset can be obtained from https://github.com/jpn/larch. It is from U.S. continental markets in May of 2013 and is retrieved from a ticketing database provided by the Airlines Reporting Corporation. Each choice set has a booked itinerary (i.e. no rank). To construct the preference rank, we use all choice sets to build an NL model first. We then pick the largest five choice sets and we assume they are from one choice set. This yields in total 543 itineraries. Next the parameters and logsum terms from the NL model are applied to compute the probability of choosing each itinerary.
5.2.1 An NL itinerary model
Based on the level of service (i.e. stop and nonstop) and time of day (i.e. morning, afternoon and evening), we divide the itineraries into six different nests. Similar to the classic NL models, we assume the logsum parameters s are the same for all nests. Then the probability of choosing itinerary belonging to nest is formulated as
where and is the set of all nests. The estimated model parameters are summarized in Table 2. Time period 2 to 9 is originally encoded in the dataset which stand for different departure time windows. Equipment type refers to the aircraft type.
Table 2. NL model parameters
Explanatory variables  Parameters  Explanatory variables  Parameters 

Time period 2  0.082  Carrier 2  0.095 
Time period 3 
0.104  Carrier 3  0.500 
Time period 4 
0.069  Carrier 4  0.435 
Time period 5 
0.110  Carrier 5  0.508 
Time period 6 
0.204  Equipment type 2  0.379 
Time period 7 
0.259  High yield fare  0.001 
Time period 8 
0.265  Low yield fare  0.001 
Time period 9 
0.076  Elapsed time  0.005 
Nonstop 
3.090  
Logsum  0.792  
Loglikelihood at convergence  777501.590  
Loglikelihood at zero 
953940.440  
Rho Squared w.r.t. Null Parameters 
0.185 
Note: All explanatory variables and the logsum parameter are significantly different from zero at level 0.01.
5.2.2 Results
We first present the results of the itinerary experiment in Figure 7 by using the proposed acquisition function. Note that this is a 17 dimensional problem. The figure shows that the average performances of the DCGP and DCDGP models only outperform the benchmark within the first 15 queries. Since the results indicate the proposed models are underperformed after some point, we also test the adaptive UCB. We get the results presented in Figure 8. In the figure, we observe that DCDGP5 is the most preferred model if the target is to find the best itinerary within 20 queries, otherwise DCDGP1 should be used.
5.3 Summary
Table 4 shows the performances of the DCGP and DCDGP models in all experiments. We then summarize the recommendations as follow.

If the number of instances is high (i.e. 6D) and computation time is a concern (e.g. the DCDGP1 and DCDGP5 spend 50% to 70% more time than the DCGP model), we recommend using the DCGP model with adaptive UCB.

When the feature dimension is high and the running time is not a concern, we recommend a DCDGP model with more hidden mappings (DCDGP5) with our proposed acquisition function if it requires a few queries, and less hidden mapping (DCDGP1) with adaptive UCB if a high number of queries is tolerable.

If the number of instances is low (i.e. 2D) or in a low feature number context and time is a concern, we recommend a DCGP model using adaptive UCB.

When the number of instances is low (i.e. 2D) or in a low feature number context, we recommend a DCDGP model with less hidden mappings and our proposed acquisition function if time is not a concern.
Table 3. Computation time for the first 10 queries in the 6D experiment (in seconds)
GP  DGP1  DGP5 
6,076  8,557 (+41%)  11,720 (+93%) 
Table 4. Performance comparisons
Query #  2D  4D  6D  Airline 

10  DCDGP1  DCDGP5  DCDGP5  DCDGP5 (UCB) 
1020 
DCDGP1  DCDGP5  DCDGP1/DCDGP5  DCDGP5 (UCB) 
2030 
DCDGP1  DCDGP1/DCDGP5  DCDGP5  DCDGP1 (UCB) 
3040 
DCDGP1  DCDGP5  DCDGP1 (UCB)   
4050 
DCDGP1  DCDGP5  DCDGP1 (UCB)   
6 Conclusion
In this paper, we present a new active preference learning algorithm which embeds GP and DGP models. The models explore a novel way of learning preference from pairwise data which include classic choice model assumptions. A new acquisition function has also been proposed to solve this problem. The new methods have been applied to four different experiments and the proposed DCDGP models have been proven to be more effective than the DCGP model and a random search method. The issues of underperformed situations can be explained by the complexity of the latent functions and the choice of acquisition function. In general, we conclude that when the choice set is large and we require fast response, then the DCGP model is preferred. On the other hand, if computational time is not a big concern, we recommend using the DCDGP models with complex structure to get a better performance. The proposed method can be applied not only in the airline or travel industry but also in other scenarios involving disentangling customers’ preferences in a sequential way. In the future, we will investigate other approximation methods in the DGP models and alternative acquisition functions which may provide good solutions faster.
7 References
References
 (1) Bengio, Y. (2009) Learning Deep Architectures for AI. Foundations and Trends in Machine Learning. 1127.
 (2) BenAkiva, M.E. and Bierlaire, M. (1999) Discrete choice methods and their applications to short term travel decisions. Handbook of Transportation Science. Randolph W. Hall, ed.

(3)
Brochu, E., Cora, V. M. and Freitas, N.D. (2009) A tutorial on Bayesian optimization of expensive cost functions with application to active user modeling and hierarchical reinforcement learning.
Technical report. Department of Computer Science, University of British Columbia.  (4) Brochu, E., Freitas, N.D. and Ghosh, A. (2007) Active preference learning with discrete choice data. Advances in Neural Information Processing Systems. 409416.
 (5) Bui, T.D., HernándezLobato, D., Li, Y., HernándezLobato, J.M. and Turner, R.E. (2016) Deep Gaussian processes for regression using approximate expectation propagation. Proceedings of the International Conference on Machine Learning. 14721481.
 (6) Carrier, E. and Weatherford, L. (2015) Implementation of MNL choice models in the passenger origindestination simulator (PODS). Journal of Revenue and Pricing Management. 400407.
 (7) Chu, W., Ghahramani, Z. (2005) Preference learning with Gaussian processes. Proceedings of the International Conference on Machine Learning. 137144.
 (8) Coldren, G.M. and Koppelman, F.S. (2005) Modeling the competition among airtravel itinerary shares: GEV model development. Transportation Research Part A. 345365.
 (9) Crooks, G.E. (2009) Logistic approximation to the logisticnormal integral. Technical report. Lawrence Berkeley National Laboratory.
 (10) Dagsvik, J.K., Liu, G. (2009) A framework for analyzing rankordered data with application to automobile demand. Transportation Research Part A. 112.

(11)
Daminaou, A. and Lawrence, N.D. (2014) Deep Gaussian processes.
Proceedings of the 16th International Conference on Artificial Intelligence and Statistics
. 207215.  (12) Damianou, A. (2015) Deep Gaussian processes and variational propagation of uncertainty. PhD Thesis.
 (13) Duvenaud, D.K. (2014) Automatic model construction with Gaussian processes. PhD Thesis.
 (14) Garrow, L.A. (2010) Discrete choice modeling and air travel demand: theory and applications. Ashgate Publishing: Aldershot, United Kingdom.
 (15) Harteveldt, H.H. (2016) The future of airline distribution. International Air Transport Association.
 (16) Koppelman, F.S. and Sethi, V. (2000) Closed form discrete choice models. Handbook of Transport Modeling. Pergamon Press, Oxford. 211228.

(17)
Lawrence, N.D. (2004) Gaussian process latent variable models for visualization of high dimensional data.
Neural Information Processing Systems. 
(18)
Lawrence, N.D. (2005) Probabilistic nonlinear principal component analysis with Gaussian process latent variable models.
Journal of Machine Learning Research. 17831816. 
(19)
MacKay, D.J.C. (1994) Bayesian methods for backpropagation networks.
Models of Neural Networks III
. 211254.  (20) McFadden, D. (1978) Modeling the choice of residential location. Transportation Research Record. 7277.
 (21) Oviedo, J.L. and Yoo, H.I. (2017) A latent class nested logit model for rankordered data with application to Cork Oak reforestation. 10211051. Environmental and Resource Economics.
 (22) Sindhwani, V., Chu, W. and Keerthi, S.S. (2007) Semisupervised Gaussian process classification. The Twentieth International Joint Conferences on Artificial Intelligence.
 (23) Snelson, E. and Ghahramani, Z. (2006) Sparse Gaussian process using pseudoinputs. Advances in Neural Information Processing Systems. MIT Press.
 (24) Titsias, M.K. and Lawrence, N.D. (2010) Bayesian Gaussian process latent variable model. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 844851.
 (25) Titsias, M.K. (2009) Variational learning of inducing variables in sparse Gaussian processes. Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics. 567574.
 (26) Warburg, V., Bhat, C. and Adler, T. (2006) Modeling demographic and unobserved heterogeneity in air passengers’ sensitivity to service attributes in itinerary choice. Transportation Research Record.
 (27) Williams, C.K.I. and Rasmussen, C.E. (1996) Gaussian processes for regression. Advances in Neural Information Processing Systems. MIT Press. 598604.
8 Appendix
8.1 2D, 4D and 6D with adaptive UCB compared to original results
In Figures 911, we show the comparison of the proposed acquisition function and UCB.
Comments
There are no comments yet.