The problem of preference learning is a well-studied and widely applicable area of study in the machine learning literature. Preference elicitation is by no means a new problem(Schapire and Singer, 1998)
, and is now ubiquitous in many different forms in nearly all subfields of machine learning. One such scenario is the active learning setting, where one sequentially and adaptively queries the user to most efficiently learn his or her preferences. In general, learning in an online setting can be more efficient than doing so in an offline supervised learning setting, which is consequential when queries are expensive. This is often the case for preference elicitation, where a user may not be inclined to answer too many questions. The ability to adaptively query the user with particular exemplars that facilitate learning to the labels of the rest is invaluable in the context of preference elicitation.
In particular, there is great interest in using choice-based queries to learn the preferences of an individual user. In this setting, a user is offered two or more alternatives and is asked to select the alternative he or she likes most. There are other types of responses that can assess one’s preferences among a set of alternatives, such as rating each of the items on a scale, or giving a full preference order for all alternatives in the set. However, choosing the most-preferred item in a given set is a natural task, and is a more robust measurement of preference than rating or fully-ranking items. For this reason, choice-based methods have been shown to work well in practice (see Louviere et al., 2000), and these are the types of queries we study. In this paper, we formulate the problem of sequential choice-based preference elicitation as a finite horizon adaptive learning problem.
The marketing community has long been focused on preference elicitation and isolating features that matter the most to consumers. In this field, conjoint analysis is a class of methods that attempts to learn these important features by offering users a subset of alternatives (Green and Srinivasan, 1978)
. Lately, there has been a push in the marketing community to design sequential methods that adaptively select the best subset of alternatives to offer the user. In the marketing research literature, this is referred to as adaptive choice-based conjoint analysis. In the past, geometrically-motivated heuristics have been used to adaptively choose questions(Toubia et al., 2004). These heuristics have since evolved to include probabilistic modeling that captures the uncertainty in user responses (Toubia et al., 2007).
These problems are also tackled by the active learning community. For instance, Maldonado et al. (2015)
use existing support vector machine (SVM) technology to identify features users find important. In the context of preference elicitation in the active learning literature, there are two main approaches. The first is to take a non-parametric approach and infer a full preference ranking, labeling every pairwise combination of alternatives(Fürnkranz and Hüllermeier, 2003)
. The benefit to this approach is the generality offered by a non-parametric model and its ability to capture realistic noise. Viewing preference learning as a generalized binary search problem,Nowak (2011)
proves exponential convergence in probability to the correct preferential ordering for all alternatives in a given set, and shows his algorithm is optimal to a constant factor. Unfortunately, this probabilistic upper bound is weakened by a coefficient that is quadratic in the total number of alternatives, and the running time of this optimal policy is proportional to the number of valid preferential orderings of all the alternatives. These issues are common for non-parametric ranking models. Using a statistical learning theoretic framework,Ailon (2012) develops an adaptive and computationally efficient algorithm to learn a ranking, but the performance guarantees are only asymptotic. In practice, one can only expect to ask a user a limited number of questions, and in this scenario, Yu et al. (2012) show that taking a Bayesian approach to optimally and adaptively selecting questions is indispensable to the task of learning preferences for a given user. In the search for finite-time results and provable bounds, we opt to learn a parametric model using a Bayesian approach. In particular, this paper focuses on a greedy policy that maximally reduces posterior entropy of a linear classifier, leveraging information theory to derive results pertaining to this policy.
Maximizing posterior entropy reduction has long been a suggested objective for learning algorithms (Lindley, 1956; Bernardo, 1979), especially within the context of active learning (MacKay, 1992). But even within this paradigm of preference elicitation, there is a variety of work that depends on the user response model. For example, Dzyabura and Hauser (2011) study maximizing entropy reduction under different response heuristics, and Saure and Vielma (2016) uses ellipsoidal credibility regions to capture the current state of knowledge of a user’s preferences. Using an entropy-based objective function allows one to leverage existing results in information theory to derive theoretical finite-time guarantees (Jedynak et al., 2012). Most similar to our methodology, Brochu et al. (2010) and Houlsby et al. (2011)
model a user’s utility function using a Gaussian process, updating the corresponding prior after each user response, and adaptively choose questions by minimizing an estimate of posterior entropy. However, while the response model is widely applicable and the method shows promise in practical situations, the lack of theoretical guarantees leaves much to be desired. Ideally, one would want concrete performance bounds for an entropy-based algorithm under a parameterized response model. In contrast, this paper proves information theoretic results in the context of adaptive choice-based preference elicitation for arbitrary feature-space dimension, leverages these results to derive bounds for performance, and shows that a greedy entropy reduction policy (hereafter referred to asentropy pursuit) optimally reduces posterior entropy of a linear classifier over the course of multiple choice-based questions. In particular, the main contributions of the paper are summarized as follows:
Section 3.3 presents results showing that the linear lower bound can be attained by a greedy algorithm up to a multiplicative constant when we are allowed to fabricate alternatives (i.e., when the set of alternatives has a non-empty interior). Further, the bound is attained exactly with moderate conditions on the noise channel.
Section 4 focuses on misclassification error, a more intuitive metric of measuring knowledge of a user’s preferences. In the context of this metric, we show a Fano-type lower bound on the optimal policy in terms of an increasing linear function of posterior differential entropy.
Finally in Section 5, we provide numerical results demonstrating that entropy pursuit performs similarly to an alternative algorithm that greedily minimizes misclassification error. This is shown in a variety of scenarios and across both metrics. Taking into account the fact that entropy pursuit is far more computationally efficient than the alternative algorithm, we conclude that entropy pursuit should be preferred in practical applications.
2 Problem Specification
The alternatives are represented by
-dimensional feature vectors that encode all of their
distinguishing aspects. Let be the set of all such
alternatives. Assuming a linear utility model, each user has her own
linear classifier that
encodes her preferences 111 Throughout the paper, we use
boldface to denote a random variable.
Throughout the paper, we use boldface to denote a random variable.. At time epoch , given alternatives , the user prefers to choose the alternative that maximizes . However, we do not observe this preference directly. Rather, we observe a signal influenced by a noise channel. In this case, the signal is the response we observe from the user.
Let denote the possible alternatives. We define to be the alternative that is consistent with our linear model after asking question , that is, . The minimum is just used as a tie-breaking rule; the specific rule is not important so long as it is deterministic. We do not observe , but rather observe a signal , which depends on . We allow to characterize any type of signal that can be received from posing questions in . In general, the density of the conditional distribution of given is denoted . In this paper, we primarily consider the scenario in which , where nature randomly perturbs to some (possibly the same) element in . In this scenario, the user’s response to the preferred alternative is the signal , which is observed in lieu of the model-consistent “true response”
. In this case, we define a noise channel stochastic matrixby setting to describe what is called a discrete noise channel.
One sequentially asks the user questions and learns from each of their responses. Accordingly, let be the probability measure conditioned on the -field generated by . Similarly, let denote the history of user responses. As we update, we condition on the previous outcomes, and subsequently choose a question that depends on all previous responses from the user. Accordingly, let policy return a comparative question that depends on time epoch and past response history . The selected question may also depend on i.i.d. random uniform variables, allowing for stochastic policies. We denote the space of all such policies as . In this light, let be the expectation operator induced by policy .
In this paper, we consider a specific noise model, which is highlighted in the following assumptions.
Noise Channel Assumptions
For every time epoch , signal and true response corresponding to comparative question , we assume
model-consistent response is a deterministic function of question and linear classifier , and
given true response , signal is conditionally independent of linear classifier and previous history , and
the conditional densities differ from each other on a set of Lebesgue measure greater than zero.
The first two assumptions ensure that all the information regarding is contained in some true response . In other words, the model assumes that no information about the linear classifier is lost if we focus on inferring the true response instead. The last assumption is focused on identifiability of the model: since we infer by observing a signal, it is critical that we can tell the conditional distributions of these signals apart, and the latter condition guarantees this.
One of the benefits this noise model provides is allowing us to easily update our beliefs of . For a given question and true response , let
These sets form a partition of that depend on the question we ask at each time epoch, where each set corresponds to all linear classifiers that are consistent with the true response .
Let denote the prior measure of at time epoch . Throughout the paper, we assume that is absolutely continuous with respect to -dimensional Lebesgue measure, admitting a corresponding Lebesgue density . At every epoch, we ask the user a comparative question that asks for the most preferred option in . We observe signal , and accordingly update the prior. Suppose that the 2 hold. Then we can write the posterior as
where denotes the indicator function. Using Bayes’ rule, we see
|Now we use a property of and from the 2, namely that and are conditionally independent given . This implies|
where the last line is true because is a deterministic function of and . Normalizing to ensure the density integrates to one gives the result. The 2 allow us to easily update the prior on . As we will see next, they also allow us to easily express the conditions required to maximize one-step entropy reduction.
3 Posterior Entropy
We focus on how we select the alternatives we offer to the user. First, we need to choose a metric to evaluate the effectiveness of each question. One option is to use a measure of dispersion of the posterior distribution of , and the objective is to decrease the amount of spread as much as possible with every question. Along these lines, we elect to use differential entropy for its tractability.
For a probability density , the differential entropy of is defined as
For the entirety of this paper, all logarithms are base-2, implying that both Shannon and differential entropy are measured in bits. Because we ask the user multiple questions, it is important to incorporate the previous response history when considering posterior entropy. Let be the entropy operator at time epoch such that , which takes into account all of the previous observation history . Occasionally, when looking at the performance of a policy , we would want to randomize over all such histories. This is equivalent to the concept of conditional entropy, with .
Throughout the paper, we represent discrete distributions as vectors. Accordingly, define
to be the set of discrete probability distributions overalternatives. For a probability distribution , we define to be the Shannon entropy of that discrete distribution, namely
Here, we consider discrete probability distributions over the alternatives we offer, which is why distributions are indexed by .
Since stochastic matrices are be used to model some noise channels, we develop similar notation for matrices. Let denote the set of row-stochastic matrices. Similarly to how we defined the Shannon entropy of a vector, we define as an -vector with the Shannon entropies of the rows of as its components. In other words,
An important concept in information theory is mutual information, which measures the entropy reduction of a random variable when conditioning on another. It is natural to ask about the relationship between the information gain of and that of after observing signal . Mutual information in this context is defined as
One critical property of mutual information is that it is symmetric, or in other words, (see Cover, 1991, p. 20). In the context of our model, this means that observing signal gives us the same amount of information about linear classifier as would observing the linear classifier would provide about the signal. This is one property we exploit throughout the paper, since the latter case only depends on the noise channel, which by assumption does not change over time. We show in Theorem 3 below that the 2 allow us to determine how the noise channel affects the posterior entropy of linear classifier .
The first identity, given by (4), says that the noise provides an additive effect with respect to entropy, particularly because the noise does not depend on itself. The second identity, given by (5), highlights the fact that provides the same amount of information on the linear classifier as it does on the true answer for a given question. This means that the entropy of both and are reduced by the same number of bits when asking question . Intuitively, asking the question that would gain the most clarity from a response would also do the same for the underlying linear classifier. This is formalized in Theorem 3 below.
The following information identities hold under the 2 for all time epochs . The first is the Noise Separation Equality, namely
and the Noise Channel Information Equality, given by
where the latter term does not depend on response history .
Using the symmetry of mutual information,
Further, we know because and are conditionally independent given . Also, since is a function of and , it must be that . Putting these together gives us the first identity. To prove the second identity, we use the fact that
Again, because is a function of and . This yields . Substitution into the first identity gives us
The entropy pursuit policy is one that maximizes the reduction in entropy of the linear classifier, namely , at each time epoch. We leverage the results from Theorem 3 to find conditions on questions that maximally reduce entropy in the linear classifier . However, we first need to introduce some more notation.
For a noise channel parameterized by , let denote the function on domain defined as
We will show in Theorem 8 that (6) refers to the reduction in entropy from asking a question, where the argument depends on the question. We define the channel capacity over noise channel , denoted , to be the supremum of over this domain, namely
and this denotes the maximal amount of entropy reduction at every step. These can be similarly defined for a discrete noise channel. For a noise channel parameterized by transmission matrix , we define
and is correspondingly the supremum of in its first argument. In Theorem 8 below, we show that is precisely the amount of entropy over linear classifiers reduced by asking a question with respective predictive distribution under noise channel . For a given question , define such that for all . Suppose that the 2 hold. Then for a fixed noise channel parameterized by ,
Consequently, for all time epochs , we have
and there exists that attains the supremum. Moreover, if there exists some such that , then the upper bound is attained. We first use (5) from Theorem 3, namely that . We use the fact that mutual information is symmetric, meaning that the entropy reduction in while observing is equal to that in while observing . Putting this together with the definition of mutual information yields
which is equal to , where . Therefore, the optimization problem in (10) is equivalent to
Since , we can relax the above problem to
It is known that mutual information is concave in its probability mass function (see Cover, 1991, p. 31), and strictly concave when the likelihood functions differ on a set of positive measure. Thus, for a fixed noise channel , is concave on , a compact convex set, implying an optimal solution exists and the optimal objective value is attained. Further, if we can construct some such that for every , then the upper bound is attained. We have shown that entropy reduction of the posterior of depends only on the implied predictive distribution of a given question and structure of the noise channel. If we are free to fabricate alternatives to achieve the optimal predictive distribution, then we reduce the entropy of the posterior by a fixed amount at every time epoch. Perhaps the most surprising aspect of this result is the fact that the history plays no role in the amount of entropy reduction, which is important for showing that entropy pursuit is an optimal policy for reducing entropy over several questions.
In practice, one can usually ask more than one question, and it is natural to ask if there is an extension that gives us a bound on the posterior entropy after asking several questions. Using the results in Theorem 8, we can derive an analogous lower bound for this case. For a given policy , we can write the entropy of linear classifier after time epochs as
and a lower bound for the differential entropy of after asking questions is given below by
Further, if for a given policy and history indicates that comparative question should be posed to the user, then the lower bound is attained if and only if , with as defined in Theorem 8
. Thus, entropy pursuit is an optimal policy. Using the information chain rule, we can write the entropy reduction for a generic policyas
where the last inequality comes directly from Theorem 8, and the upper bound is attained if and only if for every . This coincides with the entropy pursuit policy. Essentially, Corollary 3 shows that the greedy entropy reduction policy is, in fact, the optimal policy over any time horizon. However, there is still an important element that is missing: how can we ensure that there exists some alternative that satisfies the entropy pursuit criteria? We address this important concern in Section 3.3.
3.1 Optimality Conditions for Predictive Distribution
Because of the properties of entropy, the noise channel function has a lot of structure. We use this structure to find conditions for a non-degenerate optimal predictive distribution as well as derive sensitivity results that allow the optimality gap of a close-to-optimal predictive distribution to be estimated.
Before we prove structural results for the channel equation , some more information theoretic notation should be introduced. Given two densities and , the cross entropy of these two densities is defined as
Using the definition of cross entropy, the Kullback-Leibler divergence between two densitiesand is defined as
Kullback-Leibler divergence is a tractable way of measuring the difference of two densities. An interesting property of Kullback-Leibler divergence is that for any densities and , , with equality if and only if almost surely. Kullback-Leibler divergence plays a crucial role the first-order information for the channel equation .
We now derive results that express the gradient and Hessian of in terms of the noise channel, which can either be parameterized by in the case of a density, or by a fixed transmission matrix in the discrete noise channel case. For these results to hold, we require the cross entropy to be bounded in magnitude for all , which is an entirely reasonable assumption.
For a fixed noise channel characterized by , if the cross entropy terms are bounded for all , then the first and second partial derivatives of with respect to are given by
where , and is the Kullback-Leibler Divergence.
In particular, if a discrete noise channel is parameterized by transmission matrix , the gradient and Hessian matrix of can be respectively expressed as
where the logarithm is taken component-wise. We first prove the result in the more general case when the noise channel is parameterized by . From the definition of ,
Since is convex, by Jensen’s inequality, , which is bounded. By the Dominated Convergence Theorem, we can switch differentiation and integration operators, and thus,
Concerning the second partial derivative, Kullback-Leibler divergence is always non-negative, and therefore, Monotone Convergence Theorem again allows us to switch integration and differentiation, yielding
For the discrete noise channel case, the proof is analogous to above, using Equation (8). Vectorizing yields
Similarly, the discrete noise channel analogue for the second derivative is
and vectorizing gives us the Hessian matrix.
One can now use the results in Lemma 3.1 to find conditions for an optimal predictive distribution for a noise channel parameterized either by densities or transmission matrix . There has been much research on how to find the optimal predictive distribution given a noise channel, as in Gallager (1968). Generally, there are two methods for finding this quantity. The first relies on solving a constrained concave maximization problem by using a first-order method. The other involves using the Karush-Kuhn-Tucker conditions necessary for an optimal solution (see Gallager, 1968, p. 91 for proof). [Gallager] Given a noise channel parameterized by , the optimal predictive distribution satisfies
where is the channel capacity. The difficulty in solving this problem comes from determining whether or not . In the context of preference elicitation, when fixing the number of offered alternatives , it is critical for every alternative to contribute to reducing uncertainty. However, having a noise channel where implies that it is more efficient to learn without offering alternative .
To be specific, we say that a noise channel parameterized by is admissible if there exists some such that for all ,
for some . Otherwise, we say the noise channel is inadmissible. Admissibility is equivalent the existence of a predictive distribution where all alternatives are used to learn a user’s preferences. For pairwise comparisons, any noise channel where and differ on a set of non-zero Lebesgue measure is admissible. Otherwise, for , there are situations when for some , and Lemma 3.1 provides one of them. In particular, if one density is a convex combination of any of the others, then the optimal predictive distribution will always have .
Suppose the noise channel is parameterized by densities , and its corresponding optimal predictive distribution is . If there exists for such that and for all , then . Suppose . Take any such that . We will construct a such that and . Define as
It is easy to verify that . But since entropy is strictly concave, we have . Consequently,
and therefore, one can always increase the objective value of by setting . Of course, there are other cases where the predictive distribution is not strictly positive for every . For example, even if one of the densities is an approximate convex combination, the optimal predictive distribution would likely still have . In general, there is no easy condition to check whether or not . However, our problem assumes is relatively small, and so it is simpler to find and confirm the channel is admissible. In the case of a discrete noise channel, Shannon and Weaver (1948) gave an efficient way to do this by solving a relaxed version of the concave maximization problem, provided that the transmission matrix is invertible.
[Shannon] For a discrete noise channel parameterized by a non-singular transmission matrix , let
where the exponential is taken component-wise. If there exists such that , then is the optimal predictive distribution, meaning that for some , and , and the noise channel is admissible. Otherwise, then there exists some such that , and the noise channel is inadmissible. Using (8) and Lagrangian relaxation,
Differentiating with respect to and setting equal to zero yields
and since is invertible,
since for all stochastic matrices . Algebra yields
where is some positive constant. We require , and if , it must be that
implying that if and only if . Hence, is a normalizing constant that allows . Thus, we can set as in (13), and now it is clear that . We can invert to find an explicit form for , but is only feasible for the original optimization problem if it is non-negative. However, if there exists some such that , then the optimal solution to the relaxed problem is feasible for the original optimization problem, proving the theorem.
If there does not exist some that satisfied for defined in (13), then the non-negativity constraint would be tight, and for some . In this case, the noise channel is inadmissible, because it implies asking the optimal question under entropy pursuit would assign zero probability to one of the alternatives being the model consistent answer, and thus posits a question of strictly less than alternatives to the user.
The condition of being non-singular has an enlightening interpretation. Having a non-singular transmission matrix implies there would be no two distinct predictive distributions for that yield the same predictive distribution over . This is critical for the model to be identifiable, and prevents the previous problem of having one row of being a convex combination of other rows. The non-singular condition is reasonable in practice: it is easy to verify that matrices in the form for some is invertible if and only if . Transmission matrices of this type are fairly reasonable: with probability , the user selects the “true response,” and with probability , the user selects from discrete distribution , regardless of . The symmetric noise channel is a special case of this. In general, if one models , where is an stochastic matrix, then is non-singular if and only if
is not an eigenvalue of, which guarantees that is invertible when . Nevertheless, regardless of whether or not is singular, it is relatively easy to check the admissibility of a noise channel, and consequently conclude whether or not it is a good modeling choice for the purpose of preference elicitation.
3.2 Sensitivity Analysis
In reality, we cannot always fabricate alternatives so that the predictive distribution is exactly optimal. In many instances, the set of alternatives is finite. This prevents us from choosing an such that exactly. But if we can find a question that has a predictive distribution that is sufficiently close to optimal, then we can reduce the entropy at a rate that is close to the channel capacity. Below, we elaborate on our definition of sufficiently close by showing is strongly concave, using the Hessian to construct quadratic upper and lower bounds on the objective function .
If there exists such that and (i.e., if the noise channel is admissible), then there exist constants such that
Further, suppose transmission matrix encoding a discrete noise channel is non-singular, and has minimum probability , maximum probability , channel capacity and distribution such that