The task of selecting from a collection of items one which is in some sense optimal for a specific user is a classical AI task. Several algorithms have been explored to perform such tasks and automated recommendations are today included in most modern e-commerce websites [he2016interactive, sarwar2000analysis]. In standard setups, no interaction with users is considered, and the recommendation system bases its decision on some background information about the user, historical records of her choices and of the choices of other similar users, and, more recently, automatically inferred contextual information [adomavicius2011context, lu2015recommender].
Yet, modern technologies such as chat-bots or personal assistants need systems able to support and model dynamic and sequential interactions, this leading to a substantial re-design of the traditional approaches. Here we focus on such a newer class of recommendation systems, called here conversational, as we term conversation a sequence of dynamically customized interactions between the user and the system, before the latter return the recommendation. This class of systems exploits the knowledge-based recommendation techniques and is based on a strong interaction with the users. Therefore, this type of recommendation system should be considered when the goal is to support the user in purchasing a high-involvement product. In such a situation, indeed, the user wants to be involved in the decision process, and thus is not bothered by the need of interacting with the system [jugovac2017interacting].
To achieve that, we take inspiration from existing approaches in the field of computer testing [wainer2000computerized, butz2006web, mangili2017reliable], whose goal is to determine the skills of a student on the basis of the answers to the questions of a test. In particular we focus on adaptive
approaches in which the system selects the next question to ask to the student from a given set of questions on the basis of the previous answers by information-theoretic scores, that are also used to decide when the test should be ended. This can be easily achieved with generative probabilistic models such as, for instance, Bayesian networks[Millan2002], which are sequentially updated any time the answer to a new question is collected. The adaptive concept is converted here in a conversational approach which will lends itself to the (future) development of a dynamic generation of questions and richer interaction models.
The conversational procedure is applied to the recommendation system of Stagend (stagend.com), an online platform that allows organizers of events (playing here the role of the users) to book artists (playing here the role of the items) for their events. The goal of the recommendation system is therefore to select the most suitable artist for a particular event based on a set of needs elicited by the user. Results of some simulations based on data from Stagend platform will be presented to show the advantages provided by the conversational approach compared to earlier procedure based on a static questionnaire and discuss possible improvements.
Consider a recommendation system based on a catalogue including items, say . The system is supposed to support the user in selecting a single item from the catalogue on the basis of a conversational process. We assume that, at the end of the conversation, the user always picks one item from . The uncertain quantity denotes the element of to be picked by the user. We consider a Bayesian setup and model the subjective probabilities of picking the different items before the start of the conversational process as a prior mass function .
We call questions, the interactions between the user and the system. A generic question is denoted as an uncertain quantity taking its possible values, to be called answers, from . Here we assume a finite set of possible answers for each question, say . As a model of the relation between and , we might be able to assess a conditional probability table whose columns are indexed by the answers and whose rows are associated to the items. After an answer
is collected, the probability mass function over the items can be updated by Bayes theorem, i.e.,
This shows that the impact of an answer to the choice is only based on the relative proportions of the values of for each . In particular, setting an element of the conditional probability table equal to zero implies a logical constraint for which the answer associated with column makes the choice of the item associated with row impossible.
On the other side, decisions of the recommendation system are based on the the conditional probability which may be strongly influenced by the prior . Therefore, to better elicit the behaviour of the model, understand its underpinning assumptions and foresee the system behaviors, it will be often more natural to consider the joint probability , rather than the conditional one . In fact, as the probabilities of the joint events completely describe the model, they allow on one hand to derive the probabilities of interest and
and, on the other hand, to reason about the relative weight of the posterior probabilities assigned to different items after conditioning on the answer, as, for strictly positive probabilities, it holds:
for each and .
We start by considering a static approach to the elicitation of the user needs, based on a list of questions , called here a questionnaire. Selection of the optimal item to be suggested to a user at the end of the conversation is based on the conditional probability of each item
given the list of answers collected, hereafter described by the vector. In order to improve the user experience it is desirable to minimize the number of questions necessary to identify the optimal suggestion. To this goal, the list of questions in should be built dynamically as the conversation proceeds. Such a customized list of questions is called hereafter a conversation. Before discussing this dynamic approach, we will present the set up that allows the computation of , be the output of a questionnaire or of a conversation. To model a conversation we initially formulate a naive-like assumption stating the conditional independence of the questions given the item, i.e.,
More general models will be discussed later on in this paper. Under this assumption, if the conditional probability tables associated to each are available, the probability can be obtained by recursively applying the Bayes theorem to update the probability after the first answers by conditioning also on the next answer .
To open the discussion, let us present a toy example, inspired by the Stagend case study reported in the final part. The example will be used through the paper to illustrate the main features of the proposed method.
|DJ available for all type of events|
|Band available for weddings and corporate events|
|Magician available for birthdays and parties for kids|
|Which entertainment are you looking for?|
|Which event are you organizing?|
|Party for kids|
Elicitation by Logical Compatibility
Consider the background information about the entertainers in Table 1 and the questions (and possible answers) in Table 2. An underlying notion of logical compatibility can be therefore formalized as follows. Given an item and an answer to a question , we say that is compatible with , and we denote this as , if satisfy all the logical requirements implied by . As an example , i.e., the second antertainer in the catalog is compatible with a corporate event, but , i.e., the third entertainer is not a musician. Given , we can therefore define an indicator function for such a compatibility concept for each pair and as follows:
Following the above definition of compatibility, each item can be characterized by the set of compatible answers for each of the possible questions. For instance, artist in Example 1 is described by set of answers to and to .
In the probabilistic model proposed, the notion of compatibility is translated into the assumption that the joint probability and, following Equation (1), also the conditional probability , have the same support of , i.e., they are zero whenever . To elicit , and hence , when and are compatible, we need additional sets of assumption. In the remaining part of this section we discuss strategies to tackle this problem and analyze the underlying assumptions and consequences in terms of posterior inferences.
Items Compatible with a Single Answer
The above discussed notion of logical compatibility might be sufficient to elicit a conditional probability table without any additional assumption in the special case in which each item is compatible only with a single answer. Let denote such answer in the case of item , i.e., and for each . Under these assumptions, the number of non-zero elements of the joint mass function is equal to the catalogue cardinality . We assume all these non-zero probabilities to be equal, this meaning for each , while the corresponding conditional probabilities have value one (e.g., see Table 3). After the conditioning on the observed answer , all the items compatible with will have the same posterior probability, whereas all non compatible items will receive zero probability mass. Therefore, in this simple setting, the Bayesian approach proposed implement a logical filter, making impossible the items incompatible with the answers. Such a situation occurs in the following example.
In the setup of Example 1, consider the conditional probability table for . Each entertainer is consistent with only one a single answer to this question. The corresponding quantification of the conditional probability table is depicted in Table 3. If we assume, for instance, that a new user answer 1 to the first question , we can update the probabilities assigned to each entertainer by Equation (1) and conclude a posterior probability equal to one for artists and zero for the other two entertainers.
Items Compatible with Multiple Answers
Consider a question such that the assumptions in the previous section are not satisfied and there are items compatible with more than a single answer. In this case, the elicitation of the joint mass function requires further assumptions. Here below we describe two alternative strategies for such elicitation. It is a trivial exercise to note that both strategies lead to the elicitation discussed in the previous case for items compatible with a single answer, that is a special case of the one considered here.
Uniform Joint Strategy (UJS).
Let us assume, as in the previous section, that all the compatible item/question combinations, i.e., all the pairs , with and , such that have the same probability. If denotes the number of these pairs, this assumption corresponds to assigning . Let us also define the versatility of an item with respect to a question as the number of answers of that are compatible with the item . Then, . From the joint mass function , we obtain and for each .
It is easy to note that, before the question is asked, the more versatile an item is, the higher is its probability of being selected. However, once the desired characteristic of the item has been elicited by the user, all item compatible with it will be assigned the same posterior probability.
In the setup of Example 1, UJS can be applied to the elicitation of the conditional probability table for question . This corresponds to the joint probability mass function in Table 4. By summing over the rows we obtain the corresponding marginal probabilities for the entertainers: , and , while the corresponding conditional probability table is the one depicted in Table 5. Note that, before the question, entertainer has twice the probability of the other to be chosen, as it is the most versatile one. However, once we collect the answer, e.g., the two compatible entertainers, i.e, and , receive the same conditional probability:
Uniform Prior Strategy (UPS).
As shown by Example 3, UJS might produce non-uniform values for the marginal probability mass function over the items. Alternatively, we can impose such a “prior” uniformity, i.e., for each
. If we also assume that, given the item, the probability mass is uniformly distributed over all the compatible answers, we haveand, for the conditional probability, . A consequence of this model is that compatible items may have different posterior probabilities, with the more versatile ones being less probable than the less versatile ones. Such a model can then be more suitable to situations where specialization of the item can be considered a positive feature.
Here, still in the setup of Example 1, we apply UPS to the elicitation of the conditional probability table for question . By definition, the prior is uniform, i.e., for all entertainers, and the conditional probability table is the same as before. In this case, all items have the same prior probability, whereas, after collecting answer , being less versatile, user is assigned a larger probability than , coherently with the fact that the joint distribution assigned a larger probability with the joint event
, coherently with the fact that the joint distribution assigned a larger probability with the joint eventthan to the event :
Coping with Multiple Questions
In general, the complete list of possible questions can be divided into questions for which we assume UJS, and a set or questions for which we assume UPS (if the single answer compatibility holds for a question, this can be can be arbitrarily included in or ). In such general case, because of Equation (3), the joint mass function is given by:
where, coherently with the UJS and the UPS strategies described above, we have set
from which it follows that for all and , and
In the Stagend example the complete model can be graphically represented by Figure 1, where nodes represent variables, edges describe relations of conditional dependence and the absence of a path connecting two nodes corresponds to conditional independence of the corresponding variables. The conditional probability tables for and are the ones in Tables 3 and 5, respectively, whereas the prior is given in Table 7 and depends on the strategy adopted for modeling question .
Shaping the Conversation.
In classical recommendation systems the assessment of the user preferences with respect to the different items is based on a static block of background information about the user. Such information can be already available in the system or directly obtained from the user after some kind of reduced interaction, e.g., a predefined questionnaire. However, as discussed in the introduction, in modern setups the process of collecting information about the user preferences with respect to the catalogue should be based on a sequence of dynamic interactions. In this view, the questionnaire approach leaves the place to a conversational process taking the form of a personalized sequence of questions dynamically picked from a larger set of questions. The prior probability mass function is thus sequentially updated each time a new answer is collected, and the updated probability is used to select the most informative next question. The choice between a possibly huge set of candidate question/interaction can be driven by information-theoretic criteria making any sequence potentially different from the other. In particular, taking inspiration from the literature in the field of adaptive testing, we pick the question that minimizes the conditional entropy (and hence maximizes the expected information gain). This choice is the most natural one as it allows reducing the entropy of the mass function in order to concentrate most of the probability mass on a limited number of items. More formally, the adaptive conversational process selects the question such that:
where and is the entropy of the posterior mass function over after the answer to the question , and is the set of questions the system can choose from.
This procedure is iterated after any answer and the conversation ends if the posterior entropy decreases under a fixed threshold. As that the entropy of a mass function is defined , a natural threshold is the entropy of a mass function over which is uniform on items, while the other ones have zero probability, i.e., . Setting this value in the stopping rule forces the system to halt when most of the posterior probability mass is concentrated on the most probable items.
The approach discussed so far allows to properly model the relation between items and questions and dynamically update the probabilities during the conversation. Yet, such an approach is suitable only for small catalogues, as it implicitly requires the assessment of the logical compatibility relations between each item and each questions. As an example, in the case study under consideration, the catalogue includes more than three thousand items (i.e., entertainers), this preventing a straightforward application of the above outline ideas. To bypass such a limitation, we have already introduced a characterization of each item based on the set of compatible answers to each questions. Here, we formalize this characterization and make it independent from the set of questions by introducing the concept of item properties. A property is a random quantity taking its values in a finite set . Let denote the joint set of relevant properties used to characterize an item We assume that this set of properties is a sufficient description of the item and, consequently, the questions and the item variables are conditionally independent given the properties. Moreover, we initially assume that each question refers to a single property and that questions are conditionally independent given their relative property, as well as properties are conditionally independent given the item . An example of this augmented setup is reported here below.
In the same setup of Example 1, consider two properties and modeling, respectively, the type of enterteiner and the type of event. Question refers to property , while refers to . Adopting the Markov condition for directed graphs [koller2009], the above discussed conditional independence assumptions correspond to the graph in Figure 2.
The simplest way to define properties consists in regarding them as latent (i.e., not directly observable) clones of the questions, which are instead intended as manifest (i.e., observable) variables. In other words, if is a question, we define a property whose possible states are in one-to-one correspondence with the answers and we set the logical constraint meaning that if and are the elements in the correspondence, and zero otherwise. In terms of compatibility functions, this means where and are the compatible states of and . Under the above assumptions, the marginalization of the properties simply returns a model equivalent to the one defined in the previous section. More expressive setups can be obtained by specifying properties that are not latent clones of the questions they are associated to. This allows to describe items based on general properties, streamlining the elicitation of compatibility relations between items and questions. For instance, a state of property can imply both logical requirements of the answers and . Such a property would be compatible with both answers, but equivalent to none of them. This situation can be modeled by conditional probabilities , strictly larger than zero.
In the Stagend platform, it is well known that the organizer of an event may be well satisfied with a band even if she asked only for a musician. This can be modeled by assigning a positive value to the probability of asking for a musician when the best matching item is a band (i.e., corresponding to Band).
Moreover, the possibility of grouping items based on their properties can be exploited to learn model parameters with less fragmentation in the data. E.g., the probability
can be estimated from all selected items having property; instead, to estimate one should consider only the cases where item has been chosen, which can be very few or even zero for new items.
Finally, more complex relations between questions and properties or property values can be easily modeled. Some examples are: (i) multiple questions referring to the same property, modeled by assuming conditional independence between the questions, this simply requires to independently elicit the probabilities and for all possible states of and and ; (ii) multiple properties for the same question, if is the set of properties associated with , the elicitation of the probabilities for all values of and all joint combinations of states for the properties in is required; (iii) questions relevant only for some items, e.g., Stagend question Do you want the musician to play any particular instrument? should define a preference among musicians, without changing the probability of the property musician with respect to band, DJ and entertainer and can be modeled by assuming for all items for which the question is not relevant.
In the Stagend example, the most requested entertainers for weddings are musicians and DJs. To model this we remove the assumption of marginal independence between the properties Type of even and Type of entertainer, i.e., and . In the graphical language of Figure 2, this corresponds to adding an edge connecting the two properties (Figure 3). To define the corresponding model, the elicitation of the conditional probabilities for all possible values of conditioned over all possible combinations of values is required. Then, in this toy example, where only three entertainers have been considered, adding the above dependency between properties requires elicitation of 36 parameters values which can become many more in real applications. It would be much easier to reason about the marginal probability of the type of entertainer given the type of event.
Elicitating Dependencies between Properties
In the previous section we noticed that adding dependencies between properties increases the number of model parameters (e.g., conditional probabilities), and this prevents a fast elicitation for huge catalogues.
As we did for the conditional probability tables of the questions (i.e., ), we could better perform a single elicitation for all the items with the same value for the properties of interest, that is, focusing only on properties (and not on items) in the knowledge-based elicitation of probabilities, and grouping all items with the same property values when learning from data.
This can be achieved as follows. Assume we wish to model the conditional probability table for property given a set of parent properties , while the remaining properties are assumed independent from and denoted by . The procedure for eliciting the (possibly huge) number of conditional probabilities for all values of and all joint combinations consists of two steps. First, the conditional probabilities marginalized over are elicited based on prior knowledge, data or both, while the conditional probabilities , where is the vector of all property values, is elicited based on logic constraints. Afterwards, probabilities are then be derived from the relation:
while the prior is derived from
Concerning the conditional , it is derived from the joint which has the same form as in Equation (9) with and replaced by the vectors of properties and for which we assume, respectively, UJS and UPS. is then given by:
where is the number of items compatible with the combination of property values in . Notice that in case no item is compatible with the probability cannot be derived from Equation (16). This inconsistency arises from the fact that only the vectors of property values that are compatible with at least one of the item in the catalogue, are, indeed, possible. Let be the set of all possible for which . To solve the above inconsistency and account for the logical impossibility of all , an initial, eventually inconsistent, elicitation of empirical estimate of the probability of the joint properties needs to be revised as follows:
Notice that the probability of each combination of properties values is uniformly distributed over all items that are compatible with that combination.
In the Stagend example the two steps above correspond to the elicitation of the conditional probabilities of the model in figure 4. The only difference compared to the desired model (figure 3) is the direction of the edges connecting to and . Therefore, the two models define the same set of independence assumptions. Assume that the relation between properties can be modeled by the conditional probabilities in the square brackets of Table 8, which describe a situation where the preferred types of entertainers are Djs and musicians for weddings, bands and entertainers for corporate events, Djs and entertainers for birthdays, Djs and bands for parties. Moreover, we assume for the prior probabilities , , according to which weddings are twice more popular than any other type of event. The corresponding joint model however, does not comply with the logical impossibility of all joint states that are not compatible with any item, requiring such joint states to have zero probability. As all values of are compatible with at least one item, we focus on . Cells corresponding to impossible joint states are highlighted in grey in Table 8. Their values are set to zero and the re-normalized probabilities are shown in the table next to the initial ones.
From this elicitation the prior probabilities , , and the conditional follow. Therefore, in this simple case where each artist is compatible with one single value of property , we have that reduces to the zero-entropy conditional based only on logical constraints, whereas the defined dependence between and only affects the prior distribution over items. If a further entertainer performing at weddings both as DJs and as a band were included in the catalog, it would result in strictly positive probabilities both for equal to band and Dj.
Learning from Data
The above discussed procedure for the elicitation of the model parameters in the recommendation system is based on structural judgement about the logical compatibility between items, properties and the answers to be possibly integrated by judgements of a domain expert about the relations between the properties (irrespectively of the items). Yet, historical data involving observations of the parameters to be quantified might be available. Following a Bayesian paradigm, we can naturally merge these two sources of information by using the outputs of the elicitation process as the parameter of a multinomial Dirichlet prior to be combined with the likelihood of the observed data. This has the potential of further increasing the discriminative power of the system and the quality of the recommendations.
For an empirical validation we consider the Stagend recommendation system. Currently, the platform includes entertainers. In its static version, the system asks to all the interested users a questionnaire including questions intended to identify the entertainer that matches at best the needs of the users. Stagend advisors select a small subset of artists (in general less than ten) to be presented to each user. The questions identify properties used to characterize the entertainers. We model the relation between questions and items by means of the compatibility-based elicitation and simulate a conversational process using the answers collected in questionnaires filled by actual Stagend users. At the end of each simulated conversation, a set of items is retained by our model. When all questions are asked, in cases over the simulated, the final set is empty. This shows that the notion of logical compatibility can be too strict in some cases. For the remaining questionnaires, we compare the decision of our algorithm to the subset of items selected by Stagend advisors. Let FI denote the fraction of items suggested by the advisors that are actually retained by our model, the average FI over the 91 remaining simulations is . Again, this can be explained by the fact that logical compatibility is not always fully respected by the advisors suggestions, sometimes due to a poor knowledge of the entertainers in the catalogue, sometimes to better diversify the offer.
Figure 5 (left) shows how the entropy of as well as the number of retained entertainers (NRI) decreases with the number of question asked. However, after a certain number of questions, both the entropy and the NRI stop decreasing. Notice that the order of the question is defined in an adaptive way. In the right-hand side of Figure 5 it is shown how the fraction of retained entertainers (NRI/) and the fraction of questions asked (NQ/) varies with respect to the entropy threshold.
A new approach to automatic recommendations which assumes a dynamic interaction between the system and the user to provide customized and self-adaptive recommendations has been developed on the basis of a pure Bayesian approach. The framework introduced in this paper sets the ground to several future developments, among which the dynamic generations of questions in order to improve the conversational nature of the system. This could be based on a natural language generation system interacting with the structured probabilistic description of item properties and elicitation of user needs,[mostafazadeh2016generating].