Play Duration based User-Entity Affinity Modeling in Spoken Dialog System

06/29/2018 ∙ by Bo Xiao, et al. ∙ University of Massachusetts Amherst Amazon 0

Multimedia streaming services over spoken dialog systems have become ubiquitous. User-entity affinity modeling is critical for the system to understand and disambiguate user intents and personalize user experiences. However, fully voice-based interaction demands quantification of novel behavioral cues to determine user affinities. In this work, we propose using play duration cues to learn a matrix factorization based collaborative filtering model. We first binarize play durations to obtain implicit positive and negative affinity labels. The Bayesian Personalized Ranking objective and learning algorithm are employed in our low-rank matrix factorization approach. To cope with uncertainties in the implicit affinity labels, we propose to apply a weighting function that emphasizes the importance of high confidence samples. Based on a large-scale database of Alexa music service records, we evaluate the affinity models by computing Spearman correlation between play durations and predicted affinities. Comparing different data utilizations and weighting functions, we find that employing both positive and negative affinity samples with a convex weighting function yields the best performance. Further analysis demonstrates the model's effectiveness on individual entity level and provides insights on the temporal dynamics of observed affinities.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fully voice-controlled consumer products such as Amazon Echo have connected millions of users to their desired multimedia services through the spoken dialog system. Challenges remain when the system attempts to resolve ambiguities. One type of ambiguity is inherent. The same mention may stand for multiple valid entities. For example, “play hello

” may refer to the song either by Adele or by Lionel Richie. Without screen and keypad, it would be a poor user experience to read out the list of candidate entities and have the user select by voice. The other type of ambiguity comes from errors in Spoken Language Understanding (SLU), incurred by the Automatic Speech Recognition (voice to text) and Natural Language Understanding (semantic interpretation) components. Although multimedia entity retrieval engines are designed to be robust to minor distortion in the query, it is still vulnerable when an erroneous entity accidentally matches a noisy query. To address these challenges, we investigate user-entity affinity models for an additional source of meta-information. Estimating a user’s preference to each entity may facilitate personalized entity resolution,

e.g., mapping the user’s mention of “hello” to the song from their favored artist. Moreover, a user’s mention of “hello” may be decoded as “hello” and “halo”, resulting in multiple candidate entities to play. The affinity information may aid ranking SLU interpretations containing multiple hypotheses of user mentions and intents that lead to distinct entities.

In the literature, extensive studies have been conducted on Collaborative Filtering (CF) based large scale user-entity affinity modeling [1]

. CF models leverage the similarities of users and entities to infer unobserved affinities, given sparse observations of user-entity interactions. Early methods represented each user as a vector of entities that the user had interacted with, and vice versa

[2, 3]. Unobserved affinity to an entity may be predicted by a weighted sum of the user’s affinities to the neighbors of the entity that the user has interacted with. In addition, Matrix Factorization (MF) methods have been found to outperform the aforementioned neighborhood-based methods [4, 5]. MF models the users and entities in a common, low dimensional vector space such that the inner product of these vectors (a.k.a. embeddings) approximates the user-entity affinity [2].

Despite the generalizability of CF, applying it to spoken dialog systems is not straightforward. Users interact with voice interface of the system and express affinities in a fundamentally different way than typical Web based applications. In the past, CF models have employed various explicit or implicit behavioral cues [6]. For example, in online shopping or multimedia streaming applications, purchasing behavior and user ratings are explicit signals [3, 7]. In online Web search, clicking and browsing behaviors such as frequency of click and dwell time on the page, etc. are extracted as implicit cues [8, 9, 10]. In the scenario of voice-based search, previous studies have mainly focused on improving robustness towards noisy queries [11, 12, 13]. However, there has not been much work on affinity cue quantification. Since pay-per-view transactions are typically not adopted in voice-based services, explicit affinity cues are not generally available, or tend to be very sparse.

We refer to the multimedia streaming event triggered by a user request as a playback. The playback may be completed by finishing the content or terminated early by the user. The total time of such streaming event is defined as the play duration. We hypothesize that play duration may be an implicit behavioral cue for affinity, which is analogous to, but substantially different from, dwell time in Web browsing. In Web browsing, users can proactively choose the page to view by clicking or tapping, so that even when the dwell time is short, it may still indicate some positive albeit low relevance. If nothing is relevant, the user may not view any page, and no dwell time would be recorded. In spoken dialog systems, users only receive resolutions passively; a playback does not necessarily indicate any positive affinity. Terminated playback may attribute to an acceptable but not favored entity, e.g., requesting the remix version but getting the original version of a song; or a wrong SLU interpretation and totally irrelevant entity, e.g., playing the song Halo when the mention is hello. We include the former as a part of positive affinity, and define the latter as negative affinity. Empirically, negative affinity should cause much shorter play duration as users tend to immediately stop an irrelevant playback. Capturing negative affinity is critical to improving user experience. This motivates us to consider binarization of play durations to extract positive and negative affinities.

Play duration is implicit in representing affinity because there are external factors infeasible to capture. Users may have various reasons to stop the playback early or leave it unattended. Heuristically, confidence of positive affinity is higher when the play duration is much longer, and confidence of negative affinity is higher when the play duration is shorter. This inspires us to add a weight

[6] to each observation according to the confidence of the positive/negative label.

Following these ideas, in this work we investigate MF based CF modeling of user-entity affinity, using play duration cues. As a case study, we focus on Alexa music services, though the modeling approach is agnostic to application domain or entity type. In the rest of the paper, we explain the modeling approaches in Section 2; describe the data set, experiment setting, and evaluation results in Section 3; and conclude the paper with future directions in Section 4.

2 Matrix factorization model

Let the set of users and entities be and , respectively. We denote user-entity affinities in matrix . A cell in denotes affinity between and . Our goal is to derive for each , and for each , such that their inner products approximate . We define the affinity score 111Constant biases for each and are not included as the goal is not in rating estimation or recommendation.. The size of dominates the model’s complexity in .

Directly representing affinities in the form of has some challenges. First, distinct

tuples may have dramatically different amount of observations and variances of

estimation, so that the confidence of varies. Moreover, holding values requires spatial complexity, where is the average count of entities played by a user. This is dramatically larger than the model size. And so, we label each interaction separately and process the observations sequentially.

2.1 Implicit positive and negative affinity

We discretize observations into implicit positive/negative labels by an entity type specific threshold , which is based on a certain percentile (omitted) of play durations for that entity type. For example, for music domain entities, sec for song track, and sec for radio station and album. Let denote the entire observation set. For a request , we denote its play duration as , and the observed affinity label as in (1).


2.2 BPR training samples

We optimize the Bayesian Personalized Ranking (BPR) [14] objective to fit our MF model. BPR minimizes pair-wise ranking errors of affinities, rather than point-wise reconstruction errors of , resulting in better prevention of overfitting. Typically, one BPR training example is composed by a given user and a pair of entities with opposite affinity labels. In the past, BPR has been widely applied in learning MF models [15, 16, 17]. Although there are more advanced CF methods [18, 19, 20], in this work we adopt this simple model to allow a clear experimental analysis of play duration cues in spoken dialog systems.

For our task, directly pairing positive and negative samples in for each user may lead to over-sampling the less frequent negative samples and overfitting on those. Also, it requires spatial complexity. In this work, we employ a simple approach — sampling entities randomly from as negative peers for positive observations, and as positive peers for negative observations. Since the entity set is large, we expect a low chance to draw an entity that the user has ever interacted with. We drop the sampled entity if it is identical to the entity in observation. Let and denote positive/negative observations, respectively. We draw BPR training sample sets and as in (2) and (3), where and are randomly sampled negative/positive entities, respectively. Let be the count of training samples generated from each observation .


2.3 Observation confidence weighting

Binary labels do not capture fine grained notions of affinity, e.g., a user stopping a playback at 1 sec vs.  sec may indicate different levels of affinity. We assign lower weights to samples closer to the threshold . Weight for each sample is derived from the play duration, through a function over the time axis. Figure 1 illustrates weighting functions in our investigation. The weighting functions map to , and reach at 0 sec and separately (e.g., 300 sec for song track, 1800 sec for album and station). The weight, denoted , remains a constant for .

Figure 1: Observation confidence weighting curves.
Customer ID Slot type User mention Resolution Entity ID Duration (sec)
User_3 SongName hello Hello Entity_5 10.0
User_4 StationName n. p. r. WBUR radio Entity_7 900.0
User_9 AlbumName trolls Trolls (OST) Entity_2 400.0
Table 1: Illustration of Alexa music data set

2.4 Objective function and optimization

Let and denote the logistic function and the embeddings, respectively. Let be the regularization parameter. Let be the difference of affinity scores between positive entity vs. sampled negative entity for user , as shown in (4). Similarly let be that for negative observations as shown in (5). Let be the sample weight. The objective function for BPR optimization is shown in (6).


To allow for online updates, we optimize the model with stochastic gradient descent. In (

7), we illustrate the formula for updating the embedding values for the case of [14]. The formula for only differs by subscripts. We employ AdaGrad regularized dual averaging algorithm [21] for regularization and learning rate adaptation.


In testing, we compute affinity prediction in (8

) as the cosine similarity between

and , which is bounded between and .

Data Weight
(a) uniform
(b) uniform
(c) log
(d) concave quad.
(e) linear
(f) convex quad.
(g) convex quad.
Table 2: Spearman correlation between play durations and predicted affinities (showing absolute differences)

3 Experiments

3.1 Data set

In this work, we use a subset of Alexa music service logs over a three-month period, containing requests to playable entities such as song track, radio station, and album. The observed interactions are from far more than 100K users, amounting to over 10M requests/playbacks, and 100K distinct entities. Table 1 illustrates fields in the data set with a few representative examples. Slot type denotes system interpreted entity type.

Considering complexity and robustness, we do not include Customer/Entity IDs with less than 5 occurrences in the model. Instead, we replace these IDs with “<User_UNK>” and “<Entity_UNK>” labels. Ratio of requests with UNK index is below 5%. We expect the UNK index to capture an averaged characteristic of low frequency user/entity. Finally, for anonymization and computational efficiency, we replace all Customer/Entity IDs including UNK by integers and store them in a vocabulary.

3.2 Evaluation method

We compute the following two metrics for evaluation. The first is Spearman correlation between the predicted affinities and the play durations. Note that Spearman correlation tolerates the absolute values and only computes correlation of ranks, suitable for our problem as play duration may not have a linear relation to the latent affinity. As the lengths of song tracks are much shorter than those of stations and album, we normalize play durations by dividing the entity type specific thresholds , and denote the correlation in this case as . Entity type specific values are also evaluated.

Secondly, we simulate prediction of the affinity labels by the predicted affinity. By varying the decision boundary on the cosine similarities, we can plot the ROC curve for True Positive Rate — true positive prediction out of actual positive samples vs. False Positive Rate — false positive prediction out of actual negative samples. We then report the Area Under ROC Curve (AUC) metric. We may also modify the thresholds in (1), then recompute the AUC metric as the affinity labels are modified. This allows evaluation of the model’s discriminative power towards different ranges of play duration.

3.3 Experiment setting

We first only use the third month (m3) to gain insights in the experiment, than investigate if more data (m1 m3) increases the accuracy. In a causal setting, we reserve the second last day in m3 for development, and reserve the last day for testing. Implementation was based on FACTORIE library [22]. To further speed up experiments, we employed the Hogwild trainer for parallel computing [23]. For unseen users or entities in the test set, we substitute their IDs with the UNK index; thus the affinity prediction is always computable.

We optimized hyper-parameters by grid-search on the dev-set, including dimension of embedding , learning rate , regularization , negative sampling count , training iterations , etc. As a result, , , , , yielded the optimal performance. The relatively low values of and are favorable for large scale application. The superior result from a relatively small

is probably because of less collision to entities in the same polarity of affinity in the random sampling.

3.4 Overall results

Table 2 tags settings from (a) to (g). The first two columns denote data utilization and sample weighting method. The remaining columns report , , and AUC of predicting , and . Metric values in (a) — using samples only from — serve as a baseline (all baseline metrics are moderately positive but omitted). Absolute differences by columns are included in the rest of the table, except that and values are relative to the baseline of and , respectively.

Improvements in (b) demonstrate the advantage of incorporating negative affinity samples . Results in (c) to (f) show the effectiveness of confidence based sample weighting. We can see that convex weighting is superior to linear weighting, which in turn exceeds concave weightings. This implies that emphasizing high confidence observations is helpful. In addition, we found that except in (a). This means after normalization for entity types, the predicted affinities still capture the correlation to play durations. Performance gain with respect to three entity types are consistent, meaning the models tend not to be dominated by a single entity type. in (f) is lower than that in (c) to (d), whereas in (f) exceeds that in (c) to (d). This suggests linear weighting has better prediction power for , whereas convex quadratic weighting achieves better characterization of affinity in longer duration range. Finally, we found adding more observations further improved the performance in (g).

3.5 Evaluation per entity

Unlike radio stations which have unbounded content, song tracks have bounded lengths (unless they are on repeat). We investigate how well the model is capturing personalized preferences with respect to each individual entity. We compute and collect its histogram in each entity type. Note that not all have sufficient observations in the test set for a statistically meaningful estimate of . We focused on the set , where and denote the count in test set and p-value, respectively. Duplicate tuples in the test set have identical affinity prediction, and may inflate due to tie-breaking in ranking the predicted affinities. Hence only the first occurrence of is taken.

Figure 2: Histograms of entity-wise Spearman correlation .

As a result, Figure 2 shows the histograms (absolute values are omitted) computed based on the model in Table 2, row (g). We can see that consistently for three entity types, has a unimodal distribution and for almost all . This lends further support to the effectiveness of the model in capturing personalized affinities to each individual entity.

3.6 Evaluation on seen vs. unseen interactions

Despite causal split of the data, the test set has significant overlap in terms of tuples with the training set, due to frequent recurrence of customer usage. We hence divide the test set by whether the tuple has been seen in the training set, and evaluate performances in each part as shown in Table 3. The first row presents metrics of row (g) in Table 2. The rest of the table includes absolute differences.

As expected, there is a significant reduction of performance on the unseen part. But surprisingly, reduction on the seen part is similar. For song track and station, performance on the unseen part is even better. The reason might be heavily biased customer behavior in the seen part. The system as well as the customers evolve as time goes. There are a lot less negative affinities in the seen part. On the system side, SLU errors get fixed over time, so that given the same request, affinity may become positive. On the customer side, successful requests may be much more likely to reoccur than failed ones. We conjecture both effects led to a narrower dynamic range of play durations in the seen part, causing a reduction of perceived correlation, as discriminating fine levels of positive affinities is more difficult than separating negative affinities from positive ones. This suggests temporal dynamics might be a non-negligible factor in understanding user preference.

Test set
Table 3: Correlation (abs. diff.) on seen vs. unseen tuples

4 Conclusion

The unique characteristics of fully voice-based multimedia services demand innovations in modeling user-entity affinity. In this work, we have proposed to capture implicit affinities exhibited in play durations. We used the BPR objective and low-rank matrix factorization to model binarized positive vs. negative affinities. In the experiment, we found that utilizing positive and negative samples together outperformed using positive samples only; weighting samples with a convex quadratic function yielded the best outcome; and adding more training data further improved the result. We further proved the model’s effectiveness by evaluating on the basis of individual entities, and gained insights about dynamics of system and customer behavior in analyzing performance for seen vs. unseen interactions.

In the future, we would like to cover more entity types including audio books and videos, following a multi-view learning approach [24]. We may also incorporate heterogeneous side information such as explicit feedback, similarity of items, etc. [5, 17, 25]

. Moreover, we may extend the affinity matrix to a 3D tensor, following the multi-verse CF method based on tensor factorization

[26]. For example, the user’s mention text may be added so that the affinity prediction is aware of the user’s request. We would like to further evaluate the effectiveness of affinity prediction in the production environment for the optimization of SLU interpretation and customer experience.


  • [1] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recommendation algorithms,” in Proceedings of the 10th international conference on World Wide Web, 2001, pp. 285–295.
  • [2] Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in Proceedings of the 14th ACM SIGKDD, 2008, pp. 426–434.
  • [3] G. Linden, B. Smith, and J. York, “ recommendations: Item-to-item collaborative filtering,” IEEE Internet computing, vol. 7, no. 1, pp. 76–80, 2003.
  • [4] J. Bennett, S. Lanning et al., “The netflix prize,” in Proceedings of KDD cup and workshop, 2007, p. 35.
  • [5] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, 2009.
  • [6] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback datasets,” in IEEE International Conference on Data Mining, 2008, pp. 263–272.
  • [7] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Large-scale parallel collaborative filtering for the netflix prize,” in International Conference on Algorithmic Applications in Management, 2008, pp. 337–348.
  • [8] E. Agichtein, E. Brill, and S. Dumais, “Improving web search ranking by incorporating user behavior information,” in Proceedings of the ACM SIGIR, 2006, pp. 19–26.
  • [9] S. Shen, B. Hu, W. Chen, and Q. Yang, “Personalized click model through collaborative filtering,” in Proceedings of the fifth ACM international conference on Web search and data mining, 2012, pp. 323–332.
  • [10] F. Cai, S. Wang, and M. de Rijke, “Behavior-based personalization in web search,” Journal of the Association for Information Science and Technology, vol. 68, no. 4, pp. 855–868, 2017.
  • [11]

    J. Rao, F. Ture, H. He, O. Jojic, and J. Lin, “Talking to your tv: Context-aware voice search with hierarchical recurrent neural networks,” in

    Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 557–566.
  • [12] R. Levitan and D. Elson, “Detecting retries of voice search queries,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, 2014, pp. 230–235.
  • [13] M. Shokouhi, U. Ozertem, and N. Craswell, “Did you say u2 or youtube?: Inferring implicit transcripts from voice search logs,” in Proceedings of the 25th International Conference on World Wide Web, 2016, pp. 1215–1224.
  • [14] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr: Bayesian personalized ranking from implicit feedback,” in

    Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence

    , 2009, pp. 452–461.
  • [15] H. Yin, H. Chen, X. Sun, H. Wang, Y. Wang, and Q. V. H. Nguyen, “Sptf: A scalable probabilistic tensor factorization model for semantic-aware behavior prediction,” in International Conference on Data Mining, 2017, pp. 585–594.
  • [16] S. Riedel, L. Yao, A. McCallum, and B. M. Marlin, “Relation extraction with matrix factorization and universal schemas,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 74–84.
  • [17] R. He and J. McAuley, “Vbpr: Visual bayesian personalized ranking from implicit feedback.” in AAAI, 2016, pp. 144–150.
  • [18] C.-Y. Wu, A. Ahmed, A. Beutel, A. J. Smola, and H. Jing, “Recurrent recommender networks,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 2017, pp. 495–503.
  • [19]

    S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “Autorec: Autoencoders meet collaborative filtering,” in

    Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 111–112.
  • [20] D. Liang, J. Altosaar, L. Charlin, and D. M. Blei, “Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence,” in Proceedings of the 10th ACM conference on recommender systems, 2016, pp. 59–66.
  • [21]

    J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”

    Journal of Machine Learning Research

    , vol. 12, no. Jul, pp. 2121–2159, 2011.
  • [22] A. McCallum, K. Schultz, and S. Singh, “Factorie: Probabilistic programming via imperatively defined factor graphs,” in Proceedings of the Annual Conference on Neural Information Processing Systems, 2009, pp. 1249–1257.
  • [23] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Proceedings of the Annual Conference on Neural Information Processing Systems, 2011, pp. 693–701.
  • [24]

    A. M. Elkahky, Y. Song, and X. He, “A multi-view deep learning approach for cross domain user modeling in recommendation systems,” in

    Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 278–288.
  • [25] H. Chen and J. Li, “Learning multiple similarities of users and items in recommender systems,” in International Conference on Data Mining, 2017, pp. 811–816.
  • [26] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver, “Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering,” in Proceedings of the fourth ACM conference on Recommender systems, 2010, pp. 79–86.