Active Learning in Recommendation Systems with Multi-level User Preferences

11/30/2018 ∙ by Yuheng Bu, et al. ∙ Amazon 0

While recommendation systems generally observe user behavior passively, there has been an increased interest in directly querying users to learn their specific preferences. In such settings, considering queries at different levels of granularity to optimize user information acquisition is crucial to efficiently providing a good user experience. In this work, we study the active learning problem with multi-level user preferences within the collective matrix factorization (CMF) framework. CMF jointly captures multi-level user preferences with respect to items and relations between items (e.g., book genre, cuisine type), generally resulting in improved predictions. Motivated by finite-sample analysis of the CMF model, we propose a theoretically optimal active learning strategy based on the Fisher information matrix and use this to derive a realizable approximation algorithm for practical recommendations. Experiments are conducted using both the Yelp dataset directly and an illustrative synthetic dataset in the three settings of personalized active learning, cold-start recommendations, and noisy data -- demonstrating strong improvements over several widely used active learning methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recommendation systems are widely studied in both academic and commercial settings. Most existing work considers the passive, item-level feedback

scenario where the recommendation system observes historical user feedback for a set of items with respect to a population of users and estimates unobserved user-item utility to make recommendations. However, exclusively considering item-level feedback disregards frequently available information regarding relations (i.e., cuisine type, product substitutability 

[McAuley, Pandey, and Leskovec2015]) between entities (i.e., items, users, cuisines, etc.) in the database. Additionally, standard recommendation systems only observe historical user-item responses to estimate model parameters, unlike active learning systems that are able to directly query the user to more efficiently learn user preferences. Noting these shortcomings, online recommendation systems [Bresler, Chen, and Shah2014], preference elicitation methods [Chen and Karger2006], and interactive recommendation systems [Mahmood and Ricci2009] all provide directions to mitigate these issues.

In this work, we are specifically interested in adding system-initiative capabilities with respect to querying the user. As motivation, suppose you are interested in purchasing book to read during an upcoming vacation. A standard recommendation system would observe your past purchases/ratings and select a book expected to align with learned preferences. However, a multi-turn dialogue agent may be able to ask preference questions of varying types and levels of granularity (i.e., Did you like G.K. Chesterton’s ‘The Man Who Was Thursday’? [item utility elicitation], Are you looking for light reading? [use case information], Do you like science fiction novels? [category information]). These query families result in greater flexibility than simply making recommendations and allow asking questions that can more quickly improve the model for making session-specific recommendations.

The core of our proposed method is an active learning extension to the collective matrix factorization (CMF) model. CMF produces a low-dimensional embedding that is shared across each relation for which the item/user participates and jointly represents all available sources of information on different levels [Gupta and Singh2015]. By enabling active learning within CMF, we can generate a personalized active learning session to efficiently estimate the CMF parameters and make high-quality recommendations. Our primary contributions include: (1) framing the question selection problem for multi-level user preferences in a system-initiative active recommendation system within the CMF framework (2) providing a theoretical analysis of an optimal active learning strategy for CMF and corresponding realizable approximation and (3) demonstrating that the proposed algorithm outperforms strong baselines on real-world Yelp data and an illustrative synthetic dataset that explicitly satisfies the CMF assumptions in standard, cold start, and noisy data settings.

2 Related Works

This work draws upon several existing research areas. Below, we itemize some of the most relevant related work.

Recommendation Systems: Recommendation systems are frequently categorized as content-based [Pazzani and Billsus2007] or collaborative filtering [Koren, Bell, and Volinsky2009] methods. This work builds upon the collective matrix factorization (CMF) model [Singh and Gordon2008, Gupta and Singh2015], a collaborative filtering method that generalizes matrix factorization to also account for different levels of relations between entities. Within the collaborative filtering approach, observation sparsity is the primary limitation, particularly in the cold start setting (i.e., where a new item has a small number of ratings or a new user has rated a small number of items). Approaches to ameliorate this issues include user preference elicitation [Rashid, Karypis, and Riedl2008], interview construction [Zhou, Yang, and Zha2011, Sun et al.2013], and optimal experimental design [Anava et al.2015]. The distinguishing aspect of our work is that we develop an active and personalized querying strategy that jointly accounts for several types of questions based on item preferences and other relationships between entities.

Active Learning: Active learning has been widely studied (e.g., [Settles2009] for a general survey, [Rubens, Kaplan, and Sugiyama2011, Elahi, Ricci, and Rubens2016] in the context of recommendation systems), describing when a learning algorithm observes a large set (or stream) of unlabeled examples and can choose a subset for labeling – attempting to maximize performance while minimizing annotation effort. The most closely related work within the recommendation systems setting is active learning in the matrix completion scenario [Bhargava, Ganti, and Nowak2017]. While related, this differs from our work as we use the CMF setting to jointly estimate a representation for several levels of relationships. From a methodological perspective, our work draws upon recent results for active learning with Fisher information based querying functions [Sourati et al.2017] and convergence properties of active learning for maximum likelihood estimation [Chaudhuri et al.2015]. We expand upon these results for the CMF setting.

Conversational Agents: One potential motivation for this work is conversational recommendations [Mahmood and Ricci2009, Christakopoulou, Radlinski, and Hofmann2016]

, where questions are determined by modeling different relationship types within a unified framework and a theoretically well-motivated active learning strategy. However, it should be noted that developing goal-oriented conversational agents (e.g.,

[Young et al.2013]) and specifically dialogue managers is a very mature AI subfield. We are only considering the restricted setting of asking a personalized set of questions to provide an optimized recommendation.

3 Preliminaries and Model

We first briefly review the probabilistic collective matrix factorization model [Singh and Gordon2008] and concretely formalize our active learning extension.

3.1 General Notation

We use lower case letters to denote scalars and vectors, upper case letters to denote matrices and sets.

is an appropriately sized identity matrix. Superscript

denotes a vector or matrix transpose and denotes the support size of a set. The -norm of a vector is defined as where is defined for a vector and a matrix of appropriate dimensions. The Frobenius norm of a matrix is defined as .

3.2 Relational Data

We represent the set of entities by and the set of relations between them by , respectively. Denote the observed database by , which consists of tuples of the form , where is a relation between two different types of entities, are a pair of different entities, is the label denoting whether holds (or not), and is the total number of observed relations in database .

For example, a simple database that consists only of the user ratings for restaurants would contain users and businesses as the entities, and only a single relation , such that if the user liked the business . As in this example, many real-life databases are sparse (i.e., only a very small subset of possible relations are observed) and the goal of modeling is to be able to complete this database such that we can make recommendation based on the prediction. Specifically, given any query that is absent from the observed database, we would like to predict whether the relation holds.

3.3 Collective Factorization Model

The collective matrix factorization model [Singh and Gordon2008, Gupta and Singh2015] extends the commonly used matrix factorization model to multiple matrices by assigning each entity a low-dimensional latent vector that is shared across all relations where the entity appears. Formally, we assign each entity in our database a -dimensional latent vector , and denote the matrix of all such latent vectors by

. We model the probability that

equals to by:

(1)

where

is the sigmoid function

.

The CMF model presents a number of advantages in our setting. By sharing the entity factors amongst all the relations, we are able to produce a joint representation based on all sources of information on different levels. For example, the factors used predict user ratings will leverage information from other ratings in a collaborative filtering fashion, from business categories via the set inclusion relationships, and even from words that appear in the reviews (the details of the model as applied to the Yelp data are described in Section 5.1). By developing this joint representation, it also allows the active learning algorithm to query different type of personalized questions to different users in multi-turn recommendation system settings. A further advantage of learning CMF model is that all the entities are effectively embedded in the same -dimensional space, and thus similarities and distances can be computed and analyzed for any set of entities. Finally, test-time inference takes constant time and thus is very efficient: we only require a dot-product between low-dimensional vectors for estimating the probability of a relation existing between a pair of entities.

3.4 Problem Definition

In defining our personalized active learning problem, let be a chosen user, where is the set of all user entities. Let is the set of all entities except for user entities and represent the set of entities, where the label of relations between user and entities are observed in the database , and denote as the set of entities where the label of their relations with user are missing in the database . We assume the active learning procedure operates in iterations. In each iteration, the active learning algorithm will choose entities (denoted the set by ) from the unlabeled set , and get answers by asking user questions with respect to these entities – adding these labeled entities into the existing labeled set and removing from the unlabeled entities set . The goal of active learning is to find entities in total, specifically the most informative questions, to query their relations with the current user .

More precisely, in the -th iteration, we have a labeled entity set and unlabeled entity set . Given a constraint on the number of questions , our goal is to select questions for user in order to minimize the cross entropy loss on the set of the unlabeled entities .

If we define to be the negative log likelihood function of the CMF model with parameters estimated by our algorithm. And assume that the database is generated from CMF model with some unknown true parameter . Then, the cross entropy loss for a specific user can be written as,

(2)

where denotes taking expectation of over the CMF distribution with .

Our goal is to minimize the following objective function:

(3)

where the expectation is taken over the estimation of .

Clearly, the objective function inherently depends on how we estimate the latent matrix given the relations from the observed set and the answers of active learning questions provided by user. Thus we can decompose the problem in Equation (3) into two distinct problems:

  1. Model estimation: given a subset of entities and their relations with user , estimate the matrix , which minimizes the cross entropy loss in (2).

  2. Question Selection: given a pool of available questions , select the most informative subset of as the active learning questions .

4 Algorithm and Analysis

In this section we tie together the model estimation and question selection problems, and present an approach that tackles the unified problem as a whole. We derive a finite-sample estimation error bound, which is the theoretical underpinning motivating the optimal active learning strategy proposed in this section.

Essentially, the optimality relies on the assumption that the labels are generated via CMF model defined in (1), and consequently on the method used for model estimation: maximum likelihood (ML) estimation. To make this point clearer we first introduce the ML estimator.

4.1 Maximum Likelihood Estimation

Recall that we assume the database is generated from CMF model with some unknown parameter

, it is impossible for us to compute the loss function (

2) directly. However, we can compute the empirical loss for user on labeled set by taking average on the observed data as follows

(4)

where denotes the observed label of the relation between user and other entity .

To jointly estimate the latent vectors for all users and other entities, we define the empirical cross-entropy loss on the entire database as follows,

(5)

To avoid over-fitting, we place a prior on all latent vectors in the form of a zero-mean Gaussian distribution with identity covariance matrix, i.e.,

. Minimizing the following regularized empirical loss on the observed database gives the maximum likelihood estimator of the latent matrix ,

(6)

where is a positive regularization parameter and is a compact subset of .

As with several matrix factorization implementations, we use stochastic gradient descent (SGD) by cycling over the entries of the database multiple times, updating the latent factors in the direction of stochastic gradient for each entry.

4.2 Active Learning Based on Error Bound Minimization

As discussed in Appendix A, the maximum likelihood estimator in (6) is consistent and asymptotic normal. Since there are multiple users related with the same entity (ratings from different users), we can treat the ML estimates in (6) as a good estimation for latent vectors , and use active learning techniques to improve the estimation of . Our discussion below will focus on determining the best strategy to choose actively, such that the optimal convergence rate for estimating can be achieved.

Given an estimation of , defined in (2) and (4) are exclusively functions of . Thus, we will use the notation in the following discussion. For our active learning to work correctly, we require the following conditions.

Assumption 1

For any , the Hessian matrix of the log likelihood function

(7)

is a function of only and (does not depend on .)

Then, the Fisher information matrix of parameter on set can be written as: , which is not a function of the labels . In addition to above, we need the following assumption holds to establish the optimality of the active question selection algorithm.

Assumption 2

(Active learning regularity conditions)

  1. Concentration at : For any and , we have

    (8)

    holds with probability one.

  2. Lipschitz continuity: There exists a neighborhood of and a constant , such that for all , are -Lipschitz, namely,

    (9)

    holds, for .

We now present the main result of our paper. The proof of the following theorem and all the supporting lemmas will be presented in Appendix B and C.

Theorem 1

Suppose satisfies the regularity conditions in Assumptions 1, 2 and 3. Let be the ML estimate using question set

(10)

Suppose further that for some constant . Then, large enough such that , we have:

(11)

where .

Remark 1

The proof of Theorem 1 only requires the condition that is the minimizer of the cross entropy loss . Thus, similar bounds on different loss function, for example, mean square error, can be obtained using similar proof technique with different expressions.

The upper and lower bounds in Theorem 1 demonstrate that the cross entropy loss of the ML estimator using question set with size is essentially . Motivated by this result, we should select the question set that minimizes . Unfortunately, we cannot do this, since is unknown. Recall that we use in the original ML estimator (6) as the true value of to calculate , again we can use as a coarse approximation of . We then choose the set which minimizes . The algorithm is formally presented in Algorithm 1.

Input: Unlabeled set and labeled set , ML estimates of latent vectors and , for
Output: An estimation of latent vector
Method:
1: Solve the following semi-definite programming (SDP) problem (refer to Theorem 1):
2: Solve the ML estimation on labeled set :
Algorithm 1 Fisher information based algorithm
Remark 2

To avoid the computational issue of solving SDP when the unlabeled set is extremely large, since holds for positive definite matrices, we can approximate the SDP problem by minimizing or even maximizing as discussed in [Sourati et al.2017].

5 Experimental setup

To demonstrate the efficiency of Algorithm 1 and verify the theoretical results, we compare our Fisher information based active learning algorithm with four baseline algorithms on a real Yelp dataset and an illustrative synthetic dataset. In this section, we introduce these datasets and discuss the setup of our experiments.

5.1 Yelp Dataset

Yelp contains rich relational data for businesses and users in the form of business categories, user reviews and ratings [Gupta and Singh2015]. Much of this relational data characterizes different levels of user preferences that pose exciting potential for integration, for example, incorporating additional information about the business categories and user preferences can significantly improve rating prediction. The entities present in the Yelp database are users, businesses and categories. We denote the set of these entities by , and , respectively and represent each entity by a -dimensional latent vector. In the following subsection, we describe in detail the various relations we use from Yelp and show how we represent them as binary relational matrices.

Ratings

The original 5-scale ratings are converted to a binary-valued relation between businesses and users with high ratings (4 and 5) as positive() and low ratings () as negative(). We denote the resulting binary rating matrix by with size .

Business Categories

Each business in the Yelp dataset is categorized according to the nature of the business. The categories available include broad-level classes such as Restaurant, Bank and Grocery, and fine-grained descriptions such Mexican Food, Pizza, and Delis. The business category data can be viewed as a binary relation between businesses and categories and is represented as matrix (fully-observed and complete) with size .

User Categories

The relation between users and categories are not contained in the Yelp dataset explicitly. Thus, we construct the matrix of this relation using the following way to memorize the user’s opinions with respect to a certain type of food: if the user has visited a business with category , then we set , which means the user holds a positive opinion towards the category .

Figure 1: Overview of the entities and the relations in the collective factorization model for the Yelp Dataset.

Datasets and Sizes

The Yelp dataset contains data from 9 cities across 4 countries; we focus on the data from Urbana-Champaign. For our active learning algorithm comparison, we only consider users that have at least 10 ratings and categories associated with at least 5 businesses.

Since the matrix only contains observed categories (all positives), we sample negative data entries for each user by randomly selecting a set of categories that were not observed. The number of negative samples chosen for each user is same as the number of positive samples.

After this procedure, our modified Yelp dataset contains 473 users, 858 businesses and 33 categories. Note that the ratings matrix and the user categories matrix are very sparse, only of and of are observed.

5.2 CMF Generated Synthetic Dataset

Since the Yelp dataset is not actually generated from the CMF model, it is possible that Theorem 1 does not hold at all and Algorithm 1 cannot achieve a reasonable convergence rate. Thus, we also construct an illustrative synthetic dataset using the CMF model defined in (1) to show the efficiency of our active learning algorithm.

We draw samples of matrices , , using the same structure as we discussed for the Yelp data (Figure 1), where are generated i.i.d. from the Gaussian distribution . We set , and , and the dimension of latent vector .

(a) Yelp data: matrix only
(b) Yelp data: matrices , and
(c) Synthetic data: matrix only
(d) Synthetic data: matrices , and
Figure 2: score comparison of different active learning strategies. Fisher information based algorithm outperforms other algorithms presented in Section 5.3 in all settings.

5.3 Baseline Algorithms

Uncertainty sampling:

Select the set of entities with the highest noisy affinity variance.

Maximum or Minimum Model Change: Select the entities that lead to the greatest (smallest) model change , if we knew the labels [Elahi, Ricci, and Rubens2016].

Random selection: The entities are selected randomly from the pool of available question set . While ostensibly rather weak, is actually a statistically good representative of if sufficiently large number of samples are collected.

Upper and Lower bound: These two baselines are not active learning schemes, but they are provided in the following comparison to show the efficiency of active learning. The lower bound is obtained by training the CMF model without any active questions, which is the starting point of all the other algorithms; while the upper bound is obtained by using all the available training samples.

5.4 Active Learning Setup

The primary evaluation for comparing different active learning algorithms will be with respect to predicting user ratings . In particular, we consider whether incorporating different level of questions into the active learning process provides significant improvement in predictions. The simple model that performs the standard matrix factorization of will only ask active learning question with respect to matrix . Moreover, we can compare it with the case where the model predicts ratings by incorporating business categories, user categories and perform active learning on both and .

Since we are unable to retrieve the actual answer from a real user, we need to construct the ground truth for active learning responses. In the following experiments, we use the predictions given by the pre-trained CMF model using all available training data as the ground truth, which also serves as the upper bound of active learning performance. We run cross validation to determine the hyper parameter for the pre-trained model and other test models. The value of the regularization constant , dimension of the latent variables ( for Synthetic data) and learning rate is used. We use the default logistic threshold of for the output of the sigmoid function to predict whether a relation holds between entities. The performance of the prediction is measured in terms of the

score defined as the harmonic mean of the precision and recall.

(a) Yelp data: matrix only
(b) Synthetic data: matrix only
Figure 3: score comparison of different active learning strategies in the cold start setting.

6 Experimental Results

In this section, we compare the performance of different active learning algorithms with the CMF model in predicting user ratings under different settings. First, we present the rating prediction experiments under the personalized active learning setting in Section 6.1 on both Yelp and synthetic dataset. Secondly, we investigate the performance of different active learning algorithms for user cold-start estimation on these two datasets in Section 6.2, where we make rating predictions for users with no past observed ratings or reviews. Finally, for the synthetic dataset, we compare the performance of different active learning strategies when the model is updated with noisy responses in Section 6.3.

6.1 Personalized Active Learning

To compare the performance of different active learning strategies in the personalized active learning setting, we perform evaluations on a held-out test set from the observed data. The process can be described as follow:

  1. For each user in the dataset, we randomly choose 20% (30% for synthetic data, the same below) from the original matrix and as the test data, 20% (10%) as the training set, and perform active learning on the remaining 60% of data.

  2. In each iteration, we randomly choose 25% of users, and ask them one question based on different question selection algorithm. The CMF model is then updated with answers estimated by the pre-trained model using 80% (70%) of all available training data.

  3. Repeat this active learning process for 25 iterations.

  4. Compute the score on the test dataset with 50 trials of Monte Carlo runs.

In Figures 2(a) and 2(b), we compare Algorithm 1 with the four active learning baselines described in Section 5.3 on the Urbana-Champaign dataset. Figure 2(a) shows the performances of models trained using exclusively the ratings matrix and the models in Figure 2(b) are trained using matrices , and collectively. Moreover, Figure 2(c) and 2(d) show experimental results on the synthetic dataset.

We first observe that scores achieved by the lower bound using only in Figures 2(a) and 2(c) are and respectively. The lower bound in Figures 2(b) and 2(d) using , and are and respectively, showing that incorporating different types of information significantly improves the prediction accuracy.

We also note that the upper bound performance can be achieved with fewer samples by using active learning. Specifically, in Figures 2(a) and 2(b), the upper bound is trained with ratings (80% of all data), but the active learning algorithms requires ratings (start from 20% of all data, combined with 25 rounds of updates). The data efficiency improvements can also be observed in Figures 2(c) and 2(d) for the synthetic dataset.

Moreover, our Fisher information based algorithm is strictly better than all the other baseline algorithms in all cases. This improvement of Fisher information based algorithm is smaller when using , and collectively, since by utilizing different levels of information, the lower bound achieved in this setting is quite promising, which limits the potential improvements of active learning.

6.2 Cold Start Setting

The other major challenges faced by recommendation systems is to predict ratings for new users for which no reviews or ratings have been observed. To compare the performance of different active learning strategies in the cold-start setting, we carry out the following experiment:

  1. We randomly choose 20% (40%) of users as the cold start users, the training dataset contains no records for these cold-star users. The test set is constructed with 50% cold start users data.

  2. In each iteration, we ask one active learning question to all cold-start users based different question selection algorithm. The CMF model is then updated with answers estimated by the pre-trained model using 50% of available training data for cold-start users.

  3. Repeat this active learning process for 15 rounds.

  4. Compute the score on the test dataset for the cold-start users with 50 trials of Monte Carlo runs.

In the Figure 3(a) and 3(b), we compare our Fisher information based active learning algorithm in the cold start setting using both the Yelp dataset and the synthetic dataset.

Since there is no records about the cold-start users in the training set, the lower bound in these figures start from the score of with purely guessing. However, a reasonable score of can be achieved after just 2 queries with Algorithm 1 in both figures, which demonstrates the applicability of our results even in the cold-start setting.

We note that the Maximum Model Change baseline works quite well in Figure 3(a) and 3(b), one possible interpretation is that exploring user preference is more important than minimizing the noise of response in the cold-start setting. However, note that our proposed method doesn’t require changing querying functions between the cold-start and general setting as it achieves strong performance in both cases.

6.3 Noise Tolerance

To study the performance of different active learning strategies when only noisy responses are available, we modify the experiment described in Section 6.1. Instead of updating the model with answers estimated by the pre-trained CMF model, we regenerate samples using the same distribution used to construct the synthetic dataset in Section 5.2. Thus, the user response to the same question from different Monte Carlo runs will be flipped, with results shown in Figure 4.


Figure 4: score comparison of different active learning strategies with noisy answers. Upper bound is around 0.80.

Comparing Figure 4 with Figure 2(c), the performance of all the algorithms decrease due to the influence of the noisy response. The proposed Fisher information based algorithm still outperforms all the others in this situation, demonstrating the robustness of our method.

7 Conclusions

In this paper, we consider the question selection problem for recommendation systems with multi-level user preferences. Building on the CMF framework, we provide a theoretical analysis of an optimal active learning strategy with a realizable approximation algorithm. Our experiments on synthetic and Yelp data demonstrate that the proposed algorithm performs well in different practical settings. In the future, we plan to address the problem of noisy response in active learning, especially when the quality of response is related to the type of questions. This would be particularly relevant in the context of dialogue managers for modeling the natural language generation user natural language understanding loop within conversational recommendation systems in a more abstract setting.

References

  • [Anava et al.2015] Anava, O.; Golan, S.; Golbandi, N.; Karnin, Z.; Lempel, R.; Rokhlenko, O.; and Somekh, O. 2015. Budget-constrained item cold-start handling in collaborative filtering recommenders via optimal design. In Proceedings of the 24th International Conference on World Wide Web, 45–54. International World Wide Web Conferences Steering Committee.
  • [Bhargava, Ganti, and Nowak2017] Bhargava, A.; Ganti, R.; and Nowak, R. 2017. Active positive semidefinite matrix completion: Algorithms, theory and applications. In

    Proc. of the International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 1349–1357.
  • [Bresler, Chen, and Shah2014] Bresler, G.; Chen, G. H.; and Shah, D. 2014. A latent source model of online collaborative filtering. In Advances in Neural Information Processing Systems (NIPS).
  • [Chaudhuri et al.2015] Chaudhuri, K.; Kakade, S. M.; Netrapalli, P.; and Sanghavi, S. 2015. Convergence rates of active learning for maximum likelihood estimation. In Advances in Neural Information Processing Systems (NIPS), 1090–1098.
  • [Chen and Karger2006] Chen, H., and Karger, D. R. 2006. Less is more: Probabilistic models for retrieving fewer relevant documents. In Proc. of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 429–436.
  • [Christakopoulou, Radlinski, and Hofmann2016] Christakopoulou, K.; Radlinski, F.; and Hofmann, K. 2016. Towards conversational recommender systems. In KDD, 815–824.
  • [Elahi, Ricci, and Rubens2016] Elahi, M.; Ricci, F.; and Rubens, N. 2016. A survey of active learning in collaborative filtering recommender systems. Computer Science Review 20:29–50.
  • [Ferguson2017] Ferguson, T. S. 2017. A course in large sample theory. Routledge.
  • [Gupta and Singh2015] Gupta, N., and Singh, S. 2015. Collective factorization for relational data: An evaluation on the yelp datasets. Technical report, Technical report, Yelp Dataset Challenge, Round 4.
  • [Koren, Bell, and Volinsky2009] Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. In Computer, volume 8, 30–37.
  • [Mahmood and Ricci2009] Mahmood, T., and Ricci, F. 2009. Improving recommender systems with adaptive conversational strategies. In Hypertext.
  • [McAuley, Pandey, and Leskovec2015] McAuley, J.; Pandey, R.; and Leskovec, J. 2015. Inferring networks of substitutable inferring networks of substitutable and complementary products. In Proc. of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 785–794.
  • [Pazzani and Billsus2007] Pazzani, M. J., and Billsus, D. 2007. Content-based recommendation systems. In The Adaptive Web.
  • [Rashid, Karypis, and Riedl2008] Rashid, A. M.; Karypis, G.; and Riedl, J. 2008. Learning preferences of new users in recommender systems: An information theoretic approach. SIGKDD Explorations Newsletter 10(1):90–100.
  • [Rubens, Kaplan, and Sugiyama2011] Rubens, N.; Kaplan, D.; and Sugiyama, M. 2011. Active learning in recommender systems. In Recommender Systems Handbook. Springer. 735–767.
  • [Settles2009] Settles, B. 2009. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison.
  • [Singh and Gordon2008] Singh, A. P., and Gordon, G. J. 2008. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 650–658. ACM.
  • [Sourati et al.2017] Sourati, J.; Akcakaya, M.; Leen, T. K.; Erdogmus, D.; and Dy, J. G. 2017. Asymptotic analysis of objectives based on fisher information in active learning.

    Journal of Machine Learning Research (JMLR)

    18:1–41.
  • [Sun et al.2013] Sun, M.; Li, F.; Lee, J.; Zhou, K.; Lebanon, G.; and Zha, H. 2013.

    Learning multiple-question decision trees for cold-start recommendation.

    In Proc. of the ACM International Conference on Web Search and Data Mining (WSDM), 445–454.
  • [Young et al.2013] Young, S.; Gašić, M.; Thomson, B.; and Williams, J. D. 2013. POMDP-based statistical spoken dialog systems: A review. Proceedings of the IEEE 101(5):1160–1179.
  • [Zhou, Yang, and Zha2011] Zhou, K.; Yang, S.; and Zha, H. 2011. Functional matrix factorizations for cold-start recommendation. In Proc. of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 315–324.

Appendix A Regularity conditions for Asymptotic Normality of ML estimator

The following regularity conditions are needed to establish the standard Asymptotic Normality of ML estimation [Ferguson2017].

Assumption 3

(Regularity conditions for ML estimation)

  1. Identifiability: , for .

  2. Compactness: is an interior point of the compact set .

  3. Smoothness: is smooth, such that the first, second and third derivatives of exist.

  4. Strong Convexity: The Fisher information matrix of is positive definite.

  5. Boundedness: .

Note that in the original CMF model (1), parameter is not unique and the parameter space is not compact. We avoid this problem by focusing on the regularized estimation in (6), and further assume that . Thus, it can be verified that Assumption 3 holds for (6), which ensures the asymptotic normality.

Appendix B Useful lemma

Lemma 1

[Chaudhuri et al.2015] Suppose are random functions drawn i.i.d. from a distribution, where . Denote and be another function. Let , and . Assume:

  1. Assumption 3 holds for .

  2. Assumption 2 holds for and .

  3. For , we need .

  4. There exists a neighborhood of and a constant , such that are -Lipschitz, namely

    hold with probability one, for .

Choose and define , where is an appropriately chosen constant. Let be another appropriately chosen constant. If is large enough so that , then:

where .

Appendix C Proof Sketch for Theorem 1

We first let , , then

(12)

where and , for , then and let . Using the notation in Section 4.2, we can compute that

Using the Assumption 2 and the condition in Theorem 1 that , we see that this satisfies the hypothesis of Lemma 1 with constants

We now apply Lemma 1 to prove Theorem 1 and it can be verified that

(13)