Matrix factorization CanRec09 ; KesMonOh08 is a popular method to build a recommender system. As exemplified by the Netflix Prize competition NetflixPrize , it has become a dominant technology within collaborative-filtering recommenders. Matrix factorization provides a better predictive accuracy compared to classical neighborhood methods while at the same time is scalable and offers much flexibility for modeling a variety of real-life situations KorBelVol09 .
The cold-start problem
A major problem facing collaborative-filtering recommender systems is how to provide recommendations when rating data is too sparse for a subset of users or items. As a special case, the so-called cold-start problem SPUP02 is how to make recommendations to new users who have not yet rated any item or to deal with new items that have not yet been rated by users.
The cold-start problem is usually addressed by incorporating additional input sources to compensate for the lack of rating data. In addition to ratings, the analyst may for example collect certain user attributes, such as gender, age, or other demographic information AdoTuz05 ; Koren08 .
Another approach for dealing with the cold-start problem is to ask users to rate a minimum number of (well chosen) items RACLMKR02 .
Relying on additional input sources to address the cold-start problem may be difficult to deploy in practice as privacy-conscious users may be reluctant to supply some of their attributes. The second approach does not require to collect extra information beyond the ratings and is very efficient. Unfortunately, the additional received ratings may reveal a lot of information about a user to the analyst. Recent research has indeed demonstrated that this can be used by the analyst to infer private user attributes such as political affiliation KosStiGra13 ; SAPBDKOT13 , sexual orientation KosStiGra13 , age WBIT12 , gender SAPBDKOT13 ; WBIT12 , and even drug use KosStiGra13 . Further privacy threats are reported in BWIT14 ; CKNFS11 ; SP:NarShm08 .
A natural question therefore raised in IMWBFT14 is how a privacy-conscious user can benefit from recommender systems while preventing the inference of her private information.
In this paper, we show how a privacy-conscious user can learn her profile without revealing any information to the analyst. The protocol is practical and proven secure against semi-honest adversaries. The communication complexity of the protocol only grows with the square-root of the number of items.
Once the privacy-conscious user learns her profile , she can run a straightforward protocol to learn the predicted rating of any item in the database. This indeed only requires to compute the inner product between the user profile (known to the user) and the item profile (known to the analyst) (CCS:NIWJTB13, , Sect. 4.1).
In IMWBFT14 , Ioannidis et al. propose a learning protocol which enables the user to prevent the analyst from learning some (previously defined) private user attributes. This protocol perfectly hides these chosen attributes to the analyst, in an information-theoretic way. The authors also prove that no such protocol can be more accurate, when the analyst ends up knowing the resulting profile, nor can disclose less information for the same accuracy.
Unfortunately, this protocol has also several drawbacks, most of them inherent to the fact it is information-theoretically secure and does not rely on computational assumptions. First, this protocol still needs to disclose some information about the analyst database to everybody. Second, this protocol is not as accurate as a non-privacy-preserving protocol would be. This is inherent to the fact that Ioannidis et al. restricted themselves to protocols where the analyst learns an approximate profile of the user at the end, so that the resulting user profile shall not contain any information about the private attribute. Third, it can only hide a small fixed set of attributes: all attributes which are not explicitly hidden may be recovered by the analyst. And it may be hard for a user to decide which attributes are really important to her, due to the wide range of possible attributes. Finally, the analyst needs to ask users111In the simplest scenario, we have to restrict to non-privacy-conscious users. But it would also be possible to compute item profiles using privacy-preserving matrix factorization CCS:NIWJTB13 . to reveal which attributes they deem private. This may not only bother a lot these users, but also brings up the question of the reliability of these data. No user will be likely admitting she is a drug addict, for example, even if she is ensured that this data will not be disclosed.222Notice in particular that, in the privacy-preserving matrix factorization protocol in CCS:NIWJTB13 , in case of collusion between the CSP (Crypto Service Provider) and the analyst, it is possible to recover all data sent by the user. This means that governmental agencies may force the recommendation systems to disclose these private user attributes.
is the field of real numbers. For any integer , is the ring of integers modulo , while is its multiplicative group. Vectors are always column vectors and are denoted as or . Matrices are denoted with capital letters.
2.1 Cryptographic tools
A public-key encryption scheme is defined by three algorithms: , , and . generates a matching pair of public key and secret key , given a security parameter (unary notation). The public key is used to encrypt a message into a ciphertext : . The secret key is used to decrypt a ciphertext : . We assume that the encryption scheme is perfectly correct and semantically secure (i.e., IND-CPA) GolMic84 .
An additively homomorphic encryption scheme is such that the message set is an additive group, and there exists a randomized operation such that is distributed identically to a fresh ciphertext of . This operation can be extended to a scalar multiplication by an integer : is a fresh ciphertext of ; that is, ( times).
To simplify the notation, we will sometimes use for and omit when clear from the context. We so have and .
Example 1 (Paillier encryption scheme).
We recall the Paillier encryption scheme EC:Paillier99 , which is an homomorphic encryption scheme that is semantically secure under the Decisional Composite Residuosity (DCR) assumption. generates two large equal-length primes and , computes , and sets and . The public key is while the secret key is . picks a uniformly random integer and returns . returns where . The scheme is additively homomorphic: given and , with .
A -out-of- oblivious transfer (OT) protocol is a cryptographic protocol between two parties: a sender and a receiver. The receiver has an index as input. The sender knows a database . At the end of the protocol, the receiver learns , while the sender learns nothing.
As in our protocol is the number of items in the database, we need to use practical OT protocols with communication complexity sublinear in . We propose to use as -out-of- OT the basic PIR (Private Information Retrieval) protocol in (PKC:OstSke07, , Sect. 2.2) using the Paillier homomorphic encryption scheme, together with a classical 1-out-of- OT SODA:NaoPin01 which is used to mask the PIR database. The resulting OT has two rounds (one message from the receiver to the sender followed by one message from the sender to the receiver) and its communication complexity is proportional to .
2.2 Matrix factorization
The goal of matrix factorization is to predict unobserved ratings for some user and some item , given access to a set of user/item pairs for which a rating has been generated. Matrix factorization provides -dimensional vectors such that
This allows the analyst to predict missing ratings (i.e., those with ). Vector is referred to as the profile of user while vector as the profile of item .
2.3 Learning the profile of a user
Specifically, when a new user wishes to use the service, she submits a batch of ratings for a subset of
items. Upon receiving these ratings, the analyst can estimate her profilethrough the following least-squares estimation,333
and subsequently predict ratings for items , using Eq. (1).
Defining matrix and column vector , the profile of a user can be computed as follows:
3 Our learning protocol
We design a two-round learning protocol between a privacy-conscious user and an analyst, allowing the user to learn her profile from her (private) ratings , where . At the end of the protocol, the analyst will learn nothing (except the size of ), while the user will only learn her profile and nothing else about the analyst database (except the dimension , the database of items and its size , and bounds and on entries of ratings and of profiles of items , respectively).
We insist that our protocol hides the set of actual items that the user is rating as they might already leak significant information about her. If an upper bound on is known, the exact size of can trivially be masked by adding fake items (with profile and fake rating ) so that the protocol always uses a set of size .
Consider the ring . We assume that is either a prime or is hard to factor, so that for all intents and purposes behaves as a field (since a non-zero non-invertible element of would yield a factor of ). Up to using fixed point arithmetic (e.g., by multiplying values by some integer ), we suppose that the entries of and are integers, and so can be considered as elements of .
The user generates a key pair for the homomorphic encryption scheme and encrypts her ratings : for . She also initiates independent OT protocols as a receiver with respective selection indexes .
The analyst generates and computes the following matrices (over and over the ciphertext space respectively):444We slightly abuse notation here. For vectors, the bracket notation and and operators are applied component-wise.
where and are uniformly random matrices and vectors summing up to zero in and respectively, and is a uniform matrix in (the group of invertible matrices in ).
The analyst then answers the -th OT message from the user, as an OT sender with database .
Bounds for correctness
The scheme is correct when the above rational reconstruction succeeds. From WanGuyDav82 ; FC:FouSteWac02 and Hadamard’s inequality, we can show correctness when , where and are upper bounds on the absolute values of the coefficients of the item profiles and of the ratings , respectively. For example, if , , , , this is already satisfied for an integer of bits.
Security against semi-honest adversaries follows from the security of the OT protocol, the IND-CPA property of the homomorphic encryption scheme, and from the following fact: since is invertible and is a group, and only reveal .
3.2 Instantiation using Paillier homomorphic encryption scheme
The scheme can be instantiated using the Paillier encryption scheme and the OT described in Section 2. We can use the internal construction of the OT, to avoid sending ciphertexts of . Concretely, in the OT construction, the user encrypts a vector used to “select” the correct value to be received. If we use two OT protocols for each , one for and one for (instead of a single one for the pair ), then for the second OT, the user just encrypts instead of , she will receive times the value to be received.
The resulting protocol for items, dimension , and ratings from the user (modulus of size bits for Paillier encryption scheme and an elliptic curve over a 256-bit prime field for the base OT SODA:NaoPin01 ), has the following performance on a non-optimized single-thread implementation (on a laptop, CPU Intel® i7-7567U, GHz, turbo GHz): less than s to generate the first round by the user, less than s to generate the second round by the analyst, less than s to finalize the protocol by the user. The user requires less than s of computation (excluding communication). The analyst time is mostly spent in the exponentiations required in the OT protocol (modulo ): there are of them. These exponentiations can be trivially parallelized. The communication complexity is less than MB and essentially grows linearly with .
- (1) Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, June 2005.
Smriti Bhagat, Udi Weinsberg, Stratis Ioannidis, and Nina Taft.
Recommending with an agenda: Active learning of private attributes using matrix factorization.In Alfred Kobsa et al., editors, 8th ACM Conference on Recommender Systems (RecSys 2014), pages 65–72. ACM Press, October 2014.
- (3) Joseph A. Calandrino, Ann Kilzer, Arvind Narayanan, Edward W. Felten, and Vitaly Shmatikov. “You Might Also Like:” Privacy risks of collaborative filtering. In 2011 IEEE Symposium on Security and Privacy (S&P 2011), pages 231–246. IEEE Press, May 2011.
- (4) Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.
- (5) Pierre-Alain Fouque, Jacques Stern, and Jan-Geert Wackers. Cryptocomputing with rationals. In Matt Blaze, editor, 6th International Conference on Financial Cryptography (FC 2002), volume 2357 of LNCS, pages 136–146. Springer, March 2003.
- (6) Shafi Goldwasser and Silvio Micali. Probabilistic encryption. Journal of Computer and System Sciences, 28(2):270–299, 1984.
- (7) Stratis Ioannidis, Andrea Montanari, Udi Weinsberg, Smriti Bhagat, Nadia Fawaz, and Nina Taft. Privacy tradeoffs in predictive analytics. In Sujay Sanghavi et al., editors, 2014 International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2014), pages 57–69. ACM Press, June 2014.
- (8) Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Learning low rank matrices from entries. In 46th Annual Allerton Conference on Communication, Control, and Computing, pages 1365–1372. IEEE Press, September 2008.
- (9) Yehuda Koren. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Ying Li, Bing Liu, and Sunita Sarawagi, editors, 14th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2008), pages 426–434. ACM Press, August 2008.
- (10) Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, August 2009.
- (11) Michal Kosinskia, David Stillwella, and Thore Graepel. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences of the United States of America, 110(15):5802–5805, April 2013.
- (12) Moni Naor and Benny Pinkas. Efficient oblivious transfer protocols. In S. Rao Kosaraju, editor, 12th Annual Symposium on Discrete Algorithms (SODA 2001), pages 448–457. ACM-SIAM, January 2001.
- (13) Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (S&P 2008), pages 111–125. IEEE Press, May 2008.
- (14) Netflix Prize. http://www.netflixprize.com/.
- (15) Valeria Nikolaenko, Stratis Ioannidis, Udi Weinsberg, Marc Joye, Nina Taft, and Dan Boneh. Privacy-preserving matrix factorization. In Ahmad-Reza Sadeghi, Virgil D. Gligor, and Moti Yung, editors, 20th ACM Conference on Computer and Communications Security (ACM-CCS 2013), pages 801–812. ACM Press, November 2013.
- (16) Rafail Ostrovsky and William E. Skeith III. A survey of single-database private information retrieval: Techniques and applications (invited talk). In Tatsuaki Okamoto and Xiaoyun Wang, editors, 10th International Conference on Practice and Theory in Public-Key Cryptography (PKC 2007), volume 4450 of LNCS, pages 393–411. Springer, April 2007.
- (17) Pascal Paillier. Public-key cryptosystems based on composite degree residuosity classes. In Jacques Stern, editor, 18th Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT ’99), volume 1592 of LNCS, pages 223–238. Springer, May 1999.
- (18) Al Mamunur Rashid, Istvan Albert, Dan Cosley, Shyong K. Lam, Sean M. McNee, Joseph A. Konstan, and John Riedl. Getting to know you: Learning new user preferences in recommender systems. In 7th International Conference on Intelligent User Interfaces, pages 127–134. ACM Press, January 2002.
- (19) Salman Salamatian, Amy Zhang, Flávio du Pin Calmon, Sandilya Bhamidipati, Nadia Fawaz, Branislav Kveton, Pedro Oliveira, and Nina Taft. How to hide the elephant –or the donkey– in the room: Practical privacy against statistical inference for large data. In IEEE Global Conference on Signal and Information Processing (GlobalSIP 2013), pages 269–272. IEEE Press, December 2013.
- (20) Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. Methods and metrics for cold-start recommendations. In 25th Annual International ACM Conference on Research and Development in Information Retrieval, pages 253–260. ACM Press, August 2002.
- (21) Paul S. Wang, M. J. T. Guy, and James H. Davenport. -adic reconstruction of rational numbers. SIGSAM Bull., 16(2):2–3, May 1982.
- (22) Udi Weinsberg, Smriti Bhagat, Stratis Ioannidis, and Nina Taft. BlurMe: Inferring and obfuscating user gender based on ratings. In Padraig Cunningham et al., editors, 6th ACM Conference on Recommender Systems (RecSys 2012), pages 195–202. ACM Press, September 2012.