Recommendation systems [17, 18, 19, 20, 33] utilize user ratings to provide personalized suggestions of items like movies and products. Some popular brands that provide such services are Amazon , Netflix, IMDb, BarnesAndNoble , etc. Collaborative Filtering (CF) and Content-based (CB) recommendation are two commonly used techniques for building recommendation systems. CF systems [21, 22, 23]operate by gathering user ratings for different items in a given domain and compare various users, their similarities and differences, to determine the items to be recommended. Content-based methods recommend items by comparing representations of content that a user is interested in to representations of content that an item consists of.
There are several datasets like MovieLens (term1, ) and Netflix (term2, ) that are available for testing and bench-marking recommendation systems.
We present an Indian regional movie dataset on similar lines. India has been the largest producer of movies in the world for the last few years with a lot of diversity in languages and viewers. As per the UNESCO cinema statistics (term9, ), India produces around 1,724 movies every year with as many as 1,500 movies in Indian regional languages. India’s importance in the global film industry is largely because India is home to Bollywood in Mumbai. There’s a huge base of audience in India with a population of 1.3 billion which is evident by the fact that there are more than two thousand multiplexes in India where over 2.2 billion movie tickets were sold in 2016. The box office revenue in India keeps on rising every year. Therefore, there is a huge need for a dataset like Movielens in Indian context that can be used for testing and bench-marking recommendation systems for Indian Viewers. As of now, no such recommendation system exists for Indian regional cinema that can tap into the rich diversity of such movies and help provide regional movie recommendations for interested audiences.
As of now, Netflix and Movielens datasets do not have a comprehensive listing of regional productions as the clipping shows in Figure 1( borrowed from ). Therefore, a substantial source of such a data comprising movies of various regions, varying languages and genres encompassing a wider folklore is strongly needed that could provide such data in a suitable format required for building and benchmarking recommendation systems.
To capture the diversity of Indian regional cinema, popular websites like Netflix are trying to shift focus towards it [36, 37]. The goal is to bring some of the greatest stories from Indian regional cinema on a global platform. Through this, viewers are exposed to a wide variety of new and diverse stories from India. As a result of this initiative, Indian regional cinema will be available across countries. Building a recommendation system using a dataset of such movies and their audience can prove to be useful in such situations. Here, we present such a dataset which is the first of its kind.
Web portal for data collection: A web portal where a user can sign up by filling details like email, date of birth, gender, home town, languages known and occupation. The user can then provide rating to movies as like/dislike.
Indian Regional Cinema Dataset: It is the first dataset of Indian Regional Cinema which contains ratings by users for different regional movies along with user and movie metadata. User metadata is collected while signing up on the portal. Movie metadata consists of genre, release year, description, language, writer, director, cast and IMDb rating.
Detailed analysis of the dataset using some supervised and unsupervised Collaborative Filtering techniques.
2. Related Work
MovieLens  is a web-based portal that recommends movies. It uses the film preferences of its users, collected in the form of movie ratings and preferred genres and then utilizes some collaborative filtering techniques to make movie recommendations. The Department of Computer Science and Engineering at the University of Minnesota houses a research lab known as Grouplens Research that created Movielens in 1997 . Indian Regional Cinema dataset is inspired from Movielens. The primary goal was to collect data for performing research on providing personalized recommendations. MovieLens released three datasets for testing recommendation systems: 100K, 1M and 10M datasets. They have released 20M dataset as well in 2016. In the dataset, users and movies are represented with integer IDs, while ratings range from 1 to 5 at a gap of 0.5.
Netflix released a training data set for their contest, Netflix Prize (term8, ), which consists of about 100,000,000 ratings for 17,770 movies given by 480,189 users. Each rating in the training dataset consists of four entries: user, movie, date of grade, grade. Users and movies are represented with integer IDs, while ratings range from 1 to 5.
These datasets are largely for hollywood movies and TV series, and their viewers. They are not designed for those user communities which are inclined towards watching Indian regional cinema.
From the view point of recommender systems, there have been a lot of work using user ratings for items and metadata to predict their liking and disliking towards other items [4, 5, 6, 11]. Many unsupervised and supervised collaborative filtering techniques have been proposed and benchmarked on movielens dataset. Here, in this paper, we have chosen few popular techniques such as user-user similarity to establish baseline and then other deeper techniques such as Blind Compressed Sensing, Probabilistic Matrix Factorization, Matrix completion, Supervised Matrix Factorization are used on our dataset to provide benchmarking results. These techniques are chosen over others because these techniques have proven to provide better accuracy in recent works .
3. Indian Regional Movie Dataset
This is the first dataset of Indian regional cinema which covers movies of 18 different regional languages and a variety of user ratings for such movies. It consists of 919 users with varying demographics and 2,851 movies with different genres. It has 10K ratings from 919 users.
3.1. Metadata Information
The data for movies has been scraped from IMDb (term3, ). IMDb has a collection of Indian movies spanning across multiple Indian regional languages and genres. Each movie is associated with the following metadata.
Movie id: Each movie has a unique id for its representation.
Description: Description of the movie for users.
Language: Language(s) used in the movie. A movie may have been released in multiple regional languages. The distribution is shown in Table 1.
Release date: Date of release of the movie.
Rating count: As per IMDb, to judge the popularity of the movie.
Crew: Director, writer and cast of the movie.
Genre: Movie genre. It can be one or multiple out of 20 genres available on IMDb. The number of movies for each genre are shown in Figure 2.
|Movie Count||User Count|
For better recommendations, it is important to include the factors which influence user ratings the most. Following is the metadata information collected for a user:
User id: A user will have a unique id for its representation.
Languages: The languages known by the users. Its count is shown in Table 1.
State: The state of India that the user belongs to. The region wise distribution is shown in Figure 3.
Age: Date of birth of the user is taken as input to calculate the age of the user. The distribution is shown in Figure 4.
Gender: Denotes the gender of the user. Gender distribution of the data is shown in Figure 5.
Occupation: It denotes the occupation of the user. It can be any one out of student, self-employed, service, retired and others. Its distribution is shown in Figure 6.
|Movie Count||User Count||Rating Count||Sparsity(%)||Release Year|
3.2. MovieLens vs Indian Regional Cinema Dataset
The key difference between the presented dataset and movielens is that the latter does not contain movies from Indian regional cinema. Movielens only has few Hindi and Urdu movies. Also, our data has been collected mainly from the viewers of regional movies in India. The user metadata, thus, collected can be used to recommend more relevant movies for such audiences.
Also, the MovieLens datasets are biased towards a certain category of users. They contain data only from users who have rated at least twenty movies. The datasets do not include the data of those users who could not find enough movies to rate or did not find the system easy enough to use. There is a possibility that there is a fundamental difference between such users and the other users in the datasets. Our dataset makes no such distinction among users based on the number of movies that they have rated.
To make the process of rating multiple movies easier for a user, we have used the concept of binary rating for movies where, a user can either ”like” or ”dislike” a movie denoted by ”1” and ”-1” in the dataset. On the other hand, MovieLens uses a 10-point scale for rating (from 0 to 5). A basic comparison of these datasets are shown in Table 2. The table indicates the number of users, movies, ratings, release year and the sparsity of datasets.
3.3. Dataset Collection
For the collection of user information and movie ratings, a web portal named Fickscore  is created where users can sign up filling in all details as shown in Figure 7. The user has to provide the preferred languages so that the portal can ask users to rate the movies of their preferred languages.
While signing up, the user is prompted to fill up the metadata information. The user can then login to the portal to rate movies as either like or dislike and the responses are recorded as shown in Figure 8.
4. Unsupervised Collaborative Filtering techniques
To analyze the dataset, some unsupervised techniques are used such as user-user similarity, item-item similarity, Matrix Factorization, Probabilistic Matrix Factorization, Blind Compressed Sensing etc. The main advantage of using such techniques is the ease of implementation and their incremental nature. On the other hand, it is human data dependent its performance decreases on increase of sparsity of data. These techniques cannot address the cold start problem i.e., when a new user or item adds in the dataset whose ratings are not available because these use ratings of users to make predictions. Bias correction is performed on the dataset by calculating global mean, user bias and item bias and then the above techniques are used to predict the rating of a new user for an item.
4.1. User and Item-based similarity
In user-user model, a similarity matrix is calculated, each entry
indicates the score computed by cosine similarity between a userand another user . It denotes how much similar are two users and , higher the score higher is the similarity. Similarly, in item-item model, each entry of the similarity matrix ’A’ denotes the cosine similarity score between an item and another item . Higher the score, the two items are more similar.
Cosine similarity can be calculated for two users u and u’ using the following equations:
Where, denotes the rating by user for item.
Prediction for user for item is done as:
Where, is the normalized similarity weight, is the rating by u’ user for item.
Similar to user-user similarity, item-item similarity is calculated by computing cosine similarity between two items and ratings are predicted in the similar way.
4.2. Matrix factorization
There are some hidden traits (latent factors) of liking/disliking of users which may depend on the pattern of their ratings. Users and movies are mapped by this model to a joint latent factor space. Each item and user
is associated with vectorand respectively which measures the possessiveness of an item or user for those factors. The dot product denotes the liking of a user for a specific item which approximates the rating (term10, )
. Computing the mapping of each user and item to factor vectors is a major challenge. Imputation can prove to be expensive as it noticeably increases the amount of data during calculation. To model the observed ratings directly with regularization, the following equation is used:
Here, is the set of those user-item pairs in the training set for which is known.
The system uses the already observed ratings to fit a model on them and uses that model to predict the new ratings.
The intuition behind using matrix factorization to analyze this dataset is that there should be some latent features that determine how a user rates an item. For example, two users may give high ratings to a certain movie if they both like the actors/actresses of the movie, or if the movie is an action movie, which is a genre preferred by both users. Hence, if we can discover these latent features, we should be able to predict the rating given by a certain user to a certain item, because the features associated with the user should match with the features associated with the item.
|Techniques||Movielens 100K||Our Dataset||Movielens 1M|
|Probabilistic Matrix Factorization||0.7564||0.9639||0.481||0.9372||0.7241||0.9127|
|Blind Compressed Sensing||0.7356||0.9409||0.463||0.9612||0.6917||0.8789|
|Technique||Movielens 100K||Our dataset||Movielens 1M|
|Supervised Matrix Factorization||0.7199||0.9196||0.4367||0.9283||0.6709||0.8567|
4.3. Probabilistic Matrix Factorization
It can handle large datasets because it scales linearly with the number of observations in the dataset. Let represent the rating of a user for a movie . Let and be latent feature matrices for user and movie, respectively. The column vectors are denoted as , representing user-specific latent feature vectors, and , representing movie-specific latent feature vectors. (term4, ). The log posterior is maximized over movie and user features with hyper parameters using the following equation:
Where, and are the regularization parameters for user and item respectively. A local minimum of the equation can be computed by gradient descent in and . The model performance is measured by computing mean average error (MAE) and root mean squared error (RMSE) on the test set.
This model can be viewed as a probabilistic extension of the SVD model, since if all ratings have been observed, the objective given by the equation reduces to the SVD objective in the limit of prior variances going to infinity. This technique better addresses the sparsity and scalability problems and thus improves prediction performance. It gives an intuitive rationale for recommendation.
4.4. Blind Compressed Sensing
A dense user item matrix is not a reasonable assumption as each user will like/dislike a trait to certain extent . However, any item will possess only a few of the attributes and never all. Hence, the item matrix will ideally have a sparse structure rather than a dense one as formulated in earlier works.
The objective of this approach is to find the user and item latent factor matrices. As per the approach, user latent factor matrix can be dense but the same does not logically follow for the item latent factor matrix. The sparsity of the item latent factor matrix increases the recommendation accuracy significantly. (term5, ). The following equation is minimized:
Where and are regularization parameters for user and item respectively. is the binary mask matrix and is the rating matrix. and is the user latent matrix and item latent matrix respectively which were assumed to be dense in earlier models.
4.5. Matrix Completion
Matrix completion involves filling up the missing entries of a partially observed matrix. It aims to compute the matrix with the lowest rank or, if the rank of the completed matrix is known, a matrix of rank that matches the known entries. A popular approach for solving the problem is nuclear-norm-regularized (NN) matrix (term7, ) as shown in the following equation.
and, is the binary mask. R is the rating matrix imputed and Y is the original rating matrix.
5. Supervised Collaborative Filtering techniques
To analyze the dataset, some supervised techniques are used such as supervised Matrix Factorization. The main advantage of using supervised methods is that whenever a new user or new item comes in, it can make predictions for them as well which unsupervised techniques fail to do [28, 29, 30, 31]. This is also called as cold start problem. These are scalable and are dependent on the metadata information of user and item because of which it gives more accurate predictions as it establishes relation well. Bias correction is performed on the dataset by calculating user bias and item bias and then the above technique is used to calculate the rating of a new user for an item.
5.1. Supervised Matrix Factorization
The task of predicting ratings becomes difficult largely because of the sparsity of the ratings available in the database of a recommender system. Therefore, using the knowledge related to users demographics and item categories can enhance prediction accuracy [25, 26, 27, 32]. Classes are formed as per users age group, gender and occupation. A user can belong to multiple classes at a time. Class label information is important to learn the latent factor vectors of users and movies in a supervised environment, in a way that they are consistent with the class label information available. Class label information puts in additional constraints which results in reducing the search space as a result of which determinacy of the problem is reduced.
Mathematically, within the matrix factorization framework, additional information of user metadata (U) and item metadata (V) can be used and the following equation can be minimized (term6, ).
Where, if user belongs to class else 0. is the linear map from latent factor space to classification domain. is the class information matrix created similar to
. Other variables have their usual meanings. Introducing supervised learning into the latent factor model helps in improving the prediction accuracy by reducing the problem of rating matrix sparsity. The value of regularization parameters are determined using l-curve technique(term16, )
6. Experiments and Results
Three different datasets are used to compare the results of supervised and unsupervised collaborative filtering techniques used to predict user ratings. The datasets used for experiments are Movielens 100K, MovieLens 1M and our dataset of Indian regional movies. For error calculation, Mean absolute error (MAE) and Root mean squared error (RMSE) is calculated between the actual ratings and the predicted ratings. The datasets are divided into 5 folds for evaluation. The ratings are binarized into like/dislike (1/-1) labels for experiments. Results of different techniques on these datasets are shown in Table 3 and Table 4.
As the values in the Table 2 indicate, the basic cosine similarity measures between users and movies perform fairly well on all datasets. The minimum MAE values result from the experiments using our dataset. Since the sparsity of regional cinema dataset and Movielens 1M dataset is is very high (as indicated in Table 2), techniques like Probabilistic Matrix Factorization and Blind Compressed Sensing perform better than other basic similarity measures and among them the least MAE is again shown for our dataset.
To use metadata information, the information is encoded in the form of one hot vector of 1’s and 0’s where in case of languages, multiple 1’s can be present in the vector. Since supervised techniques uses both user and item metadata they outperform unsupervised collaborative filtering techniques. Among all three datasets, minimum MAE is shown on our dataset. This shows that the Indian Regional Cinema dataset can prove to be useful for building and benchmarking recommendation systems in Indian context, which has the most diverse languages and demographics.
7. Conclusion and Future Work
India is one of the country where not only varying languages are present, it’s population’s demographics are also very diverse in nature. Therefore, Indian regional cinema has a lot of diversity when it comes to the number of languages and the demographics of the viewers. There are thousands of such movies that are produced annually and there is a huge community of people who watch them. Therefore, a recommender system for Indian regional movies is needed to address the preferences of the growing number of their viewers. This dataset has around 10K ratings by Indian users, along with their demographic information. We believe that this dataset could be used to design, improve and benchmark recommendation systems for Indian regional cinema. We plan to release the dataset after its publication. We further want to release another version of this dataset with more number of ratings and users, which will help to improve the current state of recommender systems for the Indian audience.
- (1) MovieLens: https://movielens.org/
- (2) Netflix: https://www.netflix.com/in/
- (3) IMDb: http://www.imdb.com/
- (4) A. Mnih and R. Salakhutdinov, Probabilistic matrix factorization, in Proc. Adv. Neural Inf. Process. Syst., 2007, pp. 1257-1264.
- (5) A. Gogna and A. Majumdar. (2015). Blind compressive sensing framework for collaborative filtering. Available: http://arxiv.org/abs/1505.01621
- (6) A. Gogna and A. Majumdar, A Comprehensive Recommender System Model: Improving Accuracy for Both Warm and Cold Start Users, in IEEE Access, vol. 3, no. , pp. 2803-2813, 2015.
- (7) T. Hastie, R. Mazumder, J. Lee, R. Zadeh, Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares, https://arxiv.org/abs/1410.2596.
- (8) Netflix Prize: http://www.netflixprize.com/
- (9) http://uis.unesco.org/en/news/record-number-films-produced
- (10) Yehuda Koren, Matrix Factorization Techniques for Recommender Systems, Published by the IEEE Computer Society, August 2009.
Xiaoyuan Su and Taghi M. Khoshgoftaar, A survey of collaborative filtering techniques
, in Advances in artificial intelligence, vol. 2009, pp. 4, 2009.
- (12) Thomas Hofmann, Latent semantic models for collaborative filtering, in ACM Transactions on Information Systems (TOIS), vol. 22, no. 1, pp. 89-115, 2004.
- (13) Gediminas Adomavicius and Alexander Tuzhilin, Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions, in Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 6, pp. 734-749, 2005.
- (14) Sivan Gleichman and Yonina C Eldar, Blind compressed sensing, in Information Theory, IEEE Transactions on, vol. 57, no. 10, pp. 6958-6975, 2011.
- (15) Flickscore: http://flickscore.iiitd.edu.in
- (16) C. L. Lawson and R. J. Hanson, Solving Least Squares Problems, vol. 161. Englewood Cliffs, NJ, USA: Prentice-Hall, 1974.
- (17) M. F. Hornick and P. Tamayo, Extending recommender systems for disjoint user/item sets: The conference recommendation problem, IEEE Trans. Knowl. Data Eng., vol. 24, no. 8, pp. 1478-1490, Aug. 2012.
- (18) Q.Liu, E.Chen, H.Xiong, Y.Ge, Z.Li and X.Wu, A cocktail approach for travel package recommendation, IEEE Trans. Knowl. Data Eng.,vol. 26, no. 2, pp. 278-293, Feb. 2014.
- (19) Y. Koren and R. Bell, Advances in collaborative filtering, in Recommender Systems Handbook. New York, NY, USA: Springer, 2011, pp. 145-186.
- (20) X. Su and T. M. Khoshgoftaar, A survey of collaborative filtering techniques, Adv. Artif. Intell., vol. 2009, Jan. 2009, Art. ID 4.
- (21) R. M. Bell and Y. Koren, Improved neighborhood-based collaborative filtering, in Proc. KDD-Cup Workshop, 2007, pp. 7-14.
- (22) C.Desrosiers, and G.Karypis, A comprehensive survey of neighborhood based recommendation methods, in Recommender Systems Handbook. New York, NY, USA: Springer, 2011, pp. 107-144.
- (23) J. Wang, A. P. de Vries, and M. J. T. Reinders, Unifying user-based and item-based collaborative filtering approaches by similarity fusion, in Proc. 29th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., 2006, pp. 501-508.
- (24) S.Gleichman and Y.C.Eldar, Blind compressed sensing,IEEETrans. Inf. Theory, vol. 57, no. 10, pp. 6958-6975, Oct. 2011.
- (25) H.Ma, D.Zhou, C.Liu, M.R.Lyu, and I.King, Recommender systems with social regularization, in Proc. 4th ACM Int. Conf. Web Search Data Mining, 2011, pp. 287-296.
- (26) P. Massa and P. Avesani, Trust-aware recommender systems, in Proc.ACM Conf. Recommender Syst., 2007, pp. 17-24.
X.Tang, Y.Xu, and S.Geva,
Learning higher order interactions for user and item profiling based on tensor factorization, in Proc. 20th Int. Conf. Intell. User Interfaces, 2015, pp. 213-224.
- (28) Q.Gu, J.Zhou, and C.Ding,Collaborative filtering:Weighted non negative matrix factorization incorporating user and item graphs, in Proc. SDM, 2010, pp. 199-210.
- (29) S.-T. Park, D. Pennock, O. Madani, N. Good, and D. DeCoste, Naive filterbots for robust cold-start recommendations, in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2006, pp. 699-705.
- (30) A. L. V. Pereira and E. R. Hruschka, Simultaneous co-clustering and learning to address the cold start problem in recommender systems, Knowl.-Based Syst., vol. 82, pp. 11-19, Jul. 2015.
- (31) Z.-K. Zhang, C. Liu, Y.-C. Zhang, and T. Zhou, Solving the cold-start problem in recommender systems with social tags, EPL (Europhys. Lett.), vol. 92, no. 2, p. 28002, Nov. 2010.
- (32) A.-T. Nguyen, N. Denos, and C. Berrut, Improving new user recommendations with rule-based induction on cold user data, in Proc. ACM Conf. Recommender Syst., 2007, pp. 121-128.
- (33) G. Adomavicius and A. Tuzhilin, Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions, IEEE Trans. Knowl. Data Eng., vol. 17, no. 6, pp. 734-749, Jun. 2005.
- (34) Amazon: http://www.amazon.in
- (35) BarnesAndNoble: https://www.barnesandnoble.com
- (36) https://www.clickondetroit.com/entertainment/netflix-steps-up-its-battle-with-amazon-in-india
- (37) http://www.financialexpress.com/industry/technology/the-desi-content-battleground/798145/
- (38) https://grouplens.org/datasets/movielens/