Indian Regional Movie Dataset for Recommender Systems

01/07/2018 ∙ by Prerna Agarwal, et al. ∙ IIIT Delhi 0

Indian regional movie dataset is the first database of regional Indian movies, users and their ratings. It consists of movies belonging to 18 different Indian regional languages and metadata of users with varying demographics. Through this dataset, the diversity of Indian regional cinema and its huge viewership is captured. We analyze the dataset that contains roughly 10K ratings of 919 users and 2,851 movies using some supervised and unsupervised collaborative filtering techniques like Probabilistic Matrix Factorization, Matrix Completion, Blind Compressed Sensing etc. The dataset consists of metadata information of users like age, occupation, home state and known languages. It also consists of metadata of movies like genre, language, release year and cast. India has a wide base of viewers which is evident by the large number of movies released every year and the huge box-office revenue. This dataset can be used for designing recommendation systems for Indian users and regional movies, which do not, yet, exist. The dataset can be downloaded from



There are no comments yet.


page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommendation systems [17, 18, 19, 20, 33] utilize user ratings to provide personalized suggestions of items like movies and products. Some popular brands that provide such services are Amazon [34], Netflix, IMDb, BarnesAndNoble [35], etc. Collaborative Filtering (CF) and Content-based (CB) recommendation are two commonly used techniques for building recommendation systems. CF systems [21, 22, 23]operate by gathering user ratings for different items in a given domain and compare various users, their similarities and differences, to determine the items to be recommended. Content-based methods recommend items by comparing representations of content that a user is interested in to representations of content that an item consists of.
There are several datasets like MovieLens (term1, ) and Netflix (term2, ) that are available for testing and bench-marking recommendation systems.
We present an Indian regional movie dataset on similar lines. India has been the largest producer of movies in the world for the last few years with a lot of diversity in languages and viewers. As per the UNESCO cinema statistics (term9, ), India produces around 1,724 movies every year with as many as 1,500 movies in Indian regional languages. India’s importance in the global film industry is largely because India is home to Bollywood in Mumbai. There’s a huge base of audience in India with a population of 1.3 billion which is evident by the fact that there are more than two thousand multiplexes in India where over 2.2 billion movie tickets were sold in 2016. The box office revenue in India keeps on rising every year. Therefore, there is a huge need for a dataset like Movielens in Indian context that can be used for testing and bench-marking recommendation systems for Indian Viewers. As of now, no such recommendation system exists for Indian regional cinema that can tap into the rich diversity of such movies and help provide regional movie recommendations for interested audiences.

1.1. Motivation

As of now, Netflix and Movielens datasets do not have a comprehensive listing of regional productions as the clipping shows in Figure 1( borrowed from [39]). Therefore, a substantial source of such a data comprising movies of various regions, varying languages and genres encompassing a wider folklore is strongly needed that could provide such data in a suitable format required for building and benchmarking recommendation systems.
To capture the diversity of Indian regional cinema, popular websites like Netflix are trying to shift focus towards it [36, 37]. The goal is to bring some of the greatest stories from Indian regional cinema on a global platform. Through this, viewers are exposed to a wide variety of new and diverse stories from India. As a result of this initiative, Indian regional cinema will be available across countries. Building a recommendation system using a dataset of such movies and their audience can prove to be useful in such situations. Here, we present such a dataset which is the first of its kind.

Figure 1. Amazon and Netflix: Focus on Regional Films

1.2. Contributions

  • Web portal for data collection: A web portal where a user can sign up by filling details like email, date of birth, gender, home town, languages known and occupation. The user can then provide rating to movies as like/dislike.

  • Indian Regional Cinema Dataset: It is the first dataset of Indian Regional Cinema which contains ratings by users for different regional movies along with user and movie metadata. User metadata is collected while signing up on the portal. Movie metadata consists of genre, release year, description, language, writer, director, cast and IMDb rating.

  • Detailed analysis of the dataset using some supervised and unsupervised Collaborative Filtering techniques.

2. Related Work

MovieLens [1] is a web-based portal that recommends movies. It uses the film preferences of its users, collected in the form of movie ratings and preferred genres and then utilizes some collaborative filtering techniques to make movie recommendations. The Department of Computer Science and Engineering at the University of Minnesota houses a research lab known as Grouplens Research that created Movielens in 1997 [38]. Indian Regional Cinema dataset is inspired from Movielens. The primary goal was to collect data for performing research on providing personalized recommendations. MovieLens released three datasets for testing recommendation systems: 100K, 1M and 10M datasets. They have released 20M dataset as well in 2016. In the dataset, users and movies are represented with integer IDs, while ratings range from 1 to 5 at a gap of 0.5.
Netflix released a training data set for their contest, Netflix Prize (term8, ), which consists of about 100,000,000 ratings for 17,770 movies given by 480,189 users. Each rating in the training dataset consists of four entries: user, movie, date of grade, grade. Users and movies are represented with integer IDs, while ratings range from 1 to 5.
These datasets are largely for hollywood movies and TV series, and their viewers. They are not designed for those user communities which are inclined towards watching Indian regional cinema.
From the view point of recommender systems, there have been a lot of work using user ratings for items and metadata to predict their liking and disliking towards other items [4, 5, 6, 11]. Many unsupervised and supervised collaborative filtering techniques have been proposed and benchmarked on movielens dataset. Here, in this paper, we have chosen few popular techniques such as user-user similarity to establish baseline and then other deeper techniques such as Blind Compressed Sensing, Probabilistic Matrix Factorization, Matrix completion, Supervised Matrix Factorization are used on our dataset to provide benchmarking results. These techniques are chosen over others because these techniques have proven to provide better accuracy in recent works [6].

3. Indian Regional Movie Dataset

This is the first dataset of Indian regional cinema which covers movies of 18 different regional languages and a variety of user ratings for such movies. It consists of 919 users with varying demographics and 2,851 movies with different genres. It has 10K ratings from 919 users.

3.1. Metadata Information

The data for movies has been scraped from IMDb  (term3, ). IMDb has a collection of Indian movies spanning across multiple Indian regional languages and genres. Each movie is associated with the following metadata.

  • Movie id: Each movie has a unique id for its representation.

  • Description: Description of the movie for users.

  • Language: Language(s) used in the movie. A movie may have been released in multiple regional languages. The distribution is shown in Table 1.

  • Release date: Date of release of the movie.

  • Rating count: As per IMDb, to judge the popularity of the movie.

  • Crew: Director, writer and cast of the movie.

  • Genre: Movie genre. It can be one or multiple out of 20 genres available on IMDb. The number of movies for each genre are shown in Figure 2.

Figure 2. Genre Distribution

Movie Count User Count
Hindi 615 902
Bengali 582 28
Assamese 22 9
Tamil 313 30
Nepali 51 9
Punjabi 150 78
Rajasthani 18 14
Malayalam 346 16
Bhojpuri 26 21
Kannada 303 11
Haryanvi 3 18
Manipuri 8 4
Urdu 129 23
Marathi 204 14
Telugu 338 18
Oriya 98 6
Gujarati 49 7
Konkani 6 4
Table 1. Language Distribution

For better recommendations, it is important to include the factors which influence user ratings the most. Following is the metadata information collected for a user:

  • User id: A user will have a unique id for its representation.

  • Languages: The languages known by the users. Its count is shown in Table 1.

  • State: The state of India that the user belongs to. The region wise distribution is shown in Figure 3.

    Figure 3. Region-wise Distribution
  • Age: Date of birth of the user is taken as input to calculate the age of the user. The distribution is shown in Figure 4.

    Figure 4. Age Distribution
  • Gender: Denotes the gender of the user. Gender distribution of the data is shown in Figure 5.

    Figure 5. Gender Distribution
  • Occupation: It denotes the occupation of the user. It can be any one out of student, self-employed, service, retired and others. Its distribution is shown in Figure 6.

    Figure 6. Occupation Distribution

Movie Count User Count Rating Count Sparsity(%) Release Year
Our Dataset 2851 919 10,000 99.96 2017
Movielens 100K 1700 1000 100,000 99.94 4/1998
Movielens 1M 6000 4000 1,000,000 99.96 2/2003

Table 2. Comparison of Datasets

3.2. MovieLens vs Indian Regional Cinema Dataset

The key difference between the presented dataset and movielens is that the latter does not contain movies from Indian regional cinema. Movielens only has few Hindi and Urdu movies. Also, our data has been collected mainly from the viewers of regional movies in India. The user metadata, thus, collected can be used to recommend more relevant movies for such audiences.
Also, the MovieLens datasets are biased towards a certain category of users. They contain data only from users who have rated at least twenty movies. The datasets do not include the data of those users who could not find enough movies to rate or did not find the system easy enough to use. There is a possibility that there is a fundamental difference between such users and the other users in the datasets. Our dataset makes no such distinction among users based on the number of movies that they have rated.
To make the process of rating multiple movies easier for a user, we have used the concept of binary rating for movies where, a user can either ”like” or ”dislike” a movie denoted by ”1” and ”-1” in the dataset. On the other hand, MovieLens uses a 10-point scale for rating (from 0 to 5). A basic comparison of these datasets are shown in Table 2. The table indicates the number of users, movies, ratings, release year and the sparsity of datasets.

3.3. Dataset Collection

For the collection of user information and movie ratings, a web portal named Fickscore [15] is created where users can sign up filling in all details as shown in Figure 7. The user has to provide the preferred languages so that the portal can ask users to rate the movies of their preferred languages.

Figure 7. Sign Up form on Portal

While signing up, the user is prompted to fill up the metadata information. The user can then login to the portal to rate movies as either like or dislike and the responses are recorded as shown in Figure 8.

Figure 8. Rating Movies on Portal

4. Unsupervised Collaborative Filtering techniques

To analyze the dataset, some unsupervised techniques are used such as user-user similarity, item-item similarity, Matrix Factorization, Probabilistic Matrix Factorization, Blind Compressed Sensing etc. The main advantage of using such techniques is the ease of implementation and their incremental nature. On the other hand, it is human data dependent its performance decreases on increase of sparsity of data. These techniques cannot address the cold start problem i.e., when a new user or item adds in the dataset whose ratings are not available because these use ratings of users to make predictions. Bias correction is performed on the dataset by calculating global mean, user bias and item bias and then the above techniques are used to predict the rating of a new user for an item.

4.1. User and Item-based similarity

In user-user model, a similarity matrix is calculated, each entry

indicates the score computed by cosine similarity between a user

and another user . It denotes how much similar are two users and , higher the score higher is the similarity. Similarly, in item-item model, each entry of the similarity matrix ’A’ denotes the cosine similarity score between an item and another item . Higher the score, the two items are more similar.
Cosine similarity can be calculated for two users u and u’ using the following equations:


Where, denotes the rating by user for item.
Prediction for user for item is done as:


Where, is the normalized similarity weight, is the rating by u’ user for item.

Similar to user-user similarity, item-item similarity is calculated by computing cosine similarity between two items and ratings are predicted in the similar way.

4.2. Matrix factorization

There are some hidden traits (latent factors) of liking/disliking of users which may depend on the pattern of their ratings. Users and movies are mapped by this model to a joint latent factor space. Each item and user

is associated with vector

and respectively which measures the possessiveness of an item or user for those factors. The dot product denotes the liking of a user for a specific item which approximates the rating  (term10, )

. Computing the mapping of each user and item to factor vectors is a major challenge. Imputation can prove to be expensive as it noticeably increases the amount of data during calculation. To model the observed ratings directly with regularization, the following equation is used:


Here, is the set of those user-item pairs in the training set for which is known.
The system uses the already observed ratings to fit a model on them and uses that model to predict the new ratings.
The intuition behind using matrix factorization to analyze this dataset is that there should be some latent features that determine how a user rates an item. For example, two users may give high ratings to a certain movie if they both like the actors/actresses of the movie, or if the movie is an action movie, which is a genre preferred by both users. Hence, if we can discover these latent features, we should be able to predict the rating given by a certain user to a certain item, because the features associated with the user should match with the features associated with the item.

Techniques Movielens 100K Our Dataset Movielens 1M
User-User similarity 0.6980 1.026 0.5307 1.03 0.607 0.8810
Item-Item similarity 0.744 1.061 0.648 1.049 0.671 0.9196
Matrix Factorization 0.828 1.128 0.471 0.971 0.6863 0.8790
Probabilistic Matrix Factorization 0.7564 0.9639 0.481 0.9372 0.7241 0.9127
Blind Compressed Sensing 0.7356 0.9409 0.463 0.9612 0.6917 0.8789
Matrix Completion 0.8324 1.102 0.4827 0.9264 0.7196 0.9102
Table 3. Unsupervised Techniques
Technique Movielens 100K Our dataset Movielens 1M
Supervised Matrix Factorization 0.7199 0.9196 0.4367 0.9283 0.6709 0.8567
Table 4. Supervised Technique

4.3. Probabilistic Matrix Factorization

It can handle large datasets because it scales linearly with the number of observations in the dataset. Let represent the rating of a user for a movie . Let and be latent feature matrices for user and movie, respectively. The column vectors are denoted as , representing user-specific latent feature vectors, and , representing movie-specific latent feature vectors. (term4, ). The log posterior is maximized over movie and user features with hyper parameters using the following equation:


Where, and are the regularization parameters for user and item respectively. A local minimum of the equation can be computed by gradient descent in and . The model performance is measured by computing mean average error (MAE) and root mean squared error (RMSE) on the test set.

This model can be viewed as a probabilistic extension of the SVD model, since if all ratings have been observed, the objective given by the equation reduces to the SVD objective in the limit of prior variances going to infinity. This technique better addresses the sparsity and scalability problems and thus improves prediction performance. It gives an intuitive rationale for recommendation.

4.4. Blind Compressed Sensing

A dense user item matrix is not a reasonable assumption as each user will like/dislike a trait to certain extent [24]. However, any item will possess only a few of the attributes and never all. Hence, the item matrix will ideally have a sparse structure rather than a dense one as formulated in earlier works.
The objective of this approach is to find the user and item latent factor matrices. As per the approach, user latent factor matrix can be dense but the same does not logically follow for the item latent factor matrix. The sparsity of the item latent factor matrix increases the recommendation accuracy significantly.  (term5, ). The following equation is minimized:


Where and are regularization parameters for user and item respectively. is the binary mask matrix and is the rating matrix. and is the user latent matrix and item latent matrix respectively which were assumed to be dense in earlier models.

4.5. Matrix Completion

Matrix completion involves filling up the missing entries of a partially observed matrix. It aims to compute the matrix with the lowest rank or, if the rank of the completed matrix is known, a matrix of rank that matches the known entries. A popular approach for solving the problem is nuclear-norm-regularized (NN) matrix  (term7, ) as shown in the following equation.




and, is the binary mask. R is the rating matrix imputed and Y is the original rating matrix.

5. Supervised Collaborative Filtering techniques

To analyze the dataset, some supervised techniques are used such as supervised Matrix Factorization. The main advantage of using supervised methods is that whenever a new user or new item comes in, it can make predictions for them as well which unsupervised techniques fail to do [28, 29, 30, 31]. This is also called as cold start problem. These are scalable and are dependent on the metadata information of user and item because of which it gives more accurate predictions as it establishes relation well. Bias correction is performed on the dataset by calculating user bias and item bias and then the above technique is used to calculate the rating of a new user for an item.

5.1. Supervised Matrix Factorization

The task of predicting ratings becomes difficult largely because of the sparsity of the ratings available in the database of a recommender system. Therefore, using the knowledge related to users demographics and item categories can enhance prediction accuracy [25, 26, 27, 32]. Classes are formed as per users age group, gender and occupation. A user can belong to multiple classes at a time. Class label information is important to learn the latent factor vectors of users and movies in a supervised environment, in a way that they are consistent with the class label information available. Class label information puts in additional constraints which results in reducing the search space as a result of which determinacy of the problem is reduced.
Mathematically, within the matrix factorization framework, additional information of user metadata (U) and item metadata (V) can be used and the following equation can be minimized  (term6, ).


Where, if user belongs to class else 0. is the linear map from latent factor space to classification domain. is the class information matrix created similar to

. Other variables have their usual meanings. Introducing supervised learning into the latent factor model helps in improving the prediction accuracy by reducing the problem of rating matrix sparsity. The value of regularization parameters are determined using l-curve technique

(term16, )

6. Experiments and Results

Three different datasets are used to compare the results of supervised and unsupervised collaborative filtering techniques used to predict user ratings. The datasets used for experiments are Movielens 100K, MovieLens 1M and our dataset of Indian regional movies. For error calculation, Mean absolute error (MAE) and Root mean squared error (RMSE) is calculated between the actual ratings and the predicted ratings. The datasets are divided into 5 folds for evaluation. The ratings are binarized into like/dislike (1/-1) labels for experiments. Results of different techniques on these datasets are shown in Table 3 and Table 4.

As the values in the Table 2 indicate, the basic cosine similarity measures between users and movies perform fairly well on all datasets. The minimum MAE values result from the experiments using our dataset. Since the sparsity of regional cinema dataset and Movielens 1M dataset is is very high (as indicated in Table 2), techniques like Probabilistic Matrix Factorization and Blind Compressed Sensing perform better than other basic similarity measures and among them the least MAE is again shown for our dataset.
To use metadata information, the information is encoded in the form of one hot vector of 1’s and 0’s where in case of languages, multiple 1’s can be present in the vector. Since supervised techniques uses both user and item metadata they outperform unsupervised collaborative filtering techniques. Among all three datasets, minimum MAE is shown on our dataset. This shows that the Indian Regional Cinema dataset can prove to be useful for building and benchmarking recommendation systems in Indian context, which has the most diverse languages and demographics.

7. Conclusion and Future Work

India is one of the country where not only varying languages are present, it’s population’s demographics are also very diverse in nature. Therefore, Indian regional cinema has a lot of diversity when it comes to the number of languages and the demographics of the viewers. There are thousands of such movies that are produced annually and there is a huge community of people who watch them. Therefore, a recommender system for Indian regional movies is needed to address the preferences of the growing number of their viewers. This dataset has around 10K ratings by Indian users, along with their demographic information. We believe that this dataset could be used to design, improve and benchmark recommendation systems for Indian regional cinema. We plan to release the dataset after its publication. We further want to release another version of this dataset with more number of ratings and users, which will help to improve the current state of recommender systems for the Indian audience.


  • (1) MovieLens:
  • (2) Netflix:
  • (3) IMDb:
  • (4) A. Mnih and R. Salakhutdinov, Probabilistic matrix factorization, in Proc. Adv. Neural Inf. Process. Syst., 2007, pp. 1257-1264.
  • (5) A. Gogna and A. Majumdar. (2015). Blind compressive sensing framework for collaborative filtering. Available:
  • (6) A. Gogna and A. Majumdar, A Comprehensive Recommender System Model: Improving Accuracy for Both Warm and Cold Start Users, in IEEE Access, vol. 3, no. , pp. 2803-2813, 2015.
  • (7) T. Hastie, R. Mazumder, J. Lee, R. Zadeh, Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares,
  • (8) Netflix Prize:
  • (9)
  • (10) Yehuda Koren, Matrix Factorization Techniques for Recommender Systems, Published by the IEEE Computer Society, August 2009.
  • (11) Xiaoyuan Su and Taghi M. Khoshgoftaar, A survey of collaborative filtering techniques

    , in Advances in artificial intelligence, vol. 2009, pp. 4, 2009.

  • (12) Thomas Hofmann, Latent semantic models for collaborative filtering, in ACM Transactions on Information Systems (TOIS), vol. 22, no. 1, pp. 89-115, 2004.
  • (13) Gediminas Adomavicius and Alexander Tuzhilin, Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions, in Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 6, pp. 734-749, 2005.
  • (14) Sivan Gleichman and Yonina C Eldar, Blind compressed sensing, in Information Theory, IEEE Transactions on, vol. 57, no. 10, pp. 6958-6975, 2011.
  • (15) Flickscore:
  • (16) C. L. Lawson and R. J. Hanson, Solving Least Squares Problems, vol. 161. Englewood Cliffs, NJ, USA: Prentice-Hall, 1974.
  • (17) M. F. Hornick and P. Tamayo, Extending recommender systems for disjoint user/item sets: The conference recommendation problem, IEEE Trans. Knowl. Data Eng., vol. 24, no. 8, pp. 1478-1490, Aug. 2012.
  • (18) Q.Liu, E.Chen, H.Xiong, Y.Ge, Z.Li and X.Wu, A cocktail approach for travel package recommendation, IEEE Trans. Knowl. Data Eng.,vol. 26, no. 2, pp. 278-293, Feb. 2014.
  • (19) Y. Koren and R. Bell, Advances in collaborative filtering, in Recommender Systems Handbook. New York, NY, USA: Springer, 2011, pp. 145-186.
  • (20) X. Su and T. M. Khoshgoftaar, A survey of collaborative filtering techniques, Adv. Artif. Intell., vol. 2009, Jan. 2009, Art. ID 4.
  • (21) R. M. Bell and Y. Koren, Improved neighborhood-based collaborative filtering, in Proc. KDD-Cup Workshop, 2007, pp. 7-14.
  • (22) C.Desrosiers, and G.Karypis, A comprehensive survey of neighborhood based recommendation methods, in Recommender Systems Handbook. New York, NY, USA: Springer, 2011, pp. 107-144.
  • (23) J. Wang, A. P. de Vries, and M. J. T. Reinders, Unifying user-based and item-based collaborative filtering approaches by similarity fusion, in Proc. 29th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., 2006, pp. 501-508.
  • (24) S.Gleichman and Y.C.Eldar, Blind compressed sensing,IEEETrans. Inf. Theory, vol. 57, no. 10, pp. 6958-6975, Oct. 2011.
  • (25) H.Ma, D.Zhou, C.Liu, M.R.Lyu, and I.King, Recommender systems with social regularization, in Proc. 4th ACM Int. Conf. Web Search Data Mining, 2011, pp. 287-296.
  • (26) P. Massa and P. Avesani, Trust-aware recommender systems, in Proc.ACM Conf. Recommender Syst., 2007, pp. 17-24.
  • (27) X.Tang, Y.Xu, and S.Geva,

    Learning higher order interactions for user and item profiling based on tensor factorization

    , in Proc. 20th Int. Conf. Intell. User Interfaces, 2015, pp. 213-224.
  • (28) Q.Gu, J.Zhou, and C.Ding,Collaborative filtering:Weighted non negative matrix factorization incorporating user and item graphs, in Proc. SDM, 2010, pp. 199-210.
  • (29) S.-T. Park, D. Pennock, O. Madani, N. Good, and D. DeCoste, Naive filterbots for robust cold-start recommendations, in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2006, pp. 699-705.
  • (30) A. L. V. Pereira and E. R. Hruschka, Simultaneous co-clustering and learning to address the cold start problem in recommender systems, Knowl.-Based Syst., vol. 82, pp. 11-19, Jul. 2015.
  • (31) Z.-K. Zhang, C. Liu, Y.-C. Zhang, and T. Zhou, Solving the cold-start problem in recommender systems with social tags, EPL (Europhys. Lett.), vol. 92, no. 2, p. 28002, Nov. 2010.
  • (32) A.-T. Nguyen, N. Denos, and C. Berrut, Improving new user recommendations with rule-based induction on cold user data, in Proc. ACM Conf. Recommender Syst., 2007, pp. 121-128.
  • (33) G. Adomavicius and A. Tuzhilin, Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions, IEEE Trans. Knowl. Data Eng., vol. 17, no. 6, pp. 734-749, Jun. 2005.
  • (34) Amazon:
  • (35) BarnesAndNoble:
  • (36)
  • (37)
  • (38)
  • (39)