A Surrogate-based Generic Classifier for Chinese TV Series Reviews

With the emerging of various online video platforms like Youtube, Youku and LeTV, online TV series' reviews become more and more important both for viewers and producers. Customers rely heavily on these reviews before selecting TV series, while producers use them to improve the quality. As a result, automatically classifying reviews according to different requirements evolves as a popular research topic and is essential in our daily life. In this paper, we focused on reviews of hot TV series in China and successfully trained generic classifiers based on eight predefined categories. The experimental results showed promising performance and effectiveness of its generalization to different TV series.



There are no comments yet.


page 9


Mining Customers' Opinions for Online Reputation Generation and Visualization in e-Commerce Platforms

Customer reviews represent a very rich data source from which we can ext...

A Stylistic Analysis of Honest Deception: The Case of Seinfeld TV Series Sitcom

Language is a powerful tool if used in the correct manner. It is the maj...

Who's that Actor? Automatic Labelling of Actors in TV series starting from IMDB Images

In this work, we aim at automatically labeling actors in a TV series. Ra...

Narrative Smoothing: Dynamic Conversational Network for the Analysis of TV Series Plots

-Modern popular TV series often develop complex storylines spanning seve...

Serial Speakers: a Dataset of TV Series

For over a decade, TV series have been drawing increasing interest, both...

Detection and Classification of Viewer Age Range Smart Signs at TV Broadcast

In this paper, the identification and classification of Viewer Age Range...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With Web 2.0’s development, more and more commercial websites, such as Amazon, Youtube and Youku, encourage users to post product reviews on their platforms [1, 2]. These reviews are helpful for both readers and product manufacturers. For example, for TV or movie producers, online reviews indicates the aspects that viewers like and/or dislike. This information facilitates producers’ production process. When producing future films TV series, they can tailer their shows to better accommodate consumers’ tastes. For manufacturers, reviews may reveal customers’ preference and feedback on product functions, which help manufacturers to improve their products in future development. On the other hand, consumers can evaluate the quality of product or TV series based on online reviews, which help them make final decisions of whether to buy or watch it. However, there are thousands of reviews emerging every day. Given the limited time and attention consumers have, it is impossible for them to allocate equal amount of attention to all the reviews. Moreover, some readers may be only interested in certain aspects of a product or TV series. It’s been a waste of time to look at other irrelevant ones. As a result, automatic classification of reviews is essential for the review platforms to provide a better perception of the review contents to the users.

Most of the existing review studies focus on product reviews in English. While in this paper, we focus on reviews of hot Chinese movies or TV series, which owns some unique characteristics. First, Table 1 shows Chinese movies’ development [3] in recent years. The growth of box office and viewers is dramatically high in these years, which provides substantial reviewer basis for the movie/TV series review data. Moreover, the State Administration of Radio Film and Television also announced that the size of the movie market in China is at the 2nd place right after the North America market. In [3], it also has been predicted that the movie market in China may eventually become the largest movie market in the world within the next 5-10 years. Therefore, it is of great interest to researchers, practitioners and investors to understand the movie market in China.

Besides flourishing of movie/TV series, there are differences of aspect focuses between product and TV series reviews. When a reviewer writes a movie/TV series review, he or she not only care about the TV elements like actor/actress, visual effect, dialogues and music, but also related teams consisted of director, screenwriter, producer, etc. However, with product reviews, few reviewers care about the corresponding backstage teams. What they do care and will comment about are only product related issues like drawbacks of the product functions, or which aspect of the merchandise they like or dislike. Moreover, most of recent researchers’ work has been focused on English texts due to its simpler grammatical structure and less vocabulary, as compared with Chinese. Therefore, Chinese movie reviews not only provide more content based information, but also raise more technical challenges. With bloom of Chinese movies, automatic classification of Chinese movie reviews is really essential and meaningful.

Year Box office (million) # of viewers (million)
Table 1: Chinese Movies Box Office Statistics

In this paper, we proposed several strategies to make our classifiers generalizable to agnostic TV series. First, TV series roles’ and actors/actresses’ names are substituted by generic tags like role_i and player_j, where i and j defines their importance in this movie. On top of such kind of words, feature tokens are further manipulated by feature selection techniques like DRC or

, in order to make it more generic. We also experimented with different feature sizes with multiple classifiers in order to alleviate overfitting with high dimension features.

The remainder of this paper is organized as follows. Section 2 describes some related work. Section 3 states our problem and details our proposed procedure of approaching the problem. In Section 4, experimental results are provided and discussed. Finally, the conclusions are presented in Section 5.

2 Related Work

Since we are doing supervised learning task with text input, it is related with work of useful techniques like feature selections and supervised classifiers. Besides, there are only public movie review datasets in English right now, which is different from our language requirement. In the following of this section, we will first introduce some existing feature selection techniques and supervised classifiers we applied in our approach. Then we will present some relevant datasets that are normally used in movie review domain.

2.1 Feature selection

Feature selection, or variable selection is a very common strategy applied in machine learning domain, which tries to select a subset of relevant features from the whole set. There are mainly three purposes behind this. Smaller feature set or features with lower dimension can help researchers to understand or interpret the model they designed more easily. With fewer features, we can also improve the generalization of our model through preventing overfitting, and reduce the whole training time.

Document Relevance Correlation(DRC), proposed by W. Fan et al 2005 [4], is a useful feature selection technique. The authors apply this approach to profile generation in digital library service and news-monitoring. They compared DRC with other well-known methods like Robertson’s Selection Value [5], and machine learning based ones like information gain[6]. Promising experimental results were shown to demonstrate the effectiveness of DRC as a feature selection in text field.

Another popular feature selection method is called [7], which is a variant of test in statistics that tries to test the independence between two events. While in feature selection domain, the two events can be interpreted as the occurrence of feature variable and a particular class. Then we can rank the feature terms with respect to the value. It has been proved to be very useful in text domain, especially with bag of words feature model which only cares about the appearance of each term.

2.2 Supervised Classifier

What we need is to classify each review into several generic categories that might be attractive to the readers, so classifier selection is also quite important in our problem. Supervised learning takes labeled training pairs and tries to learn an inferred function, which can be used to predict new samples. In this paper, our selection is based on two kinds of learning, i.e., discriminative and generative learning algorithms. And we choose three typical algorithms to compare. Bayes [8]

, which is the representative of generative learning, will output the class with the highest probability that is generated through the bayes’ rule. While for the discriminative classifiers like logistic regression


or Support Vector Machine

[10], final decisions are based on the classifier’s output score, which is compared with some threshold to distinguish between different classes.

2.3 TV series Review Dataset

Dataset is another important factor influencing the performance of our classifiers. Most of the public available movie review data is in English, like the IMDB dataset collected by Pang/Lee 2004 [11]

. Although it covers all kinds of movies in IMDB website, it only has labels related with the sentiment. Its initial goal was for sentiment analysis. Another intact movie review dataset is SNAP

[12], which consists of reviews from Amazon but only bearing rating scores. However, what we need is the content or aspect tags that are being discussed in each review. In addition, our review text is in Chinese. Therefore, it is necessary for us to build the review dataset by ourselves and label them into generic categories, which is one of as one of the contributions of this paper.

3 Chinese TV series Review Classification

Let be a set of Chinese movie reviews with no categorical information. The ultimate task of movie review classification is to label them into different predefined categories as . Starting from scratch, we need to collect such review set from an online review website and then manually label them into generic categories

. Based on the collected dataset, we can apply natural language processing techniques to get raw text features and further learn the classifiers. In the following subsections, we will go through and elaborate all the subtasks shown in Figure 



Baidu Encyclopedia

TV series Reviews

Role/actor Knowledgebase

Generic Movie Review

Text Tokens wth Surrogate

Reduced Features



Generic Classifiers

Raw Text Tokens

knowledge crawler

review crawler


Tokenization/Stop words removal

DRC and

Category labels

Figure 1: Procedure of Building Generic Classifiers

3.1 Building Dataset

What we are interested in are the reviews of the hottest or currently broadcasted TV series, so we select one of the most influential movie and TV series sharing websites in China, Douban. For every movie or TV series, you can find a corresponding section in it. For the sake of popularity, we choose “The Journey of Flower”, “Nirvana in Fire” and “Good Time” as parts of our movie review dataset, which are the hottest TV series from summer to fall 2015. Reviews of each episode have been collected for the sake of dataset comprehensiveness.

Then we built the crawler written in python with the help of scrapy. Scrapy will create multiple threads to crawl information we need simultaneously, which saves us lots of time. For each episode, it collected both the short description of this episode and all the reviews under this post. The statistics of our TV series review dataset is shown in Table 2.

TV series # of reviews
The Journey of Flower
Nirvana in Fire
Good Time
Table 2: Statistics of TV series Review Dataset

3.2 Basic Text Processing

Based on the collected reviews, we are ready to build a rough classifier. Before feeding the reviews into a classifier, we applied two common procedures: tokenization and stop words removal for all the reviews. We also applied a common text processing technique to make our reviews more generic. We replaced the roles’ and actors/actresses’ names in the reviews with some common tokens like role_i, actor_j, where i and j are determined by their importance in this TV series. Therefore, we have the following inference


where is a function which map a role’s or actor’s index into its importance. However, in practice, it is not a trivial task to infer the importance of all actors and actresses. We rely on data onBaidu Encyclopedia, which is the Chinese version of Wikipedia. For each movie or TV series, Baidu Encyclopedia has all the required information, which includes the level of importance for each role and actor in the show. Actor/actress in a leading role will be listed at first, followed by the ones in a supporting role and other players. Thus we can build a crawler to collect such information, and replace the corresponding words in reviews with generic tags.

Afterwards, word sequence of each review can be manipulated with tokenization and stop words removal. Each sequence is broken up into a vector of unigram-based tokens using NLPIR [13], which is a very powerful tool supporting sentence segmentation in Chinese. Stop words are words that do not contribute to the meaning of the whole sentence and are usually filtered out before following data processing. Since our reviews are collected from online websites which may include lots of forum words, for this particular domain, we include common forum words in addition to the basic Chinese stop words. Shown below are some typical examples in English that are widely used in Chinese forums.

BBS, BT, NB, BS, CU, LOL, 4242, SF, YY, …

These two processes will help us remove significant amount of noise in the data.

3.3 Topic Modelling and Labeling

With volumes of TV series review data, it’s hard for us to define generic categories without looking at them one by one. Therefore, it’s necessary to run some unsupervised models to get an overview of what’s being talked in the whole corpus. Here we applied Latent Dirichlet Allocation [14, 15] to discover the main topics related to the movies and actors. In a nutshell, the LDA model assumes that there exists a hidden structure consisting of the topics appearing in the whole text corpus. The LDA algorithm uses the co-occurrence of observed words to learn this hidden structure. Mathematically, the model calculates the posterior distribution of the unobserved variables. Given a set of training documents, LDA will return two main outputs. The first is the list of topics represented as a set of words, which presumably contribute to this topic in the form of their weights. The second output is a list of documents with a vector of weight values showing the probability of a document containing a specific topic.

Based on the results from LDA, we carefully defined eight generic categories of movie reviews which are most representative in the dataset as shown in Table 3.

Categories Specific Meaning
Plot of the TV series development of the plot in TV series
Actor/actress actors related with this TV series
Role like or dislike specific roles, or role related
Dialogue discussion or analysis of the dialogues
Analysis deep analysis of the plot and role’s inner part activity
Platform related with specific platform like audience rate on some Channels
Thumb up or down simply follow the post with common forum words
Noise or others advertisements or ones making nonsense
Table 3: Categories of Movie Reviews

The purpose of this research is to classify each review into one of the above categories. In order to build reasonable classifiers, first we need to obtain a labeled dataset. Each of the TV series reviews was labeled by at least two individuals, and only those reviews with the same assigned label were selected in our training and testing data. This approach ensures that reviews with human biases are filtered out. As a result, we have for each TV series that matches the selection criteria.

3.4 Feature Selection

After the labelled cleaned data has been generated, we are now ready to process the dataset. One problem is that the vocabulary size of our corpus will be quite large. This could result in overfitting with the training data. As the dimension of the feature goes up, the complexity of our model will also increase. Then there will be quite an amount of difference between what we expect to learn and what we will learn from a particular dataset. One common way of dealing with the issue is to do feature selection. Here we applied DRC and

mentioned in related work. First let’s define a contingency table for each word

like in Table 4. If , it means the appearance of word .

Relevant Irrelevant
Word A B A+B
Word C D C+D
Table 4: Contingency Table for Word

Recall that in classical statistics, is a method designed to measure the independence between two variables or events, which in our case is the word and its relevance to the class . Higher value means higher correlations between them. Therefore, based on the definition of in [7] and the above Table 4, we can represent the value as below:


While for DRC method, it’s based on Relevance Correlation Value, whose purpose is to measure the similarity between two distributions, i.e., binary distribution of word ’s occurrence and documents’ relevance to class along all the training data. For a particular word , its occurrence distribution along all the data can be represented as below (assume we have reviews):


And we also know each review ’s relevance with respect to using the manually tagged labels.


where means irrelevant and means relevant. Therefore, we can calculate the similarity between these two vectors as


where is called the Relevance Correlation Value for word . Because is either or , with the notation in the contingency table, RCV can be simplified as


Then on top of RCV, they incorporate the probability of the presence of word if we are given that the document is relevant. In this way, our final formula for computing DRC becomes


Therefore, we can apply the above two methods to all the word terms in our dataset and choose words with higher or DRC values to reduce the dimension of our input features.

3.5 Learning Classifiers

Finally, we are going to train classifiers on top of our reduced generic features. As mentioned above, there are two kinds of learning algorithms, i.e., discriminant and generative classifiers. Based on Bayes rule, the optimal classifier is represented as

where is the prior information we know about class .

So for generative approach like Bayes, it will try to estimate both

and . During testing time, we can just apply the above Bayes rule to predict . Why do we call it naive? Remember that we assume that each feature is conditionally independent with each other. So we have


where we made the assumption that there are words being used in our input. If features are binary, for each word we may simply estimate the probability by


in which, is a smoothing parameter in case there is no training sample for and outputs the number of a set. With all these probabilities computed, we can make decisions by whether


On the other hand, discriminant learning algorithms will estimate directly, or learn some “discriminant” function . Then by comparing with some threshold, we can make the final decision. Here we applied two common classifiers logistic regression and support vector machine to classify movie reviews. Logistic regression squeezes the input feature into some interval between and

by the sigmoid function, which can be treated as the probability



The Maximum A Posteriori of logistic regression with Gaussian priors on parameter w is defined as below

which is a concave function with respect to w, so we can use gradient ascent below to optimize the objective function and get the optimal w.


where is a positive hyper parameter called learning rate. Then we can just use equation (12) to distinguish between classes.

While for Support Vector Machine(SVM), its initial goal is to learn a hyperplane, which will maximize the margin between the two classes’ boundary hyperplanes. Suppose the hyperplane we want to learn is

Then the soft-margin version of SVM is

where is the slack variable representing the error w.r.t. datapoint

. If we represent the inequality constraints by hinge loss function


What we want to minimize becomes


which can be solved easily with a Quadratic Programming solver. With learned w and , decision is made by determining whether


Based on these classifiers, we may also apply some kernel trick function on input feature to make originally linearly non-separable data to be separable on mapped space, which can further improve our classifier performance. What we’ve tried in our experiments are the polynomial and rbf kernels.

4 Experimental Results and Discussion

As our final goal is to learn a generic classifier, which is agnostic to TV series but can predict review’s category reasonably, we did experiments following our procedures of building the classifier as discussed in section 1.

4.1 Category Determining by LDA

Before defining the categories of the movie reviews, we should first run some topic modeling method. Here we define categories with the help of LDA. With the number of topics being set as eight, we applied LDA on “The Journey of Flower”, which is the hottest TV series in 2015 summer. As we rely on LDA to guide our category definition, we didn’t run it on other TV series. The results are shown in Figure 2. Note that the input data here haven’t been replaced with the generic tag like role_i or actor_j, as we want to know the specifics being talked by reviewers. Here we present it in the form of heat maps. For lines with brighter color, the corresponding topic is discussed more, compared with others on the same height for each review. As the original texts are in Chinese, the output of LDA are represented in Chinese as well.

Figure 2: LDA results with 8 topics

We can see that most of the reviews are focused on discussing the roles and analyzing the plots in the movie, i.e., 6th and 7th topics in Figure 2, while quite a few are just following the posts, like the 4th and 5th topic in the figure. Based on the findings, we generate the category definition shown in Table 3. Then out of each TV series reviews, with no label bias between readers, are selected to make up our final data set.

4.2 Feature Size Comparison

Based on and DRC discussed in section 3.4, we can sort the importance of each word term. With different feature size, we can train the eight generic classifiers and get their performances on both training and testing set. Here we use SVM as the classifier to compare feature size’s influence. Our results suggest that it performs best among the three. The results are shown in Figure 3. The red squares represent the training accuracy, while the blue triangles are testing accuracies.

(a) Plot
(b) Actor/actress
(c) Role
(d) Dialogue
(e) Analysis
(f) Platform
(g) Thumb up or down
(h) Noise
Figure 3: Accuracy vs Feature size on 8 classifiers

As shown in Figure 3, it is easy for us to determine the feature size for each classifier. Also it’s obvious that test accuracies of classifiers for plot, actor/actress, analysis, and thumb up or down, didn’t increase much with adding more words. Therefore, the top words with respect to these classes are fixed as the final feature words. While for the rest of classifiers, they achieved top testing performances at the size of about . Based on these findings, we use different feature sizes in our final classifiers.

4.3 Generalization of Classifiers

To prove the generalization of our classifiers, we use two of the TV series as training data and the rest as testing set. We compare them with classifiers trained without the replacement of generic tags like role_i or actor_j. So 3 sets of experiments are performed, and each are trained on top of Bayes, Logistic Regression and SVM. Average accuracies among them are reported as the performance measure for the sake of space limit. The results are shown in Table 5. “1”, “2” and “3” represent the TV series “The Journey of Flower”, “Nirvana in Fire” and “Good Time” respectively. In each cell, the left value represents accuracy of classifier without replacement of generic tags and winners are bolded.

Classifiers 1&2 - 3 1&3 - 2 2 & 3 - 1
Plot of the TV series 72.33%, 74.17% 70.23%, 72.78% 74.22%, 75.12%
Actor/actress 87.23%, 89.03% 88.38%, 89.44% 90.23%, 91.33%
Role 85.33%, 86.57% 84.29%, 86.37% 85.22%, 87.55%
Dialogue 93.27%, 94.11% 92.56%, 93.01% 94.23%, 94.85%
Analysis 80.22%, 81.38% 79.29%, 80.45% 81.38%, 82.11%
Platform 90.38%, 90.22% 89.47%, 89.45% 91.22%, 90.48%
Thumb up or down 83.77%, 83.77% 84.25%, 84.37% 82.11%, 83.21%
Noise or others 86.11%, 85.28% 85.34%, 84.16% 84.58%, 83.79%
Table 5: Performance of 8 Classifiers

From the above table, we can see with substitutions of generic tags in movie reviews, the top 5 classifiers have seen performance increase, which indicates the effectiveness of our method. However for the rest three classifiers, we didn’t see an improvement and in some cases the performance seems decreased. This might be due to the fact that in the first five categories, roles’ or actors’ names are mentioned pretty frequently while the rest classes don’t care much about these. But some specific names might be helpful in these categories’ classification, so the performance has decreased in some degree.

5 Conclusion

In this paper, a surrogate-based approach is proposed to make TV series review classification more generic among reviews from different TV series. Based on the topic modeling results, we define eight generic categories and manually label the collected TV series’ reviews. Then with the help of Baidu Encyclopedia, TV series’ specific information like roles’ and actors’ names are substituted by common tags within TV series domain. Our experimental results showed that such strategy combined with feature selection did improve the performance of classifications. Through this way, one may build classifiers on already collected TV series reviews, and then successfully classify those from new TV series. Our approach has broad implications on processing movie reviews as well. Since movie reviews and TV series reviews share many common characteristics, this approach can be easily applied to understand movie reviews and help movie producers to better process and classify consumers’ movie review with higher accuracy.


  • [1] K. Munson, H. H. Thompson, J. Cabaniss, H. Nance, P. Erlandsen, M. McGrath, and M. McGrath, “The world is your library, or the state of international interlibrary loan in 2015,” Interlending & Document Supply, vol. 44, no. 2, 2016.
  • [2] M. Goldner and K. Birch, “Resource sharing in a cloud computing age,” Interlending & Document Supply, vol. 40, no. 1, pp. 4–11, 2012.
  • [3] C. Film, “Chinese movie big data analysis report.” http://sanwen8.cn/p/197jW1H.html, 2016 (accessed November 5, 2016).
  • [4] W. Fan, M. D. Gordon, and P. Pathak, “Effective profiling of consumer information retrieval needs: a unified framework and empirical comparison,” Decision Support Systems, vol. 40, no. 2, pp. 213–233, 2005.
  • [5] S. E. Robertson, “On relevance weight estimation and query expansion,” Journal of Documentation, vol. 42, no. 3, pp. 182–188, 1986.
  • [6] A. Kraskov, H. Stögbauer, R. G. Andrzejak, and P. Grassberger, “Hierarchical clustering based on mutual information,” arXiv preprint q-bio/0311039, 2003.
  • [7] F. Yates, “Contingency tables involving small numbers and the 2 test,” Supplement to the Journal of the Royal Statistical Society, vol. 1, no. 2, pp. 217–235, 1934.
  • [8] J. D. Rennie, L. Shih, J. Teevan, D. R. Karger, et al.

    , “Tackling the poor assumptions of naive bayes text classifiers,” in

    ICML, vol. 3, pp. 616–623, Washington DC, 2003.
  • [9] S. H. Walker and D. B. Duncan, “Estimation of the probability of an event as a function of several independent variables,” Biometrika, vol. 54, no. 1-2, pp. 167–179, 1967.
  • [10] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  • [11] B. Pang and L. Lee, “A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts,” in Proceedings of the 42nd annual meeting on Association for Computational Linguistics, p. 271, Association for Computational Linguistics, 2004.
  • [12] J. J. McAuley and J. Leskovec, “From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews,” in Proceedings of the 22nd international conference on World Wide Web, pp. 897–908, ACM, 2013.
  • [13] L. Zhou and D. Zhang, “Nlpir: A theoretical framework for applying natural language processing to information retrieval,” Journal of the American Society for Information Science and Technology, vol. 54, no. 2, pp. 115–123, 2003.
  • [14] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
  • [15] S. Moghaddam and M. Ester, “On the design of lda models for aspect-based opinion mining,” in Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 803–812, ACM, 2012.