CrowdTSC: Crowd-based Neural Networks for Text Sentiment Classification

04/26/2020 ∙ by Keyu Yang, et al. ∙ Singapore Management University Zhejiang University Aalborg University 0

Sentiment classification is a fundamental task in content analysis. Although deep learning has demonstrated promising performance in text classification compared with shallow models, it is still not able to train a satisfying classifier for text sentiment. Human beings are more sophisticated than machine learning models in terms of understanding and capturing the emotional polarities of texts. In this paper, we leverage the power of human intelligence into text sentiment classification. We propose Crowd-based neural networks for Text Sentiment Classification (CrowdTSC for short). We design and post the questions on a crowdsourcing platform to collect the keywords in texts. Sampling and clustering are utilized to reduce the cost of crowdsourcing. Also, we present an attention-based neural network and a hybrid neural network, which incorporate the collected keywords as human being's guidance into deep neural networks. Extensive experiments on public datasets confirm that CrowdTSC outperforms state-of-the-art models, justifying the effectiveness of crowd-based keyword guidance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sentiment classification [16, 39] is a key content analysis task that has received much attention from both academia and industry. The goal is to assign emotional polarities (labels) to the specified text. Sentiment classification is essential for almost each and every human activity because it has a great impact on our decision making. With the explosive growth of social networks (e.g., Facebook, Twitter, Tumblr, and Sina Weibo), more and more individuals and organizations start using social media content to facilitate decision making. For example, merchants detect and analyze consumer insights by listening to consumers on social media; most consumers look at online reviews before they purchase; and governments employ social networks to understand public opinion on policy, etc.

Deep learning models [40, 38, 23, 21]

have become the state-of-the-art solution for text classification. They learn to represent a text with an implicit vector and feed the vector into a


function to calculate the probability of each class label. Recurrent Neural Network (RNN) and Convolutional Neural network (CNN) are two main kinds of neural networks used to represent the text. Deep neural networks learn the representation from all the words in the text without any guidance in advance. However, it is not a secret that certain words in a text are more important than the others in terms of text classification, and signals from some words provide an explicit indication about the class label. For instance, keyword

happy shows more positive emotional polarity.

This suggests that if we can consider the keyword when training the deep neural networks for text sentiment classification, the accuracy might be improved. Here, we treat the keyword as the carrier of human intelligence, and it is well-known that human beings are more capable than machine learning (ML) algorithms in terms of capturing the emotional polarity of text. Recently, the development of crowdsourcing platforms, such as Amazon Mechanical Turk (AMT)111, Figure Eight222, and Upwork333, offers a paradigm to collect intelligence from the crowd (i.e., thousands of ordinary workers). Nevertheless, how to efficiently collect and incorporate the intelligence into deep neural networks remains a big challenge.

In this paper, we investigate Crowd-based neural networks for Text Sentiment Classification (CrowdTSC for short), which leverage the crowd wisdom to improve the performance of deep neural networks. There are two main challenges to be addressed. The first challenge is how to design cost-efficient crowd-based questions to capture the human being’s guidance? For sentiment classification, we could consult the crowd for every single text’s sentiment classification exhaustively. Nonetheless, the brute-force approach is expensive, and lacks scalability as data are expected to arrive continuously. To this end, we introduce the concept of keyword, which refers to the word that has a greater impact on the sentiment orientation than other words in the text, and only ask the crowd to identify keywords from the sampled texts. Then, we expand the keyword set based on clustering in the word embedding space, by fully utilizing the fact that those words belonging to the similar semantic categories are proximal to each other in the embedding space [22].

The second challenge is how to incorporate the collected keywords as human being’s guidance into deep neural networks? Deep neural networks are notorious for the un-interpretability. It is intractable to feed external intelligence guidance into deep neural networks. Towards this, we design two types of neural networks, namely, KA-RNN and HDNN, to embrace the collected keywords. KA-RNN is an attention-based RNN model whose loss function has been redesigned to emphasize the keyword signals. HDNN is a hybrid deep neural network that combines a standard CNN (or RNN) with a Fully Connected Network (FCN) to integrate the information of original text with the keyword signals. To sum up, this paper makes the following four key contributions.

  • We present crowd-based neural networks for text sentiment classification. To our knowledge, it is the first attempt to utilize keywords collected from the crowd to improve the performance of deep learning for text classification.

  • We design a crowdsourcing framework to capture high-quality human being’s guidance with a low monetary cost. In the framework, we utilize the proximity of similar semantic words in the embedding space, and then employ sampling and clustering techniques to reduce the cost, and meanwhile retain the performance.

  • We propose two models, i.e., KA-RNN and HDNN, to incorporate the collected keywords into deep neural networks. KA-RNN redesigns loss function of attention-based RNN to emphasize the keyword signals, and HDNN builds a hybrid deep neural network that combines the standard CNN (or RNN) with FCN to enable the fusion of original text and keyword information.

  • We conduct extensive experiments to verify the effectiveness of our proposed CrowdTSC and the power of crowd-based keyword guidance compared with state-of-the-art models.

The rest of this paper is organized as follows. We first review related work in Section II, and introduce the definition of sentiment classification and deep neural networks in Section III. We then elaborate the framework of CrowdTSC in Section IV, present the cluster-based crowdsourcing in Section V, and detail the customized attention-based RNN model and the hybrid deep neural network in Section VI and Section VII, respectively. Experimental evaluation is reported in Section VIII. Finally, we conclude the paper in Section IX.

Ii Related Work

Ii-a Crowdsourcing

Nowadays, many important data management and analytics tasks can not be completely addressed by automated processes [12]. Crowdsourcing is an effective technique to harness the capabilities of people (i.e., the crowd) to apply human computation for such tasks. The development of crowdsourcing platforms makes it an active research area in the data management community. Those crowdsourcing platforms allow computer scientists to integrate the power of human intelligence into their computational workflows. Take query processing as an example. Many crowd-based query processing systems have been implemented, such as CrowdDB [6], Qurk [14], and Deco [18]. They use optimization techniques to reduce the number of questions asked to crowd workers. In the field of image recognition, Welinder and Perona [33] propose a crowd-based algorithm to determine the ground truth for images from noisy annotations. For entity resolution, Vesdapunt et al. [27] study the problem of completely resolving an entity graph using crowdsourcing; Wang et al. [31] present ACD, a crowd-based algorithm for data deduplication, which achieves high accuracy at moderate costs of crowdsourcing. Besides, many studies [13, 15]

apply active learning techniques to reduce the crowdsourcing cost for collecting and annotating data. Nevertheless, the above methods are application-dependent, and thus, they cannot be applied directly to tackle the text sentiment classification task.

Ii-B Text Sentiment Classification

Traditional text classifiers are feature-based models, relying on hand-crafted features to perform the classification. They represent a text as a sparse vector, and feed it into the classifier. Cavnar et al. [1]

propose an N-gram-based approach for text classification. Bag-of-words 

[28] is another efficient way to extract the features. Post and Bergsma [20] exploit more complex features such as POS tagging and dependency parsing

to improve the performance of text classification. Naive Bayes, maximum entropy classification, and support vector machines are popular classifiers 

[17]. Joulin et al. [10] show that simple linear models with a rank constraint and a fast loss approximation can achieve state-of-the-art performance. Nonetheless, feature-based models neglect the context of texts and hence cannot capture deep semantic information.

Fig. 1: Illustration of RNN
Fig. 2: Illustration of CNN

To overcome such an issue, deep learning models have become popular for this task. They map a text to a dense vector. Most of the deep learning models are based on Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN). Zhang et al. [40] present an empirical exploration of CNN for text classification. Tang et al. [26]

leverage Long Short-Term Memory (LSTM) 

[8] to model the relation of sentences. Yang et al. [36] propose a hierarchical attention network to better capture the important information of a document. Conneau et al. [23] use very deep CNN in text classification, which achieves good performance. Yogatama et al. [38] present a discriminative LSTM model to place documents in the semantic space, such that embeddings of documents are close to embeddings of their respective labels. Qiao et al. [21]

propose a method of learning and utilizing task-specific distributed representations of N-gram for text classification. Wang et al. 

[32] propose a neural network model based on the context-aware attention mechanism for intent identification in E-mail conversations. Islam et al. [9] propose a multi-channel CNN architecture that effectively encodes different types of emotion indicators in social media posts for sentiment identification. Recently, some researchers have attempted to combine CNN and RNN. Wang et al. [29]

propose a regional CNN-LSTM model to compute valence-arousal ratings from texts for dimensional sentiment analysis. Shi et al. 

[25] replace convolution filters with deep LSTM. Xiao and Cho  [35] utilize both convolution and recurrent layers to efficiently encode character inputs. Wang and Wan [30] propose a neural network with an abstract-based attention mechanism to address the sentiment analysis task of reviews for scholarly papers.

Although deep learning models achieve state-of-the-art performance for text sentiment classification, they do not make full use of the important signals carried by individual keywords. In this paper, we collect the keywords in the text via the crowdsourcing platform, and leverage the collective intelligence to guide deep neural networks to classify the text sentiment.

Iii Preliminaries

Iii-a Sentiment Classification

Sentiment classification (a.k.a. sentiment analysis or opinion mining) is a special task of text classification in Natural Language Processing (NLP) whose objective is to classify a text according to the sentimental polarities of opinions it contains. As an example, given a text about the movie review below:

“It’s easily the best film I’ve seen this year, the story of the film is pretty great.”

A machine learning model (classifier) could take this text as input, analyze the content, and assign the sentimental polarity (class), i.e., positive, to this text.

Iii-B Deep Neural Networks

Iii-B1 Recurrent Neural Network (RNN)

RNN is a type of neural networks that conditions the model on all previous words in the corpus. Figure 2 illustrates the RNN architecture where each rectangular box represents a hidden layer at a time step

. Each such layer holds a number of neurons, and performs a weighted sum operation on its inputs followed by a non-linear activation operation (such as

, , and ). At each time step , the output of the previous step and the next word embedding vector

in the text will be input to the hidden layer to conduct the hidden representation

in step as follows:


is a non-linear activation function and both

and are the weight matrices.

We can observe that the hidden representation of step depends upon all the previous input vectors. The -th step state can be expressed by:

The output of the hidden state in the last step could represent the text, and be the indicator of text classification.

Hidden states at each step depend on all the previous inputs, which sometimes neglect the key information and hurt the overall performance of the classifier [37]. Gating mechanisms have been developed to address the limitation of RNN, resulting in two prevailing RNN types, i.e., Long Short-Term Memory (LSTM) [8]

and Gated Recurrent Unit (GRU) 

[4]. Both LSTM and GRU can perform text classification, but we use GRU as the default RNN unit (detailed in Section VI), because GRU is faster to train and more suitable for processing large-scale data.

Iii-B2 Convolutional Neural Network (CNN)

Unlike RNN that models the whole sequence and captures the long-term dependencies, CNN is a class of neural networks that extracts local and position-invariant features. Figure 2 depicts the CNN architecture. CNN takes word vectors, i.e., -dimensional dense vectors, as input. It uses the convolution layer to represent learning from sliding -grams. For an input sequence with word vectors, , , , , let vector be the concatenated embeddings of entries, , , , , where is filter width and . The convolution layer generates the representation for the -gram , , , using the convolutional weights :

where is a non-linear activation function and is the bias.

After the convolution layer, it uses max-pooling layer to extract the main information. For all

-gram representations , a hidden representation is generated by max-pooling: . The hidden features could represent the text, and be the indicator of text classification.

Fig. 3: Overview of CrowdTSC

Iv Overview of CrowdTSC

In this section, we overview our proposed CrowdTSC. As shown in Figure 3, CrowdTSC consists of three stages, namely, crowd-based keyword selection, keyword clustering, and keyword-based deep neural network classification.

In the first stage, we try to find the keywords in the text through the human cognitive ability via the crowdsourcing platform. The keywords are the words in the text with a greater impact on the sentiment orientation than other words in the text. For the sake of low monetary cost, we sample the input text, and collect sampled keywords, instead of consulting the crowd workers for every single text’s sentiment classification.

In the second stage, we expand the keyword set by using the clustering method. We understand the side-effect of sampling as it might reduce the positive impact of keywords on accuracy. In order to compromise the negative impact of sampling, we adopt a clustering approach to expand the keywords as a remedial action. The inspiration behind is that those words that belong to similar semantic categories are expected to be located in a neighboring area after being mapped into an embedding space [22]. Thus, we use a classic clustering method to capture word clusters in the embedding space, and expand the keyword set. Note that, the first and second stages will be detailed in Section V.

In the last stage, the expanded keywords are utilized as collective intelligence from the crowd to guide deep neural networks. We design two different neural networks, i.e., KA-RNN and HDNN, to embrace the collected keywords. KA-RNN redesigns the loss function for the attention-based RNN model to emphasize the keyword signals. HDNN is a hybrid deep neural network model that combines the standard CNN (or RNN) with FCN to fuse the original text information and keyword signals. We will detail these two models in Section VI and Section VII, respectively.

V Cluster-based crowdsourcing

V-a Sampling and Crowdsourcing

As stated in Section I, it is impossible and unaffordable to ask the crowd to help in classifying each single text in the corpus. Not to mention that many applications expect new texts to be continuously generated, while the brute-force approach lacks scalability. To tackle this issue, we propose a novel concept so-called keyword, which refers to a word in a text that is informative and has a greater impact on the text’s sentiment orientation than other words in the text. We regard keywords as the carriers of intelligence.

Accordingly, we consult crowd workers for the keywords in the text on the crowdsourcing platform. We adopt the sampling approach to sample only a small portion of the corpus to further reduce the monetary cost. Specifically, we post the crowdsourcing tasks on Amazon Mechanical Turk (AMT), and collect at least three keywords per sampled text from the crowd. As to be presented in Section VIII, our proposed method could achieve good performance even if only 0.1% of the corpus is sampled.

V-B Clustering

Fig. 4: A Case for the Word Embedding Space

The main reason that the small sample size does not deteriorate the performance of our model is that we expand the small keywords set contributed by the crowd via clustering. The effectiveness of clustering is guaranteed by the fact that words sharing similar semantic meanings are expected to be close to each other in the embedding space [22].

For illustration purposes, we adopt principal components analysis (PCA) to convert the embedding vector from high-dimensional space to 2-dimensional space, and visualize an embedding space example in Figure 

4. It is observed that those words that have similar semantic meanings are close to each other. Take two words which have negative sentiment polarity, i.e., unfortunately and disappointed, as an example. They are close and located on the lower left in Figure 4. Inspired by this observation, we adopt classic clustering methods [7] to capture word clusters in the embedding space and to find the hidden keywords based on the clustering result with the help of the keywords contributed by the crowd. Next, we show how to identify the hidden keywords based on the clusters.

The main idea is to use the keywords collected from the crowd as the seeds to identify other keywords based on the clustering. To be more specific, we call keywords identified by the crowd as seed keywords, and perform clustering based on seed keywords. We choose clusters that contain at least one seed keyword as keyword clusters, and expand the keyword set based on those keyword clusters. Consider the embedding space depicted in Figure 4 again. We could use the clustering method to group those words into three clusters, which are depicted as the circles in the upper, the squares in the lower left, and the triangles in the lower right, respectively. Assume that the crowd collect excellent, cold, and interesting as seed keywords. All three clusters are labeled as keyword clusters, and we could include all the keywords into the expanded keyword set. That is to say, we expand the keyword set which contains 3 seed keywords to an expanded keyword set of 18 keywords.

Input: an embedding vector set for all the words, an embedding vector set for the seed keywords
Output: the expanded keyword embedding vector set
1 = = Cluster() for each in  do
2       if  then
return the expanded keyword embedding vector set
Algorithm 1 Keyword Expanding Algorithm (KEA)

Based on these, we present Keyword Expanding Algorithm (KEA), with its pseudo-code listed in Algorithm 1. It takes as inputs an embedding vector set for all the words and an embedding vector set for seed keywords, and outputs the expanded keyword embedding vector set . First, it initializes to an empty set (line 1). Then, it clusters the vector set , with the resulting clusters preserved by (line 2). Next, for each cluster , KEA includes it into if it contains at least one seed keyword (lines 3-5). After evaluating all the clusters, KEA returns the expanded keyword set to complete the process (line 6).

Upon the completion of the first two stages of CrowdTSC, we generate an expanded keyword set, which is ready to be fed into deep neural networks to guide the sentiment classification. Next, we address the second challenge, which is how to incorporate human intelligence into the deep neural networks for text sentiment classification. In Section VI and Section VII, we propose two deep neural network models.

Vi Keyword-based RNN with Attention Mechanism

The first proposed deep learning model KA-RNN is a type of RNN. It utilizes the attention mechanism [2, 36], and takes into account the keywords collected by cluster-based crowdsourcing. KA-RNN emphasizes the keyword signals by enabling keyword to play a greater impact on the attention weight. The structure of this model is shown in Figure 5. It contains three parts, namely, a word embedding input layer, a standard RNN layer, and an attention layer. It uses the keywords that represent the human intelligence to guide the weight training of the attention layer, and combines the output in a fully connected layer to aggregate the loss. We describe the details in the following.

Vi-a Standard RNN

In the standard RNN layer, we use GRU [4] to construct RNN. GRU employs a gating mechanism to capture potential long-term dependencies. The gating mechanism can control the flow of information, and mitigate the gradient vanishing problem. There are two types of gates in GRU, i.e., the reset gate and the update gate . They control together how a hidden state is updated. At time step , GRU computes as follows:

The computation is a linear combination of the previous state and the current new state that is derived from new input information, where is the element-wise multiplication. The gate decides how much past information shall be forgotten, and how much new information shall be considered. is computed as:

where is the input vector at time , and are the weight parameters, and refers to the bias. The candidate new state is computed in a way similar as a traditional RNN:

where is the reset gate which controls how much the previous state contributes to the candidate new state, and is the weight matrix for . Similar to the update gate, is computed as:

where and are the weight parameters, and is the bias.

Fig. 5: The Architecture of KA-RNN

Vi-B Attention Mechanism

Each word in the text contributes differently to the representation of text. Standard RNN cannot differentiate the important words from the rest of the input text for text sentiment classification. As a solution, we introduce the attention mechanism to extract important words, and describe how text sentiment classification can take advantage of keywords collected by cluster-based crowdsourcing.

Let -dimensional vectors denote the hidden representations produced by the standard RNN from the original input text, where is the size of the hidden layers and is the length of the input text. The attention mechanism will produce an attention weight vector and a weighted hidden representation . Specifically,

where and are the weight and bias parameters respectively.

We first feed word hidden representation using a non-linear active function to get , and then, we measure the importance of the word as the similarity between and a word-level context vector , and get a normalized importance weight through a softmax function. The word context vector is randomly initialized and learned during the training process. After the attention weight vector is produced, the vector is computed to summarize all the information of the input text. is considered as the feature representation of the input text. Then, a softmax function is to transform

to a conditional probability distribution, in which

and are the parameters of softmax function.

The attention weight vector can be seen as a high-level representation of the query “which is the informative word?”. When the word in th state has a greater impact on the sentiment orientation than the word in -th state, would be larger than .

We could redesign the neural network and emphasize the keyword signals by using the attention weight vector . Here, the important design criterion is that the weight of the collected keyword should be larger than that of others. In other words, the keyword signals should be effectively amplified. In view of this, we design a new loss function for KA-RNN as follows:

where refers to the input text, is the label of text , is a penalty coefficient, and is a mask vector to indicate whether the current state in text is a collected keyword or not, i.e.,

The loss function contains two parts: (i) the cross-entropy error between and the class label, and (ii) the regularization term for the attention weights. Since the goal of training is to minimize the loss function, the attention weight of the collected keyword, i.e., the keyword signal, tends to be amplified during the training process.

Vii Hybrid Deep Neural Network

In this section, we propose our second deep neural network structure, which integrates the original text and the keyword information. It is a Hybrid Deep Neural Network (HDNN) that merges a standard CNN (or RNN) with an FCN. For the sake of brevity, we focus on the version of HDNN that is based on CNN and FCN with its architecture illustrated in Figure 6. Note that CNN in HDNN could be replaced by RNN.

HDNN takes as inputs both the original text and the keywords collected by cluster-based crowdsourcing, and outputs the predicted class label. HDNN consists of two main components, a CNN and an FCN. It relies on the CNN to capture the hidden representation vector for the original text, and meanwhile, it invokes the FCN to obtain the hidden representation vector for the collected keywords. Then, it concatenates the two representation vectors to seamlessly fuse the original text information and human intelligence to effectively enhance the accuracy of the classification task. Finally, HDNN uses a softmax output layer to generate the probability for each class label. Next, we give the details of HDNN.

Fig. 6: The Architecture of HDNN

Vii-a Standard CNN

Standard CNN is a key component of HDNN, as shown in the left of Figure 6. It takes the original text as an input. Text contains words, with each corresponding to a -dimensional word embedding vector. Thus, the input word embedding layer contains a feature map of size. Next, it is the convolution layer which is used to extract the hidden representation from sliding -grams. For the input word embedding with vectors, , , , , let vector be the concatenated embeddings of entries, , , , , where is the filter width and . The convolution layer generates the representation for the -gram , , , using the convolutional weight :

where represents the bias, and is a type of active function:

After the convolution layer, a max-pooling layer is used to extract the main information. For all -gram representations , a hidden feature is generated by max-pooling:

The hidden feature could be seen as a high-level representation of the original text.

Vii-B Fully Connected Network

The other main part of HDNN is an FCN, as depicted in the right of Figure 6. After cluster-based crowdsourcing, we feed the keywords collected by cluster-based clustering into the FCN to capture the representation of the keywords and guide the sentiment classification. First, there is a fully connected layer that transforms -dimensional keyword embedding vectors into hidden representations. For each keyword embedding (), the fully connected layer computes the hidden representation as:

where is the sharing weight matrix in the fully connected layer, and is the bias. After the fully connected layer, the max-pooling layer is again used to extract the main information. For each hidden representation (), we capture the maximum value in , and construct an -dimensional vector to represent the keyword information. The vector is generated by:

Dataset Class Number Train Samples Test Samples Average Length Maximum Length Vocabulary Size
AG’s news 4 120,000 7,600 34 191 62,978
Yelp.F 5 650,000 50,000 148 1,169 268,271
Yelp.P 2 560,000 38,000 146 1,169 246,577
Amazon.F 5 3,000,000 650,000 82 588 1,007,324
Amazon.P 2 3,600,000 400,000 80 652 1,058,969
TABLE I: Statistics of the Datasets Used in Our Experiments

Vii-C Concatenation in HDNN

We generate two different representations from two parts of HDNN to perform the text sentiment classification. One is the output of CNN, which captures the information of the original text. The other is the output of FCN, which extracts the key signals from the collected keywords. Next, we introduce how to fuse these two representations for the final classifier. We denote the output vector from the CNN as , which is a -dimensional vector, and the output vector from the FCN as , which is an -dimensional vector. Then, we use another fully connected layer to transform and into two -dimensional vectors and respectively:

where , , and bias . The vector can be seen as the hidden representation of the information from the original text, and the vector carries the information extracted from the collected keywords.

We concatenate the two -dimensional vectors into a -dimen-sional vector

, which could serve as a high-level representation of the text classification with the guidance of human beings. Next, a softmax layer is followed to generate the probability distribution of predicted class labels:

where and are the parameters of the softmax function.

The model can be trained by backpropagation, in which the loss function is the cross-entropy loss. The loss function is computed as:

where is index of input text, and is the label of text .

Viii Experimental Evaluation

Viii-a Experimental Settings

We use five large-scale text datasets in [40], with their statistics listed in Table I. AG’s news dataset consists of news obtained from AG’s corpus. Yelp dataset contains the reviews obtained from 2015 Yelp Dataset Challenge. Amazon dataset is formed by reviews obtained from the Stanford Network Analysis Project (SNAP). Here, P means the polarity prediction, while F is full score prediction.

In our experiments, we sample 0.1% of the datasets, and consult the crowd workers for the keywords that are important for sentiment classification in Amazon Mechanical Turk (AMT). Note that, AG’s news contains news articles from 4 different topics, and does not have sentimental polarities. We consult the crowd for the keywords that are high-related with the class labels, and treat AG’s news as a test for the generalization of CrowdTSC to normal text classification. We utilize the 300D GloVe 42B vectors [19] as our pre-trained word embeddings. We implement three CrowdTSC models, viz., KA-RNN, HDNN, and HDNN. KA-RNN is the model proposed in Section VI, HDNN is one version of our proposed HDNN model discussed in Section VII, and HDNN the other version of HDNN model, which replaces the CNN architecture discussed in Section VII

with a standard RNN. Our CrowdTSC models are implemented in Python 3.6 on Tensorflow 1.13.

Viii-B Performance Study

Viii-B1 The selection of clustering methods

As discussed in Section V, we employ the clustering method as an approach to expand the keyword set in order to guide the tuning of deep neural networks. This effectively reduces the number of crowd workers we have to approach, and thus, decreases the monetary cost of crowdsourcing. In our experiments, we sample only 0.1% of the original text dataset to consult the crowd workers for the keywords. After that, we need to select a clustering method to expand the keyword set. We evaluate the performance of four popular and classic clustering methods, using three datasets (viz., AG’s news, Yelp.F, and Yelp.P). Those four evaluated clustering methods could be grouped into two categories. One is centroid-based clustering methods, including -means and Spectral [24] clustering, and the other is density-based clustering methods, including DBSCAN [5] and mean-shift [3] clustering.

AG’s news
-means Spectral DBSCAN Mean-shift
KA-RNN 91.13 91.24 91.27 92.04
HDNN 91.86 92.04 91.70 91.69
HDNN 92.33 92.54 93.29 93.29
-means Spectral DBSCAN Mean-shift
KA-RNN 65.12 65.97 65.03 65.08
HDNN 64.32 63.26 63.33 63.33
HDNN 65.00 64.38 66.97 65.01
-means Spectral DBSCAN Mean-shift
KA-RNN 96.47 96.18 96.22 96.12
HDNN 95.83 95.76 95.82 95.77
HDNN 96.58 96.33 96.44 95.41
TABLE II: The Best Clustering Method
(a) AG’s news
(b) Yelp.F
(c) Yelp.P
Fig. 7: The Effect of Clustering-based Crowdsourcing
(a) AG’s news
(b) Yelp.F
(c) Yelp.P
Fig. 8: The Effect of FCN’s Length in HDNN

We optimize the parameters for the clustering methods to provide competitive results. Table II lists the accuracy rates of our proposed models over different clustering methods using AG’s news, Yelp.F, and Yelp.P datasets. The results of the best clustering method are given in bold. Our experiments demonstrate that there is no single clustering method that can win out for all kinds of datasets. It once again verifies the “no free lunch” (NFL) theorem [34] in the machine learning area. At the same time, we can observe that there is no remarkable difference between the results generated by different clustering methods. In other words, our cluster-based crowdsourcing could achieve satisfactory performance even by using a simple clustering method, e.g., -means.

Viii-B2 The effect of clustering-based crowdsourcing

In CrowdTSC, we employ the sampling technique to reduce the number of crowd workers we have to approach, and adopt a clustering method to expand the keywords. In order to verify the utility of keywords and the performance of clustering-based keyword expanding, we, in addition to CrowdTSC models proposed in this paper, implement three variants of CrowdTSC, denoted as CrowdTSC, CrowdTSC, and CrowdTSC, respectively. Here, we introduce the three additional variants of CrowdTSC.

CrowdTSC refers to CrowdTSC without keywords, i.e., we do not consider human intelligence for the task of text sentiment classification.

CrowdTSC refers to CrowdTSC with keywords selected by a machine learning algorithm, i.e., TFIDF, but not the human crowd. TFIDF, Term Frequency Inverse Document Frequency, is a numerical statistic that is intended to reflect how important a word is to a text[11]. We adopt TFIDF to pick most important keywords for every single text in the corpus without sampling.

CrowdTSC refers to CrowdTSC without clustering. It still uses the crowd to help identify keywords in the sampled texts, but it does not adopt clustering algorithms to expand the keywords.

Types Models AG’s news Yelp.F Yelp.P Amazon.F Amazon.P
Feature-based Models BoW [40] 88.81 57.99 92.24 54.64 90.40
BoW-TFIDF [40] 89.64 59.86 93.66 55.26 91.00
ngrams [40] 92.04 56.26 95.64 54.27 92.02
ngrams-TFIDF [40] 92.36 54.80 95.44 52.44 91.54
Deep Learning Models char-CNN [40] 90.49 62.05 95.12 59.57 95.07
word-CNN [40] 91.45 60.42 95.40 57.61 94.49
char-CRNN [35] 91.36 61.82 94.49 59.23 94.13
D-LSTM [38] 92.1 59.6 92.6 - -
FastText [10] 92.5 63.9 95.7 60.2 94.6
VDCNN [23] 91.33 64.72 95.72 63.00 95.72
Region.emb [21] 92.8 64.9 96.4 60.9 95.3
CrowdTSC KA-RNN 92.04 65.97 96.47 60.32 94.83
HDNN 92.04 64.32 95.83 56.44 93.21
HDNN 93.29 66.97 96.58 58.94 93.92
TABLE III: Accuracy Rates on Five Datasets

The comparisons of the four versions of CrowdTSC are depicted in Figure 7, using AG’s news, Yelp.F, and Yelp.P datasets. In general, the models with keywords, either returned by TFIDF algorithm (i.e., CrowdTSC) or identified by human beings (i.e., CrowdTSC and CrowdTSC), outperform the version without keywords (i.e., CrowdTSC). This effectively demonstrates the positive impact of keywords on the accuracy of text sentiment classification. We also notice that CrowdTSC performs even worse than CrowdTSC in two out of nine cases, while CrowdTSC always outperforms CrowdTSC. It indicates that keywords, if selected wrongly, might have a negative impact on accuracy, while human beings are more capable than machines of locating the right keywords. When we compare the performance between CrowdTSC and CrowdTSC, we can observe that CrowdTSC achieves a higher accuracy rate than CrowdTSC in five out of nine cases. Note that CrowdTSC only utilizes the keywords selected by human beings for of the texts in the corpus, while CrowdTSC utilizes all the keywords returned by TFIDF for all the texts in the corpus. That is to say, the quality of keywords plays a much more important role in affecting accuracy than the number of keywords.

Consistent with our expectation, CrowdTSC performs the best in all nine cases. This signifies that when the keywords are able to reflect the sentiment of texts accurately, the number of keywords becomes important. The larger the number of properly selected keywords, the higher the accuracy rate of the classification task. Besides, CrowdTSC could effectively expand the proper keywords by using the clustering method. In addition, it is observed that KA-RNN and HDNN models that contain RNN units perform better than HDNN model that does not have RNN units. It implies that RNN is more suitable to take advantage of collected keywords than CNN.

Viii-B3 The effect of FCN’s length in HDNN

Figure 8 illustrates the accuracy rates of HDNN models w.r.t. the length of FCN varying from 5 to 25 on AG’s news, Yelp.F, and Yelp.P datasets. As discussed in Section VII, HDNN contains two main parts: CNN (or RNN) and FCN. The CNN (or RNN) part captures the information from the original text, while the FCN part extracts the collected keyword signals. The length of FCN is equivalent to the number of collected keywords that are input to HDNN. The first observation is that HDNN exceeds HDNN as expected. Besides, we can observe that the accuracy rates of HDNN (both HDNN and HDNN) first ascend as the length is increased from a small value (e.g., 5), and then drop or stay stable as the length further grows. The optimal length of FCN is around 10. The reason is that the more the keywords are fed into the neural network, the more the important information it can learn. On the other hand, as the number becomes large, those collected keywords introduce noises to the model that could hurt the performance.

Viii-C Comparison with State-of-the-art Models

Table III lists the performance of our proposed models compared with the state-of-the-art models on five text classification datasets. For ease of discussion, we group all the models in three categories, including feature-based models, deep learning models, and CrowdTSC that refers to the models presented in this paper. They correspond to the three blocks in Table III, respectively. The best results in each category are given in underscore, and the overall best records are given in bold.

The top block lists the performance of four feature-based models. These traditional models achieve a strong baseline accuracy rate in small datasets (including AG’s news, Yelp.F, and Yelp.P), but performs not well in large datasets (i.e., Amazon.F and Amazon.P).

The second block reports the performance of seven deep learning models. They achieve state-of-the-art performance using deep neural networks. It is observed that VDCNN performs best in large datasets, i.e., Amazon.F and Amazon.P. This is because VDCNN is a very deep CNN that uses up to 29 convolutional layers to extract the hidden representation for text classification. Therefore, VDCNN has also its own disadvantage, i.e., it is so deep that it is very sensitive to the parameter and difficult to tune.

The last block presents our proposed CrowdTSC models. The observation is that our presented models beat all the other models , except VDCNN, on all datasets including AG’s news. This is because we collect the keywords by clustering-based crowdsourcing, and embrace them as human guidance into the well-designed neural network architecture. This method is efficient to improve the accuracy of text sentiment classification, and could be extended to other general text classification. Besides, We win VDCNN in three datasets (i.e., AG’s news, Yelp.F, and Yelp.P), and lose in two datasets (i.e., Amazon.F and Amazon.P). Compared with the very deep and complex VDCNN model, our proposed CrowdTSC models are succinct, and thus, easy to tune.

Ix Conclusions

In this paper, we propose Crowd-based neural networks for Text Sentiment Classification, i.e., CrowdTSC. To our knowledge, this is the first attempt to use keywords collected from the crowdsourcing platform to improve the performance of deep learning. To reduce the monetary cost of hiring the crowd workers, we design a cluster-based crowdsourcing method to collect keywords in the given text datasets. Moreover, we develop two types of models to incorporate the collected keywords into deep neural networks, i.e., KA-RNN and HDNN. KA-RNN uses the attention mechanism, and constructs the loss function to emphasize keyword signals. HDNN combines the standard CNN (or RNN) with FCN to capture both the original text and the keyword information. Experimental results demonstrate that our proposed CrowdTSC models outperforms the state-of-the-art competitors, justifying the power of crowd-based human intelligence guidance.


  • [1] W. B. Cavnar and J. M. Trenkle (1994) N-gram-based text categorization. In SDAIR, pp. 161–175. Cited by: §II-B.
  • [2] J. Chen, Y. Hu, J. Liu, Y. Xiao, and H. Jiang (2019) Deep short text classification with knowledge powered attention. In AAAI, pp. 6252–6259. Cited by: §VI.
  • [3] Y. Cheng (1995) Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 17 (8), pp. 790–799. Cited by: §VIII-B1.
  • [4] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    In SSST@EMNLP, pp. 103–111. Cited by: §III-B1, §VI-A.
  • [5] M. Ester, H. Kriegel, J. Sander, and X. Xu (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pp. 226–231. Cited by: §VIII-B1.
  • [6] A. Feng, M. J. Franklin, D. Kossmann, T. Kraska, S. Madden, S. Ramesh, A. Wang, and R. Xin (2011) CrowdDB: query processing with the VLDB crowd. PVLDB 4 (12), pp. 1387–1390. Cited by: §II-A.
  • [7] J. Han, M. Kamber, and J. Pei (2011) Data mining: concepts and techniques, 3rd edition. Morgan Kaufmann. Cited by: §V-B.
  • [8] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §II-B, §III-B1.
  • [9] J. Islam, R. E. Mercer, and L. Xiao (2019) Multi-channel convolutional neural network for Twitter emotion and sentiment recognition. In NAACL-HLT, pp. 1355–1365. Cited by: §II-B.
  • [10] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In EACL, pp. 427–431. Cited by: §II-B, TABLE III.
  • [11] J. Leskovec, A. Rajaraman, and J. D. Ullman (2014) Mining of massive datasets. Cambridge University Press. Cited by: §VIII-B2.
  • [12] G. Li, J. Wang, Y. Zheng, and M. J. Franklin (2016) Crowdsourced data management: A survey. IEEE Trans. Knowl. Data Eng. 28 (9), pp. 2296–2319. Cited by: §II-A.
  • [13] C. H. Lin, Mausam, and D. S. Weld (2016) Re-active learning: active learning with relabeling. In AAAI, pp. 1845–1852. Cited by: §II-A.
  • [14] A. Marcus, E. Wu, S. Madden, and R. C. Miller (2011) Crowdsourced databases: query processing with people. In CIDR, pp. 211–214. Cited by: §II-A.
  • [15] B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden (2014) Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB 8 (2), pp. 125–136. Cited by: §II-A.
  • [16] T. Nasukawa and J. Yi (2003) Sentiment analysis: capturing favorability using natural language processing. In K-CAP, pp. 70–77. Cited by: §I.
  • [17] B. Pang, L. Lee, and S. Vaithyanathan (2002) Thumbs up? sentiment classification using machine learning techniques. In EMNLP, Cited by: §II-B.
  • [18] H. Park, R. Pang, A. G. Parameswaran, H. Garcia-Molina, N. Polyzotis, and J. Widom (2012) Deco: A system for declarative crowdsourcing. PVLDB 5 (12), pp. 1990–1993. Cited by: §II-A.
  • [19] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §VIII-A.
  • [20] M. Post and S. Bergsma (2013) Explicit and implicit syntactic features for text classification. In ACL, pp. 866–872. Cited by: §II-B.
  • [21] C. Qiao, B. Huang, G. Niu, D. Li, D. Dong, W. He, D. Yu, and H. Wu (2018) A new method of region embedding for text classification. In ICLR, Cited by: §I, §II-B, TABLE III.
  • [22] D. L.T. Rohde, L. M. Gonnerman, and D. C. Plaut (2006) An improved model of semantic similarity based on lexical co-occurrence. Commun. ACM 8 (627-633), pp. 116. Cited by: §I, §IV, §V-B.
  • [23] H. Schwenk, L. Barrault, A. Conneau, and Y. LeCun (2017) Very deep convolutional networks for text classification. In EACL, pp. 1107–1116. Cited by: §I, §II-B, TABLE III.
  • [24] J. Shi and J. Malik (2000) Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22 (8), pp. 888–905. Cited by: §VIII-B1.
  • [25] Y. Shi, K. Yao, L. Tian, and D. Jiang (2016) Deep LSTM based feature mapping for query classification. In NAACL-HLT, pp. 1501–1511. Cited by: §II-B.
  • [26] D. Tang, B. Qin, and T. Liu (2015) Document modeling with gated recurrent neural network for sentiment classification. In EMNLP, pp. 1422–1432. Cited by: §II-B.
  • [27] N. Vesdapunt, K. Bellare, and N. N. Dalvi (2014) Crowdsourcing algorithms for entity resolution. PVLDB 7 (12), pp. 1071–1082. Cited by: §II-A.
  • [28] H. M. Wallach (2006) Topic modeling: beyond bag-of-words. In ICML, pp. 977–984. Cited by: §II-B.
  • [29] J. Wang, L. Yu, K. R. Lai, and X. Zhang (2016) Dimensional sentiment analysis using a regional CNN-LSTM model. In ACL, Cited by: §II-B.
  • [30] K. Wang and X. Wan (2018) Sentiment analysis of peer review texts for scholarly papers. In SIGIR, pp. 175–184. External Links: Link, Document Cited by: §II-B.
  • [31] S. Wang, X. Xiao, and C. Lee (2015) Crowd-based deduplication: an adaptive approach. In SIGMOD, pp. 1263–1277. Cited by: §II-A.
  • [32] W. Wang, S. Hosseini, A. H. Awadallah, P. N. Bennett, and C. Quirk (2019) Context-aware intent identification in email conversations. In SIGIR, pp. 585–594. External Links: Link, Document Cited by: §II-B.
  • [33] P. Welinder and P. Perona (2010) Online crowdsourcing: rating annotators and obtaining cost-effective labels. In CVPR, pp. 25–32. Cited by: §II-A.
  • [34] D. H. Wolpert and W. G. Macready (1997) No free lunch theorems for optimization.

    IEEE Trans. Evolutionary Computation

    1 (1), pp. 67–82.
    Cited by: §VIII-B1.
  • [35] Y. Xiao and K. Cho (2016) Efficient character-level document classification by combining convolution and recurrent layers. CoRR abs/1602.00367. Cited by: §II-B, TABLE III.
  • [36] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy (2016) Hierarchical attention networks for document classification. In NAACL, pp. 1480–1489. Cited by: §II-B, §VI.
  • [37] W. Yin, K. Kann, M. Yu, and H. Schütze (2017) Comparative study of CNN and RNN for natural language processing. CoRR abs/1702.01923. Cited by: §III-B1.
  • [38] D. Yogatama, C. Dyer, W. Ling, and P. Blunsom (2017) Generative and discriminative text classification with recurrent neural networks. CoRR abs/1703.01898. Cited by: §I, §II-B, TABLE III.
  • [39] L. Zhang, S. Wang, and B. Liu (2018) Deep learning for sentiment analysis: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8 (4). Cited by: §I.
  • [40] X. Zhang, J. J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In NIPS, pp. 649–657. Cited by: §I, §II-B, §VIII-A, TABLE III.