Analyzing news stories have been pivotal for finding out some of the quantitative and qualitative attributes from text documents. A broad domain like news analytics incorporates the use of various text mining methods to analyze text. It applies methods from Natural Language Processing, Machine Learning, Information Retrieval, etc. In our study the qualitative attributes can be socio-economic tags related to demonetization in India. The sentiment score which generally reflects the tone (positive/negative) of the text as well as the emotions expressed, can be one of the quantitative attributes. In this paper we have dealt with two problems in the domain of news analytics; firstly is text categorization without any prior domain knowledge i.e., topic modeling and secondly is emotion analysis. For example we are trying to investigate how emotions of people relate to demonetization in India.
For text categorization, we have clustered the news stories into several k topics: unsupervised learning with automatic topic labeling i.e., topic modeling. Topic modeling reflects the thematic structure of the collection of documents by treating data as observations which gets derived from a generative probabilistic process that comprises hidden variables for documents. Inferring them using posterior inference results the topics generation that describes its corpus. The emotion analysis (also referred as sentiment extraction)  would give an emotion association score to each story depending on the expressive tone of the story in 8 basic emotions categories and two sentiments (positive/negative) deciding the tone of the overall story. Introduction is here. The roadmap of the paper is as follows. Data preparation and exploratory insights are described in section 2. Section 3 is on background. Section 4 reveals our proposed system architecture. Section 5 deals with the experiment setup. Section 6 gives the results. Section 7 draws conclusions from discussions and points to future work.
Ii-a Data Set
The data has been collected over a period of two months from November 13 to December 18, 2016 across four metro cities: Delhi, Kolkata, Mumbai, and Chennai based on sets of keywords corresponding to demonetization in India (e.g., “demonet”, “black money”, “cashless”, etc.) using Twitter’s streaming API  and was stored into mongoDB . We approximately collected 73,970 tweets  in the order of retweet count during the period. Novel data comprise extraction date and time, user ID, user name, tweets message, and geographical area. Due to the huge volume of novel data, we divide the data into only with dates, user IDs, and text and conduct further operation and analysis based on the three variables. Most of the tweets are written in English, but the original raw data set also includes the tweets in vernacular languages such as Hindi or Bengali. We did exclude them in the initial data manipulation process. The data from the nosql database was imported into R console using the tm package  in CRAN library to construct the document term matrix for use in developing the topic model.
Ii-B Exploratory Insight
We explore the time series analysis of tweets over given time period. We visualize the number of retweet by hour, minute; average number of words by hour. We also explore which users have contributed to maximum tweets in our corpus. It also determines user’s influence over others in terms of his retweet count. Out of 73,970 tweets we see that most of the tweets are from Twitter Web Client source followed by Windows phone, iPhone sources. We see that more than 10 users have tweeted more than 100 tweets for the event under consideration.
Fig. 1 shows the hourly retweet count, Fig. 2 displays hourly average count of words in tweets, Fig. 3 shows the top 7 source contributors (platforms) for generation of tweets, Fig. 4 lists the top Twitter handlers with maximum tweets count, Fig. 5 and Fig. 6 show some word clouds of the corpus.
Ii-C Data Preprocessing
Before applying any of the sentiment/emotion extraction methods, we perform data preprocessing. Data preprocessing allows to produce higher quality of text classification and reduce the computational complexity. Typical preprocessing procedure includes the following steps:
Stemming and lemmatization. Stemming is a procedure of replacing words with their stems, or roots. The dimensionality of the Bag-Of-Words representation is reduced when root-related words, such as “read”, “reader” and “reading” are mapped into one word “read”. Over stemming lowers precision and under-stemming lowers recall. The overall impact of stemming depends on the dataset and stemming algorithm. The most popular stemming algorithm is Porter stemmer .
Stop-words removal. Stop words are words which carry a connecting function in the sentence, such as prepositions, articles, etc.  There is no definite list of stop words, but some search machines, are using some of the most common, short function words, such as “the”, “is”, “at”, “which” and “on”. These words can be removed from the text before classification since they have a high frequency of occurrence in the text, but do not affect the final sentiment of the sentence.
TF-IDF model. Term Frequency Inverse Document Frequency (TF-IDF)  divides the term frequencies by the document frequencies (number of documents were the j word has appeared). This adjustment is done in order to lower the weightage of those words which are common across all the documents. The TF-IDF measure suggests how important the term is for the particular document. In TF-IDF scheme words which are common across all documents will automatically get less importance.
Preprocessing of tweet include following points,
Remove all URLs (e.g. www.xyz.com), hash tags (e.g. #topic), targets (username)
Correct the spellings; sequence of repeated characters is to be handled
Replace all the emoticons with their sentiment.
Remove all punctuations ,symbols, numbers
Remove Stop Words
Remove Non-English Tweets
Iii-a Introduction to LDA
Previously, documents were treated as “a-bag-of-words”  approach as in many models which dealt with text documents. Topic modeling adopts that a document is “a-bag-of-topics” instead of “a-bag-of-words” representation, and its sole purpose is to cluster each term in each document into a relevant topic. A variations of different probabilistic topic models  have been proposed and LDA  is considered to be a well known method. Alike other methods, the input to LDA is a term document matrix, and the output of LDA is composed of two distributions, namely document-topic distribution and topic-word distribution . EM  and Gibbs Sampling  algorithms were proposed to derive the distributions of and
. In this paper, we use the Gibbs Sampling based LDA. In this approach, one of the most significant step is updating each topic assignments individually for each term in every documents according to the probabilities calculated using Equation 1.
where z=k represents that the i term in a document is assigned to topic k, w =v is the mapping of the observed term w to the v term in the corpus’s vocabulary, and z signifies all the assignments of topic except the i term. C is the frequency of occurrence of term v assigned to a particular topic k, and C is the number of times that the document d contains the topic k. Moreover, K is the user input denoting the number of topics, V represents the vocabulary’s size, hyper-parameters for the document-topic distribution and topic-word distribution are denoted by and respectively. By default, and are set to 50/K and 0.01.
We perform N iterations of Gibbs sampling for every terms in the corpus and after this, we estimate the document-topicand topic-word distributions respectively using Equations 2 and 3.
Iii-B Emotion Analysis
Emotion classification is fundamentally a text classification problem. Traditional sentiment classification mainly classifies documents as positive, negative and neutral. In this scenario, the emotion is fine-grained into basic emotions such as anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. In this paper, the NRC Word-Emotion Association Lexicon Corpus is selected as the labeled corpus. It comprises a list of English words and their associations with Plutchik’s  eight basic emotions and two sentiments (negative and positive). It involves three variables ‘TargetWord’, ‘AffectCategory’, and ‘AssociationFlag’. TargetWord is a word for which emotion associations are provided. AffectCategory is one of eight emotions or one of two polarities (negative or positive). AssociationFlag has one of two possible values: 0 or 1. 0 indicates that the target word has no association with affect category, whereas 1 indicates an association. Fig. 7 shows the process to identify a crowd type from social media.
In the experiment we used NMI (Normalized Mutual Information)  to evaluate overall documents (tweets) cluster quality. The following formula is used to calculate NMI:
where I(X;Y) is mutual information between X and Y, where X = X1, X2, …Xn and Y = Y1, Y2,…Yn. Xi is the set of text reviews in LDA’s topic i while Yj is the set of text reviews with the label j. In our experiments, a text review with the label j means that the text review has the highest probability of belonging to topic j; n is the number of topics. I(X;Y) is
In the formula, p(x) means probability of being classified to topic i, p(y) means probability of labeled to topic j while p(x,y) means probability of being classified to cluster i but actually labeled to cluster j. H(X) is entropy of X as calculated by the following formula:
The clustering result is totally different from the label if the value of NMI is 0 and is identical if value of NMI is 1.
Iv Proposed System Architecture
We propose a system that consists of three main components including data collection, data analysis and data visualization. The data collection module is developed to crawl the tweets from Twitter using data crawlers and to store the tweets into MongoDb, a NoSQL database for scalability and scheme less data storage purpose. After data preprocessing steps such as tokenization, stemming and stopwords removal, the system mainly performs two different types of analyses to answer the following questions:
What are the topics discussed by people online to help us understand people’s interests?
What are people’s opinion on the specific topics to help us understand their satisfaction of those topics?
The term-document matrix is created which is fed to LDA based model for discovering latent topics and the documents are analyzed by the emotion analyzer. Then, emotion analyzer will tag each tweet as happy, sad, angry, fear, surprise or neutral. Fig. 8 presents the architecture of our proposed system.
V Experimental Setup
For the Demonetization data, we started with default parameters = 0.1; = 0.01 and input parameter topic number N =5, 10, 15, 20 which means 5, 10, 15, 20 desired topics. By comparing the LDA result given in Table II, we choose topic number N = 15 as a basic group for further comparison since when N = 15, most topics have enough words to reveal information about the topic while without too much words to make the topics messy. In the next step of our experiment, we set N = 15 and tuning parameter and by setting = 0.1, 0.05, 0.2 while = 0.01, 0.015, 0.007 to see if the results show any difference.
We performed the Emotion Analysis using syuzhet  CRAN package which is based on NRC Emotion Lexicon on the dataset. As a result, 73,970 tweets were labeled with one of eight emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, trust and two sentiments (positive and negative) to determine the overall tone of the event.
Vi-a Discovering Topics
Topic 1 lists “bank”, “queue”, “atm”, “stand”. This reflects the hectic issues related to bank/ATM transaction. Topic‘ the impact of currency ban on life of citizens which has led to deaths. Topic 4 reveals parliamentary debate on demonetization. Topic 5 reflects farmer and opposition parties protest. Topic 6 indicates people’s support for demonetization. Topic 7 lists words “don”, “modi”, “rbi”, “impact”, looks like a mixed topic. Topic 8 lists “modi”, “fights”, “corrupt”, “leader”, “blackmoney”. This indicates people’s support and acknowledgment of PM Modi’s decision. Topic 9 lists “Kashmir”, “protest”. Topic 10 discusses about impact on terror funding due to note ban. Topic 11 portrays currency ban as a vote bank politics supported by the govt as it lists “bypol”, “farmer”, “congress”, “affect”, “move”, “bjp” words. Topic 12 indicates huge economic and job loss. Topic 13 tells about harassment of people due to this event as aggressive words such as “disgust”, “harass” dominate. Topic 14 talks about cash crunch in banks as it lists “cashless”, “rbi”, “crunch”. Topic 15 tells about encouraging online transactions as it lists “app”,“paytm”, “easy”, “online”. Fig. 10 shows the distribution of top 10 terms in collection of 15 topics.
|Topic 1||Topic 2||Topic 3||Topic 4||Topic 5|
Vi-B NMI Results
In our experiments, we evaluated NMI of LDA with different topic numbers. Table II reports the results:
|LDA Models||NMI Results|
The results show that with fewer topics, the NMI value tends to be higher. Since NMI presents similarity of clustered tweets set and labelled tweets set, the overall NMI results indicate that with fewer topics, tweets set are more correctly clustered. The reason for this phenomenon could be the length of each document ( tweet ) is much shorter if compared to traditional documents. Since the length for each tweet is limited ( usually no longer than 140 characters ), information contained in a single tweet is also limited. Hence, when the number of topics increases, many topics tend to contain the same words; as a result, it is hard to determine to which topic a document be assigned. In further experiments, we can use different tweeter pooling schemes  and see whether they affect the NMI results.
Vi-C Emotion Count
Fig. 11 shows the distribution of emotions during this event. As can be seen, the dominating emotion is trust followed by anticipation and anger. The reason might be that due to the mixed reactions of people expressing their thoughts and opinions through tweets. More than 12,500 tweets express trust as an emotion. Around 8000 tweets express anticipation. 7000 tweets express fear, with a count of around 7500 tweets of anger emotion, around 3000 tweets are of disgust and 6000 tweets express sadness. Disgust emotion was the least emotion expressed in our study. More than 15000 tweets express positive sentiment and around 13000 indicate negative sentiments.
Vii Discussion and Conclusion
As substantial number of people are connected to online social networking services, and a noteworthy amount of information related to experiences, and practices in consumption is shared in this new media form. Text mining is an emergent technique for mining valuable information from the web especially related to social media. Our objective is to discovering tweets semantic patterns in users’ discussions and trend on social media about demonetization in India.
In order to detect conversations in connection to the event under consideration, we applied Latent Dirichlet Allocation based probabilistic system to discover latent topics. We varied the LDA parameters to find a model whose output is more informative as evaluated by NMI. Performance of the LDA models were not affected by changes in distribution parameters and . At the same time, the results significantly changed with the change of topic numbers. As we expected, the quality of LDA results also depends on the amount of records in the data. Manual analysis of the results revealed that LDA is able to extract most of the detailed information from the data. It extracts all the major event components, including the people involved, how the event unfolded etc. However, in some topics we can’t infer to a specific label due to its mixed nature. It is also important to note that all the extracted topics are related to the events covered by the collected data. Our method not only confides to the analysis of case study presented but also significant to the analysis of Twitter data collected in similar settings. From our analysis, we observed that the positive response has exceeded the negative aspects about the demonetization discussion as shown in the emotion distribution plot in Fig. 11 which also does not rule out large section of people have raised voices against the event. Trust, anticipation and anger are the top 3 emotions in count which reflects that our study is not biased towards one polarity.
Understanding the influence of social networks can help government agencies to better understand how such information can be used not only in the dissemination of a socio-economical event, but can also help to draw responses that could help to mitigating an unruly reaction or preventing violence from starting and escalating.
-  Srivastava, A., Sahami, M. (eds.) Text mining: Classification, Clustering and Applications, pp. 155-184. CRC Press, Boca Raton, FL.
-  Hanna M. Wallach. 2006. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning (ICML ’06). ACM, New York, NY, USA, 977-984. DOI: https://doi.org/10.1145/1143844.1143967
Bo Pang and Lillian Lee. 2008. Opinion Mining and Sentiment Analysis. Found. Trends Inf. Retr. 2, 1-2 (January 2008), 1-135. DOI=http://dx.doi.org/10.1561/1500000011
-  Twitter Streaming API-Twitter Developers. https://dev.twitter.com/streaming/overview.
-  MongoDB-MongoDB, Inc. https://www.mongodb.com/.
-  Tweets-Twitter Developers. https://dev.twitter.com/overview/api/tweets.
-  Text Mining Package in R. https://cran.r-project.org/web/packages/tm/tm.pdf.
-  Porter, M. F. (1980). An algorithm for suffix stripping. In Program, volume 14, pages 130-137.
-  Salton, G. and McGill, M. J. (1983). In Introduction to Modern Information Retrieval. McGraw Hill Book Co.
-  Yates, B. R., Neto, R. B. (1999) Modern Information Retrieval, ACM Press, New York.
-  Zhai Z., Liu B., Xu H., Jia P. (2011) Constrained LDA for Grouping Product Features in Opinion Mining. In: Huang J.Z., Cao L., Srivastava J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science, vol 6634. Springer, Berlin, Heidelberg
-  Blei, David M. 2012. Probabilistic topic models. Communications of the ACM 55 (4):77-84.
-  Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3:993-1022.
-  T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228-5235, 2004.
-  T. Griffiths, “Gibbs sampling in the generative model of latent dirichlet allocation”, 2002.
-  Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon, Saif Mohammad and Peter Turney, In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, June 2010, LA, California.
-  Plutchik, R. 2001. Integration, Differentiation, and Derivatives of Emotion, Evolution and Cognition (7:2), pp 114-125.
-  Mehrotra, R., S. Sanner, W. Buntine, L. Xie, Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling, Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 2013.
-  Syuzhet. https://cran.r-project.org/web/packages/syuzhet/index.html.