Sentiment Analysis of Financial News Articles using Performance Indicators

11/25/2018 ∙ by Srikumar Krishnamoorthy, et al. ∙ IIMA 0

Mining financial text documents and understanding the sentiments of individual investors, institutions and markets is an important and challenging problem in the literature. Current approaches to mine sentiments from financial texts largely rely on domain specific dictionaries. However, dictionary based methods often fail to accurately predict the polarity of financial texts. This paper aims to improve the state-of-the-art and introduces a novel sentiment analysis approach that employs the concept of financial and non-financial performance indicators. It presents an association rule mining based hierarchical sentiment classifier model to predict the polarity of financial texts as positive, neutral or negative. The performance of the proposed model is evaluated on a benchmark financial dataset. The model is also compared against other state-of-the-art dictionary and machine learning based approaches and the results are found to be quite promising. The novel use of performance indicators for financial sentiment analysis offers interesting and useful insights.



There are no comments yet.


page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sentiment analysis and opinion mining Pang2002; Turney2002 has received significant attention in the literature due to its wide applicability in business, management and social science disciplines. It has been effectively applied in domains such as movies Pang2002; Turney2002, product reviews Blitzer2007, travel reviews Dang2010; Turney2002 and finance Antweiler2004; Huang2014; Loughran2015; Loughran2016; Tetlock2007; Tetlock2008.

Financial sentiment analysis is considered to be an important and challenging problem in the literature Loughran2015; Loughran2016. Current approaches to financial sentiment analysis utilize generic dictionary Tetlock2007; Tetlock2008, domain specific dictionary Ferguson2014; li2014effect; li2014news; Loughran2011; Malo2014 or statistical/machine learning methods Antweiler2004; Huang2014; Li2010; Malo2014; OHare2009; VanDeKauter2015 to determine polarity in financial texts. Some of the common dictionaries used in the financial sentiment analysis literature include Harvard GI (HGI) Stone1962, MPQA Wiebe2005, Sentiwordnet esuli2006sentiwordnet, SenticNet cambria2014senticnet, SentiStrength2 thelwall2012sentiment, LM Loughran2011

, and Financial Polarity Lexicon (FPL)

Malo2014. LM & FPL are finance-specific dictionaries used in the recent literature. Other dictionaries (such as HGI Stone1962, MPQA Wiebe2005, SenticNet cambria2014senticnet) are generic in nature and are likely to mis-classify common financial words. For example, a detailed study of HGI Stone1962 dictionary by Loughran2011 shows that more than 75% of the negative words used in HGI are non-negative in a financial context. Furthermore, recent research studies demonstrate superior results while using domain specific dictionary over general dictionary Huang2014; li2014news; Loughran2015; Malo2014.

In this paper, we aim to improve the state-of-the-art in domain specific dictionary based financial sentiment analysis. We motivate the need for the current study with the help of a set of illustrative examples. Let us consider the following financial text sentences extracted from financial phrase-bank dataset Malo2014.

  1. [leftmargin=1cm]

  2. Halonen’s office acknowledged receiving the letter but declined comment.

  3. Financial details were not disclosed.

  4. The serial bond is part of the plan to refinance the short-term credit facility.

  5. Aspo’s strong company brands – ESL Shipping, Leipurin, Telko and Kaukomarkkinat – aim to be the market leaders in their sectors.

  6. DnB Nord of Norway is the most likely Nordic buyer for Citadele, while Nordea would be a good strategic fit, according to published documents.

  7. We are also pleased to welcome the new employees.

In the LM dictionary Loughran2011, declined, disclosed and refinance are negative words; similarly, strong, good and pleased are positive words. Therefore, each of the above sentences will be classified as positive or negative sentences by methods that use LM or other related domain dictionaries. However, it is quite evident that all of the above sentences are neutral statements from an investor/analyst perspective.

A few research studies have attempted to go beyond just polarity based dictionaries and utilized financial entities (custom words Malo2014 or noun phrases li2014effect or named entities schumaker2009textual) to improve the quality of polarity detection. But, the use of financial entities often generate a lot of false positives and false negatives. Let us consider the following sentences that contain at least one financial entity.

  1. [leftmargin=1cm]

  2. The company’s board of directors has proposed a dividend of EUR 0.12 per share.

  3. Amanda Capital has investments in 22 private equity funds and in over 200 unquoted companies mainly in Europe.

  4. A corresponding increase of 85,432.50 euros in Ahlstrom’s share capital has been entered in the Trade Register today.

All of the above sentences are neutral statements, though they contain financial terms/entities (italized words in text). Sentence 3 also indicates increase of a financial entity (share capital). Hence, the sentence is likely to be considered as positive by systems that use a combination of LM dictionary and financial entities (or noun phrases/named entities) li2014effect; Malo2014.

The following sentences contain financial or non-financial terms and directionality related words (e.g. raise, fell, lower, increase and so on).

  1. [leftmargin=1cm]

  2. Turnover rose to EUR 21mn from EUR 17mn.

  3. The EPS outlook was increased by 5.6

  4. Unit costs for flight operations fell by 6.4 percent.

  5. The company intends to raise production capacity in 2006.

  6. Rapala said it estimates it will make

    savings of 1-2 million euros a year by centralising its French operations at one site.

  7. Demand for fireplace products was lower than expected, especially in Germany.

Sentences 4, 5 and 6 do not use financial entities. The systems that use only financial entities are likely to treat them as neutral sentences. An investor/analyst, however, is likely to consider the above sentences as polarized ones (positive or negative). A closer examination of these sentences reveals that they: (1) use non-financial indicators like demand, production capacity, and operations, and (2) refer to improvement or decline of the non-financial indicators.

From the foregoing discussions, it is evident that the use of domain specific dictionary (on a stand-alone basis or in conjunction with financial entities) is inadequate to accurately predict the polarity in financial texts.

It is quite intuitive to understand that an investor or a financial analyst is likely to analyze the performance of a company in terms of both financial and non-financial performance indicators. An investor/analyst often looks for performance indicators or measures while assessing polarity in financial texts and making investment decisions. Therefore, we posit that the use of financial and non-financial performance indicators can improve the quality of financial sentiment analysis.

Our main objective in this research work is to examine the use of performance indicators (lagging/leading financial and non-financial indicators) to predict polarity in financial texts. We conjecture that the use of performance indicators Kaplan1996 is likely to be more meaningful and closely aligned with the way humans interpret financial texts while making investment decisions. We present a new approach to perform financial sentiment analysis based on performance indicators and conduct rigorous experiments to assess its usefulness. Such an approach, to the best of our knowledge, has not been explored in the literature.

This paper primarily makes two key contributions to the literature. First, the paper introduces the use of performance indicators to predict polarity in financial texts. Second, the paper presents a hierarchical sentiment classifier model based on the concept of association rule mining Agrawal1993

. The primary benefit of using association rule mining for polarity prediction is that the generated rules can be used to easily explain predictions, unlike more complex discriminative models (like SVM, Neural networks) that are black-boxes. The performance of the proposed sentiment classifier model is assessed on a benchmark financial dataset. The model is also compared against other state-of-the-art methods to demonstrate its usefulness.

The rest of the paper is organized as follows: In the next section, the related work in the literature is described. Subsequently, in section 3, the paper outlines the proposed method in detail. The experimental evaluation of the method and a discussion of the key findings are presented in section 4. The paper concludes with a summary and directions for future research work.

2 Related Literature

Financial sentiment analysis approaches in the literature can be broadly categorized as (a) generic dictionary based methods, (b) domain specific dictionary based methods, and (c) statistical or machine learning based methods. Generic dictionaries such as Harvard GI Stone1962 was used in some of the early works in financial sentiment analysis Tetlock2007; Tetlock2008. The use of generic dictionaries, however, lead to misclassification of common words in financial text and impacts the performance of sentiment prediction Loughran2011 or stock price movements li2014effect; li2014news. Recent works in the literature Ferguson2014; li2016tensor; li2014effect; li2014news; Malo2014 predominantly use domain specific dictionary such as LM dictionary Loughran2011, and FPL Malo2014. A detailed review of different dictionaries used in the literature can be found in Loughran2015. Recent surveys on sentiment analysis in finance can be found in Kearney2014; Loughran2016.

Statistical or machine learning based methods Antweiler2004; Huang2014; Li2010; Malo2014; OHare2009; VanDeKauter2015

use bag-of-words or n-grams as features and apply generative or discriminative classifier models for predicting sentiments. Internet message postings were used by

Antweiler2004 to classify financial text as buy, hold or sell. The classifier results were then aggregated for a pre-defined time period and bullishness & agreement indices were computed. The authors also conducted a study on relationship between the computed indices and financial measures such as stock returns and market volatility. The experimental results show that stock messages help predict market volatility.

A naive bayes classifier model is used in

Huang2014 to classify sentences as positive, negative or neutral. The predicted sentence level opinions are aggregated to derive report level opinions. The authors show that their method outperforms both generic and domain specific dictionary methods. A naive bayes classifier model based on bag-of-words features is trained in Li2010 to predict the tone (as positive, neutral, negative or uncertain) of forward looking statements in corporate filings. The author presents evidence that dictionary based methods (both generic and domain specific) are unsuitable for analyzing tone of corporate filings.



of text

Dictionary Features Method Objective
Tetlock Tetlock2007 News HGI HGI words Regression



Tetlock Tetlock2008 News HGI HGI words Regression



Loughran Loughran2011 10-K LM LM words Regression


Antweiler Antweiler2004



- BoW





Huang Huang2014











Li Li2010



- BoW NB


Info. Content

O’Hare OHare2009 Blogs -

Topic terms,

Words near topics




Marjan VanDeKauter2015 News Dutch


Polarity expression




Li li2016tensor


Social media


Nouns, Sentiment words





Li li2014news News HGI,LM






Li li2014effect


Internet messages




Polarity words




Mo mo2016news News SWN






RPS schumaker2009textual News - Nouns SVR



Malo Malo2014 News





Phrase structure




Our work News



PI tags ARM



BoW - Bag of words; FE - Financial entity; DI - Directionality; SWN - Sentiwordnet
PI - Performance indicator words; LIWC - Linguistic Inquiry and Word Count

Table 1: Comparison of financial sentiment analysis literature

Naive bayes and support vector machine methods were used to predict sentiment of bloggers towards companies and their stocks

OHare2009. The sentiment prediction was done using selected topic terms. Their work analyzes sentiment at the topic level and not at the sentence level as in Li2010; Malo2014. Other recent works in the literature that perform sentiment analysis include li2016tensor; li2014effect; li2014news; mo2016news; schumaker2009textual. These works utilize sentiment dictionaries and extract sentiments from text and predict stock price movements or market returns. The proposed paper is distinct from such works as the focus of the current paper is primarily on predicting the polarity of news articles as positive, neutral or negative. A comparative analysis of the financial sentiment analysis works in the literature is presented in Table 1.

The polarity sequence model proposed in Malo2014 is an extension of Moilanen2010. The authors propose an LPS model and utilize both generic dictionary, MPQA Wiebe2005 and domain specific LM dictionary Loughran2011. The authors also enrich the finance lexicon by including (a) financial concept, (b) directional verbs such as increase, decrease, and (c) polarity for the interaction between financial concept and directional verb e.g. cost-increase is pre-labelled in the dictionary as negative; profit-increase is pre-labelled as positive. The authors also contribute to the literature by making an annotated financial sentiment corpus publicly available.

The LPS model proposed in Malo2014

primarily works in three phases. In the first phase, entities are extracted along with their semantic orientations from the given text sentences. Several heuristics are applied to extract different entities (financial and general) and semantic orientations from financial text sentences. Subsequently, in the second phase, the model applies a phrase structure projection to project the extracted entities into a sequence in

space. Finally, in the third phase, a multi-label classifier is applied on the generated sequence. For multi-label classification, support vector machines (SVM) with one-against-one strategy was used.

The proposed work is distinct from existing works in the literature on the following aspects: (1) It presents a financial text polarity prediction method based on financial and non-financial performance indicators. This is different from the works of Malo2014 that primarily use financial concepts and are likely to generate a lot of false positives and false negatives. Our experimental results corroborates this claim and show the usefulness of the proposed method. (2) Polarity for the interaction between financial concept and directional verb is not pre-defined as in Malo2014. (3) Presents an association rule Agrawal1993 based hierarchical classifier to predict sentiment. To the best of our knowledge, this is the first work that uses association rule mining based method for financial sentiment analysis. A few recent works in non-financial domain have explored the use of a non-hierarchical associative classifier for sentiment analysis of webreviews man2014investigating and product reviews yang2010understanding.

We demonstrate the usefulness of the proposed approach over other state-of-the-art methods through rigorous experimental evaluation.

3 Hierarchical Sentiment Classifier (HSC)

This section describes the proposed method for financial sentiment analysis. The method broadly consists of the following four key aspects.

  1. [leftmargin=1cm]

  2. Domain specific lexicon The proposed method uses a standard domain specific dictionary, LM dictionary Loughran2011. In addition, the lexicon defines words related to performance indicators and directionality.

  3. Text tagging using lexicon The given financial text is parsed and tagged by looking up words in the lexicon.

  4. Polarity classifier model The tagged financial text is used to build a hierarchical classifier model. The classifier model utilizes the concept of association rule mining.

  5. Predict sentiment The polarity classifier model (association rules) is used to make polarity predictions for new financial text sentences.

Each of the above aspects is described in detail in the following pages.

3.1 Domain specific lexicon

Three categories of words are defined in the domain specific lexicon used in this paper. The overall distribution of words in the lexicon is given in Table 2.


Type of word (tags)

No. of words

(% of all entries)


Lagging Indicator words (LagInd)

67 (2.29%)


Leading indicator words (LeadInd)

70 (2.39%)

Down (DOWN)

53 (1.81%)

Up (UP)

51 (1.74%)


Negative (NEG)

2337 (79.73%)

sentiment words

Positive (POS)

353 (12.04%)
Table 2: Distribution of words in the dictionary

The first category of words is related to performance indicators. The performance indicator words are categorized as lagging and leading indicators. Lagging indicators reflect the results of firm’s activity e.g. improvement or decline in sales, market share, operating profit, operating cost, orders, inventory turns etc.

The leading indicators signal future events and are precursors to the firm’s future performance. Some of the common examples of leading indicators are: #new stores, #employee recruitments, #employee reductions/layoffs, #new customers, % increase in productivity, % increase in production capacity, #new contracts won or awarded, and % increase in plant utilization.

The specific words used in the dictionary for leading indicators include only the key terms like store, employee, customer, productivity, efficiency. The words that signify the direction of movement of these terms (i.e. increase, decrease, reduction, layoff etc.) are maintained separately as part of the directionality category that is described next.

The second category of words defined in the lexicon is related to the directionality of leading/lagging indicators. Directionality is the word (or n-gram) describing direction of events Malo2014. Examples include increase, improve, skyrocket, decrease, and plummet. For building the directionality related words, the words defined under specific categories (namely, rise, fall, increase, and decrease) in the Harvard GI lexicon Stone1962 were used as seed words. The words were manually reviewed to remove words that have different meaning in financial context. A few additional words that signify directionality of lagging/leading indicator terms were included e.g. layoff (#employees laid off), terminated (#staff terminated), and awarded (#new contracts awarded). The final number of words under the directionality category was about 100.

The third, and final, category of words used in the lexicon is the domain-specific lexicon, LM dictionary Loughran2011. The polarity bearing words defined in the LM dictionary were also used in this study.

3.2 Text tagging using lexicon

The input to the proposed method is a set of financial text sentences. Each financial text sentence is parsed using NLTK Parts-Of-Speech (POS) tagger bird2006nltk. The sentences are then tagged or labeled based on the occurrence of different categories of words (i.e. performance indicators, directionality, interaction of performance indicators & directionality, and sentiment words). The occurrence patterns of different categories of words are analyzed using several heuristic rules. Fig. 1 provides a couple of illustrative examples of parsed and tagged text sentences. Fig. 1 gives the raw text sentence, its corresponding POS tags and the lexicon tagged words. The lexicons used for tagging are the ones described in section 3.1. The complete grammar and the heuristic rules used for parsing and labeling financial text sentences are given in Appendix.

The output of this step is the conversion of text sentences into tagged sequence of words. The specific tags used are: LagInd, LeadInd, UP, DOWN, POS, NEG, LagInd::UP, LagInd::DOWN, LeadInd::UP, and LeadInd::DOWN. The last four tags are the interactions between performance indicators and directionality. All other words in the original sentence are discarded. The above process is repeated for each and every sentence in the review collection. The resulting collection of tagged words is then used for polarity classification.

Figure 1: Illustration of text sentences tagged with POS tags and lexicon related tags

3.3 Polarity classifier model

The proposed method aims to classify the financial text as positive, neutral or negative. The classifier proposed in this paper uses the concept of association rule mining Agrawal1993. Some of the common associative classifiers used in the literature include: CBA Liu1998, CMAR Li2001, LB meretakis1999extending and ART berzal2004art. The proposed approach utilizes a variation of the CMAR method for sentiment classification. The key steps of the associative classifier proposed in this paper are as follows:

3.3.1 Prepare transaction data

The input data required for rule mining is the set of transactions with items. The tags identified in the earlier step are used as items for rule mining. In addition, the class label of the sentence is appended to the set of items. For example, the transaction data for the illustrative example in Figure 1 is shown in Table 3.

Transaction # Type of word (tags)
1 LagInd::UP, positive
2 UP, POS, neutral
3 LagInd::DOWN, negative
4 LagInd, neutral
5 LeadInd::UP, positive
6 LagInd, POS, NEG, neutral
Table 3: A sample tagged transaction database

3.3.2 Build a classifier model

The model building involves frequent itemset mining, association rule generation, and rule ordering.

Frequent itemset and rule mining. The frequent itemsets are mined using Apriori algorithm Agrawal1993. Association rules are mined from the frequent itemsets with the constraint that the consequent of the rule contains one item and is a class (positive, negative, or neutral class). Antecedent of the rule can contain any number of items. That is, rules of the form: X c, are mined from the generated frequent itemsets. A sample set of association rules mined from the transactions in Table 3 is given in Figure 2.

LagInd neutral (33.33%, 100%) POS neutral (33.33%, 100%) UP, POS neutral (16.67%, 100%) LagInd::DOWN negative (16.67%, 100%) LagInd::UP positive (16.67%, 100%) LeadInd::UP positive (16.67%, 100%) UP neutral (16.67%, 100%)
Figure 2: Sample Association Rules

Rule ordering. The generated association rules are ordered according to decreasing precedence based on their confidence, support and antecedent length. This heuristic is used in the literature to give precedence to more confident rules Li2001; Liu1998 while making predictions.

3.4 Predict sentiment

The rules mined in the previous step are used as a model for predicting the polarity of new sentences. The complete pseudo code for polarity prediction is given in Algorithm 1.

Given a new financial text sentence, the polarity prediction works as follows: First, the new text sentence is tagged with performance indicators, directionality and finance-specific sentiment words. The tagged words are then looked up in the rule base (i.e. the set of rules generated by the polarity classifier model). It is to be noted that each rule in the rule base is of the form: Premise (one or more tag words) Conclusion (the class label). The tagged words are matched against each and every rule in the rule base (steps 5-19 of Algorithm 1). If a rule match is found, a matching score is assigned based on the confidence of the rule (the higher the confidence, the more important the rule). The scores are grouped based on class labels, as multiple rules with different class labels are likely to match the tagged words. The process of grouping scores based on class labels is similar to the approach followed by associative classifiers in the literature Li2001. The above process is repeated for each and every rule in the rule base and the scores are accumulated. At the end of the iteration, the algorithm produces a score for each class label. The class with the highest score is treated as the final predicted class (step 20 of Algorithm 1).

Let us illustrate the above prediction procedure with the following example:

Olvi expects market share to increase in the first quarter of 2010

The above sentence after parsing and dictionary lookup will be tagged as LagInd::UP. The class label is then determined by looking up the tag in the set of mined rules. The only matching rule available in Figure 2 for the tagged sentence is LagInd::UP positive (16.67%, 100%). The scores are, therefore, generated only for the ’positive’ class label. The most appropriate class label for the sentence is predicted as ”positive”. If there are rule matches across class labels, then scores for each class label are generated (refer to steps 5 to 19 of Algorithm 1). The final class label is predicted as the one with highest score. If no rule matches are found during the search process, then the sentence is assigned a default prediction value of ”neutral”.

Input: , financial text sentence
          , mined association rules
Output: , the predicted class

1:  result= a dictionary that holds confidence values for each class
2:  tags Parse and tag the text sentence #section 3.2
3:   iterate through each rule and match the tags
4:   each rule, premise class, contains support and confidence values
5:  for r in  do
6:      check if complete set of tags match
7:     if r.premise==tags then
8:         accumulate the score for each class in result
9:        result[r.class] = result[r.class] + r.confidence
10:     else
11:         match each tag individually
12:        for t in tags do
13:           if r.premise==t then
14:               one of the tags match
15:              result[r.class] = result[r.class] + r.confidence
16:           end if
17:        end for
18:     end if
19:  end for
20:  return class with the highest average confidence
Algorithm 1 Pseudocode for polarity prediction

The polarity classifier model and prediction method described above is generic and can be used for multi-class problems (positive, neutral, negative sentiments). However, in the financial sentiment analysis context, the total number of training examples in each class is likely to be highly unbalanced. That is, one is likely to encounter more number of neutral sentences than positive/negative sentences in financial texts. This class imbalance can also be observed in the annotated corpus used by researchers in the past literature. For example, the number of neutral examples was well over 53%; and negative examples was a small fraction of about 12-14% in the earlier works Huang2014; Malo2014.

In order to effectively address the class imbalance problem, we propose a hierarchical classifier, named Hierarchical Sentiment Classifier (HSC). The proposed method (HSC) first classifies a given sentence as polarized or neutral. Subsequently, the polarized sentences are classified as either positive or negative. The underlying classifier model and prediction methods used in HSC were described earlier (section 3). A comparative evaluation of the hierarchical versus the multi-class and the standard one-against-one classifier is presented in the experimental results section.

4 Experimental Evaluation

The publicly available financial phrase bank dataset Malo2014 is used for the experimental analysis. This dataset is a reasonably large annotated corpus of about 5000 sentences. The sentences are marked as positive, neutral or negative by the annotators. Four different datasets were prepared by the authors based on the level of agreement amongst the annotators. These datasets are labelled as DS100 (100% agreement amongst annotators), DS75 (75% agreement), DS66 (66% agreement) and DS50 (50% agreement). The total number of examples and distribution of class labels on each of the datasets is given in Table 4.

Dataset Number of sentences
(%positive, %neutral, %negative)
DS100 2259 (13.4, 61.4, 25.2)
DS75 3448 (12.2, 62.1, 25.7)
DS66 4211 (12.2, 60.1, 27.7)
DS50 4840 (12.5, 59.4, 28.2)
Table 4: Dataset characteristics

The proposed classifier model is evaluated on standard classifier evaluation measures fawcett2006classifiermeasures such as Precision (P), Recall (R), F-measure (F) and Accuracy (A). To ensure robustness of the results, all the experiments were conducted using a stratified 10-fold cross-validation.

The proposed method (HSC) is compared against the most commonly used domain specific dictionary (LM Loughran2011) based method (W-Loughran), and the most recent LPS model Malo2014. The dictionary based method is too simple to capture all intricacies of financial text. The LPS model is more complex requiring transformation of text using domain dictionary and inferencing of complex sentence structures using SVM based approach. HSC, on the other hand, is a relatively simple approach that transforms the text into a set of tags and applies rule mining on the identified tags to predict sentiment. One of the major advantages of the proposed method is that the rationale for prediction can be clearly explained as it is a rule based approach.

The rule mining requires two configurable parameters namely, minimum support (minsup) and minimum confidence (minconf). We use a default minsup and minconf value of 0.5% and 60% respectively. A sensitivity analysis of the parameter value changes on sentiment prediction are presented in section 4.4.

The results of our initial experiments are presented in Table 5. The best F-measure and accuracy values are marked in bold. The results reveal that the proposed method is significantly better compared to other state-of-the-art methods in all of the datasets studied. The accuracy values for the neutral classes alone are the same for both LPS and HSC models. It is to be noted that there are large number of neutral examples in the dataset (about 60%, refer to Table 4) and both the models are able to learn well from these examples and achieve a very high accuracy value of over 93%.

Dataset Measure W-Loughran LPS HSC
pos neut neg   pos neg neut   pos neg neut
DS100 P 0.56 0.64 0.36   0.74 0.85 0.84   0.83 0.93 0.86
R 0.13 0.91 0.16   0.74 0.87 0.79   0.82 0.93 0.81
F 0.20 0.75 0.22   0.74 0.86 0.81   0.83 0.93 0.83
A 0.76 0.63 0.85   0.87 0.83 0.95   0.91 0.92 0.95
DS75 P 0.60 0.65 0.38   0.69 0.83 0.76   0.77 0.90 0.83
R 0.17 0.90 0.20   0.66 0.84 0.80   0.75 0.90 0.78
F 0.26 0.76 0.26   0.67 0.83 0.78   0.76 0.90 0.80
A 0.76 0.64 0.86   0.84 0.79 0.95   0.88 0.88 0.95
DS66 P 0.61 0.63 0.40   0.60 0.86 0.73   0.75 0.86 0.80
R 0.18 0.89 0.21   0.82 0.71 0.77   0.70 0.88 0.75
F 0.28 0.74 0.28   0.69 0.77 0.75   0.72 0.87 0.77
A 0.74 0.62 0.87   0.80 0.75 0.94   0.85 0.84 0.94
DS50 P 0.57 0.62 0.41   0.65 0.76 0.72   0.72 0.84 0.79
R 0.19 0.88 0.22   0.54 0.81 0.77   0.66 0.85 0.71
F 0.28 0.73 0.29   0.59 0.78 0.74   0.69 0.85 0.75
A 0.73 0.61 0.86   0.79 0.74 0.93   0.82 0.81 0.93

The best F & A cases are marked in bold and the tie cases are underlined.

Table 5: Comparative evaluation of HSC against other methods

4.1 Study of the effect of performance indicators

In the next set of experiments, we study the influence of performance indicators and financial sentiment words on sentiment prediction. Three different scenarios were considered for the analysis: (1) using lagging indicators (along with directionality), (2) using both lagging and leading indicators (along with directionality), and (3) using all of the tags including financial sentiment words (baseline case). The analysis results are presented in Table 6. The best F-measure and accuracy values are marked in bold. In addition, the values that are tied are underlined.

The results present a few interesting insights, when it is analysed in relation to the characteristics of the annotated datasets. First, the results for DS100 dataset using only the lagging indicators is almost close to that of the baseline case. The performance difference between the two cases widens as we navigate down the table (DS75, DS66 and DS50). Second, the performance results improve as the leading indicators are included in the model. Further performance improvements are observed when all of the tags are used. One can observe a correlation between inter-annotator agreement and the use of indicators. A model that uses only the lagging indicators works quite well when the inter-annotator agreement is 100%. The second model that use both lagging and leading indicators perform better, even when the inter-annotator agreement declines to 75%. Finally, the third model works best, even when the inter-annotator agreement declines to a very low value of 50%. These results imply that human annotators, experienced in finance domain, have less agreement when a financial text uses leading indicators to describe company’s performance. This is in line with our expectation, as the relationship amongst the leading indicators, investor sentiment and the firm’s future performance are often unclear. One can also find evidence from the finance and accounting literature on the lack of clear relationship between leading indicators and firm performance ittner1998nonfinancial. Let us consider the following two sentences:

  1. [leftmargin=1cm]

  2. VDW combined with LXE devices enhances productivity, enabling workers to use a single device to perform voice, scanning and keyboard functions.

  3. Last year, UPM cut production, closed mills in Finland and slashed 700 jobs.

The first sentence refers to worker productivity improvement (a leading indicator) and the second sentence refers to cut in production and jobs (leading indicators). Both of these sentences do not have 100% agreement and are annotated by 75% of the annotators as polarized i.e. the first and second sentences are marked respectively as positive and negative.

The financial texts also contain general positive opinions. For example, let us consider the following sentence:

”He believes that the soy-oats have a good chance of entering the UK market”

The above sentence contains neither lagging nor leading indicators but expresses a positive outlook. A high level of disagreement amongst annotators was observed and only 50% of the annotators have marked the above sentence as positive. Another sentence that has been rated as positive by 50% of the annotators was: ”The company’s strength is its Apetit brand”.

Dataset Measure HSC: LagInd Only HSC:LagInd,LeadInd HSC: All
pos neut neg   pos neg neut   pos neg neut
DS100 P 0.89 0.91 0.87   0.89 0.92 0.87   0.83 0.93 0.86
R 0.77 0.96 0.79   0.80 0.96 0.78   0.82 0.93 0.81
F 0.82 0.93 0.82   0.84 0.94 0.82   0.83 0.93 0.83
A 0.91 0.91 0.95   0.92 0.92 0.95   0.91 0.92 0.95
DS75 P 0.82 0.85 0.86   0.82 0.88 0.84   0.77 0.90 0.83
R 0.65 0.94 0.68   0.70 0.94 0.76   0.75 0.90 0.78
F 0.73 0.89 0.75   0.76 0.91 0.80   0.76 0.90 0.80
A 0.87 0.86 0.94   0.89 0.88 0.95   0.88 0.88 0.95
DS66 P 0.79 0.80 0.81   0.79 0.84 0.81   0.75 0.86 0.80
R 0.58 0.93 0.57   0.63 0.91 0.72   0.70 0.88 0.75
F 0.67 0.86 0.67   0.70 0.87 0.76   0.72 0.87 0.77
A 0.84 0.81 0.93   0.85 0.84 0.94   0.85 0.84 0.94
DS50 P 0.75 0.77 0.82   0.75 0.81 0.80   0.72 0.84 0.79
R 0.54 0.91 0.51   0.59 0.90 0.69   0.66 0.85 0.71
F 0.62 0.83 0.63   0.66 0.85 0.74   0.69 0.85 0.75
A 0.82 0.78 0.92   0.83 0.80 0.93   0.82 0.81 0.93
Table 6: Influence of performance indicators on sentiment prediction

From the foregoing discussions and the experimental results, it is clear that financial texts are interpreted differently based on the nature of the indicators and sentiment words. It is to be noted that sentiment analysis in other domains (such as movies, music) utilizes general sentiment lexicon words to predict sentiments. In the financial sentiment analysis literature, similar approach has been borrowed with a refinement of dictionary words to finance domain. This paper presents a new perspective and suggests the use of multiple levels of analysis (lagging indicator, leading indicator, domain specific lexicon words) to improve the quality of financial sentiment analysis. The experimental results clearly demonstrate the utility of such a multi-level sentiment analysis.

The multi-level approach to financial sentiment analysis can be very useful in building models to suit specific application requirements. For example, an investor might prefer a model built using lagging (or lagging and leading) indicators over a more accurate baseline model. The former model is more likely to help the investor make the best investment decisions, even though it offers relatively lower accuracy.

4.2 Analysis of alternate classifier models

In the financial sentiment analysis context, the number of examples in each of the classes is likely to be highly unbalanced. Therefore, a hierarchical sentiment classifier was used in this paper. Table 7 presents an analysis of hierarchical versus multi-class and one-against-one versions of the associative classifier. A comparison of multi-class versus hierarchical classifier reveals that the multi-class version performs marginally better for DS100 dataset. However, the performance degrades considerably for other datasets. For example, the F-measure values for DS50 dataset is 64% and 69% respectively for multi-class and hierarchical versions.

A comparison of one-against-one versus hierarchical classifier suggests that the hierarchical classifier performs better. However, the margin of difference is not very significant, except for a few cases (e.g. D75 dataset, negative class values of 74% versus 80%). Overall, the comparative evaluation of different variants of the classifier models did not indicate statistically significant differences in the observed results. Future work could consider detailed evaluation of alternate classifier models on more diverse datasets to better understand the trade-offs involved.

Dataset Measure One-against-one Multi-class HSC
pos neut neg   pos neg neut   pos neg neut
DS100 P 0.84 0.92 0.87   0.85 0.94 0.84   0.83 0.93 0.86
R 0.82 0.94 0.76   0.81 0.92 0.89   0.82 0.93 0.81
F 0.83 0.93 0.81   0.83 0.93 0.86   0.83 0.93 0.83
A 0.91 0.91 0.94   0.91 0.92 0.95   0.91 0.92 0.95
DS75 P 0.78 0.88 0.85   0.78 0.89 0.83   0.77 0.90 0.83
R 0.75 0.92 0.66   0.71 0.90 0.81   0.75 0.90 0.78
F 0.76 0.89 0.74   0.74 0.89 0.82   0.76 0.90 0.80
A 0.88 0.86 0.94   0.87 0.87 0.95   0.87 0.88 0.95
DS66 P 0.76 0.85 0.81   0.75 0.84 0.80   0.75 0.86 0.80
R 0.69 0.89 0.70   0.64 0.88 0.76   0.70 0.88 0.75
F 0.73 0.87 0.75   0.69 0.86 0.78   0.72 0.87 0.77
A 0.85 0.83 0.94   0.84 0.82 0.94   0.85 0.84 0.94
DS50 P 0.73 0.82 0.80   0.78 0.80 0.79   0.72 0.84 0.79
R 0.65 0.87 0.67   0.54 0.89 0.73   0.66 0.85 0.71
F 0.69 0.84 0.73   0.64 0.84 0.76   0.69 0.85 0.75
A 0.83 0.80 0.93   0.81 0.79 0.94   0.82 0.81 0.93
Table 7: Performance analysis of One-against-One, Multi-class and HSC classifiers

4.3 Comparative evaluation of HSC and machine learning models

In the recent literature on financial sentiment analysis, machine learning approaches like Naive Bayes (NB) and Support Vector Machines (SVM) have been explored. These approaches utilize bag-of-words or n-grams or sentiment words as features for building the model. Such models generate large number of features and pose scalability issues. Besides, the models based on bag-of-words (or n-grams) are shown to be less predictive in the past research studies li2014news; schumaker2009textual.

We analyze the use of machine learning models with parsimonious set of features (tags) introduced in this paper. A term document matrix was built using the tags as features. The generated term document matrix was then weighted with the help of standard Term frequency-Inverse document frequency (TF-IDF) measure. The standard python library (sklearn) was utilized for building NB and SVM models. For SVM, the default parameter values of C, and RBF kernel were used for the experimentation.

The results of our experiments are shown in Fig. 3. The results reveal that HSC perform better compared to other models. SVM and NB models produced a zero F-measure values for the negative class. This can be attributed to the models inability to learn from the parsimonious set of features. The naive bayes method was found to perform better than SVM for the positive and neutral classes. It is evident from the results that the standard machine learning models produced poor results compared to HSC. These results clearly demonstrate that an associative classifier (HSC) is quite effective in predicting sentiments even with parsimonious set of features.

Figure 3: Classifier Performance for DS50 dataset

4.4 Effect of changes in rule mining parameters

The association rule mining requires two parameters to be specified, namely, minimum support and minimum confidence. The minimum confidence parameter influences the quality of rules generated and plays a critical role in sentiment prediction. We study the influence of the choice of minimum confidence values on the precision and recall values of the classifier. The results of our investigation are presented in Fig. 


The results show the trade-off involved between precision and recall. An increase in confidence value is associated with increase in precision for positive and negative classes. But, the precision increase at higher confidence values comes at the cost of lower recall. The observed relationship is quite intuitive and is in line with our expectations. One can choose an appropriate value of confidence based on the level of precision and recall desired. Alternately, one can choose an optimal threshold value that maximizes F-measure. It is to be noted that F-measure is the harmonic mean of precision and recall.

Figure 4: Performance analysis of confidence parameter changes for DS50 dataset

4.5 Study of the use of performance indicator - directionality reversal

In the next set of experiments, we analyze the influence of directionality reversals. Let us consider the following two sentences to understand the need for directionality reversals.

  1. [leftmargin=1cm]

  2. Turnover rose to EUR 21mn from EUR 17 mn.

  3. Unit costs for flight operations fell by 6.4 percent.

Both of the above sentences belong to the positive class. The above two sentences will be tagged respectively as LagInd::UP and LagInd::DOWN. It is reasonable to assume that one is likely to encounter more number of LagInd::UP cases in positive class and more number of LagInd::DOWN cases in negative class. Hence, an associative classifier (HSC) is most likely to predict the class for the second example as negative.

In the literature, Malo et al Malo2014 use a lexicon to pre-define directional dependence for each financial term. The baseline associative classifier presented in this paper does not use such directional dependencies. In order to understand the influence of directional dependencies on classifier performance, a directionality reversal is applied as a post-processing operation. That is, after the initial tagging, a lookup is made for specific indicators (like operating cost, operating loss, expenses) and the directionality is reversed. For the illustrative example in sentence 2 above, the tag is reversed from LagInd::DOWN to LagInd::UP as lower cost is considered as a positive outcome.

Dataset Measure HSC PI-Directionality reversal
pos neut neg   pos neg neut
DS100 P 0.83 0.93 0.86   0.83 0.93 0.89
R 0.82 0.93 0.81   0.85 0.93 0.80
F 0.83 0.93 0.83   0.84 0.93 0.84
A 0.91 0.92 0.95   0.92 0.92 0.96
DS75 P 0.77 0.90 0.83   0.77 0.90 0.85
R 0.75 0.90 0.78   0.77 0.90 0.77
F 0.76 0.90 0.80   0.77 0.90 0.81
A 0.88 0.88 0.95   0.88 0.88 0.95
DS66 P 0.75 0.86 0.80   0.75 0.86 0.82
R 0.70 0.88 0.75   0.71 0.88 0.74
F 0.72 0.87 0.77   0.73 0.87 0.78
A 0.85 0.84 0.94   0.85 0.84 0.95
DS50 P 0.72 0.84 0.79   0.71 0.84 0.81
R 0.66 0.85 0.71   0.68 0.85 0.71
F 0.69 0.85 0.75   0.70 0.85 0.75
A 0.82 0.81 0.93   0.83 0.81 0.94
Table 8: Experimental analysis of PI and Directionality reversal

The results of the directionality reversal experiments are presented in Table 8. The results show that there is a general improvement in classifier performance for both positive and negative classes. This is in line with our expectation as the positive and negative example cases are correctly tagged with directionality reversals. The observed improvement is marginally better compared to the baseline case but the results are statistically insignificant. These results demonstrate that the proposed model (HSC) offers better predictive results even when the directionality reversal words are not used.

An alternate way to handle the directional dependence is to tag the identified indicators along with the business terms. For example, while tagging sentence ’Operational cost rose by 30%’, the term ’operational cost’ can be tagged along with ’LagInd::UP’. The association rule mining can then be applied on the transactions that contain both performance indicators and the associated terms. The main challenge with this approach is that the support of individual terms (like cost, expenses) will be very low and is unlikely to satisfy the minimum support constraints or produce models with very poor accuracy (if very low minimum support value is used). In such cases, the use of more complex variable support rule mining approaches can be explored.

4.6 Discussion

This paper proposed a hierarchical sentiment classifier using performance indicators for financial sentiment prediction. The proposed approach was found to be quite effective achieving an accuracy of over 80% and F-measure of over 68% in all of the datasets studied.

The findings of this paper have implications for both theory and practice. From a theoretical perspective, the paper presents a novel approach to financial sentiment analysis using performance indicators. The presented approach is more closely aligned with the way humans interpret financial texts while making investment decisions. The associative sentiment classifier generates easily explainable predictions compared to other methods available in the literature. Our rigorous experimental evaluation clearly demonstrate the utility of the approach over the state-of-the-art methods. Furthermore, the results observed in Table 6 clearly reveals the reasons for differences amongst annotators with the help of performance indicators (lagging and leading indicators).

The paper presents new perspectives on the influence of performance indicators and financial sentiment lexicon words on sentiment prediction. The experimental results reveal that it is useful to consider financial sentiment analysis at multiple levels (using lagging indicator, combination of lagging and leading indicators, and all indicators along with sentiment words). An investor, analyst or financial institution can benefit by choosing an appropriate financial sentiment model to suit their specific application requirements e.g. track overall market outlook, make investment decision and so on.

The transparency and replicability of results is often a challenge in the financial sentiment analysis literature Loughran2016 due to the proprietary nature of datasets, dictionaries and advanced sentence parsing rules. To ensure transparency and replicability of the results presented, the detailed grammar and sentence parsing steps used for tagging are provided in Appendix.

5 Conclusion

This paper examined the use of performance indicators for predicting sentiments from financial texts. It presented a hierarchical classifier that uses the concept of association rules. The effectiveness of the classifier was demonstrated through rigorous experimental analysis on a benchmark financial dataset.

A study of the influence of performance indicators on sentiment prediction revealed interesting insights. The presence of varying levels of influence of lagging indicators, leading indicators and sentiment words on sentiment prediction were observed for different datasets. The results are clearly in alignment with the way humans interpret financial texts and make decisions.

As part of our future work, we plan to explore several interesting extensions. First, a more fine-grained analysis of performance indicators is likely to reveal interesting insights. For instance, a balanced scorecard approach Kaplan1996 analyses company’s performance from multiple dimensions, namely, financial, customers, internal business processes and learning & growth. Performance indicators can be categorized across each of these dimensions and the influence of each of the dimensions on financial sentiment can be explored. Second, advanced variable support rule mining methods can be studied to capture the directional dependencies in financial texts without using a lexicon. Third, the proposed method is not tuned to handle sentences that have mix of positive and negative orientations on performance indicators. One can investigate the use of utility mining liu2012mining approaches to give different weights to different indicators. Such an approach allows one to capture varying influence of indicators and further improve sentiment prediction.

The findings of this study are likely to be useful for financial institutions in building superior sentiment analysis models. The investors or financial analysts can make better investment decisions using qualitative financial texts.