Aspect-based sentiment analysis (ABSA) is a way to systematically mine opinions given a body of text. Unlike regular sentiment analysis, ABSA allows for far more granular levels of opinion mining. For example, one common application of ABSA is to dissect product or service reviews and determine sentiment on sometimes unrelated aspects, such as a quality or price. Rarely are product reviews as simple as good or bad. They are nuanced with conflicting positive and negative opinions based on what aspect of the product is being reviewed.
We find the field of finance to be a significantly under-explored domain for ABSA. Similar to product reviews, financial investment opportunities are commonly written as free-form essays; these write-ups are generally nuanced with positive and negative opinions on specific aspects of a certain investment opportunity. Being able to identify these topics and to subsequently determine the associated sentiment could be beneficial in downstream models to auto-summarize predefined aspects, allowing users to obtain structured information from an unstructured set of write-ups. Another use case could be to employ the aspect-based sentiments as features to classify future performance or volatility of investment ideas.
The under-explored nature of financial related ABSA also manifests itself in a lack of large, high-quality datasets on which to train. Current ABSA techniques and model architectures do not accurately scale down to small data sizes, presenting an opportunity for transfer learning that can leverage larger, domain-related datasets. Successful inductive transfer learning allows these larger datasets that have no sentiment or aspect annotations to be used to improve results on the main ABSA task.
2 Related Work
ABSA is not a new idea. There have been extensive bodies of work, starting from the original rule-based methods thet2010aspect to more recent deep learning methods wang2015deep. In general, the state-of-the-art consists of a few subtasks. First, a model identifies the entity and its aspects. Then, sentiment analysis is performed on the body of text before combining the two tasks.
Sentiment analysis for finance – without considering aspects – has also been explored cortis-EtAl:2017:SemEval. Unlike ABSA, more general sentiment analysis cannot determine if a statement is positive or negative by each aspect – it must choose an overall sentiment.
Until recently, it has been difficult to perform the kind of analysis in wang2015deep in the financial domain due to both a lack of well-annotated datasets and the necessary quantity of data. Unlike the larger product review dataset pontiki2016semeval used in Wang’s study, we speculate it would be expensive to annotate as many examples for the financial domain, as it would require extensive domain expertise. However, as part of the companion proceedings for WWW ’18 conference, maia201818 released a very small dataset (FiQA) with a call for papers. FiQA contains the particular labels for which we are interested, but it lacks data quantity. Still, we reference the submissions to this open challenge as a response to FiQA Task 1. Many of the submissions jangid2018aspect; Chen:2018; deFrancaCosta:2018 use a neural architecture similar to wang2015deep, despite very few training examples. We show these results along with our own in Section 7.1.
The problem of few training examples brings us to the topic of inductive transfer learning pan2010survey. In general, transfer learning allows us to perform training on some source task with the ultimate goal of optimizing the loss of some target task. Word embeddings such as word2vec mikolov2013distributed or GloVe pennington2014glove
are an early form of transfer learning in NLP. Just beyond that are more sophisticated vector representations of words, such as ELMopeters2018deep. ELMo embeddings offer a solution to the challenges of complex word use across linguistic context (i.e. polysemy, syntax, etc.) and limited training data. In ELMo, each word is assigned a representation which is a function of the entire corpus sentences to which they belong. At a high-level, the embeddings are computed from internal states of a two-layer biLSTM, bidirectional Language Model (biLM). Despite ELMo’s important improvements on “traditional” word embeddings, any approach utilizing ELMo embeddings still requires a custom architecture further downstream.
In this paper, we explore a recent method called ULMFiT howard2018universal, which allows us to not only transfer word representations with a vector, but use a single, pre-trained model architecture (AWD-LSTM DBLP:journals/corr/abs-1708-02182) for all intended tasks. Similar to ELMo, the key benefit of a higher level representation is that it allows semantically meaningful starting points of the input words for training. Using ELMo, one ultimately concatenates the output of each trained layer and uses it as a fixed embedding for some downstream task, whereas ULMFiT fine-tunes an entire language model to some target domain and then directly connects a downstream target task. This concept itself is not entirely new DBLP:journals/corr/DaiL15a, but howard2018universal contribute novel techniques (Gradual Unfreezing, Discriminative Learning Rates and Slanted Triangular Learning Rates) to make this possible on small datasets without all prior learning being forgotten. Similar to chain-thaw felbo2017using, gradual unfreezing offers another approach to the transfer learning training and fine-tuning process. Chain-thaw first tunes any new layers in a model until convergence; this is followed by a sequential tuning of each layer individually. Finally, chain-thaw fine-tunes all layers together. In contrast, gradual unfreezing fine-tunes all layers in reverse, adding a ’thawed’ layer instead of fine-tuning single layers individually.
The following are the primary contributions of this paper:
We assess the performance of recent NLP inductive transfer learning techniques such as ULMFiT on a new dataset (FiQA) which has the problem of limited training examples.
We use FiQA as our target task to conduct experiments with varying intermediary tasks, at least one of which appears to be novel, to our knowledge.
We extend state-of-the-art on FiQA task 1 (aspect 2 classification and sentiment scoring) by using our novel intermediary task.
The rest of this paper is structured as follows. In Section 4 we introduce the critical datasets used for both the primary and intermediary tasks (outlined in Section 5). We propose our novel variation to model architecture and hyper-parameter tuning in Section 6. In Section 7 we discuss our results and analysis of our experiments. Finally, we conclude with some future direction of research related to this paper in Section 8.
The provided training dataset for WWW ’18 maia201818 contains a total of 1,174 examples from news headlines and tweets. Each example contains the sentence and the sentence snippet associated with the target entity, aspect, and sentiment score. A sample FiQA entry is shown in Table 1.
|Sentence||easyJet expects resilient demand to withstand security fears.|
|Aspect Level 1||Corporate|
|Aspect Level 2||Risks|
A Level 1 Aspect label takes on one of four possible labels (Corporate, Economy, Market or Stock), and our Level 2 Aspect label takes on one of twenty-seven possible labels (Appointment, Risks, Dividend Policy, Financial, Legal, Volatility, Coverage, Price Action, etc.). The original dataset contained a small number of multilabel examples, however, we considered this number too few to train a meaningful multilabel classifier. Thus, we slightly stray from the original WWW ’18 task for the purpose of this research. Finally, sentiment score takes on a continuous value between -1 and 1 – most negative to most positive.
4.2 Value Investors Club (VIC)
VIC is an online investment forum where fund managers and skilled investors submit in-depth investment recommendations on a daily basis. We scraped roughly 6,800 documents with investment theses along with attributes of a particular thesis, such as long or short stock position. Each document has an average of approximately 1,950 words which gives us a corpus size of about 13M. This is a crucial dataset that we use to adapt the domain of our general pre-trained models to our tasks in FiQA subsection 6.2. We find the high quality of VIC to be of particular interest. Although most recommendations and investment ideas shared online lack credibility or accuracy, VIC investment ideas empirically outperform the market on average and over time gray2012fund.
4.3 Pre-Trained Models
Although we do not directly train on “1B Word Benchmark” corpus chelba2013one or wikitext-103 merity2016pointer, we thought it was important to note the underlying source for the pre-trained models released by howard2018universal and peters2018deep.
Using this outlined data, we have two primary subtasks: classification for determining sentence aspects and regression for continuous sentiment. Namely, FiQA Level 2 Aspect and FiQA sentiment score are our ultimate target values of interest. We measure our models by using error-rate and F1-score for the classification task, and with MSE and R-squared for the regression task. These tasks can be seen in Figure 1
We also have a number of intermediary tasks, the purpose of which is to better adapt the pre-trained models to our primary tasks. First, we use VIC to train a language model on the 13M word corpus. We also train VIC on a binary classifier which learns if the thesis of a given body of text is that of a long position or short position. We also use FiQA Level 1 Aspect as an intermediary task.
Our most naive baseline on the performance of the primary tasks are logistic regression for aspect classification and linear regression for sentiment. We use a simple sparse representation of words, rather than introduce any type of continuous embedding.
We also create a simple ELMo embedding baseline from the peters2018deep pre-trained biLM. Our model computes a fixed mean-pooling of all contextualized word representations for each input sentence which is then passed through a single dense layer. Note that we do not claim this methodology to be comparable to our more rigorous implementation of ULMFiT. We use this as a neural baseline, but reserve any in-depth analysis for future work.
We largely use the methodology and architecture used in the ULMFiT howard2018universal paper and experiment with different methods of domain adaptation, model fine-tuning, and hyper-parameter tuning.
Typically, ULMFiT is composed of three phases of training. First, a language model is trained using a large general corpus. In particular, howard2018universal released a pre-trained general language model which we use as our starting point. Second, the weights from the first phase are fine-tuned to a general task using a corpus from the target domain. Third, the weights from the second phase are fine-tuned to the primary task using the same upstream architecture as the first two phases, but by passing the output of the tuned LSTM model into a new fully-connected layer.
We propose a few different variations of the typical framework and apply it to FiQA:
FiQA does not contain enough volume of data to meaningfully apply LM fine-tuning (Phase 2); Thus, we use VIC as our target corpus for the language model fine-tuning. In this phase, we apply no change in model architecture.
We further exploit VIC labels of long/short positions and perform an additional round of fine-tuning in Phase 2. The decoder of the Language Model is no longer used, and, in its place, we add a fully-connected layer with a binary output.
We train the FiQA primary classification and regression tasks using both variations of phase two, attaching classification and regression fully-connected layers, respectively.
We train FiQA Aspect Level 1 using both variations of phase two, then transfer the weights further downstream for FiQA Aspect Level 2.
As suggested in howard2018universal, there are a number of techniques used to ensure that the training process during fine-tuning does not cause the model to “forget“ what was previously learned. We experiment with both gradual unfreezing howard2018universal and chain-thaw felbo2017using and compare results on varying sub-sample sizes. We also experiment with some early-stoppage of chain-thaw, which we think may be beneficial to avoid catastrophic forgetting and to not overly fine-tune. In addition, we utilize slanted triangular learning rates, concat pooling, and bptt for text classification as described in howard2018universal. While we did initially perform some exploratory work on different hyper-parameter values of these latter techniques, we keep the values constant for our experiments and results.
7 Results and Discussions
7.1 Model Results
We were able to ultimately outperform the current FiQA Task 1 state-of-the-art by using the ULMFiT framework in conjunction with our modified intermediary tasks (see Table 2). In some cases we achieve close to state-of-the-art scores, even without much fine-tuning.
In Aspect Level 2 classification, although we can match state-of-the-art results with a typical ULMFiT process, we are unable to exceed state-of-the-art until we use Aspect Level 1 as our intermediary task. We conclude that since Aspect Level 1 and Level 2 are hierarchical, the internal states from fine-tuning on Aspect Level 1 lead to a much better starting point for training on Aspect Level 2. We want to be clear that this method does not train with Aspect Level 1 as a feature, we simply transfer the internal state. Thus, it will work at inference time without Aspect Level 1 labels. We also believe that this is not simply a result of additional training as the prior metrics for Aspect Level 2 had already converged.
While Aspect Level 1 was not our primary task, we note some interesting results here as well. We saw the best performance from using chain-thaw. For this particular task, we see that the other models actually perform worse than using the Phase 1 language model without any domain adaptation. It is possible that these methods of training here are leading to catastrophic forgetting howard2018universal, in which we lose the utility of the pre-trained language model.
For the regression task, we outperform the current FiQA state-of-the-art in terms of MSE, but not in terms of R-squared. It is interesting to note that the top-performer was from Phase 2 training on whether or not VIC advocates for a long or short position. It seems rational that transfer learning from a more closely related task leads to better starting internal states for our primary sentiment task.
We also note that in all cases, even our Phase 1 language model is comparable, if not superior, to our naive baseline models. In this case, no additional training time is required to produce these predictions. This framework is also relatively universal in the sense that not much pre-preprocessing is required other than standard tokenization. Out-of-vocabulary words are also easily handled by using the mean representation of all other vocabulary.
7.2 Hyper-parameter results
The only hyper-parameter we vary for each of the models above is the method in which layers are unfrozen. We report gradual unfreezing for all models, but only report the best version of chain-thaw for each model for brevity. Roughly speaking, our primary tasks perform the best with full chain-thaw, but our Aspect Level 1 classification performs best with partial chain-thaw. More research is needed on this topic, but we speculate that this may be due to over-learning on the target task.
Also, we compare model performance for both primary tasks with random subsamples of the training data (see Figure 2). We see that chain-thaw tends to have a smoother decrease in error. It appears the steepest portion of the curve occurs when there are even fewer training examples than our case. While it’s difficult to further forecast the error rates, this shows some evidence our current methodology does help with the few training examples issue. However, the error rates do not yet appear to have hit a point of saturation.
Aspect-based sentiment analysis is not broadly used in the finance industry. Due to a lack of well annotated data, it is necessary to apply transfer learning techniques to leverage data from a large general corpus for the training of tasks on specific domain data.
Because the dataset for ABSA training (FiQA) is so small, we modify the ULMFiT steps. We fine tuned the language model with another dataset containing target domain specific context (VIC) and perform additional fine-tuning on a classification task (long/short position of VIC documents). Another contribution is the improvement of the FiQA Aspect 2 classification task by leveraging a pre-trained model on Aspect 1 classification.
Despite improved performance using a modified ULMFiT process for ABSA training on a very small sample, more research is needed to generalize this methodology to multi-task and multi-label learning. Moreover, in order to improve the methodology, experiments using other pre-trained models of large corpora as well as different setups of hyper-parameters should be evaluated.
The adapted ULMFiT methodology to ABSA on a very small, domain-specific dataset such as FiQA is an important cornerstone for learning subsequent tasks related to predicting financial performance. The methodology can also be expanded to other domain specific tasks when the size of data available represents a challenge.
- Handbook of technical writing. St. Martin’s, New York.
- The latex companion. Addison-Wesley, Reading, Mass..
- Developments in maximum-entropy data analysis. In Maximum Entropy and Bayesian Methods, J. Skilling (Ed.), pp. 53–71.
- Introduction to Bayesian image analysis. In Medical Imaging: Image Processing, M. H. Loew (Ed.), Proc. SPIE, Vol. 1898, pp. 716–731.
- LaTeX: a document preparation system. Addison-Wesley, Reading, Mass..
- Equations of state calculations by fast computing machine. J. Chem. Phys. 21, pp. 1087–1091.
- The latex companion. Addison-Wesley, Reading, Mass..
- Mayfield handbook of technical and scientific writing. Mountain View, Mayfield. Note: http://mit.imoat.net/handbook/