A framework for anomaly detection using language modeling, and its applications to finance

by   Armineh Nourbakhsh, et al.
S&P Global

In the finance sector, studies focused on anomaly detection are often associated with time-series and transactional data analytics. In this paper, we lay out the opportunities for applying anomaly and deviation detection methods to text corpora and challenges associated with them. We argue that language models that use distributional semantics can play a significant role in advancing these studies in novel directions, with new applications in risk identification, predictive modeling, and trend analysis.



There are no comments yet.


page 1

page 2

page 3

page 4


anomaly : Detection of Anomalous Structure in Time Series Data

One of the contemporary challenges in anomaly detection is the ability t...

Evaluation of Point Pattern Features for Anomaly Detection of Defect within Random Finite Set Framework

Defect detection in the manufacturing industry is of utmost importance f...

Towards Malware Detection via CPU Power Consumption: Data Collection Design and Analytics (Extended Version)

This paper presents an experimental design and data analytics approach a...

Anomaly Detection by Recombining Gated Unsupervised Experts

Inspired by mixture-of-experts models and the analysis of the hidden act...

Probabilistic Forecast Combination for Anomaly Detection in Building Heat Load Time Series

We consider the problem of automated anomaly detection for building leve...

LAnoBERT : System Log Anomaly Detection based on BERT Masked Language Model

The system log generated in a computer system refers to large-scale data...

Sparsity-based Cholesky Factorization and its Application to Hyperspectral Anomaly Detection

Estimating large covariance matrices has been a longstanding important p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The detection of anomalous trends in the financial domain has focused largely on fraud detection (West and Bhattacharya, 2016), risk modeling (Addo et al., 2018), and predictive analysis (Lee and Rui, 2000). The data used in the majority of such studies is of time-series, transactional, graph or generally quantitative or structured nature. This belies the critical importance of semi-structured or unstructured text corpora that practitioners in the finance domain derive insights from—corpora such as financial reports, press releases, earnings call transcripts, credit agreements, news articles, customer interaction logs, and social data.

Previous research in anomaly detection from text has evolved largely independently from financial applications. Unsupervised clustering methods have been applied to documents in order to identify outliers and emerging topics (Cheng, 2013). Deviation analysis has been applied to text in order to identify errors in spelling (Samanta and Chaudhuri, 2013) and tagging of documents (Eskin, 2000). Recent popularity of distributional semantics (Turney and Pantel, 2010) has led to further advances in semantic deviation analysis (Vecchi et al., 2011). However, current research remains largely divorced from specific applications within the domain of finance.

In the following sections, we enumerate major applications of anomaly detection from text in the financial domain, and contextualize them within current research topics in Natural Language Processing.

2. Five views on anomaly

Anomaly detection is a strategy that is often employed in contexts where a deviation from a certain norm is sought to be captured, especially when extreme class imbalance impedes the use of a supervised approach. The implementation of such methods allows for the unveiling of previously hidden or obstructed insights.

In this section, we lay out five perspectives on how textual anomaly detection can be applied in the context of finance, and how each application opens up opportunities for NLP researchers to apply current research to the financial domain.

2.1. Anomaly as error

Previous studies have used anomaly detection to identify and correct errors in text (Samanta and Chaudhuri, 2013; Eskin, 2000). These are often unintentional errors that occur as a result of some form of data transfer, e.g. from audio to text, from image to text, or from one language to another. Such studies have direct applicability to the error-prone process of earnings call or customer call transcription, where audio quality, accents, and domain-specific terms can lead to errors. Consider a scenario where the CEO of a company states in an audio conference, ‘Now investments will be made in Asia.’ However, the system instead transcribes, ‘No investments will be made in Asia.’ There is a meaningful difference in the implication of the two statements that could greatly influence the analysis and future direction of the company. Additionally, with regards to the second scenario, it is highly unlikely that the CEO would make such a strong and negative statement in a public setting thus supporting the use of anomaly detection for error correction.

Optical-character-recognition from images is another error-prone process with large applicability to finance. Many financial reports and presentations are circulated as image documents that need to undergo OCR in order to be machine-readable. OCR might also be applicable to satellite imagery and other forms of image data that might include important textual content such as a graphical representation of financial data. Errors that result from OCR’d documents can often be fixed using systems that have a robust semantic representation of the target domain. For instance, a model that is trained on financial reports might have encoded awareness that emojis are unlikely to appear in them or that it is unusual for the numeric value of profit to be higher than that of revenue.

2.2. Anomaly as irregularity

Anomaly in the semantic space might reflect irregularities that are intentional or emergent, signaling risky behavior or phenomena. A sudden change in the tone and vocabulary of a company’s leadership in their earnings calls or financial reports can signal risk. News stories that have abnormal language, or irregular origination or propagation patterns might be unreliable or untrustworthy.

(Wendlandt et al., 2018)

showed that when trained on similar domains or contexts, distributed representations of words are likely to be stable, where stability is measured as the similarity of their nearest neighbors in the distributed space. Such insight can be used to assess anomalies in this sense. As an example,

(Nourbakhsh et al., 2017) identified cliques of users on Twitter who consistently shared news from similar domains. Characterizing these networks as “echo-chambers,” they then represented the content shared by these echo-chambers as distributed representations. When certain topics from one echo-chamber began to deviate from similar topics in other echo-chambers, the content was tagged as unreliable. (Nourbakhsh et al., 2017) showed that this method can be used to improve the performance of standard methods for fake-news detection.

In another study (Zhao, 2017), the researchers hypothesized that transparent language in earnings calls indicates high expectations for performance in the upcoming quarters, whereas semantic ambiguity can signal a lack of confidence and expected poor performance. By quantifying transparency as the frequent use of numbers, shorter words, and unsophisticated vocabulary, they showed that a change in transparency is associated with a change in future performance.

2.3. Anomaly as novelty

Anomaly can indicate a novel event or phenomenon that may or may not be risky. Breaking news stories often emerge as anomalous trends on social media. (Li et al., 2017)

experimented with this in their effort to detect novel events from Twitter conversations. By representing each event as a real-time cluster of tweets (where each tweet was encoded as a vector), they managed to assess the novelty of the event by comparing its centroid to the centroids of older events.

Novelty detection can also be used to detect emerging trends on social media, e.g. controversies that engulf various brands often start as small local events that are shared on social media and attract attention over a short period of time. How people respond to these events in early stages of development can be a measure of their veracity or controversiality (Nourbakhsh et al., 2015; Liu et al., 2015).

An anomaly in an industry grouping of companies can also be indicative of a company that is disrupting the norm for that industry and the emergence of a new sector or sub-sector. Often known as trail-blazers, these companies innovate faster than their competitors to meet market demands sometimes even before the consumer is aware of their need. As these companies continually evolve their business lines, their core operations are novel outliers from others in the same industry classification that can serve as meaningful signals of transforming industry demands.

2.4. Anomaly as semantic richness

A large portion of text documents that analysts and researchers in the financial sectors consume have a regulatory nature. Annual financial reports, credit agreements, and filings with the U.S. Securities and Exchange Commission (SEC) are some of these types of documents. These documents can be tens or hundreds of pages long, and often include boilerplate language that the readers might need to skip or ignore in order to get to the “meat” of the content. Often, the abnormal clauses found in these documents are buried in standard text so as not to attract attention to the unique phrases.

(Shah et al., 2017)

used smoothed representations of n-grams in SEC filings in order to identify boilerplate and abnormal language. They did so by comparing the probability of each n-gram against the company’s previous filings, against other filings in the same sector, and against other filings from companies with similar market cap. The aim was to assist accounting analysts in skipping boilerplate language and focusing their attention on important snippets in these documents.

Similar methods can be applied to credit agreements where covenants and clauses that are too common are often ignored by risk analysts and special attention is paid to clauses that “stand out” from similar agreements.

2.5. Anomaly as contextual relevance

Certain types of documents include universal as well as context-specific signals. As an example, consider a given company’s financial reports. The reports may include standard financial metrics such as total revenue, net sales, net income, etc. In addition to these universal metrics, businesses often report their performance in terms of the performance of their operating segments. These segments can be business divisions, products, services, or regional operations. The segments are often specific to the company or its peers. For example, Apple Inc.’s segments might include “iPhone,” “iMac,” “iPad,” and “services.” The same segments will not appear in reports by other businesses.

For many analysts and researchers, operating segments are a crucial part of exploratory or predictive analysis. They use performance metrics associated with these segments to compare the business to its competitors, to estimate its market share, and to project the overall performance of the business in upcoming quarters. Automating the identification and normalization of these metrics can facilitate more insightful analytical research. Since these segments are often specific to each business, supervised models that are trained on a diverse set of companies cannot capture them without overfitting to certain companies. Instead, these segments can be treated as company-specific anomalies.

3. Anomaly detection via language modeling

Unlike numeric data, text data is not directly machine-readable, and requires some form of transformation as a pre-processing step. In “bag-of-words” methods, this transformation can take place by assigning an index number to each word, and representing any block of text as an unordered set of these words. A slightly more sophisticated approach might chain words into continuous “n-grams” and represent a block of text as an ordered series of “n-grams” that have been extracted on a sliding window of size n. These approaches are conventionally known as “language modeling.”

Figure 1. Illustration of a recurrent step in a language model. Excerpted from (Leonard, 2016).

Since the advent of high-powered processors enabled the widespread use of distributed representations, language modeling has rapidly evolved and adapted to these new capabilities. Recurrent neural networks can capture an arbitrarily long sequence of text and perform various tasks such as classification or text generation

(Wen et al., 2015). In this new context, language modeling often refers to training a recurrent network that predicts a word in a given sequence of text (Howard and Ruder, 2018). Language models are easy to train because even though they follow a predictive mechanism, they do not need any labeled data, and are thus unsupervised.

Figure 1

is a simple illustration of how a neural network that is composed of recurrent units such as Long-Short Term Memory (LSTM)

(Hochreiter and Schmidhuber, 1997) can perform language modeling. The are four main components to the network:

  • The input vectors (), which represent units (i.e. characters, words, phrases, sentences, paragraphs, etc.) in the input text. Occasionally, these are represented by one-hot vectors that assign a unique index to each particular input. More commonly, these vectors are adapted from a pre-trained corpus, where distributed representations have been inferred either by a simpler auto-encoding process (Mikolov et al., 2013) or by applying the same recurrent model to a baseline corpus such as Wikipedia (Howard and Ruder, 2018).

  • The output vectors (), which represent the model’s prediction of the next word in the sequence. Naturally, they are represented in the same dimensionality as s.

  • The hidden vectors (

    ), which are often randomly initialized and learned through backpropagation. Often trained as dense representations, these vectors tend to display characteristics that indicate semantic richness

    (Peters et al., 2018) and compositionality (Mikolov et al., 2013)

    . While the language model can be used as a text-generation mechanism, the hidden vectors are a strong side product that are sometimes extracted and reused as augmented features in other machine learning systems

    (Devlin et al., 2018).

  • The weights of the network () (or other parameters in the network), which are tuned through backpropagation. These often indicate how each vectors in the input or hidden sequence is utilized to generate the output. These parameters play a big role in the way the output of neural networks are reverse-engineered or explained to the end user 111As an example see https://tinyurl.com/y56drbnk.

The distributions of any of the above-mentioned components can be studied to mine signals for anomalous behavior in the context of irregularity, error, novelty, semantic richness, or contextual relevance.

3.1. Anomaly in input vectors

As previously mentioned, the input vectors to a text-based neural network are often adapted from publicly-available word vector corpora. In simpler architectures, the network is allowed to back-propagate its errors all the way to the input layer, which might cause the input vectors to be modified. This can serve as a signal for anomaly in the semantic distributions between the original vectors and the modified vectors.

Analyzing the stability of word vectors when trained on different iterations can also signal anomalous trends (Wendlandt et al., 2018).

3.2. Anomaly in output vectors

As previously mentioned, language models generate a probability distribution over a word (or character) in a sequence. These probabilities can be used to detect transcription or character-recognition errors in a domain-friendly manner. When the language model is trained on financial data, domain-specific trends (such as the use of commas and parentheses in financial metrics) can be captured and accounted for by the network, minimizing the rate of false positives.

3.3. Anomaly in hidden vectors

Figure 2. A pre-trained model can be fine-tuned on a new domain, and applied to a classification or prediction task. Excerpted from (Howard and Ruder, 2018).

A recent advancement in text processing is the introduction of fine-tuning methods to neural networks trained on text (Howard and Ruder, 2018)

. Fine-tuning is an approach that facilitates the transfer of semantic knowledge from one domain (source) to another domain (target). The source domain is often large and generic, such as web data or the Wikipedia corpus, while the target domain is often specific (e.g. SEC filings). A network is pre-trained on the source corpus such that its hidden representations are enriched. Next, the pre-trained networks is re-trained on the target domain, but this time only the final (or top few) layers are tuned and the parameters in the remaining layers remain “frozen.” The top-most layer of the network can be modified to perform a classification, prediction, or generation task in the target domain (see Figure


Fine-tuning aims to change the distribution of hidden representations in such a way that important information about the source domain is preserved, while idiosyncrasies of the target domain are captured in an effective manner (Ruder and Plank, 2017). A similar process can be used to determine anomalies in documents. As an example, consider a model that is pre-trained on historical documents from a given sector. If fine-tuning the model on recent documents from the same sector dramatically shifts the representations for certain vectors, this can signal an evolving trend.

3.4. Anomaly in weight tensors and other parameters

Models that have interpretable parameters can be used to identify areas of deviation or anomalous content. Attention mechanisms (Vaswani et al., 2017) allow the network to account for certain input signals more than others. The learned attention mechanism can provide insight into potential anomalies in the input. Consider a language model that predicts the social media engagement for a given tweet. Such a model can be used to distinguish between engaging and information-rich content versus clickbait, bot-generated, propagandistic, or promotional content by exposing how, for these categories, engagement is associated with attention to certain distributions of “trigger words.”

source of
type of
Input Novelty
Identifying how
perspectives on ESG
factors are changing
over time in
financial reports
Retrain the network
on y-o-y data and
observe unstable
word vectors
Output Error
Identifying errors
in earnings call
Analyze the
emission probability
of observed
Determine which
hidden vectors
diverge from others
Weights &
Observe how a
model attends to
words in the input
Table 1. Four scenarios for anomaly detection on text data using signals from various layers and parameters in a language model.

Table 1 lists four scenarios for using the various layers and parameters of a language model in order to perform anomaly detection from text.

4. Challenges and Future Research

Like many other domains, in the financial domain, the application of language models as a measurement for semantic regularity of text bears the challenge of dealing with unseen input. Unseen input can be mistaken for anomaly, especially in systems that are designed for error detection. As an example, a system that is trained to correct errors in an earnings call transcript might treat named entities such as the names of a company’s executives, or a recent acquisition, as anomalies. This problem is particularly prominent in fine-tuned language models, which are pre-trained on generic corpora that might not include domain-specific terms.

When anomalies are of a malicious nature, such as in the case where abnormal clauses are included in credit agreements, the implementation of the anomalous content is adapted to appear normal. Thereby, the task of detecting normal language becomes more difficult.

Alternatively, in the case of language used by executives in company presentations such as earnings calls, there may be a lot of noise in the data due to the large degree of variability in the personalities and linguistic patterns of various leaders. The noise variability present in this content could be similar to actual anomalies, hence making it difficult to identify true anomalies.

Factors related to market interactions and competitive behavior can also impact the effectiveness of anomaly-detection models. In detecting the emergence of a new industry sector, it may be challenging for a system to detect novelty when a collection of companies, rather than a single company, behave in an anomalous way. The former may be the more common real-world scenario as companies closely monitor and mimic the innovations of their competitors. The exact notion of anomaly can also vary based on the sector and point in time. For example, in the technology sector, the norm in today’s world is one of continuous innovation and technological advancements.

Additionally, certain types of anomaly can interact and make it difficult for systems to distinguish between them. As an example, a system that is trained to identify the operating segments of a company tends to distinguish between information that is specific to the company, and information that is common across different companies. As a result, it might identify the names of the company’s board of directors or its office locations as its operating segments.

Traditional machine learning models have previously tackled the above challenges, and solutions are likely to emerge in the neural paradigms as well. Any future research in these directions will have to account for the impact of such solutions on the reliability and explainability of the resulting models and their robustness against adversarial data.

5. Conclusion

Anomaly detection from text can have numerous applications in finance, including risk detection, predictive analysis, error correction, and peer detection. We have outlined various perspectives on how anomaly can be interpreted in the context of finance, and corresponding views on how language modeling can be used to detect such aspects of anomalous content. We hope that this paper lays the groundwork for establishing a framework for understanding the opportunities and risks associated with these methods when applied in the financial domain.


  • P. Addo, D. Guegan, and B. Hassani (2018)

    Credit risk analysis using machine and deep learning models

    Risks 6 (2), pp. 38. Cited by: §1.
  • L. Cheng (2013) Unsupervised topic discovery by anomaly detection. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 3rd item.
  • E. Eskin (2000) Detecting errors within a corpus using anomaly detection. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics, External Links: Link Cited by: §1, §2.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §3.
  • J. Howard and S. Ruder (2018) Fine-tuned language models for text classification. CoRR abs/1801.06146. Cited by: Figure 2, 1st item, §3.3, §3.
  • C. F. Lee and O. M. Rui (2000) Does trading volume contain information to predict stock returns? evidence from china’s stock markets. Review of Quantitative Finance and Accounting 14 (4), pp. 341–360. Cited by: §1.
  • N. Leonard (2016) Language modeling a billion words. Note: http://torch.ch/blog/2016/07/25/nce.html Cited by: Figure 1.
  • Q. Li, A. Nourbakhsh, S. Shah, and X. Liu (2017) Real-time novel event detection from social media. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, pp. 1129–1139. Cited by: §2.3.
  • X. Liu, A. Nourbakhsh, Q. Li, R. Fang, and S. Shah (2015) Real-time rumor debunking on twitter. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015, pp. 1867–1870. Cited by: §2.3.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: 1st item, 3rd item.
  • A. Nourbakhsh, X. Liu, Q. Li, and S. Shah (2017) ”Apping the echo-chamber: detecting and characterizing partisan networks on twitter. In Proceedings of the 2017 International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and Simulation, Cited by: §2.2.
  • A. Nourbakhsh, X. Liu, S. Shah, R. Fang, M. M. Ghassemi, and Q. Li (2015) Newsworthy rumor events: A case study of twitter. In IEEE International Conference on Data Mining Workshop, ICDMW 2015, Atlantic City, NJ, USA, November 14-17, 2015, pp. 27–32. Cited by: §2.3.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: 3rd item.
  • S. Ruder and B. Plank (2017)

    Learning to select data for transfer learning with Bayesian optimization

    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 372–382. External Links: Link, Document Cited by: §3.3.
  • P. Samanta and B. B. Chaudhuri (2013) A simple real-word error detection and correction using local word bigram and trigram. In Proceedings of the 25th Conference on Computational Linguistics and Speech Processing (ROCLING 2013), Kaohsiung, Taiwan, pp. 211–220. External Links: Link Cited by: §1, §2.1.
  • S. Shah, D. Dorr, K. Al-Kofahi, and J. Sisk (2017) Systems and methods for determining atypical language. grant Cited by: §2.4.
  • P. D. Turney and P. Pantel (2010) From frequency to meaning: vector space models of semantics. CoRR abs/1003.1141. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §3.4.
  • E. M. Vecchi, M. Baroni, and R. Zamparelli (2011) (Linear) maps of the impossible: capturing semantic anomalies in distributional space. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pp. 1–9. Cited by: §1.
  • T. H. Wen, M. Gasic, D. Kim, N. Mrksic, P. Su, D. Vandyke, and S. Young (2015) Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking. Cited by: §3.
  • L. Wendlandt, J. K. Kummerfeld, and R. Mihalcea (2018) Factors influencing the surprising instability of word embeddings. arXiv preprint arXiv:1804.09692. Cited by: §2.2, §3.1.
  • J. West and M. Bhattacharya (2016) Intelligent financial fraud detection: a comprehensive review. Computers & Security 57, pp. 47 – 66. External Links: ISSN 0167-4048, Document, Link Cited by: §1.
  • F. Zhao (2017)

    Hanging on Every Word: Natural Language Processing Unlocks New Frontier in Corporate Earnings Sentiment Analysis

    Note: https://www.valuewalk.com/2017/09/natural-language-processing-corporate-earnings-sentiment/ Cited by: §2.2.