Detection of Accounting Anomalies in the Latent Space using Adversarial Autoencoder Neural Networks

08/02/2019 ∙ by Marco Schreyer, et al. ∙ PwC Deutsche Bundesbank University of St. Gallen 15

The detection of fraud in accounting data is a long-standing challenge in financial statement audits. Nowadays, the majority of applied techniques refer to handcrafted rules derived from known fraud scenarios. While fairly successful, these rules exhibit the drawback that they often fail to generalize beyond known fraud scenarios and fraudsters gradually find ways to circumvent them. In contrast, more advanced approaches inspired by the recent success of deep learning often lack seamless interpretability of the detected results. To overcome this challenge, we propose the application of adversarial autoencoder networks. We demonstrate that such artificial neural networks are capable of learning a semantic meaningful representation of real-world journal entries. The learned representation provides a holistic view on a given set of journal entries and significantly improves the interpretability of detected accounting anomalies. We show that such a representation combined with the networks reconstruction error can be utilized as an unsupervised and highly adaptive anomaly assessment. Experiments on two datasets and initial feedback received by forensic accountants underpinned the effectiveness of the approach.

READ FULL TEXT VIEW PDF

Authors

page 8

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The Association of Certified Fraud Examiners estimates in its ”Global Study on Occupational Fraud and Abuse 2018”

(ACFE, 2018) that organizations lose 5% of their annual revenues due to fraud. The term ”fraud” refers to ”the abuse of one’s occupation for personal enrichment through the deliberate misuse of an organization’s resources or assets” (Wells, 2017). A similar study, conducted by the auditors of PwC, revealed that approx. 30% of the respondents experienced losses between $100,000 USD and $5 million USD due to fraud (PwC, 2018). The study also showed that financial statement fraud caused by far the highest median loss of the surveyed fraud schemes111The ACFE study encompasses an analysis of 2.690 cases of occupational fraud investigated between January 2016 and October 2017 that occurred in 125 countries. The PwC study encompasses over 7.228 respondents that experienced economic crime in the last 24 months..

Figure 1. Hierarchical view of an Accounting Information System (AIS) that records distinct layer of abstractions, namely (1) the business process, (2) the accounting and (3) technical journal entry information in designated tables.

At the same time, organizations accelerate the digitization of business processes (Markovitch and Willmott, 2014) affecting in particular Accounting Information Systems (AIS) or more generally Enterprise Resource Planning (ERP) systems. Steadily, these systems collect vast quantities of business process and accounting data at a granular level. This holds in particular for the journal entries of an organization recorded in its general ledger and sub-ledger accounts. SAP, one of the most prominent enterprise software providers, estimates that approx. 77% of the world’s transaction revenue touches one of their ERP systems (SAP, 2019). Figure 1 depicts a hierarchical view of an AIS recording process of journal entry information in designated database tables.

Figure 2. The adversarial autoencoder architecture (Makhzani et al., 2015), applied to learn the journal entries characteristics and to partition the entries into semantic meaningful groups. The adversarial autoencoder network architecture imposes an arbitrary prior distribution

on the discrete latent code vector

, e.g., a mixture of Gaussians. With progressing training the encoder learns an aggregated posterior distribution that matches the imposed prior to fool the discriminator network .

In order to conduct fraud, perpetrators need to deviate from the regular system usage or posting pattern. Such deviations are recorded by a very limited number of ”anomalous” journal entries and their respective attribute values. To detect potentially fraudulent activities international audit standards require the direct assessment of such journal entries (AICPA, 2002), (IFAC, 2009a). Nowadays, auditors and forensic accountants apply a wide range of data analysis techniques to examine journal entries during an audit. These techniques often encompass rule-based analyses referred to as ”red-flag” tests (e.g. postings late at night, multiple vendor bank account changes) as well as statistical analyses (e.g. Benford’s Law, time series evaluation). Nevertheless, the detection of traces of fraud in up to several hundred million journal entries remains a labor-intensive task requiring significant time effort and resources.

Driven by the recent technological advances in artificial intelligence

(LeCun et al., 2015) deep neural network-based techniques (e.g. deep autoencoder neural networks) have emerged into the field of forensic accounting and financial statement audits (Schreyer et al., 2017). Such, approaches often lack a seamless interpretability of the detected ”anomalous” journal entries selected for a detailed audit. This is a major drawback since auditors are required to test a representative sample of journal entries in order to reduce the ”sampling risk” (IFAC, 2009b) of an audit. Ultimately, the testing of an individual entry while ignoring another one must be defensible in court (Hall et al., 2002).

To overcome this challenge we propose the application of Adversarial Autoencoder Neural Networks (AAEs) (Makhzani et al., 2015). We demonstrate that such adversarial architectures are capable of learning a semantic meaningful representation of journal entries. The learned representation allows for an improved interpretability of the entries underlying generative processes as well as detected ”anomalous” journal entries. In summary, we present the following contributions:

  • We illustrate that AAEs can be used to learn a representation of journal entries observable in real-world ERP systems that partitions the entries into semantic meaningful groups;

  • We demonstrate how such a learned representation can be used by a human auditor or forensic accountant to sample journal entries for an audit in an interpretable manner;

  • We show that the learned representation combined with the magnitude of an entry’s reconstruction error can be interpreted as a highly adaptive anomaly assessment of journal entries.

We envision this deep learning-based methodology as an important supplement to the auditors and forensic accountants toolbox (Pedrosa and Costa, 2014). The remainder of this work is structured as follows: In Section 2 we provide an overview of the related work. Section 3 follows with a description of the adversarial autoencoder network architecture and presents the proposed methodology to detect accounting anomalies. The experimental setup and results are outlined in Section 4 and Section 5. In Section 6 the paper concludes with a summary of the current work and future directions of research. An implementation of the proposed methodology is available at https://github.com/GitiHubi/deepAD.

2. Related Work

The literature survey presented hereafter focuses on (1) the detection of fraudulent activities in Enterprise Resource Planning (ERP) data and (2) the detection of financial fraud using deep Autoencoder Neural Networks (AENs) (Hinton and Salakhutdinov, 2006) as well as Generative Adversarial Networks (GANs) (Li et al., 2017).

2.1. Fraud Detection in Accounting Data

The task of detecting fraud and accounting anomalies has been studied both by practitioners (Wells, 2017) and academia (Amani and Fadlalla, 2017). Several references describe different fraud schemes and ways to detect unusual and ”creative” accounting practices (Singleton and Singleton, 2006).

The forensic analysis of journal entries emerged with the advent of Enterprise Resource Planning (ERP) systems and the increased volume of data recorded by such systems. Bay et al. in (Bay et al., 2006)

use Naive Bayes methods to identify suspicious general ledger accounts, by evaluating attributes derived from journal entries measuring any unusual general ledger account activity. Their approach is enhanced by McGlohon et al. applying link analysis to identify (sub-) groups of high-risk general ledger accounts

(McGlohon et al., 2009). Kahn et al. in (Khan and Corney, 2009) and (Khan et al., 2010) create transaction profiles of SAP ERP users. The profiles are derived from journal entry based user activity pattern recorded in two SAP R/3 ERP system in order to detect suspicious user behavior and segregation of duties violations. Similarly, Islam et al. in (Islam et al., 2010) use SAP R/3 system audit logs to detect known fraud scenarios and collusion fraud via a ”red-flag” based matching of fraud scenarios. Debreceny and Gray in (Debreceny and Gray, 2010) analyze dollar amounts of journal entries obtained from 29 US organizations. In their work, they search for violations of Benford’s Law (Benford, 1938), anomalous digit combinations as well as an unusual temporal pattern such as end-of-year postings. More recently, Poh-Sun et al. in (Seow, Poh-Sun; Sun, Gary; Themin, 2016) demonstrate the generalization of the approach by applying it to journal entries obtained from 12 non-US organizations. Jans et al. in (Jans et al., 2010) use latent class clustering to conduct a uni- and multivariate clustering of SAP ERP purchase order transactions. Transactions significantly deviating from the cluster centroids are flagged as anomalous and are proposed for a detailed review by auditors. The approach is enhanced in (Jans et al., 2011) by a means of process mining to detect deviating process flows in an organization procure to pay process. Argyrou et al. in (Argyrou, 2012)

evaluate self-organizing maps to identify ”suspicious” journal entries of a shipping company. In their work, they calculated the Euclidean distance of a journal entry and the code-vector of a self-organizing maps best matching unit. In subsequent work, they estimate optimal sampling thresholds of journal entry attributes derived from extreme value theory

(Argyrou, 2013).

Concluding from the reviewed literature, the majority of references either draw from accounting and forensic knowledge about historical fraud schemes or non deep-learning based techniques to detect financial fraud. However, driven by the recent success of deep learning techniques, which are potentially misused by fraudsters, we see a high demand for auditors to likewise enhance their examination methodologies.

2.2. Anomaly Detection using Deep Learning

Nowadays, deep learning inspired methods are increasingly used for novelty and anomaly detection in financial data (Chalapathy and Chawla, 2019; Pimentel et al., 2014).

Renström and Holmsten in (Renström and Holmsten, 2018) evaluate AENs to detect fraud in credit card transactions. Similarly, Kazemi and Zarrabi (Kazemi and Zarrabi, 2017) and Sweers et al. (Sweers et al., 2018) train and evaluate a variety of variational AEN architectures. Pumsirirat and Yan in (Pumsirirat and Yan, 2018) compare the anomaly detection performance of AENs based on three datasets of credit card transactions. Wedge et al. (Wedge et al., 2017) use AENs to learn behavioral features from historical credit card transactions. Paula et al. in (Paula et al., 2017) use AENs in export controls to detect traces of money laundry and fraud by analyzing volumes of exported goods. Similarly, Schreyer et al. in (Schreyer et al., 2017) utilized the reconstruction error of deep AENs to detect anomalous journal entries in two datasets of real-world accounting data.

More recently, GANs are utilized in the context of fraud detection. Fiore et al. in (Fiore et al., 2017)

train such networks to generate mimicked anomalies, which were used to augment training data to improve credit card fraud detection classifiers. Choi et al. in

(Choi and Jang, 2018) train ensembles of generative models to successfully detect anomalies in credit card transactions. Zheng et al. in (Zheng et al., 2018b) train LSTM-AENs in an adversarial training set up to detect fraudulent credit card transactions. In another study, Zheng et al. in (Zheng et al., 2018a) propose generative denoising GANs to detect telecommunication fraud in the transactions of two financial institutions.

To the best of our knowledge, this work presents the first deep-learning inspired methodology trained in an adversarial training setup to detect anomalous journal entries in real-world accounting data.

3. Methodology

To detect anomalous journal entries one first has to define ”normality” with respect to accounting data. We assume that the majority of journal entries recorded within an organizations’ ERP system relate to regular day-to-day business activities. In order to conduct fraud, perpetrators need to deviate from the ”normal”. Such a deviating behavior will be recorded by a very limited number of journal entries and their respective attribute values. We refer to journal entries exhibiting such deviating attribute values as accounting anomalies.

3.1. Accounting Anomaly Classes

When conducting a detailed examination of real-world journal entries, recorded in large-scaled ERP systems, two characteristics can be observed: First, journal entry attributes exhibit a high variety of distinct attribute values, e.g., due to the high number of vendors or distinct posting amounts, and second, journal entries exhibit strong dependencies between certain attribute values e.g. a document type that is usually posted in combination with a certain general ledger account. Derived from this observation and similarly to Breunig et al. in (Breunig et al., 2000) we distinguish two classes of anomalous journal entries, namely global and local anomalies:

Global accounting anomalies

are journal entries that exhibit unusual or rare individual attribute values. Such anomalies usually relate to skewed attributes, e.g., rarely used ledgers, or unusual posting times. Traditionally, ”red-flag” tests performed by auditors during an annual audit, are designed to capture this type of anomaly. However, such tests often result in a high volume of false-positive alerts due to rare but regular events such as reverse postings, provisions and year-end adjustments usually associated with a low fraud risk

(Schreyer et al., 2017). Furthermore, when consulting with auditors and forensic accountants, ”global” anomalies often refer to ”error” rather than ”fraud”.

Local accounting anomalies are journal entries that exhibit an unusual or rare combination of attribute values while their individual attribute values occur quite frequently, e.g., unusual combinations of general ledger accounts or user accounts used by several accounting departments. This type of anomaly is significantly more difficult to detect since perpetrators intend to disguise their activities by imitating a regular activity pattern. As a result, such anomalies usually pose a high fraud risk since they correspond to processes and activities that may not be conducted in compliance with organizational standards.

We aim to learn a model that detects both classes of anomalous journal entries in an unsupervised manner. Thereby, the learned model should partitions the population of journal entries into semantic meaningful classes that allows for an increased interpretability of the detection results. To achieve this two-fold objective we utilize Adversarial Autoencoders (AAEs), a deep neural network architecture introduced by Makhzani et al. (Makhzani et al., 2015). We provide preliminaries of Autoencoder Neural Networks (AENs) and Generative Adversarial Networks (GANs) that constitute the AAEs in the following. A more detailed presentation can be found in (Goodfellow et al., 2016).

3.2. Autoencoder Neural Networks

Formally, let denote a set of journal entries , where each journal entry consists of attributes . Thereby, denotes the -th attribute of the -th journal entry. The individual attributes describe the journal entries accounting specific details, e.g., the entries fiscal year, posting type, posting date, amount, general-ledger. Hinton and Salakhutdinov in (Hinton and Salakhutdinov, 2006) introduced AENs, a special type of feed-forward multi-layer network that can be trained to reconstruct its input. Formally, AENs are comprised of two nonlinear functions referred to as encoder and decoder network (Rumelhart et al., 1985). The encoder function maps the input to a code vector referred to as latent space representation, where usually . This latent representation is then mapped back by the decoder function to a reconstruction of the original input space. In an attempt to achieve the AEN is trained to minimize the dissimilarity of a given journal entry and its reconstruction as faithfully as possible. Thereby, the training objective is to learn a set of optimal model parameters by minimizing the AENs reconstruction loss, formally denoted as:

(1)

3.3. Generative Adversarial Neural Networks

Goodfellow et al. introduced GANs in (Goodfellow et al., 2014), a framework for training deep generative models using a mini-max game. The objective is to learn a generator distribution that matches the real data distribution

of journal entries. Instead of trying to explicitly assign probability to every

in the data distribution, the GAN aims to learn a set of parameters of a generator network that generates samples from the generator distribution by transforming a noise variable into a sample . Thereby, the generator is trained by playing against an adversarial discriminator network that aims to learn a set of parameters to distinguish between samples from the true data distribution and the generator’s distribution . Both networks establish a min-max adversarial game. A solution to this game can be, expressed as:

(2)

3.4. Adversarial Autoencoders

The AAE architecture, as illustrated in Fig. 2, extends the concept of AEN by imposing an arbitrary prior on the AENs latent space using a GAN training setup (Makhzani et al., 2015). This is achieved by training the AAE jointly in two phases (1) a reconstruction phase as well as (2) an adversarial regularization phase.

In the reconstruction phase, the AAEs encoder network is trained to learn an aggregated posterior distribution of the journal entries over the latent code vector . Thereby, the learned posterior distribution corresponds to a compressed representation of the journal entry characteristics. Similarly to AENs, the decoder network of the AAE utilizes the learned latent code vector representations to reconstruct the journal entries as faithfully as possible to minimize the AAEs reconstruction error.

In the regularization phase, an adversarial training setup is applied were the encoder network of the AAE functions as the generator network. In addition, a discriminator network is attached on top of the learned latent code vector . Similarly to GANs, the discriminator network of the AAE is trained to distinguish samples of an imposed prior distribution onto from the learned aggregated posterior distribution . In contrast, the encoder network is trained to learn a posterior distribution that fools the discriminator network into thinking that the samples drawn from originate from the imposed prior distribution .

3.5. Accounting Anomaly Detection

Figure 3. Exemplary distribution of the ’account key’ (technically: ’KTOSL’) attribute values (left), the log-normalized ’local’ and ’foreign currency amount’ (technically: ’DMBTR’ and ’WRBTR’) attribute values (center), as well as the ’posting key’ (technically: ’BSCHL’) and ’general ledger account’ (technically: ’HKONT’) (right) observable in Dataset B.

In order, to detect interpretable accounting anomalies in real-world ERP datasets we propose a novel anomaly score utilizing the introduced AAE architecture. The score builds on the regularisation applied throughout the AAE training process, namely the reconstruction error loss, denoted by Eq. (1), and the adversarial loss, denoted by Eq. (2), described in the following.

The reconstruction loss promotes the AAE to learn a set of non-overlapping latent journal entry representation. However, this may result in a highly ”fractured” latent space in which deviating representations are learned for similar journal entries. The additionally applied adversarial loss prevents the fracturing problem. It forces the learned representations to reside within the high probability density regions of the imposed prior distribution . To partition the latent space into semantic regions we impose a multi-modal prior, e.g., a mixture of Gaussians. Thereby, the interaction of both regularising losses forces the AAE to learn groups of semantic similar journal entries located in close spatial proximity to the modes of the imposed prior.

As a result, the AAE learns with progressing training, a model that disentangles the underlying generative processes of journal entries in the latent space . Each group of learned representations corresponds to a distinct generative process of journal entries, e.g., depreciation postings or vendor payment postings. To detect potential accounting anomalies, we investigate the individual entries of each group in terms of potential ”violations” of one of the two applied regularising losses. We hypothesize, that anomalous journal entries can be captured by either (1) their latent divergence from the modes of the imposed prior or (2) an increased reconstruction error. Thereby, the type of violation also reveals the anomaly class of the investigated entry, as described in the following:

Mode Divergence (MD): Journal entries that exhibit anomalous attribute values (global anomalies) result in an increased divergence from the imposed multi-modal prior, e.g., as in this work the divergence to the modes of an imposed mixture of multivariate isotropic Gaussians , where defines the modes of the distinct Gaussians denoted by . Throughout the AAE training, the entries will be ”pushed” towards the high probability density regions of the prior by the regularization. In order to be able to discriminate between the imposed prior and the learned aggregated posterior the AAE aims to keep the majority of the entries within the high-density regions (modes) of the prior. In contrast, representations that correspond to rare or anomalous journal entries will tend to differ from the imposed modes and be placed in the priors low-density regions. We use this characteristic and obtain an entry’s mode divergence as the Euclidean distance of the entry’s learned representation to its closest mode . Formally, we derive the mode divergence as denoted by under optimal model parameters . Finally, we calculate the normalized mode divergence as expressed by:

(3)

where and denotes the min- and max-values of the obtained mode divergences given by and closest mode .

Reconstruction Error (RE): Journal entries that exhibit anomalous attribute value co-occurrences (local anomalies) tend to result in an increased reconstruction error (Schreyer et al., 2017). This is caused by the compression capability of the AAE architecture. Anomalous and therefore unique attribute co-occurrences exhibit an increased probability of getting lost in the encoders ”lossy” compression. As a result, their low dimensional representation will overlap with regular entries in the latent space and are not reconstructed correctly by the decoder. Formally, we obtain the reconstruction error of each entry and its reconstruction as the squared-difference denoted by under optimal model parameters . Finally, we calculate the normalized reconstruction error as expressed by:

(4)

where and denotes the min- and max-values of the obtained reconstruction errors given by and closest mode .

Anomaly Score (AS): Quantifying both characteristics for a given journal entry, we can reasonably conclude (1) if the entry is anomalous and (2) if it was created by a ”regular” business activity. To detect global and local accounting anomalies in real-world audit scenarios we propose to score each journal entry by its normalized reconstruction error regularized and normalized mode divergence given by:

(5)

for each individual journal entry and optimal model parameters and closest mode . We introduce as a factor to balance both characteristics.

4. Experimental Setup

In this section, we describe the experimental setup and model training. We evaluate the anomaly detection performance of the proposed scoring based on two datasets of journal entries.

Figure 4. Exemplary AAE latent space distribution of dataset B with progressing network training: imposed prior distribution consisting of a mixture of Gaussians (left), learned aggregated posterior distribution

after 100 training epochs (center), learned aggregated posterior distribution

after 2,000 training epochs (right).

4.1. Datasets and Data Preparation

In general, SAP ERP systems record journal entries and their corresponding attributes predominantly in two database tables: (1) the table ”Accounting Document Headers” (technically: ”BKPF”) contains the meta-information of a journal entry, such as document id, type, date, time, or currency, while (2) the table ”Accounting Document Segments” (technically: ”BSEG”) contains the entry details, such as posting key, general ledger account, debit-credit information, or posting amount. In the context of this work, we extract a subset of the most discriminative journal entry attributes of the ”BKPF” and ”BSEG” table.

In our experiments we use two datasets of journal entries: a real-world and a synthetic dataset referred to as dataset A and dataset B in the following. Dataset A is an extract of an SAP ERP instance and encompasses the entire population of journal entries of a single fiscal year222In compliance with strict data privacy regulations, all journal entry attributes of dataset A have been anonymized using an irreversible one-way hash function during the data extraction process. To ensure data completeness, the journal entry based general ledger balances were reconciled against the standard SAP trial balance reports e.g. the SAP ’RFBILA00’ report.. Dataset B is an excerpt of the synthetic dataset presented in (Lopez-Rojas et al., 2016)333The original dataset is publicly available via the Kaggle predictive modeling and analytics competitions platform and can be obtained using the following link: https://www.kaggle.com/ntnu-testimon/paysim1.

. The majority of attributes recorded in ERP systems correspond to categorical (discrete) variables, e.g. posting date, account, posting type, currency. We pre-process the categorical journal entry attributes to obtain a binary (”one-hot” encoded) representation of each journal entry.

To allow for a detailed analysis and quantitative evaluation of the experiments we inject a small fraction of synthetic global and local anomalies into both datasets. Similar to real audit scenarios this results in a highly unbalanced class distribution of anomalous vs. regular day-to-day entries. The injected global anomalies consist of attribute values not evident in the original data while the local anomalies exhibit combinations of attribute value subsets not occurring in the original data. The true labels are available for both datasets. Each journal entry is labeled as either (1) synthetic global anomaly, (2) synthetic local anomaly or (3) non-synthetic regular entry

. The following descriptive statistics summarize both datasets:

  • Dataset A: contains a total of journal entry line items comprised of six categorical and two numerical attributes. The encoding resulted in a total of encoded dimensions for each entry ; and, In total () synthetic anomalous journal entries have been injected into the dataset. These entries encompass () global anomalies and () local anomalies.

  • Dataset B: contains a total of journal entry line items comprised of six categorical and two numerical attributes. The encoding resulted in a total of encoded dimensions for each entry . In total () synthetic, anomalous journal entries have been injected into the dataset. These entries encompass () global anomalies and () local anomalies.

Figure 3 illustrates an exemplary distribution of the attributes primarily investigated during and audit, namely the ’account key’ (technically: ’KTOSL’) attribute values, the log-normalized ’local’ and ’foreign currency amount’ (technically: ’DMBTR’ and ’WRBTR’) attribute values, as well as the ’posting key’ (technically: ’BSCHL’) and ’general ledger account’ (technically: ’HKONT’) observable in Dataset B.

4.2. Adversarial Autoencoder Training

Our architectural setup follows the AAE architecture (Makhzani et al., 2015) as shown in Fig. 2, comprised of three distinct neural networks that are trained in parallel. The encoder network

uses Leaky Rectified Linear Unit (LReLU) activation functions

(Xu et al., 2015) except in the last ”bottleneck” layer. Both the decoder network and the discriminator network use LReLUs in all layers except the output layers where a Sigmoid activation function is used. Table 1

depicts the architectural details of the networks which are implemented using PyTorch

(Paszke et al., 2017).

Training stability is a main challenge in adversarial training (Arjovsky et al., 2017) and we face a variety of collapsing and non-convergence scenarios. To determine a stable training setup we sweep the learning rates of the encoder and decoder networks through the interval , and the learning rates of the discriminator network through the interval . Ultimately, we use the following constant learning rates to learn a stable model of each dataset:

  • Dataset A: for the encoder and the decoder network, for the discriminator network; and,

  • Dataset B: for the encoder and the decoder network, for the discriminator network.

Net Dataset = 1 2 3 4 5 6 7 8
A 256 128 64 32 16 8 4 2
A 2 4 8 16 32 64 128 256
A 128 64 32 16
B 256 64 16 4 2 - - -
B 2 4 16 64 256 - - -
B 256 64 16 4 1 - - -
Table 1. Neurons per layer of the distinct networks that comprise the AAE architecture (Makhzani et al., 2015): encoder , decoder and discriminator neural network.

We train the AAE with mini-batch wise SGD for max. 10,000 training epochs and apply early stopping once the reconstruction loss converges. In accordance with (Xu et al., 2015) we set the scaling factor of the LReLUs to and initialized the AAE parameters as described in (Glorot and Bengio, 2010). A mini-batch size of 128 journal entries is used in both the reconstruction and the regularization phase. We use Adam optimization (Kingma and Ba, 2014) and set and

in the optimization of the network parameters. In the reconstruction phase, we use a combined loss function

to optimize the encoder and decoder net parameters. For each journal entry we calculate (1) the cross-entropy reconstruction error of the categorical attribute value encodings , e.g., the encoded general ledger account id, and (2) the mean-squared reconstruction error of the numerical attribute value encodings , e.g., the encoded posting amount, formally expressed by:

(6)

were the parameter balances both losses. In this initial work, we set in all our experiments to account for the higher amount of categorical attributes in both datasets. In the regularization phase, we calculate the adversarial loss, according to equation 2, when optimizing the parameters of the discriminator .

To partition the learned journal entry representations, we sample from a prior distribution comprised of a mixture of multivariate isotropic Gaussians , where . Thereby,

is a hyperparameter we evaluate when sampling of

Gaussians. Figure 4 shows an exemplary prior consisting of Gaussians as well as the learned aggregated posterior distributions after 100 and 2,000 training epochs.

5. Experimental Results

In this section, we first assess the semantic partitioning of the journal entries by the imposed prior distributions. Afterward, we examine the anomalies detected of each semantic partition.

Semantic partitioning: We qualitatively review the latent space partitioning of the journal entries and assess the accounting specific semantics that is learned by each mode. Figure 5 shows the partitioning result of dataset A, where Gaussians (see appxs. for results of varying and dataset B). It can be observed that the AAE learned a rather clean separation of the regular journal entries. The review of the journal entries accounting specific semantics captured by each mode and dataset revealed:

  • Dataset A: The entries of each partition exhibit a high semantic similarity while each partition corresponds to a general accounting process, such as (1) automated payment run entries postings, (2) outgoing customer invoices, and (3) material movements.

  • Dataset B: Similarly, the entries of each partition exhibit a high semantic similarity and correspond to the following general accounting processes (1) foreign and domestic invoice postings, (2) purchase of goods, (3) manual payments.

The experimental results when imposing Gaussians on the latent space of each dataset are presented in the appendix of this work. The results show that the AAE is capable of learning a semantic partition of a given set of journal entries according to that disentangles the entries underlying generative processes. The learned partition provides the auditor a holistic view on a given set of accounting data subject to audit. Furthermore, it allows to effectively obtain a representative and interpretable sample of the data and thereby reduces the audits sampling risk.

Figure 5. Learned AAE latent space representations of the journal entries contained in dataset A after training the AAE for 5,000 epochs and imposing a mixture of Gaussians (left), the anomaly scores obtained for of each journal entry and corresponding mode (center), the anomaly score distribution (bold line defines the median, upper and lower bound define the and quantile of the distribution) obtained of each journal entry class with progressing network training (right).
Figure 6. Anomaly scores obtained by the application of distinct vs. balance factors after training the AAE for 5,000 epochs on dataset A (see appxs. for results of varying and dataset B) and imposing a mixture of Gaussians. It can be observed that decreasing results in an improved detection of global anomalies (left). In contrast, increasing results in an improved detection of local anomalies (right).
Class Data , = 5 , = 10 , = 15
global A 0.295 0.233 0.448 0.207 0.532 0.244
local A 0.248 0.276 0.275 0.143 0.446 0.202
regular A 0.045 0.076 0.053 0.085 0.110 0.034
global B 0.508 0.249 0.442 0.245 0.437 0.241
local B 0.357 0.260 0.164 0.148 0.273 0.228
regular B 0.046 0.061 0.070 0.041 0.028 0.029
Table 2. Mean anomaly score obtained per journal entry class and by imposing a prior distribution consisting of (, and

) mixture of Gaussians and training the AAE for 5,000 (10,000, and 15,000) epochs (variances originate from the distinct parameter initialization seeds).

Anomaly detection: In addition, we analyze the anomaly detection capability of the proposed anomaly score. Table 2 depicts the mean anomaly score obtained for each journal entry class by imposing a prior distribution consisting of mixture of Gaussians and training the AAE for 5,000 epochs. The quantitative results show the distinct journal entry classes (global, local, and regular entries) can be distinguished according to their anomaly score in both datasets. Figure 5 exemplary shows the anomaly scores obtained for dataset A (see appxs. for results of dataset B) of each journal entry and corresponding partition as well as the distribution of the obtained individual anomaly scores. Figure 6 illustrates the change in anomaly scoring when varying the parameter of the . It can be observed that increasing (and therefore the weight on the reconstruction error of the score) improves the ability to detect local accounting anomalies in the dataset.

We also qualitatively evaluate the characteristics of the anomalies detected in each partition. Therefore, we review journal entries that correspond to a high anomaly score but have not been synthetically injected as anomalies into the evaluated datasets. Thereby, we interpret the detected anomalies of each mode in the context of the modes regular entries:

  • Global anomalies exhibit a low semantic similarity to the regular entries of a mode. The detected entries correspond to rarely observable attribute values and accounting ”exceptions”’, e.g., unusual purchase order amounts or high depreciation, year-end as well as impairment postings.

  • Local anomalies exhibit high semantic similarity to the regular entries of a mode. The detected entries correspond to rarely observable attribute value combinations, e.g., system users that switched departments, postings exhibiting unusual general ledger account combinations.

In summary, these results lead us to conclude that the proposed anomaly score can be utilized as a highly adaptive anomaly assessment of financial accounting data. It furthermore provides the ability to interpret the detected anomalies of a particular mode in the context of the modes regular journal entry semantics. Initial feedback received by auditors on the detected anomalies underpinned not only their relevance from an accounting perspective.

6. Summary

In this work, we showed that Adversarial Autoencoder (AAE) neural networks can be trained to learn a semantic meaningful representation of journal entries recorded in real-world ERP systems. We also provided initial evidence that such representations provide a holistic view of the entries and disentangle the underlying generative processes. We believe that the presented approach enables a human auditor or forensic accountant with the ability to sample journal entries for a detailed audit in an interpretable manner and therefore reduce the ”sampling risk”. In addition, we proposed a novel anomaly score that combines and entry’s learned representation and reconstruction error. We demonstrated that the scoring can be interpreted as a highly adaptive and unsupervised anomaly assessment to detect global and accounting anomalies.

We plan to conduct a more detailed investigation of the journal entries’ latent space disentanglement. Given the tremendous amount of journal entries annually recorded by organizations, an automated semantic disentanglement improves the transparency of entries to be audited and can save auditors considerable time.

Acknowledgements.
We thank the members of the statistics department at Deutsche Bundesbank and PwC Europe’s Forensic Services for their valuable review and remarks. Opinions expressed in this work are solely those of the authors, and do not necessarily reflect the view of the Deutsche Bundesbank or PricewaterhouseCoopers (PwC) International Ltd. and its network firms.

References

  • (1)
  • ACFE (2018) ACFE. 2018. Report to the Nations on Occupational Fraud and Abuse, The 2018 Global Fraud Study. Association of Certified Fraud Examiners (ACFE). https://s3-us-west-2.amazonaws.com/acfepublic/2018-report-to-the-nations.pdf
  • AICPA (2002) AICPA. 2002. Consideration of Fraud in a Financial Statement Audit. American Institute of Certified Public Accountants (AICPA). 1719–1770 pages. https://www.aicpa.org/Research/Standards/AuditAttest/DownloadableDocuments/AU-00316.pdf
  • Amani and Fadlalla (2017) Farzaneh A. Amani and Adam M. Fadlalla. 2017. Data mining applications in accounting: A review of the literature and organizing framework. International Journal of Accounting Information Systems 24 (2017), 32–58.
  • Argyrou (2012) Argyris Argyrou. 2012. Auditing Journal Entries Using Self-Organizing Map. In Proceedings of the Eighteenth Americas Conference on Information Systems (AMCIS). Seattle, Washington, 1–10.
  • Argyrou (2013) Argyris Argyrou. 2013. Auditing Journal Entries Using Extreme Vale Theory. Proceedings of the 21st European Conference on Information Systems 1, 2013 (2013).
  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).
  • Bay et al. (2006) Stephen Bay, Krishna Kumaraswamy, Markus G Anderle, Rohit Kumar, David M Steier, Almaden Blvd, and San Jose. 2006. Large Scale Detection of Irregularities in Accounting Data. In Data Mining, 2006. ICDM’06. Sixth International Conference on. IEEE, 75–86.
  • Benford (1938) Frank Benford. 1938. The Law of Anomalous Numbers. Proceedings of the American Philosophical Society 78, 4 (1938), 551–572.
  • Breunig et al. (2000) Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000.

    LOF: Identifying Density-Based Local Outliers. In

    Proceedings of the 2000 Acm Sigmod International Conference on Management of Data. 1–12.
  • Chalapathy and Chawla (2019) Raghavendra Chalapathy and Sanjay Chawla. 2019. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407 (2019).
  • Choi and Jang (2018) Hyunsun Choi and Eric Jang. 2018. Generative ensembles for robust anomaly detection. arXiv preprint arXiv:1810.01392 (2018).
  • Debreceny and Gray (2010) R. S. Debreceny and G. L. Gray. 2010. Data mining journal entries for fraud detection: An exploratory study. International Journal of Accounting Information Systems 11, 3 (2010), 157–181.
  • Fiore et al. (2017) Ugo Fiore, Alfredo De Santis, Francesca Perla, Paolo Zanetti, and Francesco Palmieri. 2017. Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Information Sciences (2017).
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249–256.
  • Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
  • Hall et al. (2002) Thomas W Hall, James E Hunton, and Bethane Jo Pierce. 2002. Sampling practices of auditors in public accounting, industry, and government. Accounting Horizons 16, 2 (2002), 125–136.
  • Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science 313, 5786 (2006), 504–507.
  • IFAC (2009a) IFAC. 2009a. International Standards on Auditing 240, The Auditor’s Responsibilities Relating to Fraud in an Audit of Financial Statements. International Federation of Accountants (IFAC). 155–197 pages.
  • IFAC (2009b) IFAC. 2009b. International Standards on Auditing 530, Audit Sampling. International Federation of Accountants (IFAC). 441–457 pages.
  • Islam et al. (2010) Asadul Khandoker Islam, Malcom Corney, George Mohay, Andrew Clark, Shane Bracher, Tobias Raub, and Ulrich Flegel. 2010. Fraud detection in ERP systems using Scenario matching. IFIP Advances in Information and Communication Technology 330 (2010), 112–123.
  • Jans et al. (2010) Mieke Jans, Nadine Lybaert, and Koen Vanhoof. 2010. Internal fraud risk reduction: Results of a data mining case study. International Journal of Accounting Information Systems 11, 1 (2010), 17–41.
  • Jans et al. (2011) Mieke Jans, Jan Martijn Van Der Werf, Nadine Lybaert, and Koen Vanhoof. 2011. A business process mining application for internal transaction fraud mitigation. Expert Systems with Applications 38, 10 (2011), 13351–13359.
  • Kazemi and Zarrabi (2017) Zahra Kazemi and Houman Zarrabi. 2017. Using deep networks for fraud detection in the credit card transactions. In 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI). IEEE, 0630–0633.
  • Khan and Corney (2009) Rq Khan and Mw Corney. 2009. A role mining inspired approach to representing user behaviour in ERP systems. In Proceedings of The 10th Asia Pacific Industrial Engineering and Management Systems Conference. 2541–2552.
  • Khan et al. (2010) Roheena Khan, Malcolm Corney, Andrew Clark, and George Mohay. 2010. Transaction Mining for Fraud Detection in ERP Systems. Industrial Engineering and Management Systems 9, 2 (2010), 141–156.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436.
  • Li et al. (2017) Chongxuan Li, Kun Xu, Jun Zhu, and Bo Zhang. 2017. Triple Generative Adversarial Nets. JAMA Internal Medicine 177, 3 (mar 2017), 326. arXiv:1703.02291
  • Lopez-Rojas et al. (2016) E. A. Lopez-Rojas, A. Elmir, and S. Axelsson. 2016. PaySim: A financial mobile money simulator for fraud detection. In The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus.
  • Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. 2015. Adversarial Autoencoders. arXiv (2015), 1–10. arXiv:1511.05644 http://arxiv.org/abs/1511.05644
  • Markovitch and Willmott (2014) Shahar Markovitch and Paul Willmott. 2014. Accelerating the digitization of business processes. McKinsey & Company (2014), 1–5.
  • McGlohon et al. (2009) Mary McGlohon, Stephen Bay, Markus G Mg Anderle, David M Steier, and Christos Faloutsos. 2009. SNARE: A Link Analytic System for Graph Labeling and Risk Detection. Kdd-09: 15Th Acm Sigkdd Conference on Knowledge Discovery and Data Mining (2009), 1265–1273.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
  • Paula et al. (2017) Ebberth L. Paula, Marcelo Ladeira, Rommel N. Carvalho, and Thiago Marzagão. 2017. Deep learning anomaly detection as suppor fraud investigation in Brazilian exports and anti-money laundering. In

    Proceedings - 2016 15th IEEE International Conference on Machine Learning and Applications, ICMLA 2016

    . 954–960.
  • Pedrosa and Costa (2014) Isabel Pedrosa and Carlos J Costa. 2014. New trends on CAATTs: what are the Chartered Accountants’ new challenges? ISDOC ’14 Proceedings of the International Conference on Information Systems and Design of Communication, May 16–17, 2014, Lisbon, Portugal (2014), 138–142.
  • Pimentel et al. (2014) Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. 2014.

    A review of novelty detection.

    Signal Processing 99 (2014), 215–249.
  • Pumsirirat and Yan (2018) Apapan Pumsirirat and Liu Yan. 2018.

    Credit card fraud detection using deep learning based on auto-encoder and restricted boltzmann machine.

    International Journal of advanced computer science and applications 9, 1 (2018), 18–25.
  • PwC (2018) PwC. 2018. Pulling Fraud Out of the Shadows, The Global Economic Crime Survey 2018. PricewaterhouseCoopers LLP. https://www.pwc.com/gx/en/forensics/global-economic-crime-and-fraud-survey-2018.pdf
  • Renström and Holmsten (2018) Martin Renström and Timothy Holmsten. 2018. Fraud Detection on Unlabeled Data with Unsupervised Machine Learning.
  • Rumelhart et al. (1985) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning internal representations by error propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science.
  • SAP (2019) SAP. 2019. SAP Global Corporate Affairs, Corporate Factsheet 2019. https://www.sap.com/documents/2017/04/4666ecdd-b67c-0010-82c7-eda71af511fa.html
  • Schreyer et al. (2017) Marco Schreyer, Timur Sattarov, Damian Borth, Andreas Dengel, and Bernd Reimer. 2017. Detection of anomalies in large scale accounting data using deep autoencoder networks. arXiv preprint arXiv:1709.05254 (2017).
  • Seow, Poh-Sun; Sun, Gary; Themin (2016) Suwardy Seow, Poh-Sun; Sun, Gary; Themin. 2016. Data Mining Journal Entries for Fraud Detection : a Replication of Debreceny and Gray ’ S ( 2010 ). Journal of Forensic Investigative Accounting 3, 8 (2016), 501–514.
  • Singleton and Singleton (2006) Tommie. Singleton and Aaron J. Singleton. 2006. Fraud auditing and forensic accounting. John Wiley & Sons.
  • Sweers et al. (2018) Tom Sweers, Tom Heskes, and Jesse Krijthe. 2018. Autoencoding Credit Card Fraud. (2018).
  • Wedge et al. (2017) Roy Wedge, James Max Kanter, Santiago Moral Rubio, Sergio Iglesias Perez, and Kalyan Veeramachaneni. 2017. Solving the” false positives” problem in fraud prediction. arXiv preprint arXiv:1710.07709 (2017).
  • Wells (2017) Joseph T. Wells. 2017. Corporate Fraud Handbook: Prevention and Detection. John Wiley & Sons.
  • Xu et al. (2015) Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015. Empirical Evaluation of Rectified Activations in Convolution Network. ICML Deep Learning Workshop (2015), 1–5. arXiv:1505.00853
  • Zheng et al. (2018a) Panpan Zheng, Shuhan Yuan, Xintao Wu, Jun Li, and Aidong Lu. 2018a. One-class adversarial nets for fraud detection. arXiv preprint arXiv:1803.01798 (2018).
  • Zheng et al. (2018b) Yu-Jun Zheng, Xiao-Han Zhou, Wei-Guo Sheng, Yu Xue, and Sheng-Yong Chen. 2018b. Generative adversarial network based telecom fraud detection at the receiving bank. Neural Networks 102 (2018), 78–86.

Appendix

Experimental Results - Dataset A

Figure 7. Learned AAE latent space representations of the journal entries contained in dataset A after training the AAE for 10,000 epochs and imposing a mixture of Gaussians (left), the anomaly scores obtained for of each journal entry and corresponding mode (center), the anomaly score distribution (bold line defines the median, upper and lower bound define the and quantile of the distribution) obtained of each journal entry class with progressing training (right).
Figure 8. Learned AAE latent space representations of the journal entries contained in dataset A after training the AAE for 15,000 epochs and imposing a mixture of Gaussians (left), the anomaly scores obtained for of each journal entry and corresponding mode (center), the anomaly score distribution (bold line defines the median, upper and lower bound define the and quantile of the distribution) obtained of each journal entry class with progressing training (right). It can be observed, based on the anomaly scores obtained with progressing training of mode , that mode stability wasn’t reached entirely after training the AAE for 15,000 epochs.

Experimental Results - Dataset B

Figure 9. Learned AAE latent space representations of the journal entries contained in dataset B after training the AAE for 5,000 epochs and imposing a mixture of and Gaussians (top to bottom) (left), the anomaly scores obtained for of each journal entry and corresponding mode (center), the anomaly score distribution (bold line defines the median, upper and lower bound define the and quantile of the distribution) obtained of each journal entry class with progressing training (right).