Analyzing Bias in Sensitive Personal Information Used to Train Financial Models

by   Reginald Bryant, et al.

Bias in data can have unintended consequences that propagate to the design, development, and deployment of machine learning models. In the financial services sector, this can result in discrimination from certain financial instruments and services. At the same time, data privacy is of paramount importance, and recent data breaches have seen reputational damage for large institutions. Presented in this paper is a trusted model-lifecycle management platform that attempts to ensure consumer data protection, anonymization, and fairness. Specifically, we examine how datasets can be reproduced using deep learning techniques to effectively retain important statistical features in datasets whilst simultaneously protecting data privacy and enabling safe and secure sharing of sensitive personal information beyond the current state-of-practice.



page 4


Ride Sharing Data Privacy: An Analysis of the State of Practice

Digital services like ride sharing rely heavily on personal data as indi...

Mechanisms for Hiding Sensitive Genotypes with Information-Theoretic Privacy

The growing availability of personal genomics services comes with increa...

Towards Federated Graph Learning for Collaborative Financial Crimes Detection

Financial crime is a large and growing problem, in some way touching alm...

Robust Classification of Financial Risk

Algorithms are increasingly common components of high-impact decision-ma...

Drynx: Decentralized, Secure, Verifiable System for Statistical Queries and Machine Learning on Distributed Datasets

Data sharing has become of primary importance in many domains such as bi...

The Natural Auditor: How To Tell If Someone Used Your Words To Train Their Model

To help enforce data-protection regulations such as GDPR and detect unau...

Leaking Sensitive Financial Accounting Data in Plain Sight using Deep Autoencoder Neural Networks

Nowadays, organizations collect vast quantities of sensitive information...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In spite of the potential advantages that the new digital transformations (e.g., the introduction of digital currencies, advances in AI) offer to several sectors ranging from financial services to healthcare, their advancement is still very much paralleled with risks. Chief amongst these pressing concerns is the secure and fair use of (user) data to provide diagnostics and access to services (e.g., lending, loan, credit, etc.). This – if mismanaged – could leave back doors open for both intentional and unintentional biases that can be exploited for unlawful acts. Fairness can happen at any level of data, model, algorithm and application stack, including within the underlying platform [1].

Unlawful acts which are propagated, for example, in financial models in order to optimize return on investment, is mitigated due to customer protection. This is due to perceived financial service practices created by these financial models. There are also compliance requirements which are used to manage operational practices at the corresponding banks. Hence, financial institutions who wish to manage their reputations to avoid brand or reputational damage closely monitor the operational practices of and compliance of their correspondence banks as well as newly-acquired banks. For example, recently a high profile case was highlighted against the First National Bank (FNB) of South Africa [3]. The Usury Act was levied against FNB after acquiring the smaller Saambou Bank —a bank that had operational difficulties managing their R8 billion ($550 million USD) worth of mortgages.

Thus, it becomes imperative across many domains and sectors that a robust, trusted platform to ensure end-to-end consumer data protection, anonymisation, and fairness exists. This is the main objective of this paper. In particular, in this paper, we describe our proposed distributed trusted model platform for small-to-medium business blockchain-based business networks. We then discuss our novel methodology for sharing SPI (sensitive personal information) data beyond the traditionally used anonymization and data sharing techniques. Finally, we present our preliminary experimental evaluation of the proposed data synthesis techniques, demonstrating the utility (e.g., validation, verification, etc.) of our methodology and its usage in protecting consumer data. Finally, we present the analysis and summary of this work.

Ii Motivation

Several factors have motivated this work. First and foremost, this work is initially driven by the necessity to leverage distinctly small datasets to build machine learning (ML) models. This is the case in certain domains where there is a scarcity of data (low-frequency transactions). These conditions require the expansion of existing datasets to the level required by some ML algorithms, which we applied to use cases in financial services (e.g., credit scoring and credit-limit management [10]). Data generation based on small datasets can become a powerful tool in the data scientist’s toolbox when considering these circumstances, and when working on network-based ML algorithms like Federated Learning where bespoke Deep Learning (DL) model structures need to be defined and optimized beforehand and later optimized as more real-world data is collected.

Secondly, as we later realized, the same same techniques selected for data expansion could be used for data protection. There is a need to develop novel ways to ensure data privacy that addresses the pitfall of existing techniques. Most financial institutions use popular state-of-practice anonymization techniques (e.g. removal, redaction, encryption, and data masking) to share data with other privileged institutions be it partners, vendors, or regulators. Unfortunately, these mechanisms are at high risk from bad actors as anonymization remains susceptible to de-identification [8]. However, we find that with most DL synthesis techniques, no one-to-one relationship is formed between the real and synthesized datasets. This then makes decryption challenging to a degree which can be set prior to data generation.

Thirdly, trust involves particular levels of transparency. The effective creation of transparency is a balancing act between data utility and privacy. To provide a concrete example, let us take the case where one bank is wanting to attain credit scoring models from a third-party vs. another bank which wants to assure its customers that decisions are being made in a fair, regulatory-compliant manner from a third-party. In the first case, data synthesis should be done which maximizes utility over privacy. In this case, the bank would want to be sure that each vendor has as much information as necessary to train the best model possible as the bank will ultimately use that model in production. In the second case, privacy would be prioritized over utility: the bank would like to retain a competitive advantage against a possible bad actors which could seize upon that information to improve their existing models.

Fourth, many experts can agree that human-mediated processes used to collect data can be inherently biased. Having the ability to share data in a secure manner would be good insurance against biased data collection efforts. Maintaining a diverse set of third-party evaluators becomes critical. Thus, we wish to make inroads in addressing the inherent biasness and skewness in data. This is also linked with the lack of an approach for other researchers to reproduce studies and cross-examine a particular dataset. Having a controlled mechanism for reproducing data is a fundamental element of a model/data sharing solution that combats the issue faced by many studies of reproducability.

Lastly, we wish to empower users to ensure that their data is forgotten. The implementation of the The Right to Be Forgotten legislation ensures individuals (or groups of individuals) who choose to no longer be apart of a platform, that their data can be deleted completely. However, user-data removal can impact performance especially in the case of algorithms like collaborative filtering which are used in recommendation engines. We look to mitigate this by using the data-synthesis models instead of the actual data (Section IV-A). We also track and manage both the data-synthesis models and actual data models (see Section III).

Iii Trusted Model Executor

To address (some of) the above challenges, we developed a mechanism and infrastructure to evaluate data quality and trained machine learning models throughout their lifecycle. These capabilities are presented in what we call the Trusted Model Executor (TME) as shown in Fig. 1. The TME has been integrated and tested on a blockchain-based platform for small-to-medium businesses (SMEs). Networked digital trust is established among stakeholders along the SME value chain. [4]. Each participant involved in the network can interact, view, and/or act on data, models and information pertaining to order contract transactions and decision-making.


Fig. 1: TME Overview.

The TME is designed to execute and evaluate the performance of models (e.g., credit scoring models) without the need to disclose the proprietary structural design of the models. The model lifecycle (Fig. 2), is managed by a blockchain controlled workflow, and recorded on the blockchain. Every action on a model is recorded as a blockchain event or sequence of events for transparency and immutability. Before a model is executed, the model file is verified by executing smart contracts.


Fig. 2: Illustrating the lifecycle of a model.

Models built using frameworks and languages like Python, PMML, and PKL are accepted and executed by the TME. While pre-processing scripts are used to perform any pre-processing tasks that need to be done on input data before being ran on the model (e.g., removing unnecessary or blank fields), the post-processing scripts performs any extra tasks to the output of the model (e.g., formatting of the model output). The model execution then provides model explainability of how the various features in the model contributed to the model output. The model execution also supports action triggers, initiated when certain model results are achieved given a set of input parameters. For example, a model that scores businesses can have a trigger that creates a notification whenever small businesses (identified by sales volume) get disproportionately lower credit scores as compared to their larger counterparts.

The TME supports bias detection and mitigation for both training data and pretrained models by utilizing the underlying capabilities of IBM’s AIF360 library [1]. For data attributes found with bias based a set of metrics, several mitigation measures can be performed at the user’s discretion. Additinally, using a series of model approximation techniques, the TME is able to generate non-expert explanations as to possible causes of the bias in both the dataset and model.

Model/data uploaded to the TME is pre-processed and ingested by the data synthesis module. This module enables users to generate a similar dataset while maintaining the privacy of the original data. The module can also be used to expand the data in cases where data is limited. This module enables sharing of data between users on the TME platform while preserving the privacy of the original data. This will be the main focus for this work.

Iv Data Synthesis and Expansion

In this section, we describe the reference datasets used, the generation of data and our experimental studies and analyzing bias-detection utility in synthesized datasets.

Iv-a Reference Datasets

We used two tabular datasets for our experimental studies. The purpose of these studies is to demonstrate that the synthetic data generated is able to retain all of the high-level relational information of the real datasets without it being a one-to-one mapping of records from the synthetic data into the real. With that, we would be able to generate synthetic/synthesized data that has similar properties to the real data without any of the privacy risks.

The two datasets are:

  • US Adult Census Dataset [11]

    . This dataset consists of several personally-identifiable attributes with labels of yearly income. For the preliminary experiments we used a subset of categorical and ordinal variables, such as, work class, education, marital status, occupation, relationship, ethnicity, gender and the target class (income exceeds a threshold). The training dataset contained 32561 records, 7 attributes (listed above) and 1 binary target label.

  • Bank of Portugal Dataset [7]. This consists of 41,188 labeled records with 20 labeled attributes which contain data on whether a customer will accept the term deposits of this particular bank. The preliminary experiments using this dataset used a subset of categorical and ordinal variables, such as, job, marital, education, default, housing, loan, contact, month, last contact day, outcome of previous marketing campaign. The training dataset comprised 37069 records, 10 attributes (listed above) and 1 binary target label.

We split these datasets into a training, validation and testing set (70% training, 10% validation and 20% testing). This was selected using a random permutation cross-validation iterator.

Iv-B Data Generation

There are several generative models that can be used to synthesize simulated tabular data that preserves statistical similarity to the original dataset yet prevents information leakage. Examples of these models are Variational Auto-Encoders (VAE) [6]

, Generative Adversarial Networks, and Dimension-Reduction methods and Kernel Density Estimation. They typically generate new samples that follow the same probabilistic distribution of a given training dataset with a reduced feature vector. We describe the approach used in this work to synthesize the five variations of the two experimental datasets described in Section 


We used the VAE to develop our data generation method for the purpose of this work. VAE provides a probabilistic mechanism to describe an observation in a latent space. Rather than building an encoder which outputs a single value, our encoder describes a probability distribution for each latent attribute. The entire network is trained as a whole, with two hidden layers for the encoder, two hidden layers for the decoder and the bottle neck layers size is

, where is the number of classes and

the number of categorical distributions. The loss function is the addition of cross-entropy between the output and the input known as the reconstruction loss and the Kullback–Leibler divergence. We trained a standard categorical to generate the samples for both datasets and all the variations. In our case, we use Adaptive Moment Estimation (ADAM)


as an optimization method, which computes adaptive learning rates for each parameter. The input shape of the vectors varies depending on the dataset, and all variables were encoded using one-hot encoding procedure.

Once the data is generated, it is important to understand the representation of the data. We therefore displayed the feature representation of the real and simulated data distribution using t-distributed Stochastic Neighbor Embedding (t-SNE) [12]

. t-SNE is an enhanced method for representing high dimensional data by giving each data point a location in a three dimensional map. This can be seen in Fig. 


[width=0.4]images/1edit.pdf [width=0.4]images/4edit.pdf
Fig. 3: The feature representation of the raw data distribution using t-SNE is shown with different perspectives. Colors represent the source of the data (real or generated), where red represents generated data and blue real data.

Iv-C Data synthesis and experimental bias evaluation

Our studies seek to experimentally identify and characterize the set of bias metrics that should be tracked for the synthetic data. For this work, we are performing a comparative analysis between the real and synthetic datasets using the following metrics

  • Statistical Parity Difference (Stat. Diff.)

  • Disparate Impact (Disp. Imp.)

  • K-Nearest Neighbors Consistency (Consistency) [2]

  • Number of Positive Examples (Num. Neg.)

  • Number of Negative Examples (Num. Neg.)

  • Base Rate

For this work, the analysis focused on between group fairness metrics as determined by statistical parity difference. and Disparate impact as well as the individual fairness metric captured by data consistency.

Statistical parity difference is defined as:


Disparate impact is defined as:


Consistency is defined as:


For Equations (1), (2) and 3, we assume that the labled datasets are defined by , where is the set of attributes and the labels. Generally, the domain of , , can take on a variety of data types. As stated, for our analysis, is restricted to categorical (nominal) and ordinal values with low cardinality. Moreover, the domain of is restricted to binary label classes: . For bias, a single attribute in is designated as the sensitive attribute: . For our analysis, also takes on binary values, , where is designated as the unprivileged class and the privileged. For our experiements was set to gender and contact method (contact) in the US Adult Census and Bank of Portugal datasets, respectively, with and set to and .

For Equation (3), is the k-Nearest Neighbor function used to identify -number (, in our case) of instance around in attribute space. Ideally, those five neighbors should have the same label as . Any discrepancies will reduce a perfect consistency score of one.

Number of positive instances, , number of negative instances, () and base rate, () represent the unconditioned class probabilities of the labels. Naturally as each of the datasets are intentionally skewed, those three metrics are expected to change accordingly.


Fig. 4: Correlation matrix comparing the features and bias (Statistical Parity Difference and Disparate Impact) metric scores of the real and generated dataset variations. There is high correlation in four out of the five bias metrics. , , , , and .

Figure  4 is a correlation map for both the real and synthetic (synthetic) datasets. Examining top off-axis correlations between the synthetic and real datasets, the top off-axis correlations are shown to be between the number of all negative instances (Num. Neg.), , the disparte impact (Disp. Imp.), and statistical parity difference (Stat. Parity Diff.). Conversely, the consistency metric shows a weak relationship between the real and synthetic datasets.

Iv-D Analysis and Results

Upon characterizing and scoring of each of the 20 datasets, several initial trends emerged. Figure 4 highlights these trends in a correlation map among the five selected metrics. Worth noting is the real to synthesized dataset correlation for the Disp. Imp. and Stat. Parity Diff. metrics. The high correlation suggests that our method for data reproduction should be able to preserve group bias while effectively breaking the one-to-one connection between the original and synthesized datasets. It should be noted that while both datasets tracked the monotonic trends of increasing and decreasing DI, the synthesized data in the Bank Portugal dataset experiments tracked the scale changes much more closely to the real one when compared against the Adult USA experiments. More fine grain dataset sampling and more dataset types are required to fully understand the underline behavior of VAE-generated data with respect to group bias tracking.

V Discussion and Summary

In this work, we presented our implementation of a trusted model-lifecycle management platform, highlighting the Data Synthesis and Expansion module. Specifically, the focus was on how to securely distribute datasets (containing sensitive information) to third-party evaluators by using Variational Auto-Encoder (VAE) technology. The goal was to generate synthetic data from the latent representation of the original data in order to preserve privacy while retaining the utility of that original data. In our case, the utility of bias detection in the synthetic dataset was measured using the bias in the original dataset as the ground truth. Several bias metrics including group and individual bias were examined as two financial datasets were artificially skewed by a subsampling process. Experimentally, our results lead us to believe that using the VAE for data reproduction can effectively retain some of the high-level statistical information from the original dataset. However, individual bias may not be retained during the data reproduction process.

More datasets and experimental evaluations are required in order to uncover the relationship that may exist between real and VAE-generated tabular data.