1 Introduction and Related Work
The large amount of data and diversity of sources (financial time series, customers characteristics, financial statements, etc.) from the financial industry, create many applications where unsupervised learning techniques can be leveraged successfully. Clustering methods especially, can find meaningful patterns among this unstructured data and provide support for decision making. Applications of clustering techniques to financial cases can be found in risk analysis
kou2014evaluation, credit scoring xiao2016ensemble, financial time series analysis gavrilov2000mining, dias2015clustering, portfolio management lemieux2014clusteringand financial statements’ anomalies detection
omanovic2009line. In all these applications, being able to interpret the obtained clusters and explain the rational behind their construction is necessary to trust them and provide transparency to regulators.Some previous works tackle the problem of interpreting clusters by visualizing them across two or three dimensions typically found by a PCA analysis rao1964use. This has the disadvantage of restricting the number of dimensions used to explain and in addition, the principal components are no more directly interpretable. Another group of methods uses the centroid or a selected set of points to represent the cluster radev2004centroid. These methods, while successful in some cases, are very sensitive to the geometry of the clusters. Distinct from these previous methods that directly interpret the clusters, twosteps methods explain the clusters through an interpretable model that learns how to classify them. The cluster assignment of each point can be used to label the data and train a classifier on them. Classification trees breiman2017classification are often used in practice hancock2003supervised
. Because the model has to be interpretable, this prevent the use of a larger class of models such as deep neural networks that could potentially provide better classification accuracy. Another work proposes to directly generate interpretable treebased clustering models
bertsimas2018interpretable. It has the inconvenient of restricting the type of clustering algorithm that can be used.To overcome these limitations, we propose to interpret a classifier trained on the clusters by using the SFIT method horel2019computationally. This method identifies the statistically significant features of the model as well as feature interactions of any order in a hierarchical manner. It can be applied to any classification model and is computationally efficient. Hence, by combining a twosteps method and this general model interpretability method, we do not have to restrict the choice of clustering technique nor the choice of classifier to predict the cluster assignment. This provides a general interpretability framework that can be applied to any clustering algorithm and type of data.
The structure of this paper is as follow. In section 2, we present a real usecase from the financial industry that poses the problem of interpreting clusters. After explaining our method, we describe in section 3 its key component: the SFIT method used to interpret the cluster classifier. Because the data of the described usecase are highly sensitive, we could not use them directly. As a replacement, we illustrate our method in section 4 on a dataset of U.S. companies clustered using their financial ratios.
2 Explainable clustering for compliance monitoring
2.1 An overview of the business case
In banks, Wealth Management teams help customers to meet their financial goals by managing their financial assets. An account is associated to each client and an investment strategy is designed by the account manager according to the risk aversion of the client. Using financial performance metrics as features, the efficiency of an account’s strategy can be monitored. Comparison with benchmarks or indices are examples of such performance metrics.
When an account is underperforming based on these features, it is directed to the Compliance team. As a matter of fact, the Compliance team has to spend a considerable amount of time manually reviewing accounts across all monitoring activities and across all global regions to catch the problematic ones. They typically use criteria learned from previous cases to identify the accounts to escalate for further investigation. This is often done using very few features in a not very principled way. As a consequence, systematic explanation of why an account was flagged as problematic is missing in most cases.
In order to automatise this procedure and cover all the different cases that result in an underperforming behavior, a clustering algorithm is implemented. Clustering has the advantage of taking into account a large number of different risk factors. This enables to group accounts that behave similarly and bring understanding on the underlying reasons of poor performance. Clustering accounts can also be used to sample from the cluster distribution and estimate the frequency of one particular type of underperformance. However, the clustering method has to be fully understandable by the Wealth Management Compliance team so that they can explain what are the significant features that characterise each cluster and ensure that it is aligned with their expertise and domain knowledge.
2.2 A proposal to mitigate the compliance monitoring challenge
As mentioned in the previous section, our goal is to design an interpretable clustering capability for Compliance Monitoring teams that can then be used to select underperforming accounts.
The Clustering Step
. A clustering algorithm (such as Kmeans, DBSCAN, or agglomerative clustering) is run over the set of all underperforming accounts.
The Explaining Step. A label is assigned to every cluster which allows to label the whole set of accounts. We can then train a classifier in a supervised way using this dataset. This classifier learns to predict the cluster of a given account. The SFIT procedure can now be applied on the trained classifier. A single model has been trained to classify all clusters, but we ultimately want to interpret each cluster independently. To interpret a specific cluster, the SFIT method is run using only the accounts belonging to this cluster. This allows us to provide for each cluster, a set of features that are significantly characterising it.
3 Presentation of the SFIT method
3.1 Introduction
SFIT horel2019computationally
, is a method that assesses the statistical significance and importance of features of machine learning models. It is based on a novel application of a forwardselection approach. Given a trained model and one of its features, it compares the predictive performance of the model that uses only the intercept with the model that uses both the intercept and the feature. The performance difference captures the intrinsic contribution of the feature in isolation which leads to an informative notion of feature importance. This approach has the advantage of being robust to correlation among features. Other advantages of our method include: (1) it does not assume any assumptions on the distribution of the data nor assumptions on the specification of the model; (2) it can be applied to both continuous and categorical types of features; (3) in addition to assessing the contribution of individual features, it can also identify higher order interactions among features in a hierarchical manner.
Formally, a set of i.i.d. accounts with .
is a vector of size
that contains thefeatures measuring the performance of the account plus an intercept at the first coordinate. The features can be a mix of continuous and categorical variable with the latter assumed to be binary variables through onehot encoding.
represents the index of the cluster where represents the total number of clusters. We randomly split the accounts into two subsets and and denote the two corresponding split of the data as , .We denote by , the cluster classifier trained on . To evaluate the contribution of the individual feature removed from the potential interaction that it could have with the remaining features, we define the transformed input vector which is obtained from where all entries except for the coordinate and the intercept are replaced with . Similarly,
is the transformed input vector where all entries except for the intercept are set to zero. This transformed input prevents us from having to refit a new model for each input. Then, given the loss function
used to train the classifier (like the cross entropy loss for instance), we can definethe difference between of the prediction loss from the model using the intercept term only and the loss from the model using the intercept plus the feature .
times the baseline loss and not the loss value itself is considered to make this test more robust to noninformative variables and control its typeI error. More details about this parameter and how to optimally select can be found in
horel2019computationally. Let’s now define , the metric that is used to assess the significance of feature :for . is defined as the median over the inference set of the differences of predictive performance. Intuitively, represents the predictive power of variable compared to a baseline model that only has an intercept term, the bigger it is, the more predictive power the variable contains.
Using a standard sign test, it is possible to obtain finite sample confidence interval for
and to perform the following onesided hypothesis test of significance:using the statistic .
This method can be generalized to higherorder interactions between features as explained in more details in horel2019computationally.
4 Experiment
4.1 Data
To illustrate our explainable clustering method, we use the Financial Ratios Firm Level dataset from Wharton Research Data Services (WRDS). This dataset provides, for all U.S. companies, 71 commonly used financial ratios grouped into the following seven categories: capitalization, efficiency, financial soundness/solvency, liquidity, profitability and valuation. From this database, we extract a subset of 682 companies which corresponds to all the unique companies listed over the last 10 years. In addition, we have for each company, its NAICS (North American Industry Classification System) and description. The data are centered and scale to unit variance as a preprocessing step.
4.2 Results
We cluster our dataset of companies into 5 clusters using an agglomerative hierarchical clustering algorithm. This algorithm works in a bottomup fashion: each observation starts in its own cluster, and then, clusters are successively merged together. We use a Ward linkage that minimizes the sum of squared differences within all clusters, and euclidean distance. We obtain the following clusters:

cluster 1: 201 samples (manufacturing, retails),

cluster 2: 277 samples,

cluster 3: 139 samples (energy, resources),

cluster 4: 60 samples (telecommunication, technology),

cluster 5: 5 samples.
Because cluster 5 does not have a significant enough size to perform meaningful analysis, we choose to discard it. By looking at the industry code of the companies of each cluster, we notice that the largest cluster contains a mix of various industries while the 3 remaining clusters are fairly specialized as listed above.
We then label each company using its cluster assignment. We train a 3 hidden layers fully connected neural network to perform classification on the clusters. The dataset is split into three parts, a training set of size 480, a validation set of size 125 and a test set of size 77. We optimize the architecture of the network through random search over the validation set. We end up using ReLU as activation function and a first hidden size of 100, a second hidden size of 50 and a third of 25. The network is trained for at most 50 epochs using Adam optimizer and early stopping. We obtain a classification accuracy of 0.88 on the test set.
We finally run on the trained network one SFIT method per cluster, i.e. by only using the data of this cluster, and returns the five most important features. For the first cluster, the first five features are: gross profit margin, asset turnover, long term debt, current debt and net profit margin, the value of their corresponding test statistic along with their 95% confidence interval can be found in Table
1. For the second cluster, they are: book/market, free cash flow/operating cash flow, pretax return on total earning assets, sales/invested capital, total debt/ebitda as displayed in Table 2. For the third cluster: total debt/ebitda, operating profit margin before depreciation, pretax return on total earning assets, free cash flow/operating cash flow, return on assets as shown in Table 3. And for the last cluster: research and development/sales, cash balance/total liabilities, gross profit margin, cash ratio, operating cf/current liabilities as displayed in Table 4. From these, we can see that the most predictive features are not the same for each cluster. They capture what makes a cluster distinct from the others. This is an efficient way to interpret and explain what are the intrinsic characteristics of a cluster. These results are also consistent with the main types of industries present in each cluster. For instance, the research and development/sales ratio is the most significant feature of the cluster that mainly contains telecommunication and technology companies.Variable  Median  95%CI lower bound  95%CI upper bound 

gpm  0.755  0.657  0.835 
at_turn  0.677  0.554  0.850 
lt_debt  0.556  0.366  0.615 
curr_debt  0.542  0.391  0.728 
npm  0.532  0.509  0.568 
Variable  Median  95%CI lower bound  95%CI upper bound 

bm  0.114  0.102  0.126 
fcf_ocf  0.057  0.055  0.059 
pretret_earnat  0.043  0.024  0.061 
sale_invcap  0.037  0.035  0.040 
debt_ebitda  0.031  0.021  0.039 
Variable  Median  95%CI lower bound  95%CI upper bound 

debt_ebitda  0.856  0.746  0.987 
opmbd  0.692  0.360  0.903 
pretret_earnat  0.518  0.489  0.545 
fcf_ocf  0.419  0.235  0.656 
roa  0.382  0.337  0.422 
Variable  Median  95%CI lower bound  95%CI upper bound 

rd_sale  1.89  1.70  2.02 
cash_lt  1.23  1.09  1.54 
gpm  0.718  0.564  0.796 
cash_ratio  0.449  0.333  0.527 
ocf_lct  0.283  0.119  0.598 
We propose in this paper a novel twosteps method that can interpret any clustering algorithms. For each cluster, this method identifies the statistically significant features that characterise it as well as feature interactions. We justify the necessity of such a method by describing a Wealth Management Compliance usecase that requires explaining clusters of underperforming accounts. We demonstrate its effectiveness on a dataset of financial ratios of U.S. companies.
Acknowledgments
The authors would like to thank the Wealth Management Technology team at J.P. Morgan and especially Amish Seth for their collaboration and help to gain insight on the business case.