Company classification using machine learning

by   Sven Husmann, et al.
European University Viadrina

The recent advancements in computational power and machine learning algorithms have led to vast improvements in manifold areas of research. Especially in finance, the application of machine learning enables researchers to gain new insights into well-studied areas. In our paper, we demonstrate that unsupervised machine learning algorithms can be used to visualize and classify company data in an economically meaningful and effective way. In particular, we implement the t-distributed stochastic neighbor embedding (t-SNE) algorithm due to its beneficial properties as a data-driven dimension reduction and visualization tool in combination with spectral clustering to perform company classification. The resulting groups can then be implemented by experts in the field for empirical analysis and optimal decision making. By providing an exemplary out-of-sample study within a portfolio optimization framework, we show that meaningful grouping of stock data improves the overall portfolio performance. We, therefore, introduce the t-SNE algorithm to the financial community as a valuable technique both for researchers and practitioners.



There are no comments yet.


page 7

page 9


A News-based Machine Learning Model for Adaptive Asset Pricing

The paper proposes a new asset pricing model – the News Embedding UMAP S...

GRASPEL: Graph Spectral Learning at Scale

Learning meaningful graphs from data plays important roles in many data ...

Data-driven Advice for Applying Machine Learning to Bioinformatics Problems

As the bioinformatics field grows, it must keep pace not only with new d...

Machine Learning Algorithms for Financial Asset Price Forecasting

This research paper explores the performance of Machine Learning (ML) al...

Stochastic Portfolio Theory: A Machine Learning Perspective

In this paper we propose a novel application of Gaussian processes (GPs)...

A Tweet-based Dataset for Company-Level Stock Return Prediction

Public opinion influences events, especially related to stock market mov...

ESG investments: Filtering versus machine learning approaches

We designed a machine learning algorithm that identifies patterns betwee...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning has become one of the most influential phrases in the business world as well as within the recent financial literature. The applications are manifold, some of which include financial signal processing (AkansuEtAl_2016), stock selection (RasekhschaffeJones_2019), and portfolio optimization (BanEtAl_2018; JainJain_2019). However, the practical implementation of such techniques is often complicated and time-consuming and therefore does not coincide with the real challenges an investor has to face in respect to efficiency and performance. Therefore, the appropriate usage of machine learning algorithms requires deep knowledge of the underlying mechanics and more importantly, established, functional frameworks for straight-forward application.

In this paper, we combine recent advancements in the field of unsupervised learning with the practical expertise of an investor and introduce a machine learning tool for company classification which is easy to implement and works best when used as a decision support module. In particular, we focus on the general problem that investors are often confronted with complex and potentially high-dimensional datasets such as stock returns. To gain necessary and more precise insights for such data, companies can be grouped in a meaningful way by a machine learning algorithm. Here, we first apply the popular t-SNE, developed by MaatenHinton_2008, to decrease the dimension of financial data which allows a reasonable, easily-understandable graphical visualization. In a second step, these illustrations can be either visually interpreted by an expert or used within a data-driven classification analysis to detect clusters of similar companies. In a final step, the resulting groups can be utilized to improve existing approaches of financial research and practical applications, for example portfolio optimization.

Up to this date, the t-SNE algorithm is regarded as a standard tool for dimension reduction and data illustration (see, e.g., MaatenHinton_2008; PezzottiEtAl_2017; RogovschiEtAl_2017; SchubertGertz_2017). As a result, various applications have been developed, mostly in natural science (LindermanEtAl_2019; TravenEtAl_2017). Although in a less extensive matter, t-SNE has recently been considered by the financial literature as well. KalsyteVerikas_2013 exploit the properties of t-SNE to create an ordered 2D map, which allows them to explore the financial soundness of companies. Sarlin_2015 provides a qualitative overview of data and dimension reduction methods for visual financial performance analysis. WuEtAl_2019

introduce a deep learning framework for predicting stock prices. Throughout the analysis, they use a two-dimensional t-SNE from the final stock representation to assess the interpretation of the stocks’ universe. Still, the current financial research exploits t-SNE for visualization purposes, whereas our approach employs t-SNE not only as a visualization tool but as well as a dimension-reduction tool. In particular, we utilize the produced low-dimensional data as the basis for an efficient, data-driven clustering (or classification) companies.

To validate the proposed machine learning tool, we first show that the clusters resulting from an application of t-SNE on stock returns provide similar results to a proprietary industry classification scheme. Moreover, as we demonstrate within a portfolio optimization framework, groups based on t-SNE yield better results than other standard classification approaches. The properties of t-SNE, especially its fitness in case of nonlinear data structures, result in an effective dimension-reduction of the financial data and therefore, in an outperforming classification of the respective companies.

The rest of paper is organized as follows: In Section 2, we introduce the main properties of t-SNE for company classification. Section 3 outlines a decision engine for the optimal selection of algorithm-specific tuning parameters within a general model framework. In Section 4, we provide an exemplary application of the proposed machine learning tool for portfolio optimization and report the corresponding empirical results. Section 5 summarizes and concludes.

2 t-SNE for Company Classification

Due to its property to reduce the dimension of a dataset and to provide a good interpretation about underlying structures, t-SNE of MaatenHinton_2008

is nowadays mostly used as a unsupervised visualization tool for high-dimensional data. In that sense, t-SNE can be compared to the well-known principal component analysis (PCA). While the PCA achieves dimension reduction by a linear projection according to the ordered eigenvalues of the covariance matrix, t-SNE utilizes present nonlinear relationships within the dataset. Moreover, instead of using only the global structure of the data, t-SNE balances global and local structure by introducing a so-called perplexity parameter, which represents the trade-off between both. This trade-off is assured by the following two-step process. First, t-SNE creates a normal probability distribution over the full dimension of the dataset to measure similarity between different data point series

to and to compute the corresponding conditional probability for each data pair with



refers to the standard deviation of the data series

. In a second step, t-SNE maps these probabilities to a lower, usually two- or three-dimensional probability distribution, given as


where the new lower-dimensional representations of and are and

, respectively. t-SNE then uses the Kullback-Leibler-Divergence to minimize the difference between the two probability measures

and . The novelty of t-SNE lies in the second step of this process. In particular, whereas the original SNE algorithm by RoweisEtAl_2002

uses a Gaussian distribution in the low-dimensional space (Equation 


), t-SNE employs a Student-t distribution with one degree of freedom to solve for the so-called “crowding problem” and the optimization problems of SNE

(MaatenHinton_2008). By utilizing a heavy-tailed distribution, t-SNE allows for the data to be spread wider within the low-dimensional space. Hence, the resulting two- or three-dimensional data can be easily visualized to represent distinct clusters, referred also as groups, and henceforth, utilized for specific financial applications.

2.1 Challenges

Despite its broad applicability and promising properties for dimension reduction and visualization of high-dimensional data, the t-SNE does exhibit some drawbacks. One practical problem is that the distances between the resulting data points on the low-dimensional space do not have any meaningful interpretation. In addition, contrary to PCA, t-SNE does not provide a functional representation on how to map the data. It is therefore necessary to always apply the t-SNE algorithm on the full dataset with all its components. Most importantly, as the performance of t-SNE depends on the choice of the perplexity parameter, the resulting groups can sometimes be misleading. For instance, if the perplexity is chosen to be too low, then the t-SNE will incorporate mostly the local data structure, which results in too many groups with lower inter-group dissimilarity and therefore, lower quality of such grouping. In contrast, if the perplexity is chosen too high, relevant groups cannot be detected, as the algorithm puts too much weight on the global data structure, and the resulting groups would display lower intra-group similarity, which again damages the value of the results. As a consequence, in order to exploit the properties of t-SNE, practitioners and researchers should use it as an effective decision support module - either by additionally incorporating expert knowledge of the data, for example, which groups do indeed make sense, or by applying a proper data-driven validation method.

Figure 1: Visualization of companies after applying a two-dimensional t-SNE with different perplexities.

To demonstrate the outlined challenges, we exemplary show different outcomes of the t-SNE application on daily returns of 318 S&P500 companies in Figure 1. For creating the pictures from top left to bottom right, we conducted a t-SNE dimension reduction based on 318 companies from the S&P500 index using the respective stock returns over a four-year period from 01/14/14 to 12/31/18. Each orange point in the pictures corresponds to the two-dimensional transformed values and therefore represents one company. The main idea is, that based on the proximity of different points, arranged to clusters, the user can now refer to companies which are close together as a group. This can be done either manually, by visual interpretation, or automatically, in a data driven way. For the purpose of illustration at this point, within each picture we performed a manual grouping by circling data points or companies, which visibly form a distinct cluster. For each picture we use the same dataset but yet different perplexities within the t-SNE, namely 6, 12, 30, and 60, respectively. It is obvious, that the different choices of the perplexity parameter lead to vastly different outcomes. In the case of a high perplexity, e.g. 60, only one distinct group can be found. However, for low perplexities such as 6, one can find eight groups and more. As argued beforehand, it can be easily observed from Figure 1 that the perplexity influences the amount of visibly detectable groups.

2.2 t-SNE and visual Classification

In a real-life scenario an investor could now determine which perplexity to use by examining the groups of each outcome. That could be achieved either in accordance to practical experience or industry knowledge, or by applying a data-driven classification algorithm. To provide some insight into the quality of the performed grouping from combining t-SNE with a classification algorithm, in Figure 2 we show the established groups with an exemplary perplexity parameter of 12. In addition, on the left-hand side, we use a proprietary industry classification system as a benchmark.

Figure 2: Visualization of companies based on a two-dimensional t-SNE with a perplexity of 12. The colors of the points correspond to a grouping, performed with the Thomson Reuters 2-digit industrial code (left) and a classification using spectral clustering with 10 groups (right).

By using the first two digits of the Thomson Reuters Business Classification (TRBC2) for each company, we visualize the companies from the same industry according to TRBC2 with the same color, for example, every teal colored point represents a company which belongs to the same group, independent of its position in the two-dimensional graph. Here we can notice that the grouping, based on visual analysis of the proximity between data points, would indeed lead to groups which seem to cluster companies within the same industry. However, we point out that the Thomson Reuters Codes are not what is referred to as ground truth - the only real and true grouping for the considered companies. In fact, the real grouping is always unknown and constantly changing due to changes in companies’ policies, market size and other factors.

2.3 t-SNE and data-based Classification

In case no knowledge of the underlying groups is present or assumed, clustering algorithms can be applied for classification. Technically, not all clustering algorithms can utilize the result of t-SNE. As mentioned previously, distances of different clusters in t-SNE do not have any meaningful interpretation. In general, the output of the t-SNE does not keep the information of the original data density which can be challenging especially for the class of density-based clustering methods such as DBSCAN (EsterEtAl_1996)

. Moreover, due to the various types of shapes for the clusters, resulting from t-SNE, methods such as the K-means clustering

(MacQueenothers_1967) may not suffice to classify groups in a meaningful way. In detail, K-means aims to identify groups by calculating measures such as the Euclidean distance between different points towards a so-called “centroid”. This leads to promising results only when the data is clustered mostly spherical. In the case of t-SNE, however, the clusters can have a multitude of shapes, as Figure 1 shows. This property of t-SNE makes it an appropriate and effective choice for dimensionality reduction of heterogeneous high-dimensional stock returns’ data, but limits, on the other hand, its applicability in combination with centroid-based clustering algorithms.

Up to this point, the literature suggests graph-based clustering algorithms, in particular a spectral clustering algorithm (see, e.g., XieEtAl_2016; RogovschiEtAl_2017; LindermanSteinerberger_2019). The main idea here is that spectral clustering requires a similarity matrix, a so-called Laplacian matrix, derived from the data. In case of high-dimensional data, such matrix is difficult to obtain and inefficient to use within the algorithm. This is where t-SNE can be applied as a dimension-reduction tool. The argument is supported additionally by LindermanSteinerberger_2019 who perform a rigorous analysis on the t-SNE mathematical foundations and prove the connection to Laplacian eigenmaps and matrices (BelkinNiyogi_2003). Following this line of research, in this paper, we use the spectral clustering algorithm, developed by NgEtAl_2002 and as implemented by KaratzoglouEtAl_2004. Moreover, within the spectral clustering algorithm we set the group size to 10, so that the results can be compared fairly to the ones, produced by the industry classification with TRBC2. On the right hand side of Figure 2 every point is colorized according to the result of the spectral clustering algorithm, applied on the output of the t-SNE. We can clearly observe that spectral clustering performs the grouping firstly, more consistently than the industry classification and secondly, more similarly to an investor who visually differentiates between the mapped data points. One advantage of the data-driven approach is though that it avoids a subjective assessment within the grouping process.

2.4 Additional Validation

After applying either a visual or data-based clustering to the output, the resulting grouping might be checked for plausibility by an expert. As we do not know the ground truth of the underlying groups, expert knowledge about the business models and/or sectors of companies can be applied to validate or change the labeling of companies. In the case of the dataset at hand, one could, for instance argue, why in the left hand side of Figure 2 the group, colored in green (identified with the code 55, referring to ”Financials“) is spread around three distinctive clusters, while in the right-hand side of the pictures, these seem to have different groupings, colored in purple, orange and yellow, respectively. Interestingly, the grouping of the t-SNE indicates that this group should be divided into smaller distinctive subgroups. The spectral clustering acknowledges this by, for instance, grouping one subset into the purple group in the upper right corner. After researching the business profile of these companies, we discover that they all belong to the Real Estate business, while the other companies are mostly banks and investment companies. This shows that the combination of t-SNE and spectral clustering could indeed provide further insight in the grouping than the industry code TRBC2. However, the yellow group produced by the t-SNE with spectral clustering seems to include companies which are not found in the 55 industry code of TRBC2. Analyzing them shows that this divergence is likely due to a wrong classification, as these are all automotive related companies. An expert would have easily identified these as wrongfully grouped and thus discarded them. Without the presence of such an expert, the investor can still use data-driven approaches to adjust the grouping by setting different starting points for the algorithm or changing the number of classes for the spectral clustering.

We perform similar adjustments, new starting points and a group size of 16, and display the results in Figure 3. It can now be obtained that the grouping for the financial companies seems more accurate and in accordance to the visual representation of the t-SNE.

Figure 3: Visualization of companies based on a two-dimensional t-SNE with perplexity of 12. The colors of the points corresponds to a grouping done by the Thomson Reuters 2-digit industrial code (left) and a classification using spectral clustering with 16 groups (right).

Still, even if some tuning parameter setting within t-SNE with spectral clustering might match the results from a classification according to the TRBC-Codes, such grouping is not necessarily beneficial for the usage within, for instance, portfolio optimization framework. Moreover, a group formation merely on visual analysis by an expert and general practical expertise may suffer from confirmation bias and subjective reasoning. In this paper, we argue that the t-SNE becomes especially relevant, if the ground-truth of the underlying grouping is unknown for some or all the companies. Examples include groupings by industry, sustainability or similar artificial rankings, when the ranking is not available for that company, but yet for others. In that case, t-SNE can help identify the relevant cluster of that company by simply using similarities in the provided dataset, e.g. return data, balance-sheet data, social responsibility measures and others. Additionally, missing grouping data is often a practical issue, as proprietary data such as the TRBC is often not available for smaller, not traded companies. In contrast, other information, like the balance sheet, is easily accessible. Furthermore, the application of a fully data-dependent approach such as t-SNE with spectral clustering allows for even more precise quantitative improvement of the performed grouping.

3 Decision Engine for Optimal Grouping

Besides adjusting the parameters of the proposed algorithms to better match the visible clusters, an improvement can be achieved in a data-driven way with respect to a certain goodness-of-fit. That goodness-of-fit criterion results from the original purpose of the study and can be, for instance, the mean-squared-error (MSE), the maximum-likelihood, the minimum-variance (MV) and others. In general, the perplexity parameter within t-SNE as well as the group size within spectral clustering can be regarded as tuning parameters. Changing these parameters will always yield different results. Moreover, due to the randomization of starting points in both the t-SNE as well as the spectral clustering, outcomes will differ in general. This provides the opportunity and somewhat necessity, to use a cross-validation (CV) to detect the optimal parameters and hence, optimal grouping.

The main advantage of such data-driven approach is its flexibility in terms of a model setup. For example, in the practically relevant case of company valuation with multiples (see, e.g. Schreiner_2009) the investor is obliged to define peer-groups based on similar idiosyncratic risks. In this case, the number of groups to detect is based on the homogeneity between groups and thus dependent on the choice of companies. While choosing too many groups will lead to high correlations between groups and thus a lack of dissimilarity between them, choosing too few groups will lead to groups which contain companies with too diverse and incomparable business models. Another example from finance is portfolio optimization. From a theoretical standpoint, a portfolio of stocks is to be optimized with the least amount of groups as possible (preferably only one group), so that the desired diversification effect can be achieved (see, e.g. Markowitz_1952; GreenHollifield_1992; DomianEtAl_2007)

. In reality, on the other hand, such grouping results in large concentration ratios with comparably many companies per observation point which negates the positive diversification effect due to high estimation error in the necessary parameters - the expected returns and the covariance matrix of returns.

111For more information on estimation risk within portfolio optimization, see, e.g. Michaud_1989; BestGrauer_1991; Jorion_1992; ChopraZiemba_1993; JagannathanMa_2003; SiegelWoodgate_2007. Considering the trade-off between estimation risk and diversification, it would be preferable to use a large number of groups with a small amount of companies per group.

In any way, each purpose of investigation is usually aiming to optimize a parameter of interest, for instance MV for minimum-variance portfolios or MSE for company valuation. The t-SNE approach for grouping can add value to the general optimization framework by applying a new layer of decision and optimizing with respect to grouping. Aiming at designing an easy to implement framework for company classification with machine learning, in Figure 4 we present the decision engine for our grouping approach.

Figure 4: Decision engine for the incorporation of t-SNE and spectral clustering (spec) into another modeling approach. OM stands for original model, MP for model parameters, RS for results.

The decision engine requires the user to split the data into a training dataset, which is also referred to as in-sample data, a validation dataset, which is also referred to as out-of-sample data, and a test dataset, used for the final evaluation (see, e.g. Hjort_1996 for more information on the terminology). As depicted on Figure 4, from left to right, the modeler firstly uses the in-sample data and passes it to the t-SNE with spectral clustering approach. Based on a predefined parameter set, the modeler receives a variety of different results for the performed grouping.222For illustration purpose, in Figure 4 we show overall three different scenarios. The generated grouped data can then be used within the original model (OM). If the model, as for instance in the case of company valuation, requires already grouped companies, the resulting groups can be directly applied. However, if the original model, as often the case, does not incorporate grouping, a slight adjustment is necessary. One of the simplest ways for adaptation is to apply the OM to each of the groups for a certain parameter and later aggregate the results with a suitable measure, for example the average. Applying the OM yields the necessary model parameters (MP). However, for a practically relevant performance evaluation, it is necessary to figure out which model performs best on a new dataset. This is shown at the right-hand-side of Figure 4, where the MP are now directed to the out-of-sample dataset, keeping the grouping step of earlier intact. The respective results (RS) can be drawn from any measure of fit and give information about each model’s performance, conditional on the chosen parameters. The parameter set, which yields the best results is then chosen as the best model, leaving the investor overall not only with a potential better fit to the data, but also a suitable choice of grouping, given on the problem at hand. Using the proposed decision engine multiple times, by splitting the dataset repeatedly in different training and validation datasets and averaging over all results or decisions, results in a standard CV approach for the choice of parameters. The final chosen model is then applied to the test dataset, which is used neither in training nor in the validation of the model. This test dataset, if chosen appropriately, can be therefore used to reveal the true predictive power of the chosen model.

The advantage of this approach is that original models, which usually do not have tuning parameters, such as the perplexity, can now be fine-tuned to provide a better fit to the data. Generally speaking, the grouping enriches the model variety and thus provides opportunity for a more widespread model evaluation. To show the wide range of applicability and the performance of our framework for company classification with t-SNE, spectral clustering and CV, we present an exemplary application from portfolio optimization.

4 Application for Portfolio Optimization

One of the most important challenges in finance is the problem of optimizing a portfolio. However, due to uncertainties in the estimation of input parameters, the expected returns and the covariance of asset returns, the efficient estimation of optimal portfolio weights for each asset remains a challenge. Since Markowitz postulated what is nowadays known as Modern Portfolio Theory (Markowitz_1952), financial researchers apply methods from various scientific fields for implementing portfolio strategies in practical scenarios. Some try to combine optimal portfolios, which are perform poor out-of-sample due to estimation risk, with other less optimal, but stable portfolios, see e.g. KanZhou_2007; TuZhou_2011; Schanbacher_2014. Others penalize the optimal weights by using practical constraints or some form of regularization (see, e.g. JagannathanMa_2003; BrodieEtAl_2009; DeMiguelEtAl_2009; FanEtAl_2012a; FastrichEtAl_2015; HusmannEtAl_2019). Another line of research focuses on directly improving the estimation of the expected returns (see, e.g. Jorion_1991; BestGrauer_1991) and the covariance matrix of returns (see, e.g. LedoitWolf_2003; LedoitWolf_2004; LedoitWolf_2017b; FanEtAl_2013; GotoXu_2015; HusmannEtAl_2020). Interestingly, DeMiguelEtAl_2009a show that the estimation-free, equally-weighted portfolio can be superior to Markowitz portfolios in terms of relevant performance measures such as Sharpe ratio, whereas the minimum-variance portfolio still achieves the lower risk level. Moreover, JagannathanMa_2003 conclude, that the expected returns are much more prone to estimation error than the covariances of returns.

In this paper, we apply the proposed t-SNE with spectral clustering approach to improve the performance of Markowitz portfolio from the perspective of efficient company classification. As we do not want to base our grouping on a potentially biased expert opinion, we use the data-driven approach as described in Section 3. Our goodness-of-fit criterion is the out-of-sample Sharpe ratio, as the standard goodness-of-fit for this optimization problem.333HusmannEtAl_2020 show that the implementation of a data-driven method requires the same or similar optimality criterion as the optimization target of the original model. As a next step, we proceed with a CV-based approach, which has been recently proven to be especially useful for optimizing minimum-variance portfolios (DeMiguelEtAl_2013; BanEtAl_2018; HusmannEtAl_2020). Nonetheless, in the standard Markowitz portfolio optimization framework there is no need for a CV, as it is assumed that either all necessary parameters are known or at least can be sufficiently estimated. The application of our decision engine together with t-SNE and spectral clustering can therefore yield additional improvement of the results by selecting the tuning parameters in an optimal fashion.

4.1 Empirical Setup

To incorporate effectively our approach and hence, a company classification into the standard portfolio optimization problem, we need to slightly adjust the general optimization problem, defined as

s.t. (4)


is the estimated weights vector for the assets,

is the estimated covariance of returns, is a target expected return and (4) is the sum constraint, which ensures that all weights sum up to 1.

Although this optimization is left untouched, following Section 3, we can easily pre-process the returns’ data with t-SNE to gain relevant, low-dimensional grouped data. Considering that the weight parameter can be applied to any asset, such as single stocks but as well as portfolios, we can adjust the underlying dataset in a meaningful way. The standard case, where any asset is treated individually, can be considered the special case of a grouping, where the amount of groups are equal to the amount of assets. Thus, if we group all companies into smaller subgroups we can technically consider them as portfolios as well. The only related question is on how to find the weights for these group-based portfolios.

In order to keep the analysis simple and to avoid unnecessary estimation and all its uncertainties, we decide to combine each company with its group members using the simple average, i.e. constructing an equally-weighted (also called naive) portfolio on an intra-group level. On these resulting portfolios, which number ranges from 1 (no grouping at all) to (each company is its own group), we then construct the tangency portfolio from Equation 3. Since our aim is to emphasize the effect of grouping in portfolio optimization, we use this procedure for a predefined set of groups, ranging from to up to .

Figure 5 depicts the described process of first, creating naive portfolios, given the grouping decision of the t-SNE and spectral clustering approach, and subsequently forming a global tangency portfolio out of all the naive portfolios present.

(a) Portfolio formation with
(b) Portfolio formation with
Figure 5: Using the t-SNE with spectral clustering to create naive portfolios out of grouped companies which subsequently will be transformed into one global tangency portfolio. The letters A to G represent different stocks, which are first combined into a naive portfolio for each of the corresponding groups . These portfolios are then combined to one tangency portfolio, leading to an overall different distribution of weights for each company, as represented by the different sizes of the boxes for each company in the lowest part of the picture.

The number of naive portfolios always coincides with the number of groups. The both sides of the figure illustrate the difference of the result for seven companies when for the left-hand side and for the right-hand side. Due to the group sizes constantly changing, the size and constitution of the naive portfolios can be different for each fixed amount of groups . We can also observe that the algorithm sometimes tends to choose a large cluster of companies as one group and leaving only a few other companies together in another group. However, as all naive portfolios will be passed to the standard optimization, described in Equation (3), each asset will have a different individual weight in the end, depending on the number of groups . With respect to the decision engine in Section 3, we create groups for each based on a predefined parameter set. In our study, we use a grid of parameters, based on the perplexity parameter as well as on the randomness of the spectral clustering. For each perplexity parameter in , we simply apply the spectral clustering 30 times, leading to overall different grouping outcomes per predefined group size. This approach is used to address the issue of randomness in the spectral clustering. All of these 330 models will then be passed through our decision engine by combining the different companies into an equally-weighted portfolio for each group and finally constructing the tangency portfolio out of the resulting portfolios. The size of the training dataset is observations and the validation set consists of observations. We use the returns of the validation set to decide on the final model, based on the out-of-sample Sharpe ratio.

To create a large test dataset, we incorporate our decision engine in a standard daily rolling window study. We therefore set our initial test data point to the day which directly follows the validation data and report the returns of our chosen portfolio dependent on . We then shift the rolling window by one day, adding the former test data point to the full dataset, deleting the first observation and calculating again the return on the new test data point. Every 252 trading days we reapply our decision engine to adjust our classification and adapt to new market conditions. Overall we shift our rolling window exactly 1259 times, leading to 1259 daily test data points, for which we can calculate the true daily Sharpe ratio, an investor would have gained by applying our framework.

In total we use 318 companies listed in the S&P 500 from 12/27/11 to 12/31/18, downloaded from the EIKON database of Thomson Reuters. We use only those companies, for which price data is available for the whole period. We use discrete returns and assume that the risk-free rate of return is zero for every period. The number of observations, available for estimating the portfolio weights as in Equation 3, is approximately two trading years, i.e. .

4.2 Empirical results

To examine thoroughly the performance of our approach, from now on referred to as , we introduce some strong benchmarks. First, the standard tangency portfolio as the global market portfolio . It is constructed by simply applying Equation 3 to all companies. Furthermore, as argumented by DeMiguelEtAl_2009a, the naive portfolio for all companies can serve as a powerful benchmark too, as it does not suffer from estimation error. We refer to the naive portfolio on all stocks as . However, these two approaches do not consider the grouping of the data and thus could be considered as unfair benchmark. Therefore, we introduce three more benchmarks, based on different grouping. First, we use the Thomson Reuters industry classification system with the two leading digits of the code and form groups accordingly. Afterwards, we apply the proposed methods of Figures 4 and 5 to each group. This is referred to as . Secondly, we again use the Thomson Reuters codes but this time with four digits. We call this benchmark . Finally, we construct a random grouping, based on the group sizes of the t-SNE and spectral clustering approach for all different predefined group sizes and again apply the methods of Figures 4 and 5. By randomly allocating companies to specific groups we can determine the effect of the t-SNE in combination with spectral clustering, when all other relevant factors are hold constant. To stabilize the results of that approach, we repeat it exactly 100 times with different company allocations and calculate the average Sharpe ratio for each number of groups. This method will be referred to as .

Figure 6: Annualized Sharpe ratio (left ordinate) of returns of the test datasets sorted by the amount of groups . Orange dots represent the corresponding Sharpe ratio of the t-SNE and spectral clustering approach, gray dots represent the Sharpe ratio of the grouping approach using random grouping. Bars and bar colors indicate whether the proposed method yielded better (blue) results or worse (red). The absolute difference of both models is depicted at the left ordinate. The dashed lines represent the benchmarks for (red), (salmon), (blue) and (green). The black line is a trend line for the results of the t-SNE with spectral clustering approach.
Sharpe ratio Standard error
1 - - 0.452 0.371 - - - - 0.270 0.441 - -
2 0.414 0.032 - - - - 0.403 0.424 - - - -
3 -0.359 0.129 - - - - 0.440 0.439 - - - -
4 0.943 0.001 - - - - 0.376 0.428 - - - -
5 0.853 0.094 - - - - 0.336 0.419 - - - -
6 1.491 0.048 - - - - 0.462 0.405 - - - -
7 0.878 0.068 - - - - 0.438 0.438 - - - -
8 0.649 0.112 - - - - 0.326 0.380 - - - -
9 -0.242 0.116 - - - - 0.606 0.368 - - - -
10 0.900 0.113 - - -0.008 - 0.406 0.386 - - 0.484 -
11 0.525 0.097 - - - - 0.326 0.420 - - - -
12 0.959 0.148 - - - - 0.460 0.394 - - - -
13 0.450 0.139 - - - - 0.459 0.423 - - - -
14 0.774 0.118 - - - - 0.451 0.424 - - - -
15 0.648 0.170 - - - - 0.360 0.419 - - - -
16 0.019 0.204 - - - - 0.555 0.450 - - - -
17 0.473 0.236 - - - - 0.325 0.422 - - - -
18 0.490 0.177 - - - - 0.443 0.389 - - - -
19 0.431 0.255 - - - - 0.385 0.376 - - - -
20 0.530 0.190 - - - - 0.502 0.446 - - - -
25 - - - - - 0.065 - - - - - 0.478
Table 1: Annualized Sharpe ratio and standard error of the Sharpe ratio for every model and every number of groups . Standard errors are calculated using standard bootstrap samples with 1000 repetitions. “-” indicates that there was no model calculated for that number of groups.

Figure 6 as well as Table 1 both show the results of our empirical study. The orange dots in the figure represent the annualized Sharpe ratio of our proposed model for all numbers of groups we have specified, for example means that the dataset was split into exactly five groups. The black curve between the orange dots is a trend curve, fitted with b-splines, which shows the general development of the performance for each number of groups.444The trend curve can be fitted in many ways and hence serves only an illustrative purpose. The gray dots represent the corresponding model but with the companies, randomly allocated to groups. The bars beneath the dots show the difference of our approach compared to the random benchmark . The difference can be obtained from the left -axis. If the bar is filled with blue color, the proposed model performs better, if it is red, the random model exhibits a higher Sharpe ratio. The dashed lines represent the Sharpe ratio of our other four benchmarks: (red), (salmon), (blue) and (green).

By analyzing Figure 6, we can observe that our approach has an interesting development of the Sharpe ratio, depending on the number of groups. First, the Sharpe ratio seems to be quite low, but increasing the number of groups leads to a steady increase in the performance up to a certain number of groups. Starting with six groups, the performance seems to decline and move towards the performance of the tangency portfolio. The gray points of the random portfolio seem to also slowly move towards the Sharpe ratio of the tangency portfolio when the number of groups increases. This behavior, however, is to be expected, as increasing the number of groups leads more and more to the standard scenario, where each company represents a group. In this case, using the before-mentioned scheme leads to the standard tangency approach, as a naive portfolio of one company is the company itself. We also can see that the performance of the random portfolios is worse than the performance of the tangency portfolio, which can be explained by diversification effects and the standard portfolio theory - the more assets in the final portfolio, the better. Nonetheless, the portfolios based on t-SNE and spectral clustering do have different properties when a low number of groups is considered. They outperform the tangency portfolio in almost all cases, which is not surprising due to the expected high estimation error of the tangency portfolio. Due to the grouping into naive portfolios this estimation error is reduced and overcompensates the loss diversification. The global naive portfolio seems to be still a strong benchmark for optimization, while the portfolios based on industry classification seem to have no real impact.

Table 1 provides detailed information on the true annualized Sharpe ratio an investor would have received, when using our or one of the benchmark studies. While the left-hand side shows the Sharpe ratio, the right-hand side displays the standard error associated with that Sharpe ratio, constructed by a standard bootstrap with replacement and 1000 repetitions. Analyzing the different standard errors of the portfolios shows a general problem of comparisons of Sharpe ratios. Even when thousands of observations are made, the Sharpe ratio seems to still suffer from high standard errors and hence often lacks statistical significance. However, the difference between the best portfolio and the tangency portfolio is different on a chosen -level of . Comparing multiple Sharpe ratios between the and portfolios is not achievable while maintaining acceptable statistical significance because of the high standard errors and potential stacking of type errors.

5 Discussion

In this paper, we introduce a novel approach on how to classify financial data. To do so, we utilize a standard machine learning tool, combined with a clustering algorithm, on financial data. The performed analysis emphasizes the importance of this framework for practitioners with expert knowledge as well as for researchers. While practitioners can use the t-SNE part of the model to visualize high-dimensional data in a comprehensive way and enrich the further analysis with their expert knowledge, researchers can apply sophisticated clustering approach for a data-driven and efficient grouping.

Furthermore, we create a general decision engine which not only shows the implementation steps for the proposed machine learning approach, but additionally opens up standard approaches such as portfolio optimization for the incorporation of cross-validation and data-driven detection of optimal parameters. Standard approaches can be fine-tuned by machine learning due to the introduction of new parameters for grouping. The whole model setup from classification to optimization can thus be constructed individually in dependence on the problem at hand which further ensures the flexibility of the approach.

The demonstrated effective company grouping with machine learning replaces the need to use potentially proprietary classification systems and can help identify similarities, based on various measures. Even though we have only used return data, the method can be used for many other data types as well, such as balance-sheet data. Our empirical example from the field of portfolio optimization shows that our t-SNE with spectral clustering approach significantly outperforms standard benchmarks, such as the tangency or the naive portfolio. Still, our approach is not limited to portfolio theory. On the contrary, we believe that the financial research and the practice can benefit from more applications of our framework to other financially related topics, such as company valuation or factor analysis.

6 References