The NEU Meta-Algorithm for Geometric Learning with Applications in Finance

08/31/2018 ∙ by Anastasis Kratsios, et al. ∙ Concordia University 0

We introduce a meta-algorithm, called non-Euclidean upgrading (NEU), which learns algorithm-specific geometries to improve the training and validation set performance of a wide class of learning algorithms. Our approach is based on iteratively performing local reconfigurations of the space in which the data lie. These reconfigurations build universal approximation and universal reconfiguration properties into the new algorithm being learned. This allows any set of features to be learned by the new algorithm to arbitrary precision. The training and validation set performance of NEU is investigated through implementations predicting the relationship between select stock prices as well as finding low-dimensional representations of the German Bond yield curve.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many statistical and learning algorithms such as ordinary linear regression and PCA admit linear algebraic formulations making them quick to execute. Their inability to capture non-linear features has motivated several non-linear generalizations. Non-linear generalizations of linear models require alternative, computationally costly, estimation procedures. Generalized additive models (GAM) and artificial neural networks (ANN) are examples of non-linear generalizations of linear regression that come with a significant increase in computational cost (see

[14, Chapters 9 and 11] for a discussion of these methods).

Patterns in the data are typically interpreted as a function relating explanatory inputs to the observations which they explain. Alternatively, a pattern can be interpreted as the positioning of points in space. Since a function’s graph is a specific set of points in space, interpreting a pattern as a configuration of points in space is more general than interpreting it as a function. The non-Euclidean Upgrading (NEU) methodology introduced in this paper can learn any configuration of data. As a consequence, two versions of the universal approximation property (see [5] for details) of ANNs is also recovered.

Non-Euclidean Upgrading (NEU) is a meta-algorithm. Meta-algorithms are algorithms whose inputs and outputs are other algorithms. For example, the Boosting meta-algorithm of [26] efficiently combines learning algorithms to build a more accurate new learning algorithm. Bagging, as introduced in [3]

, is another meta-algorithm which generates bootstrapped samples from a given dataset, performs the input algorithm on those bootstrapped samples, and aggregates each of the predictions into a lower-variance estimate. NEU is also a meta-algorithm which inputs a learning algorithm and a dataset, and outputs a new algorithm with the universal approximation property built into it. Applying NEU to simple linear algorithms produces algorithms which are interpretable, have a low computational burden, and can predict any pattern to arbitrary precision once trained.

NEU works by first segmenting the input data into training and validation components, then performing local perturbations on the space on which the data is defined, executing the learning algorithm on the perturbed training and validation data sets, and evaluating if the validation set performance has increased. The procedure continues iteratively, stopping once the validation set performance begins to drop.

(a) Non-Linear Configuration of Euclidean Data.
(b) Linear Configuration of Non-Euclidean Data.
Figure 1: Visualization of Reconfiguration of the Data.

Figure 1 illustrates how perturbing reconfigures the given dataset and allows for a linear regression to explain a non-linear relationship. After linear regression is performed, the transformations to are inverted and the linear predictor becomes non-linear. This illustration is analogous to the non-Euclidean regression proposed in [10] with the central difference being that our methodology learns the geometry of the problem whereas the algorithm in [10] relies on a prespecified geometry.

Applying NEU to principal component analysis (PCA) generates an analogue of the principal geodesic analysis of

[11]

where the geometry is learned from the data. Applying NEU to the unscented Kalman filtering algorithm of

[16] or to the geometric GARCH framework of [13] produces analogues of those algorithms but without a prespecified geometry. There are many other potential applications of NEU in statistics and machine learning.

We consider two examples from finance. The first example considers the use of principal component analysis (PCA) on German bond data. Using NEU on PCA shows that one NEU-principal component performs better than standard principal components.

The second example from finance considers the relationship between Apple stock price and the stock prices of companies related to Apple. Using NEU on linear regression provides better out-of-sample predictions than the LASSO, Ridge regression, and non-linear extensions of Elastic-Net (ENET) procedures. While we consider only two examples from finance to illustrate NEU, the generality and flexibility should allow for similar performance gains in other areas of financial statistics and machine learning.

The remainder of this paper is organized as follows. Section 2 introduces the mathematical framework for non-Euclidean upgrading, the main results regarding the technique’s flexibility, and predictive performance enhancement are proven. Section 3 investigates the empirical performance of non-Euclidean upgrading on the two examples from finance. The relationship between Apple stock price and the stock price of related companies can be better explained training sets and validation sets using non-Euclidean upgraded regression. Parallels are drawn to the non-Euclidean generalizations of regression and principal geodesic analysis developed in [10] and [11], respectively. We adjoin an appendix with two sections, the first lists the regularity assumptions made and the second contains certain technical proofs.

2 Non-Euclidean Upgrading

This section introduces and develops the NEU meta-algorithm. Reconfigurations are first introduced and a universal approximation property is proven. The NEU meta-algorithm is then introduced and its performance gain property is proven.

2.1 Reconfiguration

For the remainder of this paper, a dataset will be comprised of training and validation sets. The training set will be denoted by and the validation set will be denoted by , where and are non-negative integers and .

Reconfigurations perturbing the dataset are smooth maps from back into itself, smooth autodiffeomorphisms, which satisfy certain local properties. These are defined as follows.

Definition 2.1 (Reconfiguration Map)

Let be an open subset of and be a star-shaped domain in of dimension . A reconfiguration on is a map

satisfying the following properties:

  1. Invertiblility: For every , the map is a bijection,

  2. Smoothness: For every , the maps , and are continuously differentiable,

  3. Smooth Parametrization: For every in , the map is continuously differentiable,

  4. Local Transience: For every in with , there exists such that

    where is the Euclidean distance on .

  5. Identity: The subset of is non-empty.

The central example of a reconfiguration map, is a rapidly decaying rotation concentrated on a disc. These rotations slow exponentially as the boundary of the disc is approached. Beyond the disc’s boundary the reconfiguration map becomes the identity transformation. Rapidly decaying rotations are illustrated by Figure 2.

(a) Data in Euclidean Space.
(b) A Rapidly Decaying Rotation.
Figure 2: Visualization of Rapidly Decaying Rotations.
Definition 2.2 (Rapidly Decaying-Rotations)

Let denote the set of skew-symmetric matrices and set . A rapidly decaying rotation is the map defined by

(2.1)

where is the Gaussian bump-function supported on the unit sphere of radius centered at the point , defined by

(2.2)

and is the matrix exponential map.

Proposition 2.3.

Rapidly decaying rotations are reconfiguration maps on . Moreover, the inverse of is

where is the image of under .

Proof.

The proof is deferred to the appendix ∎

Remark 2.4 (Geometric Interpretation).

The rapidly decaying rotations are interpolations between a rotation and the identity map interior to the disc of radius

, centered at . However, the interpolation does not take place in , but instead happens within the lie algebra lying tangential to the space of all generalized rotation matrices . This ensures that the map is invertible for all possible parameter choices.

Definition 2.5 (Planar Micro-Bumps)

A planar micro-bump on , is the map defined by

(2.3)

where .

Proposition 2.6.

Planar micro-bumps are reconfigurations maps on .

Proof.

The proof is deferred to the appendix ∎

Data points are deemed poorly placed if moving them increases the validation set performance of a learning algorithm. Iteratively applying reconfiguration maps allows poorly placed data-points to be moved to locations which increase an algorithm’s validation set performance. The local transience property of reconfiguration maps, Definition 2.1 (iv), makes it possible to only move poorly placed data-points while leaving the others fixed. The procedure is summarized as follows.

Definition 2.7 (Reconfiguration)

Let be a star-shaped domain in of dimension , be a smooth sub-manifold of , which is diffeomorphic111 The Whitney embedding theorem implies that any smooth manifold is a smooth subset of a Euclidean space. In this paper, a map will be quantified as being smooth if it is once continuously differentiable. to , be a diffeomorphism222 A diffeomorphism is a bijection which is smooth and has a smooth inverse. from onto , let be a reconfiguration map on , and let be in with . Here is as in definition 2.1(v). A reconfiguration , is a map from to defined by

where

Reconfiguring a dataset on maps it into new coordinates for the input -variables. These coordinates may not be directly interpretable, therefore after performing the learning algorithm and obtaining an estimate in the new coordinate system the reconfiguration must be inverted. This inverse procedure is called deconfiguration.

Definition 2.8 (Deconfiguration)

Let be a reconfiguration of . The deconfiguration of is the map denoted by defined as

The universal approximation property of neural networks states that certain neural networks can approximate any function to arbitrary precision (see [5]). The first analogous property for reconfiguration states that any dataset can be transformed into any other dataset of equal size.

Theorem 2.9 (Universal Reconfiguration Property).

Assume that , is an open star-shaped domain in of dimension , be a reconfiguration map on , and be a diffeomorphism from onto . Let and be subsets of . There exists a positive integer , and in for which

for every in .

Proof.

The proof is deferred to the appendix. ∎

The universal reconfiguration property implies the following analogues to the universal approximation property of neural networks of [19]. The first captures general functions on a more restricted domain and the second captures a smaller class of functions on a larger domain.

Corollary 2.10 (Universal Approximation Property).

Let be positive integers, be a subset of and be Borel-functions from to . If is diffeomorphic to , then for every countable subset of

, probability measure

supported on , and every , there exists such that for every there exists a Borel-subset of satisfying

  1. .

Here is the second canonical projection333The second canonical projection of the product space takes a pair to , see [20] for details. of onto . In the limiting case where , the convergence of to on is point-wise.

Proof.

The proof will be deferred to the appendix. ∎

Corollary 2.11 (Universal Smooth Approximation Property).

Let be positive integers, is a regular, convex, compact subset of of dimension , be continuously differentiable functions from to . If is a reconfiguration map satisfies regularity condition A.5, then for every there exists in such that

Moreover, the limiting function exists and is continuously differentiable .

Proof.

The proof will be deferred to the appendix. ∎

Non-Euclidean upgrading uses reconfigurations to improve a class of learning algorithms which we call objective learning algorithms. These are discussed in the next section.

2.2 Objective Learning Algorithms

The learning algorithms we consider in this paper optimize both the training set and validation set loss functions. Regularized regression, PCA, k-means, neural networks, Bayesian classifiers, support vector machines, and stochastic filters are all examples of objective learning algorithms.

Objective learning algorithms associate to every pair of training and validation sets of a given size, a pair of training set and validation set loss-functions as well as a pattern function linking the parameters being optimized to the prediction they can make. This formalization requires the definition of the set of all possible learning algorithms for a fixed set of hyper-parameters and parameter to prediction function . Here is the dimension of the space in which the data-points lie, is the dimension of the explanatory parameters, and is the number of -dimensional points out-putted by the algorithm.

For example, for a factor PCA, and for a two factor PCA and . In the case of linear regression, the regression weights are scalars therefore . If there is no intercept then and if there is an intercept , in this formulation is the number of columns of the design matrix.

Let be a positive integer and be a non-negative integer. Define to be the set of all pairs of maps such that

  1. The map ,

  2. The map ,

  3. Regularity condition A.1 holds.

The function represents the estimated pattern, parameterized by . The parameter lies in the space and is to be chosen by optimizing training set and validation set loss functions. is the training set loss function on a dataset of size and is the out-of-sample loss function on a dataset of size . The space of all learning algorithms for a specific pattern function is

Definition 2.12 (Objective Learning Algorithm)

An objective learning algorithm is a map

where the pair of an training set and a validation set are viewed as elements of , and where is the non-negative integer-valued function mapping a point in Euclidean space to .

Remark 2.13.

Given a dataset consisting of

data-points, the regression analysis loss function is

(2.4)

where are the data-points and are the responses. Incorporating an additional data-point and an additional response into the regression analysis changes the loss function of Equation (2.4) to

(2.5)

Both Equations (2.4) and (2.5) are a -dimensional regression problem but technically are defined by different loss functions. Definition 2.12 overcomes the oddity of having a learning algorithm differ depending on the size of the dataset, by defining an objective learning algorithm as a map associating the size of a dataset to the corresponding loss function; which is what we do in inadvertently.

Principal component analysis and regression analysis are objective learning algorithms. This is illustrated by the following two examples.

Example 2.14 (Regression as an Objective Learning Algorithm).

Let be real numbers and be a continuously differentiable linearly independent set of functions in . Non-linear regression is an objective learning algorithm which is represented by

  1. ,

  2. ,

  3. ,

  4. ,

where is the component of the -dimensional vector and where is the observed data-point. Typically, the out-of-sample dataset is always taken to be empty unless a regularization or sparsity constraint is imposed.

By adding a penalty term, such as the

norm, to the training set and validation set loss functions and expanding the hyperparameter set

accordingly, most regularized regression problems, such the LASSO of [31], are seen to be objective learning algorithms.

Example 2.15 (PCA as an Objective Learning Algorithm).

Calculating the first principal component of a dataset’s empirical covariance matrix is an objective learning algorithm. Here are represented by

  1. ,

  2. ,

  3. ,

where and are the training and validation sets and , viewed as matrices but with their column-wise means removed. Typically, the out-of-sample dataset is always taken to be empty. The higher principal components, as well as sparse principal components, can also be represented analogously as an objective learning algorithm.

The optimal evaluation of a learning algorithm, is a map taking a learning algorithm and a dataset to an optimized pattern. The optimal evaluation is only well-defined on datasets which admit a unique optimizer. This set of regular datasets, called the regular domain of definition of the learning algorithm, is defined as follows.

Definition 2.16 (Regular Domain of Definition)

Let be a learning algorithm. The regular domain of definition of , denoted by
, is the set of all pairs of data points in

satisfying the regularity condition A.2.

The map associating a dataset and an objective learning algorithm to the pattern best describing it is now defined.

Definition 2.17 (Optimal Evaluation)

Given an objective learning algorithm, its optimal evaluation is the output of the function taking as input a pair training and validation sets in and returning the optimal parameter defined by

Remark 2.18.

The optimal evaluation takes an objective learning algorithm and a dataset and returns the optimizer minimizing the loss function defined by the dataset. For example, in a LASSO regression the optimal evaluation returns the parameters of the line of best fit relating the explanatory variables to the responses, with the tuning parameter is optimized according to the validation set.

The requirement that the dataset be in the regular domain of definition of the learning algorithm means that the optimal evaluation is a well-defined function. For example the points do not have a single line of best fit describing their relationship therefore the optimal evaluation of the regression problem is not defined on that dataset.

As in [14] the performance of a learning algorithm is defined as the negative of its loss function evaluated at the optimal value. The definition of performance of training and validation set performance of an objective learning algorithm is defined in an analogous manner.

Definition 2.19 (Performance)

Let be a learning algorithm. The training set performance of is the function, denoted by , taking a dataset in to the extended real number

The validation set performance of is the function, denoted by , taking a dataset in to the extended real number

Remark 2.20.

The performance is the negative of the loss function evaluated at its optimal evaluation. It provides a measure of how well an objective learning algorithm can explain a given dataset.

A dataset in is said to maximize the in (resp. out-of) sample performance of if there is no other dataset in having the same number of training and validation data points and a higher validation set performance.

The main result can now be stated. If the data is in the regular domain of definition of a learning algorithm, and is not already in an optimal position, then there is a reconfiguration which increases the performance of that algorithm. An example of optimally positioned data for linear regression is data that is perfectly explained by a line both on the training and validation sets. In this extreme case, it is natural to expect that no improvement can be made to linear regression.

Theorem 2.21 (Performance Gain).

Let and be an objective learning algorithm. For every pair of integers and every in , there exists in such that

(2.6)
(2.7)

where the reconfigured datasets and are defined as

The inequality in equation (2.7) (resp. equation (2.6)) is strict if does not maximize (resp. ).

Proof.

Without loss of generality assume that does not maximize , with the proof of the statement for being identical. Therefore there is in which has a higher value of and has the same number of training and validation data-points.

Therefore, by the universal reconfiguration property of Theorem 2.9, there exists
such that

Theorem 2.21 guarantees that there exists a reconfiguration of the data which improves an algorithm’s training set and validation set performance. The NEU meta-algorithm is a procedure which learns the reconfiguration of the space ensuring that the training and validation sets are positioned in a way which reduces the training set and validation set loss functions. This is formalized by the meta-algorithms illustrated by Figure 3 and made explicit in meta-algorithm 2.22.

Euclidean Data

Non-Euclidean Data

Randomly Reconfigre Data

Reconfiguration Map

Training-Set Improvement? Y/N

Undo Reconfiguration

Make Prediction

Return Prediction to

Validation-Set Improvement? Y/N

Update Data and Reconfiguration map

Yes

No

No

Yes
Figure 3: Work-flow of Reconfiguration Learning Phase of Non-Euclidean Upgrading
Meta-Algorithm 2.22 (Non-Euclidean Upgrading).

The inputs of the non-Euclidean upgrading algorithm are a diffeomorphism , an objective learning algorithm , a pair of training-set and validation-set data-points in satisfying regularity condition A.3, a reconfiguration map, , , and a positive integer . Non-Euclidean upgrading takes these inputs and returns the following algorithms as its output

  1. Learning Reconfiguration: Define a reconfiguration through the following procedure,

    1. Define the data-points ,

    2. ,

    3. For integers between :

      1. Define the tentative optimal evaluation to be

      2. Define the tentative performance measurement, to be

      3. if  then

              define
        else
               define
      4. Define the updated data ,

  2. Stop when or when ,

  3. Define ,

  4. Define the reconfiguration ,

  • Perform Algorithm: Perform on the data and obtain the optimal evaluation ,

  • Deconfigure Prediction: Returns the values:

    1. Prediction: ,

    2. Performance Gain: ,

    3. Parameter Estimates: .

  • 3 Numerical Implementation of NEU-OLS and NEU-PCA

    We begin by investigating the empirical performance of non-Euclidean upgrading. The first two implementations focus on real datasets and the second uses simulated data. The first two use the rapidly decreasing rotations to reconfigure the data whereas the last example uses micro-bumps since the data lies in .

    3.1 Data-driven Studies

    The performance of the NEU meta-algorithm will be investigated both in the regression and dimensionality reduction settings on financial datasets beginning with a regression analysis study.

    Example 3.1 (Regression Analysis: Apple Stock Tracker).

    Predicting the relationship between the price of a set of assets is central to many trading strategies. For example, strategies that rely on illiquid assets may create a portfolio comprised entirely of liquid assets, which tracks the illiquid asset’s movements. Since that is a particular application of tracking portfolios, in this example, the technique is demonstrated using liquid stocks. The target stock price will be denoted by and the prices of the assets making the tracking portfolio will be denoted by .

    In this example, will be the price of apple stock, and will be the stock prices for IBM, Google, Cisco Systems Inc., Microsoft Corporation, Acacia Communications Inc., NXP Semiconductors NV, Qualcomm, Analog Devices Inc., Glu Mobile Inc., Jabil Inc., Micron, and STMicroelectronics NV. These portfolio is chosen as being comprised of the stock of major companies in the same same industry as well as major companies making up apple’s supply chain (see [1] for a discussion on apple’s supply chain and [29] for a discussion of the tech companies with the largest market capitalization).

    A tracking portfolio consisting of these assets is built by minimizing the ordinary least-squares loss function on the training dataset

    where is the number of data points and is the number of assets used to track the Apple stock price. For illustrative and comparative purposes, the LASSO of [31], the Ridge (or Tykhonov regularization) regression of [32], the Elastic-Net regularization (ENET) of [34], and the NEU-OLS are compared.

    The ENET selects the optimal regression weights by minimizing the loss function ENET Opt. Power denotes the solution to

    with selected by sequential-validation. The LASSO is the special case where is fixed to and Ridge regression is the special case where . The penalty

    reduces the number of explanatory parameters in a model by forcing the regression weights towards , thereby forcing the most significant parameters to only be fit. The meta-parameter controls the strength of this sparsity penalty, controls the aggressiveness of the variable-selection process, with giving a more aggressive choice and towards a non-aggressive penalty. ENET, LASSO, and Ridge regression are interpreted in [33] as robust regression problems where the regression problem is optimized against varying types of shocks in the data, or alternatively these can be interpreted as in [31, 35] as modifications of the regression problem that are able to detect and converge to the true set of explanatory variables, under linear and Gaussian noise assumptions.

    In this example, years of adjusted stock prices are used to compute the weights, ending on July . The modeling assumption that the data does not follow a constant pattern throughout time is made and the data is broken up into rolling windows. Regression weights are dynamically updated on each window as is standard in practice (for example see [9, 2, 30]). In order to extract meaningful weights

    , the time-series must be shown to be co-integrated. The Dickey-Fuller, unit root test is performed on the returns of the adjusted stock price time-series and the null-hypothesis that there exists a unit root is rejected with a p-value of less than

    and Dickey-Fuller statistic , therefore the can meaningfully be computed from the adjusted stock price’s returns using regression methods (see [22] for more details on co-integrated time-series).

    Mean 95 L 95 U 99 L 99 U
    OLS 4.185 4.038 4.385 4.017 4.448
    Ridge -0.831 -0.916 -0.715 -0.928 -0.678
    LASSO 0.581 0.568 0.599 0.566 0.604
    ENET 0.526 0.519 0.535 0.518 0.538
    NEU-OLS 0.204 0.202 0.208 0.202 0.209
    Table 1: Mean Aggregate Training Errors.

    Each window is sequentially divided into a training, a validation, and a test set. Each of the training sets consists of 200 observations, the validation sets consist of 2 weeks, and the test sets consists of the last week of each moving window. The proportions invested in each asset, denoted are the regression weights on that window, and are recalibrate on each window using each of the stocks’ returns. The mean training, validation, and test errors aggregated across each windows are reported in the Tables 13, and 3

    , respectively. The optimal parameters for the Ridge, LASSO, ELASTIC-NET, and NEU-OLS are re-calibrated on every window using sequential validation. The optimization of the parameters defining the reconfiguration of the data on were performed by alternating between stochastic gradient descent and randomized searches of the parameter space.

    Mean 95%L 95%U 99%L 99%U
    OLS 4.217 4.214 4.222 4.214 4.224
    Ridge -0.853 -0.946 -0.726 -0.959 -0.686
    LASSO 0.582 0.573 0.594 0.572 0.598
    ENET 0.525 0.518 0.534 0.517 0.537
    NEU-OLS 0.204 0.203 0.206 0.203 0.206
    Table 2: Mean Aggregate Testing Errors.
    Mean 95%L 95%U 99%L 99%U
    OLS 4.202 4.058 4.397 4.038 4.458
    Ridge -0.845 -0.928 -0.734 -0.939 -0.699
    LASSO 0.581 0.571 0.594 0.569 0.598
    ENET 0.525 0.521 0.530 0.520 0.531
    NEU-OLS 0.204 0.203 0.206 0.202 0.206
    Table 3: Mean Aggregate Validation Errors.

    As expected the OLS performs worst and the ENET performs best amongst the benchmark regression methods. All the methods, except the Ridge regression are conservative and under-estimate the price of apple stock. The NEU-OLS has the lowest error in the training, validation, and test sets across every window. Moreover, it has the tightest confidence intervals. Therefore the NEU-OLS performs achieves a lower bias as well as a lower variance.

    Algorithm OLS NEU-OLS Ridge LASSO ENET
    Run Time (sec) 0.01 104.02 0.02 0.02 0.07
    1 12,980.03 2.74 2.57 9.11
    Table 4: Runtime Comparison.

    The NEU-OLS does have its own drawbacks, namely computational time. Once the reconfiguration of the data is learned the OLS algorithm can be run directly on the reconfigured dataset making NEU-OLS and OLS just as fast. However on the first run, when the reconfiguration is being learned the NEU-OLS is significantly slower than the other methods compared within this paper.

    Table 4 reports the run-times of performing the OLS, NEU-OLS, Ridge regression, LASSO, and ENET algorithms on the dataset considered in this example using an Intel(R) Core(TM) i5-6200U CPU at 2.30GHz, with 7844MB available RAM machine running 18.04 LTS version of the Ubuntu Linux distribution.

    We conclude that after learning the NEU-OLS has the lowest prediction error amongst the regression methods considered in this example and its execution speed is just as fast as OLS after the reconfiguration has been learned. However, on the first run when the reconfiguration is being learned NEU-OLS is notably slower than the other methods. Therefore, NEU-OLS may be the best of these options when speed is not a large factor, but it may not be ideal for setting when the runtime of an algorithm is a determining factor, such as for live high-frequency trading.

    Example 3.2 (Dimensionality Reduction: German-Bond Yield Curve).

    Principal component analysis (PCA) is a non-parametric technique which converts correlated data into a set of uncorrelated vectors , each explaining progressively less of the data’s variance than the last one. The vectors , called principal components, are obtained through the recursion relation:

    (3.1)

    where is the empirical data matrix with column-wise means removed.

    PCA is commonly used in finance, where high dimensional data is typical. A classical use is for pricing zero-coupon bonds. Denote by

    the price of a zero-coupon bond with maturity at time . The price can be modeled using the yield curve , which is defined as the rate at which the price of the bond is equal to the discounted cash flows. That is,

    The first three principal components of the yield curve are known to explain its level, slope, and curvature respectively (see [7] for more details). The validation-set loss function which we will use is

    (3.2)

    where is the vector of Bond yields observed on the day in the validation set (resp. training set) and is the number of principal components used to give a low dimensional approximation of the yield curve. As discussed in [7], the first three principal components of most yield curves tend to explain about of the data’s variance.

    As a benchmark a two common alternatives to PCA, Kernel PCA (kPCA) and sparse PCA (sPCA) will be also be considered. Kernel PCA, performs first maps the data into another space, called the feature space, wherein the data can be more naturally partitioned by hyperplanes and the performs PCA in the feature space. The transformation into the feature space is typically made indirect by only describing the feature space’s inner product, which is possible due to the reproducing kernel Hilbert space structure of the feature space. A choice of inner product between two vectors

    in the feature space is

    Unlike NEU-PCA, the non-linear transformation used in kPCA is not learned from the data but chosen before the algorithm is executed. Since kPCA does not make computations directly in the feature space but works indirectly to it by exploiting its inner product, kPCA does not allow for reconstruction of the data. However this is not the case with NEU-PCA, since it is entirely constructive.

    Analogously to the LASSO, Ridge regression, and ENET regularization problems, sPCA penalizes the Equation (3.1) to in order to obtain sparser principal components. The implementation considered in this paper will use the sPCA formulation of [8]. Sparse PCA has the advantage over PCA of being more interpretable, lower-dimensional, and being more robust due to its low dimensionality (see [36, 8] for more details on sPCA).

    For this illustration PCA, kPCA, sPCA, NEU-PCA, NEU-kPCA, and NEU-sPCA will all be performed on bond yield data. The daily bond data considered in this example consists of stripped German government bond prices between January 2010 and December 2014. The considered bond maturities are between 6 months and 30 years. The training-set consists of the first 1000 days of data, the validation set of the next 200 days, and the test set consists of the remainder. The reconfigurations defining the NEU methods with be learned using NEU-PCA. The NEU-kPCA and NEU-sPCA methods will be use the reconfigurations learned from NEU-PCA.

    The NEU-PCA algorithm is implemented by optimized the training and validation objective functions by alternating between random searches and performing bulk iterations of the Nelder-Mead heuristic search method (see

    [21] for details Nelder-Mead optimization). This heuristic scheme provided faster convergence results than direct use of stochastic gradient descent as in Example 2.14

    due to the data’s high dimensionality. After learning the reconfigurations defining the NEU-PCA algorithm, the same reconfigurations were used to define NEU-kPCA and NEU-sPCA. This is interpreted as a form of transfer learning between analogous models.

    N.Fact. PCA NEU-PCA kPCA NEU-kPCA sPCA NEU-sPCA
    1 0.7749 0.7868 0.0906 0.0894 0.9756 0.9774
    2 0.8833 0.8936 0.9171 0.9175 0.9942 0.9949
    3 0.9417 0.9506 0.9948 0.9955 0.9992 0.9996
    4 0.9654 0.9688 0.9981 0.9981 0.9999 0.9999
    Table 5: Comparison of Variance Explained in Training Set.

    Table 5 shows that NEU-PCA explains more of the training set variance than PCA does. However, kPCA and sPCA seem to explain more training set variance than NEU-PCA, but not as much as NEU-kPCA or NEU-sPCA. However, examining the test-set predictive performance of the four algorithms in Table 6, it is observed that the kPCA based algorithms are not able to accurately forecast the yield curve. Therefore, NEU-PCA is the most parsimonious option for prediction between the four methods and NEU-kPCA explains the most training set variance of the data.

    The more modest gains of this method are due to the district training and validation loss functions. For example, removing the validation loss-function and thereby the early stopping criterion in the definition of NEU, it can be seen that one NEU-PCA can explain more than of the training set variability of the data. However, this leads to poor out-of-sample predictions of the test set yield curves as well as uninterpretable NEU-PCAs.

    N.Fact. PCA NEU-PCA kPCA NEU-kPCA sPCA NEU-sPCA
    1 2,245.643 2,153.412 829.210 827.651 497.683 471.695
    2 344.961 294.106 829.200 827.644 290.040 265.822
    3 28.633 17.927 829.197 827.640 14.489 12.400
    4 4.424 2.975 829.190 827.634 12.061 12.210
    Table 6: Comparison of test set Predictions according to the loss-function of Equation (3.2).
    Figure 4: First four principal components of the German Bond Yield-curve.

    In this implementation, the NEU-PCAs of the yield curve. Figure 4 shows that, upon rescaling, the first and fourth PCA and NEU-PCAs have identical interpretation, while the second and fourth NEU-PCAs look similar a flipped version of the second and fourth PCAs. The NEU-PCAs in Figure 4 are in the transformed, non-Euclidean space, whereas the PCAs in Figure 4 are in Euclidean space itself. It should not be surprising that the and factor sPCA outperforms the -factor NEU-sPCA since the reconfiguration used for the NEU-sPCA was trained using the PCA algorithm.

    In this implementation, the NEU-PCAs provided the most robust out-of-sample predictions of the yield curve, explained more of the training set variance than PCAs did and retained the interoperability of each of the principal components. Moreover like PCA, the approach is constructive therefore can be used for reconstruction purposes, which is not the case for kPCA due to it indirectly working with the feature space (see [27, Section 4] for a brief discussion on the data-reconstruction shortcomings of kPCA).

    Table 7 examines the runtime of each method. All six algorithms were run on a machine with the same specs as those of Example 2.14.

    Algorithm Run Time (sec)
    PCA 0.01 1
    NEU-PCA 2.89 474.99
    kPCA 0.08 12.50
    NEU-kPCA 2.96 486.48
    sPCA 0.81 132.40
    NEU-sPCA 3.70 606.39
    Table 7: Runtime Comparison.

    The central shortcoming of the NEU meta-algorithm is underlined by Table 7. Its second row shows that the runtime of the NEU algorithms are about 1000 times slower than PCA and 100 times slower than kPCA. Therefore if speed is necessary it may be more desirable to turn to PCA or kPCA than their NEU counterparts. However, if time can be spared then the first three NEU-PCAs makes -factors NEU-PCA the best overall choice due to its interpretability, out-of-sample predictive power and it explaining a competitive level of the training set’s variance.

    The next example investigates the implications of the universal approximation and universal reconfiguration properties of reconfigurations in the controlled environment provided by simulation studies.

    3.2 Simulation Studies - Investigation of Universal Properties

    These simulation studies will focus on the illustrating the universal approximation and universal reconfiguration properties of reconfigurations, and thus the NEU meta-algorithms through the lens of regression analysis. In these simulation studies, the data will be generated according to the model

    (3.3)

    where , , and is a non-linear function. Three non-linear functions will be investigated, these are

    1. ,

    2. ,

    The first function investigates how well NEU-OLS can approximate non-linear functions whose global shape, unlike polynomials or periodic function, cannot be determined by local data. The second evaluates how NEU-OLS can deal with functions osculating at non-constant speeds. The third looks at how well the NEU-OLS algorithm can approximate functions with discontinuities.

    The NEU-OLS algorithm will be benchmarked against two standard non-parametric regression algorithms, penalized smoothing splines regression (p-splines) and Locally Weighted Scatterplot Smoothing (LOESS). Smoothing splines regression is a highly flexible approximation method. A smoothing splines is a twice continuously differentiable function which is constructed by gluing a finite number , at most equal to , of cubic polynomials together. The optimal p-spline, denoted here by , is chosen by minimizing the objective function

    where are real numbers, is a suitable function, , and the pairs are generated according to the model described in Equation (3.3).

    The value of the tuning parameter determines how smooth is and how well it interpolates the data-points . If and , then interpolates the data. Conversely, as approaches infinity and becomes small, then approaches the solution to an ordinary linear regression (see [14, Chapter 5] for details on smoothing splines and p-splines). Unlike smoothing splines, p-splines do not require a knot at every point and therefore are less susceptible to over-fitting than smoothing splines. The parameters and will be chosen by -fold cross-validation.

    LOESS is a non-parametric regression method, where a smooth polynomial is fit to the data. The best fitting polynomial, denoted by , is found by minimizing the value of the loss-function

    where is the distance of the point to the polynomial. Unlike classical regression problems, the LOESS objective function does not only look at the pairs themselves but incorporates the importance of nearby points into its objective function. This is because the closest point on to need not be but may be a neighboring point on to . The degree of the polynomial is chosen using cross-validation (see [4] for details).

    For each simulation observations will be generated on the interval , the data will then be normalized to the unit square for uniformity between the three examples. The models’ tuning-parameters will be estimated on a subset of data-points sampled in a stratified manner on evenly spaced subintervals by cross-validation or early stopping in the case of NEU-OLS. The remaining sample points will serve as the test set. The run-times reported in these simulation studies will be using the same specs as the PC used to report the run-times in Table 4.

    Example 3.3 (Simulation Study - NEU-OLS - Non-Locality).

    In this simulation study, NEU-OLS will be compared against LOESS and -splines regression when the function in Equation (3.3) is assume to be

    (3.4)

    This simulation study was performed by generating