1 Introduction
Many statistical and learning algorithms such as ordinary linear regression and PCA admit linear algebraic formulations making them quick to execute. Their inability to capture nonlinear features has motivated several nonlinear generalizations. Nonlinear generalizations of linear models require alternative, computationally costly, estimation procedures. Generalized additive models (GAM) and artificial neural networks (ANN) are examples of nonlinear generalizations of linear regression that come with a significant increase in computational cost (see
[14, Chapters 9 and 11] for a discussion of these methods).Patterns in the data are typically interpreted as a function relating explanatory inputs to the observations which they explain. Alternatively, a pattern can be interpreted as the positioning of points in space. Since a function’s graph is a specific set of points in space, interpreting a pattern as a configuration of points in space is more general than interpreting it as a function. The nonEuclidean Upgrading (NEU) methodology introduced in this paper can learn any configuration of data. As a consequence, two versions of the universal approximation property (see [5] for details) of ANNs is also recovered.
NonEuclidean Upgrading (NEU) is a metaalgorithm. Metaalgorithms are algorithms whose inputs and outputs are other algorithms. For example, the Boosting metaalgorithm of [26] efficiently combines learning algorithms to build a more accurate new learning algorithm. Bagging, as introduced in [3]
, is another metaalgorithm which generates bootstrapped samples from a given dataset, performs the input algorithm on those bootstrapped samples, and aggregates each of the predictions into a lowervariance estimate. NEU is also a metaalgorithm which inputs a learning algorithm and a dataset, and outputs a new algorithm with the universal approximation property built into it. Applying NEU to simple linear algorithms produces algorithms which are interpretable, have a low computational burden, and can predict any pattern to arbitrary precision once trained.
NEU works by first segmenting the input data into training and validation components, then performing local perturbations on the space on which the data is defined, executing the learning algorithm on the perturbed training and validation data sets, and evaluating if the validation set performance has increased. The procedure continues iteratively, stopping once the validation set performance begins to drop.
Figure 1 illustrates how perturbing reconfigures the given dataset and allows for a linear regression to explain a nonlinear relationship. After linear regression is performed, the transformations to are inverted and the linear predictor becomes nonlinear. This illustration is analogous to the nonEuclidean regression proposed in [10] with the central difference being that our methodology learns the geometry of the problem whereas the algorithm in [10] relies on a prespecified geometry.
Applying NEU to principal component analysis (PCA) generates an analogue of the principal geodesic analysis of
[11]where the geometry is learned from the data. Applying NEU to the unscented Kalman filtering algorithm of
[16] or to the geometric GARCH framework of [13] produces analogues of those algorithms but without a prespecified geometry. There are many other potential applications of NEU in statistics and machine learning.We consider two examples from finance. The first example considers the use of principal component analysis (PCA) on German bond data. Using NEU on PCA shows that one NEUprincipal component performs better than standard principal components.
The second example from finance considers the relationship between Apple stock price and the stock prices of companies related to Apple. Using NEU on linear regression provides better outofsample predictions than the LASSO, Ridge regression, and nonlinear extensions of ElasticNet (ENET) procedures. While we consider only two examples from finance to illustrate NEU, the generality and flexibility should allow for similar performance gains in other areas of financial statistics and machine learning.
The remainder of this paper is organized as follows. Section 2 introduces the mathematical framework for nonEuclidean upgrading, the main results regarding the technique’s flexibility, and predictive performance enhancement are proven. Section 3 investigates the empirical performance of nonEuclidean upgrading on the two examples from finance. The relationship between Apple stock price and the stock price of related companies can be better explained training sets and validation sets using nonEuclidean upgraded regression. Parallels are drawn to the nonEuclidean generalizations of regression and principal geodesic analysis developed in [10] and [11], respectively. We adjoin an appendix with two sections, the first lists the regularity assumptions made and the second contains certain technical proofs.
2 NonEuclidean Upgrading
This section introduces and develops the NEU metaalgorithm. Reconfigurations are first introduced and a universal approximation property is proven. The NEU metaalgorithm is then introduced and its performance gain property is proven.
2.1 Reconfiguration
For the remainder of this paper, a dataset will be comprised of training and validation sets. The training set will be denoted by and the validation set will be denoted by , where and are nonnegative integers and .
Reconfigurations perturbing the dataset are smooth maps from back into itself, smooth autodiffeomorphisms, which satisfy certain local properties. These are defined as follows.
Definition 2.1 (Reconfiguration Map)
Let be an open subset of and be a starshaped domain in of dimension . A reconfiguration on is a map
satisfying the following properties:

Invertiblility: For every , the map is a bijection,

Smoothness: For every , the maps , and are continuously differentiable,

Smooth Parametrization: For every in , the map is continuously differentiable,

Local Transience: For every in with , there exists such that
where is the Euclidean distance on .

Identity: The subset of is nonempty.
The central example of a reconfiguration map, is a rapidly decaying rotation concentrated on a disc. These rotations slow exponentially as the boundary of the disc is approached. Beyond the disc’s boundary the reconfiguration map becomes the identity transformation. Rapidly decaying rotations are illustrated by Figure 2.
Definition 2.2 (Rapidly DecayingRotations)
Let denote the set of skewsymmetric matrices and set . A rapidly decaying rotation is the map defined by
(2.1)  
where is the Gaussian bumpfunction supported on the unit sphere of radius centered at the point , defined by
(2.2) 
and is the matrix exponential map.
Proposition 2.3.
Rapidly decaying rotations are reconfiguration maps on . Moreover, the inverse of is
where is the image of under .
Proof.
The proof is deferred to the appendix ∎
Remark 2.4 (Geometric Interpretation).
The rapidly decaying rotations are interpolations between a rotation and the identity map interior to the disc of radius
, centered at . However, the interpolation does not take place in , but instead happens within the lie algebra lying tangential to the space of all generalized rotation matrices . This ensures that the map is invertible for all possible parameter choices.Definition 2.5 (Planar MicroBumps)
A planar microbump on , is the map defined by
(2.3)  
where .
Proposition 2.6.
Planar microbumps are reconfigurations maps on .
Proof.
The proof is deferred to the appendix ∎
Data points are deemed poorly placed if moving them increases the validation set performance of a learning algorithm. Iteratively applying reconfiguration maps allows poorly placed datapoints to be moved to locations which increase an algorithm’s validation set performance. The local transience property of reconfiguration maps, Definition 2.1 (iv), makes it possible to only move poorly placed datapoints while leaving the others fixed. The procedure is summarized as follows.
Definition 2.7 (Reconfiguration)
Let be a starshaped domain in of dimension , be a smooth submanifold of , which is diffeomorphic^{1}^{1}1 The Whitney embedding theorem implies that any smooth manifold is a smooth subset of a Euclidean space. In this paper, a map will be quantified as being smooth if it is once continuously differentiable. to , be a diffeomorphism^{2}^{2}2 A diffeomorphism is a bijection which is smooth and has a smooth inverse. from onto , let be a reconfiguration map on , and let be in with . Here is as in definition 2.1(v). A reconfiguration , is a map from to defined by
where
Reconfiguring a dataset on maps it into new coordinates for the input variables. These coordinates may not be directly interpretable, therefore after performing the learning algorithm and obtaining an estimate in the new coordinate system the reconfiguration must be inverted. This inverse procedure is called deconfiguration.
Definition 2.8 (Deconfiguration)
Let be a reconfiguration of . The deconfiguration of is the map denoted by defined as
The universal approximation property of neural networks states that certain neural networks can approximate any function to arbitrary precision (see [5]). The first analogous property for reconfiguration states that any dataset can be transformed into any other dataset of equal size.
Theorem 2.9 (Universal Reconfiguration Property).
Assume that , is an open starshaped domain in of dimension , be a reconfiguration map on , and be a diffeomorphism from onto . Let and be subsets of . There exists a positive integer , and in for which
for every in .
Proof.
The proof is deferred to the appendix. ∎
The universal reconfiguration property implies the following analogues to the universal approximation property of neural networks of [19]. The first captures general functions on a more restricted domain and the second captures a smaller class of functions on a larger domain.
Corollary 2.10 (Universal Approximation Property).
Let be positive integers, be a subset of and be Borelfunctions from to . If is diffeomorphic to , then for every countable subset of
, probability measure
supported on , and every , there exists such that for every there exists a Borelsubset of satisfying

.
Here is the second canonical projection^{3}^{3}3The second canonical projection of the product space takes a pair to , see [20] for details. of onto . In the limiting case where , the convergence of to on is pointwise.
Proof.
The proof will be deferred to the appendix. ∎
Corollary 2.11 (Universal Smooth Approximation Property).
Let be positive integers, is a regular, convex, compact subset of of dimension , be continuously differentiable functions from to . If is a reconfiguration map satisfies regularity condition A.5, then for every there exists in such that
Moreover, the limiting function exists and is continuously differentiable .
Proof.
The proof will be deferred to the appendix. ∎
NonEuclidean upgrading uses reconfigurations to improve a class of learning algorithms which we call objective learning algorithms. These are discussed in the next section.
2.2 Objective Learning Algorithms
The learning algorithms we consider in this paper optimize both the training set and validation set loss functions. Regularized regression, PCA, kmeans, neural networks, Bayesian classifiers, support vector machines, and stochastic filters are all examples of objective learning algorithms.
Objective learning algorithms associate to every pair of training and validation sets of a given size, a pair of training set and validation set lossfunctions as well as a pattern function linking the parameters being optimized to the prediction they can make. This formalization requires the definition of the set of all possible learning algorithms for a fixed set of hyperparameters and parameter to prediction function . Here is the dimension of the space in which the datapoints lie, is the dimension of the explanatory parameters, and is the number of dimensional points outputted by the algorithm.
For example, for a factor PCA, and for a two factor PCA and . In the case of linear regression, the regression weights are scalars therefore . If there is no intercept then and if there is an intercept , in this formulation is the number of columns of the design matrix.
Let be a positive integer and be a nonnegative integer. Define to be the set of all pairs of maps such that

The map ,

The map ,

Regularity condition A.1 holds.
The function represents the estimated pattern, parameterized by . The parameter lies in the space and is to be chosen by optimizing training set and validation set loss functions. is the training set loss function on a dataset of size and is the outofsample loss function on a dataset of size . The space of all learning algorithms for a specific pattern function is
Definition 2.12 (Objective Learning Algorithm)
An objective learning algorithm is a map
where the pair of an training set and a validation set are viewed as elements of , and where is the nonnegative integervalued function mapping a point in Euclidean space to .
Remark 2.13.
Given a dataset consisting of
datapoints, the regression analysis loss function is
(2.4) 
where are the datapoints and are the responses. Incorporating an additional datapoint and an additional response into the regression analysis changes the loss function of Equation (2.4) to
(2.5) 
Both Equations (2.4) and (2.5) are a dimensional regression problem but technically are defined by different loss functions. Definition 2.12 overcomes the oddity of having a learning algorithm differ depending on the size of the dataset, by defining an objective learning algorithm as a map associating the size of a dataset to the corresponding loss function; which is what we do in inadvertently.
Principal component analysis and regression analysis are objective learning algorithms. This is illustrated by the following two examples.
Example 2.14 (Regression as an Objective Learning Algorithm).
Let be real numbers and be a continuously differentiable linearly independent set of functions in . Nonlinear regression is an objective learning algorithm which is represented by

,

,

,

,
where is the component of the dimensional vector and where is the observed datapoint. Typically, the outofsample dataset is always taken to be empty unless a regularization or sparsity constraint is imposed.
By adding a penalty term, such as the
norm, to the training set and validation set loss functions and expanding the hyperparameter set
accordingly, most regularized regression problems, such the LASSO of [31], are seen to be objective learning algorithms.Example 2.15 (PCA as an Objective Learning Algorithm).
Calculating the first principal component of a dataset’s empirical covariance matrix is an objective learning algorithm. Here are represented by

,


,

,
where and are the training and validation sets and , viewed as matrices but with their columnwise means removed. Typically, the outofsample dataset is always taken to be empty. The higher principal components, as well as sparse principal components, can also be represented analogously as an objective learning algorithm.
The optimal evaluation of a learning algorithm, is a map taking a learning algorithm and a dataset to an optimized pattern. The optimal evaluation is only welldefined on datasets which admit a unique optimizer. This set of regular datasets, called the regular domain of definition of the learning algorithm, is defined as follows.
Definition 2.16 (Regular Domain of Definition)
Let be a learning algorithm. The regular domain of definition of , denoted by
, is the set of all pairs of data points in
satisfying the regularity condition A.2.
The map associating a dataset and an objective learning algorithm to the pattern best describing it is now defined.
Definition 2.17 (Optimal Evaluation)
Given an objective learning algorithm, its optimal evaluation is the output of the function taking as input a pair training and validation sets in and returning the optimal parameter defined by
Remark 2.18.
The optimal evaluation takes an objective learning algorithm and a dataset and returns the optimizer minimizing the loss function defined by the dataset. For example, in a LASSO regression the optimal evaluation returns the parameters of the line of best fit relating the explanatory variables to the responses, with the tuning parameter is optimized according to the validation set.
The requirement that the dataset be in the regular domain of definition of the learning algorithm means that the optimal evaluation is a welldefined function. For example the points do not have a single line of best fit describing their relationship therefore the optimal evaluation of the regression problem is not defined on that dataset.
As in [14] the performance of a learning algorithm is defined as the negative of its loss function evaluated at the optimal value. The definition of performance of training and validation set performance of an objective learning algorithm is defined in an analogous manner.
Definition 2.19 (Performance)
Let be a learning algorithm. The training set performance of is the function, denoted by , taking a dataset in to the extended real number
The validation set performance of is the function, denoted by , taking a dataset in to the extended real number
Remark 2.20.
The performance is the negative of the loss function evaluated at its optimal evaluation. It provides a measure of how well an objective learning algorithm can explain a given dataset.
A dataset in is said to maximize the in (resp. outof) sample performance of if there is no other dataset in having the same number of training and validation data points and a higher validation set performance.
The main result can now be stated. If the data is in the regular domain of definition of a learning algorithm, and is not already in an optimal position, then there is a reconfiguration which increases the performance of that algorithm. An example of optimally positioned data for linear regression is data that is perfectly explained by a line both on the training and validation sets. In this extreme case, it is natural to expect that no improvement can be made to linear regression.
Theorem 2.21 (Performance Gain).
Proof.
Without loss of generality assume that does not maximize , with the proof of the statement for being identical. Therefore there is in which has a higher value of and has the same number of training and validation datapoints.
Therefore, by the universal reconfiguration property of Theorem 2.9, there exists
such that
∎
Theorem 2.21 guarantees that there exists a reconfiguration of the data which improves an algorithm’s training set and validation set performance. The NEU metaalgorithm is a procedure which learns the reconfiguration of the space ensuring that the training and validation sets are positioned in a way which reduces the training set and validation set loss functions. This is formalized by the metaalgorithms illustrated by Figure 3 and made explicit in metaalgorithm 2.22.
3 Numerical Implementation of NEUOLS and NEUPCA
We begin by investigating the empirical performance of nonEuclidean upgrading. The first two implementations focus on real datasets and the second uses simulated data. The first two use the rapidly decreasing rotations to reconfigure the data whereas the last example uses microbumps since the data lies in .
3.1 Datadriven Studies
The performance of the NEU metaalgorithm will be investigated both in the regression and dimensionality reduction settings on financial datasets beginning with a regression analysis study.
Example 3.1 (Regression Analysis: Apple Stock Tracker).
Predicting the relationship between the price of a set of assets is central to many trading strategies. For example, strategies that rely on illiquid assets may create a portfolio comprised entirely of liquid assets, which tracks the illiquid asset’s movements. Since that is a particular application of tracking portfolios, in this example, the technique is demonstrated using liquid stocks. The target stock price will be denoted by and the prices of the assets making the tracking portfolio will be denoted by .
In this example, will be the price of apple stock, and will be the stock prices for IBM, Google, Cisco Systems Inc., Microsoft Corporation, Acacia Communications Inc., NXP Semiconductors NV, Qualcomm, Analog Devices Inc., Glu Mobile Inc., Jabil Inc., Micron, and STMicroelectronics NV. These portfolio is chosen as being comprised of the stock of major companies in the same same industry as well as major companies making up apple’s supply chain (see [1] for a discussion on apple’s supply chain and [29] for a discussion of the tech companies with the largest market capitalization).
A tracking portfolio consisting of these assets is built by minimizing the ordinary leastsquares loss function on the training dataset
where is the number of data points and is the number of assets used to track the Apple stock price. For illustrative and comparative purposes, the LASSO of [31], the Ridge (or Tykhonov regularization) regression of [32], the ElasticNet regularization (ENET) of [34], and the NEUOLS are compared.
The ENET selects the optimal regression weights by minimizing the loss function ENET Opt. Power denotes the solution to
with selected by sequentialvalidation. The LASSO is the special case where is fixed to and Ridge regression is the special case where . The penalty
reduces the number of explanatory parameters in a model by forcing the regression weights towards , thereby forcing the most significant parameters to only be fit. The metaparameter controls the strength of this sparsity penalty, controls the aggressiveness of the variableselection process, with giving a more aggressive choice and towards a nonaggressive penalty. ENET, LASSO, and Ridge regression are interpreted in [33] as robust regression problems where the regression problem is optimized against varying types of shocks in the data, or alternatively these can be interpreted as in [31, 35] as modifications of the regression problem that are able to detect and converge to the true set of explanatory variables, under linear and Gaussian noise assumptions.
In this example, years of adjusted stock prices are used to compute the weights, ending on July . The modeling assumption that the data does not follow a constant pattern throughout time is made and the data is broken up into rolling windows. Regression weights are dynamically updated on each window as is standard in practice (for example see [9, 2, 30]). In order to extract meaningful weights
, the timeseries must be shown to be cointegrated. The DickeyFuller, unit root test is performed on the returns of the adjusted stock price timeseries and the nullhypothesis that there exists a unit root is rejected with a pvalue of less than
and DickeyFuller statistic , therefore the can meaningfully be computed from the adjusted stock price’s returns using regression methods (see [22] for more details on cointegrated timeseries).Mean  95 L  95 U  99 L  99 U  

OLS  4.185  4.038  4.385  4.017  4.448 
Ridge  0.831  0.916  0.715  0.928  0.678 
LASSO  0.581  0.568  0.599  0.566  0.604 
ENET  0.526  0.519  0.535  0.518  0.538 
NEUOLS  0.204  0.202  0.208  0.202  0.209 
Each window is sequentially divided into a training, a validation, and a test set. Each of the training sets consists of 200 observations, the validation sets consist of 2 weeks, and the test sets consists of the last week of each moving window. The proportions invested in each asset, denoted are the regression weights on that window, and are recalibrate on each window using each of the stocks’ returns. The mean training, validation, and test errors aggregated across each windows are reported in the Tables 1, 3, and 3
, respectively. The optimal parameters for the Ridge, LASSO, ELASTICNET, and NEUOLS are recalibrated on every window using sequential validation. The optimization of the parameters defining the reconfiguration of the data on were performed by alternating between stochastic gradient descent and randomized searches of the parameter space.
Mean  95%L  95%U  99%L  99%U  

OLS  4.217  4.214  4.222  4.214  4.224 
Ridge  0.853  0.946  0.726  0.959  0.686 
LASSO  0.582  0.573  0.594  0.572  0.598 
ENET  0.525  0.518  0.534  0.517  0.537 
NEUOLS  0.204  0.203  0.206  0.203  0.206 
Mean  95%L  95%U  99%L  99%U  

OLS  4.202  4.058  4.397  4.038  4.458 
Ridge  0.845  0.928  0.734  0.939  0.699 
LASSO  0.581  0.571  0.594  0.569  0.598 
ENET  0.525  0.521  0.530  0.520  0.531 
NEUOLS  0.204  0.203  0.206  0.202  0.206 
As expected the OLS performs worst and the ENET performs best amongst the benchmark regression methods. All the methods, except the Ridge regression are conservative and underestimate the price of apple stock. The NEUOLS has the lowest error in the training, validation, and test sets across every window. Moreover, it has the tightest confidence intervals. Therefore the NEUOLS performs achieves a lower bias as well as a lower variance.
Algorithm  OLS  NEUOLS  Ridge  LASSO  ENET 

Run Time (sec)  0.01  104.02  0.02  0.02  0.07 
1  12,980.03  2.74  2.57  9.11 
The NEUOLS does have its own drawbacks, namely computational time. Once the reconfiguration of the data is learned the OLS algorithm can be run directly on the reconfigured dataset making NEUOLS and OLS just as fast. However on the first run, when the reconfiguration is being learned the NEUOLS is significantly slower than the other methods compared within this paper.
Table 4 reports the runtimes of performing the OLS, NEUOLS, Ridge regression, LASSO, and ENET algorithms on the dataset considered in this example using an Intel(R) Core(TM) i56200U CPU at 2.30GHz, with 7844MB available RAM machine running 18.04 LTS version of the Ubuntu Linux distribution.
We conclude that after learning the NEUOLS has the lowest prediction error amongst the regression methods considered in this example and its execution speed is just as fast as OLS after the reconfiguration has been learned. However, on the first run when the reconfiguration is being learned NEUOLS is notably slower than the other methods. Therefore, NEUOLS may be the best of these options when speed is not a large factor, but it may not be ideal for setting when the runtime of an algorithm is a determining factor, such as for live highfrequency trading.
Example 3.2 (Dimensionality Reduction: GermanBond Yield Curve).
Principal component analysis (PCA) is a nonparametric technique which converts correlated data into a set of uncorrelated vectors , each explaining progressively less of the data’s variance than the last one. The vectors , called principal components, are obtained through the recursion relation:
(3.1)  
where is the empirical data matrix with columnwise means removed.
PCA is commonly used in finance, where high dimensional data is typical. A classical use is for pricing zerocoupon bonds. Denote by
the price of a zerocoupon bond with maturity at time . The price can be modeled using the yield curve , which is defined as the rate at which the price of the bond is equal to the discounted cash flows. That is,The first three principal components of the yield curve are known to explain its level, slope, and curvature respectively (see [7] for more details). The validationset loss function which we will use is
(3.2) 
where is the vector of Bond yields observed on the day in the validation set (resp. training set) and is the number of principal components used to give a low dimensional approximation of the yield curve. As discussed in [7], the first three principal components of most yield curves tend to explain about of the data’s variance.
As a benchmark a two common alternatives to PCA, Kernel PCA (kPCA) and sparse PCA (sPCA) will be also be considered. Kernel PCA, performs first maps the data into another space, called the feature space, wherein the data can be more naturally partitioned by hyperplanes and the performs PCA in the feature space. The transformation into the feature space is typically made indirect by only describing the feature space’s inner product, which is possible due to the reproducing kernel Hilbert space structure of the feature space. A choice of inner product between two vectors
in the feature space isUnlike NEUPCA, the nonlinear transformation used in kPCA is not learned from the data but chosen before the algorithm is executed. Since kPCA does not make computations directly in the feature space but works indirectly to it by exploiting its inner product, kPCA does not allow for reconstruction of the data. However this is not the case with NEUPCA, since it is entirely constructive.
Analogously to the LASSO, Ridge regression, and ENET regularization problems, sPCA penalizes the Equation (3.1) to in order to obtain sparser principal components. The implementation considered in this paper will use the sPCA formulation of [8]. Sparse PCA has the advantage over PCA of being more interpretable, lowerdimensional, and being more robust due to its low dimensionality (see [36, 8] for more details on sPCA).
For this illustration PCA, kPCA, sPCA, NEUPCA, NEUkPCA, and NEUsPCA will all be performed on bond yield data. The daily bond data considered in this example consists of stripped German government bond prices between January 2010 and December 2014. The considered bond maturities are between 6 months and 30 years. The trainingset consists of the first 1000 days of data, the validation set of the next 200 days, and the test set consists of the remainder. The reconfigurations defining the NEU methods with be learned using NEUPCA. The NEUkPCA and NEUsPCA methods will be use the reconfigurations learned from NEUPCA.
The NEUPCA algorithm is implemented by optimized the training and validation objective functions by alternating between random searches and performing bulk iterations of the NelderMead heuristic search method (see
[21] for details NelderMead optimization). This heuristic scheme provided faster convergence results than direct use of stochastic gradient descent as in Example 2.14due to the data’s high dimensionality. After learning the reconfigurations defining the NEUPCA algorithm, the same reconfigurations were used to define NEUkPCA and NEUsPCA. This is interpreted as a form of transfer learning between analogous models.
N.Fact.  PCA  NEUPCA  kPCA  NEUkPCA  sPCA  NEUsPCA 

1  0.7749  0.7868  0.0906  0.0894  0.9756  0.9774 
2  0.8833  0.8936  0.9171  0.9175  0.9942  0.9949 
3  0.9417  0.9506  0.9948  0.9955  0.9992  0.9996 
4  0.9654  0.9688  0.9981  0.9981  0.9999  0.9999 
Table 5 shows that NEUPCA explains more of the training set variance than PCA does. However, kPCA and sPCA seem to explain more training set variance than NEUPCA, but not as much as NEUkPCA or NEUsPCA. However, examining the testset predictive performance of the four algorithms in Table 6, it is observed that the kPCA based algorithms are not able to accurately forecast the yield curve. Therefore, NEUPCA is the most parsimonious option for prediction between the four methods and NEUkPCA explains the most training set variance of the data.
The more modest gains of this method are due to the district training and validation loss functions. For example, removing the validation lossfunction and thereby the early stopping criterion in the definition of NEU, it can be seen that one NEUPCA can explain more than of the training set variability of the data. However, this leads to poor outofsample predictions of the test set yield curves as well as uninterpretable NEUPCAs.
N.Fact.  PCA  NEUPCA  kPCA  NEUkPCA  sPCA  NEUsPCA 

1  2,245.643  2,153.412  829.210  827.651  497.683  471.695 
2  344.961  294.106  829.200  827.644  290.040  265.822 
3  28.633  17.927  829.197  827.640  14.489  12.400 
4  4.424  2.975  829.190  827.634  12.061  12.210 
In this implementation, the NEUPCAs of the yield curve. Figure 4 shows that, upon rescaling, the first and fourth PCA and NEUPCAs have identical interpretation, while the second and fourth NEUPCAs look similar a flipped version of the second and fourth PCAs. The NEUPCAs in Figure 4 are in the transformed, nonEuclidean space, whereas the PCAs in Figure 4 are in Euclidean space itself. It should not be surprising that the and factor sPCA outperforms the factor NEUsPCA since the reconfiguration used for the NEUsPCA was trained using the PCA algorithm.
In this implementation, the NEUPCAs provided the most robust outofsample predictions of the yield curve, explained more of the training set variance than PCAs did and retained the interoperability of each of the principal components. Moreover like PCA, the approach is constructive therefore can be used for reconstruction purposes, which is not the case for kPCA due to it indirectly working with the feature space (see [27, Section 4] for a brief discussion on the datareconstruction shortcomings of kPCA).
Table 7 examines the runtime of each method. All six algorithms were run on a machine with the same specs as those of Example 2.14.
Algorithm  Run Time (sec)  

PCA  0.01  1 
NEUPCA  2.89  474.99 
kPCA  0.08  12.50 
NEUkPCA  2.96  486.48 
sPCA  0.81  132.40 
NEUsPCA  3.70  606.39 
The central shortcoming of the NEU metaalgorithm is underlined by Table 7. Its second row shows that the runtime of the NEU algorithms are about 1000 times slower than PCA and 100 times slower than kPCA. Therefore if speed is necessary it may be more desirable to turn to PCA or kPCA than their NEU counterparts. However, if time can be spared then the first three NEUPCAs makes factors NEUPCA the best overall choice due to its interpretability, outofsample predictive power and it explaining a competitive level of the training set’s variance.
The next example investigates the implications of the universal approximation and universal reconfiguration properties of reconfigurations in the controlled environment provided by simulation studies.
3.2 Simulation Studies  Investigation of Universal Properties
These simulation studies will focus on the illustrating the universal approximation and universal reconfiguration properties of reconfigurations, and thus the NEU metaalgorithms through the lens of regression analysis. In these simulation studies, the data will be generated according to the model
(3.3) 
where , , and is a nonlinear function. Three nonlinear functions will be investigated, these are

,

,

The first function investigates how well NEUOLS can approximate nonlinear functions whose global shape, unlike polynomials or periodic function, cannot be determined by local data. The second evaluates how NEUOLS can deal with functions osculating at nonconstant speeds. The third looks at how well the NEUOLS algorithm can approximate functions with discontinuities.
The NEUOLS algorithm will be benchmarked against two standard nonparametric regression algorithms, penalized smoothing splines regression (psplines) and Locally Weighted Scatterplot Smoothing (LOESS). Smoothing splines regression is a highly flexible approximation method. A smoothing splines is a twice continuously differentiable function which is constructed by gluing a finite number , at most equal to , of cubic polynomials together. The optimal pspline, denoted here by , is chosen by minimizing the objective function
where are real numbers, is a suitable function, , and the pairs are generated according to the model described in Equation (3.3).
The value of the tuning parameter determines how smooth is and how well it interpolates the datapoints . If and , then interpolates the data. Conversely, as approaches infinity and becomes small, then approaches the solution to an ordinary linear regression (see [14, Chapter 5] for details on smoothing splines and psplines). Unlike smoothing splines, psplines do not require a knot at every point and therefore are less susceptible to overfitting than smoothing splines. The parameters and will be chosen by fold crossvalidation.
LOESS is a nonparametric regression method, where a smooth polynomial is fit to the data. The best fitting polynomial, denoted by , is found by minimizing the value of the lossfunction
where is the distance of the point to the polynomial. Unlike classical regression problems, the LOESS objective function does not only look at the pairs themselves but incorporates the importance of nearby points into its objective function. This is because the closest point on to need not be but may be a neighboring point on to . The degree of the polynomial is chosen using crossvalidation (see [4] for details).
For each simulation observations will be generated on the interval , the data will then be normalized to the unit square for uniformity between the three examples. The models’ tuningparameters will be estimated on a subset of datapoints sampled in a stratified manner on evenly spaced subintervals by crossvalidation or early stopping in the case of NEUOLS. The remaining sample points will serve as the test set. The runtimes reported in these simulation studies will be using the same specs as the PC used to report the runtimes in Table 4.
Example 3.3 (Simulation Study  NEUOLS  NonLocality).
In this simulation study, NEUOLS will be compared against LOESS and splines regression when the function in Equation (3.3) is assume to be
(3.4) 
This simulation study was performed by generating
Comments
There are no comments yet.