1 Introduction
Nonlinear principal component analysis [1, 2, 3] is a nonlinear generalization of standard principal component analysis (PCA).
While PCA is restricted to linear components, nonlinear PCA generalizes the principal components from straight lines to curves and hence describes the inherent structure of the data by curved subspaces.
Detecting and describing nonlinear structures is especially important for analysing time series. Nonlinear PCA is therefore frequently used to investigate the dynamics of different natural processes [4, 5, 6]. But validating the model complexity of nonlinear PCA is a difficult task [7].
Overfitting can be caused by the often limited number of available samples; moreover, in nonlinear PCA overfitting can also occur by the intrinsic geometry of the data, as shown in Fig. 5, which cannot be solved by increasing the number of samples.
A good control of the complexity of the nonlinear PCA model is required.
We have to find the optimal flexibility of the curved components.
A component with too little flexibility, an almost linear component,
cannot follow the complex curved trajectory of real data.
By contrast, a too flexible component fits nonrelevant noise of the data (overfitting) and hence gives a poor approximation of the original process, as illustrated in Fig. 1A.
The objective is to find a model whose complexity is neither too small nor too
large.
Even though the term nonlinear PCA (NLPCA) is often referred to
the autoassociative neural network approach, there are many other methods which
visualise data and extract components in a nonlinear manner [8].
Locally linear embedding (LLE) [9, 10]
and Isomap [11]
visualise high dimensional data by projecting (embedding) them into a two or threedimensional space.
Principal curves [12] and self organising maps (SOM) [13] describe data by nonlinear curves and nonlinear planes up to two dimensions. Kernel PCA [14] as a kernel approach can be used to visualise data and for noise reduction [15]. In [16] linear subspaces of PCA are replaced by manifolds and in [17] a neural network approach is used for nonlinear mapping. This work is focused on the autoassociative neural network approach to nonlinear PCA and its model validation problem.For supervised methods, a standard validation technique is crossvalidation. But even though the neural network architecture used is supervised, the nonlinear PCA itself is an unsupervised method that requires validating techniques different from those used for supervised methods. A common approach for validating unsupervised methods is to validate the robustness of the components under moderate modifications of the original data set, e.g., by using resampling bootstrap [18] or by corrupting the data with a small amount of Gaussian noise [19]. In both techniques, the motivation is that reliable components should be robust and stable against small random modification of the data. In principle, these techniques could be adapted to nonlinear methods. But there would be the difficulty of measuring the robustness of nonlinear components. Robustness of linear components is measured by comparing their directions under slightly different conditions (resampled data sets or different noiseinjections). But since comparing the curvature of nonlinear components is no trivial task, nonlinear methods require other techniques for model validation.
In a similar neural network based nonlinear PCA model, termed nonlinear factor analysis (NFA) [20]
, a Bayesian framework is used in which the weights and inputs are described by posterior probability distributions which leads to a good regularisation. While in such Bayesian learning the inputs (components) are explicitly modelled by Gaussian distributions, the maximum likelihood approach in this work attempts to find a single set of values for the network weights and inputs. A weightdecay regulariser is used to control the model complexity. There are several attempts to the model selection in the autoassociative nonlinear PCA. Some are based on a criterion of how good the local neighbour relation is preserved by the nonlinear PCA transformation
[21]. In [22], a nearest neighbour inconsistency term that penalises complex models is added to the error function, but standard test set validation is used for model preselection. In [23] an alternative network architecture is proposed to solve the problems of overfitting and nonuniqueness of nonlinear PCA solutions. Here we consider a natural approach that validates the model by its own ability to estimate missing data. Such missing data validation is used, e.g., for validating linear PCA models [24], and for comparing probabilistic nonlinear PCA models based on Gaussian processes [25]. Here, the missing data validation approach is adapted to validate the autoassociative neural network based nonlinear PCA.2 The test set validation problem
To validate supervised methods, the standard approach is to use an independent test set for controlling the complexity of the model. This can be done either by using a new data set, or when the number of samples is limited, by performing crossvalidation by repeatedly splitting the original data into a training and test set. The idea is that only the model, which best represents the underlying process, can provide optimal results on new, for the model previously unknown, data. But test set validation only works well when there exist a clear target value (e.g., class labels) as in supervised methods, it fails on unsupervised methods. In the same way that a test data set cannot be used to validate the optimal number of components in standard linear PCA, test data also cannot be used to validate the curvature of components in nonlinear PCA [7]. Even though nonlinear PCA can be performed by using a supervised neural network architecture, it is still an unsupervised method and hence should not be validated by using crossvalidation. With increasing complexity, nonlinear PCA is able to provide a curved component with better data space coverage. Thus, also test data can be projected onto the (overfitted) curve by a decreased distance and hence give an incorrect small error. This effect is illustrated in Fig. 1
using 10 training and 200 test samples generated from a quadratic function plus Gaussian noise of standard deviation
. The mean square error (MSE) is given by the mean of the squared distances between the data points and their projections onto the curve. The overfitted and the wellfitted or ideal model are compared by using the same test data set. It turns out that the test error of the true original model (Fig. 1C) is almost three times larger than the test error of the overly complex model (Fig. 1B), which overfits the data. Test set validation clearly favours the overfitted model over the correct model, and hence fails to validate nonlinear PCA.To understand this contradiction, we have to distinguish between an error in supervised learning and the fulfilment of specific criteria in unsupervised learning. Test set validation works well for supervised methods because we measure the error as the difference from a known target (e.g., class labels). Since in unsupervised methods the target (e.g., the correct component) is unknown, we optimize a specific criterion. In nonlinear PCA the criterion is to project the data by the shortest distance onto a curve. But a more complex overfitted curve covers more data space and hence can also achieve a smaller error on test data than the true original curve.
3 The nonlinear PCA model
Nonlinear PCA (NLPCA) can be performed by using a multilayer perceptron (MLP) of an autoassociative topology, also known as autoencoder, replicator network, bottleneck, or sandglass type network, see Fig.
2.The autoassociative network performs an identity mapping. The output is forced to approximate the input by minimising the squared reconstruction error . The network can be considered as consisting of two parts: the first part represents the extraction function , whereas the second part represents the inverse function, the generation or reconstruction function . A hidden layer in each part enables the network to perform nonlinear mapping functions. By using additional units in the component layer in the middle, the network can be extended to extract more than one component. Ordered components can be achieved by using a hierarchical nonlinear PCA [26].
For the proposed validation approach, we have to adapt nonlinear PCA to be able to estimate missing data. This can be done by using an inverse nonlinear PCA model [27] which optimises the generation function by using only the second part of the autoassociative neural network. Since the extraction mapping is lost, we have to estimate both the weights and also the inputs which represent the values of the nonlinear component. Both and can optimised simultaneously to minimise the reconstruction error, as shown in [27].
The complexity of a model can be controlled by a weightdecay penalty term [28] added to the error function , are the network weights. By varying the coefficient , the impact of the weightdecay term can be changed and hence we modify the complexity of the model which defines the flexibility of the component curves in nonlinear PCA.
4 The missing data validation approach
Since classical test set validation fails to select the optimal nonlinear PCA model, as illustrated in Fig. 1, I propose to evaluate the complexity of a model by using the error in missing
data estimation as the criterion for model selection.
This requires to adapt nonlinear PCA for missing data as done in the inverse nonlinear PCA model [27].
The following model selection procedure can be used to find the optimal weightdecay complexity parameter of the nonlinear PCA model:
1. Choose a specific complexity parameter .
2. Apply inverse nonlinear PCA to a training data set.
3. Validate the nonlinear PCA model by its performance on missing data estimation of
an independent test set in which one or more elements of a sample are randomly rejected.
The mean of the squared errors
between the randomly removed values and their estimations by the nonlinear PCA model is used as the validation or generalization error.
Applied to a range of different weightdecay complexity parameters , the optimal model complexity is given by the lowest missing value estimation error.
To get a more robust result, for each complexity setting, nonlinear PCA can be repeatedly applied by using different weightinitializations of the neural network. The median can then be used for validation as shown in the following examples.
5 Validation examples
The first example of a nonlinear data set shows that model validation based on missing data estimation performance provides a clear optimum of the complexity parameter. The second example demonstrates that the proposed validation ensures that nonlinear PCA does not describe data in a nonlinear way when the inherent data structure is, in fact, linear.
5.1 Helix data
The nonlinear data set consist of data that lie on a onedimensional manifold, a helical loop, embedded in three dimensions, plus Gaussian noise of standard deviation , as illustrated in Fig. 3. The samples
were generated from a uniformly distributed factor
over the range [0.8,0.8], represents the angle:Nonlinear PCA is applied by using a 1103 network architecture optimized in 5,000 iterations by using the conjugate gradient descent algorithm [29].
To evaluate different weightdecay complexity parameters , nonlinear PCA is applied to 20 complete samples
generated from the helical loop function and validated by using a missing data set of 1,000 incomplete
samples in which randomly one value of the three dimensions is rejected per sample and can be easily estimated from the other two dimensions when the nonlinear component has the correct helical curve.
For comparison with standard test set validation, the same 1,000 (complete) samples are used.
This is repeatedly done 100 times for each model complexity with newly generated data each time. The median of missing data estimation over all 100 runs is finally taken to validate a specific model complexity.
Fig. 4 shows the results of comparing the proposed model selection approach with standard test set validation.
It turns out that only the missing data approach is able to show a clear minimum in the performance curve. Test set validation, by contrast, shows a small error even for very complex (overfitted) models.
This is contrary to our experience with supervised learning,
where the test error becomes large again when the model overfits.
Thus, test set validation cannot be used to determine the optimal model complexity of unsupervised methods.
In contrast, the missing value validation approach shows that the optimal complexity setting of the weightdecay coefficient is in the range .
5.2 Linear data
Nonlinear PCA can also be used to answer the question of whether highdimensional observations are driven by an unimodal or a multimodal process, e.g., in atmospheric science for analysing the El NiñoSouthern Oscillation [30].
But applying nonlinear PCA can be misleading if the model complexity is insufficiently controlled: multimodality can be incorrectly detected in data that are inherently unimodal, as pointed out by Christiansen [7].
Fig. 5 C & D illustrates that if the model complexity is too high, even linear data is described by nonlinear components.
Therefore, to obtain the right description of the data, controlling the model complexity is very important.
Fig. 5 shows the validation error curves of the standard test set and the proposed missing data validation for different model complexities. The median of 500 differently initialized 142 networks is plotted. Again, it is shown that standard test set validation fails in validating nonlinear PCA.
With increasing model complexity, classical test set validation shows an decreasing error, and hence favours overfitted models.
By contrast, the missing value estimation error shows correctly that the optimum would be a strong penalty which gives a linear or even a point solution, thereby confirming the absence of nonlinearity in the data.
This is correct because the data consists, in principle, of Gaussian noise centred at
the point (0,0).
While test set validation favours overfitted models which produce components that incorrectly show multimodal distributions, missing data validation confirms the unimodal characteristics of the data. Nonlinear PCA in combination with missing data validation can therefore be used to find out whether a highdimensional data set is generated by a unimodal or a multimodal process.
6 Test set versus missing data approach
In standard test set validation, the nonlinear PCA model is trained using a training set . An independent test set is then used to compute a validation error as , where is the output of the nonlinear PCA given the test data as the input. The test set validation reconstructs the test data from the test data itself. The problem with this approach is that increasingly complex functions can give approximately , thus favouring complex models. While test set validation is a standard approach in supervised applications, in unsupervised techniques it suffers from the lack of a known target (e.g., a class label). Highly complex nonlinear PCA models, which overfit the original training data, are in principle also able to fit test data better than would be possible by the true original model. With higher complexity, a model is able to describe a more complicated structure in the data space. Even for new test samples, it is more likely to find a short projecting distance (error) onto a curve which covers the data space almost complete than by a curve of moderate complexity (Fig. 1). The problem is that we can project the data onto any position on the curve. There is no further restriction in pure test set validation. In missing data estimation, by contrast, the required position on the curve is fixed, given by the remaining available values of the same sample. The artificially removed missing value of a test sample gives an exact target which have to be predicted from the available values of the same test sample. While test set validation predicts the test data from the test data itself, the missing data validation predicts removed values from the remaining values of the same sample. Thus, we transform the unsupervised validation problem into a kind of supervised validation problem.
7 Conclusion
In this paper, the missing data validation approach to model selection is proposed to be applied to the autoassociative neural network based nonlinear PCA. The idea behind this approach is that the true generalization error in unsupervised methods is given by a missing value estimation error and not by the classical test set error. The proposed missing value validation approach can therefore be seen as an adaptation of the standard test set validation so as to be applicable to unsupervised methods. The absence of a target value in unsupervised methods is replaced by using artificially removed missing values as expected target values that have to be predicted from the remaining values of the same sample. It can be shown that standard test set validation clearly fails to validate nonlinear PCA. In contrast, the proposed missing data validation approach was able to validate correctly the model complexity.
Availability of Software
A MATLAB^{®} implementation of nonlinear PCA including the inverse model for estimating missing data is available at:
http://www.NLPCA.org/matlab.html
An example of how to apply the proposed validation approach
can be found at:http://www.NLPCA.org/validation.html
References
 Kramer [1991] M. A. Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37(2):233–243, 1991.
 DeMers and Cottrell [1993] D. DeMers and G. W. Cottrell. Nonlinear dimensionality reduction. In D. Hanson, J. Cowan, and L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 580–587, San Mateo, CA, 1993. Morgan Kaufmann.
 HechtNielsen [1995] R. HechtNielsen. Replicator neural networks for universal optimal source coding. Science, 269:1860–1863, 1995.
 Hsieh et al. [2006] W. W. Hsieh, A. Wu, and A. Shabbar. Nonlinear atmospheric teleconnections. Geophysical Research Letters, 33(7):L07714, 2006.
 Herman [2007] A. Herman. Nonlinear principal component analysis of the tidal dynamics in a shallow sea. Geophysical Research Letters, 34:L02608, 2007.
 Scholz and Fraunholz [2008] M. Scholz and M. J. Fraunholz. A computational model of gene expression reveals early transcriptional events at the subtelomeric regions of the malaria parasite, Plasmodium falciparum. Genome Biology, 9(R88), 2008. doi: doi:10.1186/gb200895r88.
 Christiansen [2005] B. Christiansen. The shortcomings of nonlinear principal component analysis in identifying circulation regimes. J. Climate, 18(22):4814–4823, 2005.

Gorban et al. [2007]
A. N. Gorban, B. Kégl, D. C. Wunsch, and A. Zinovyev.
Principal Manifolds for Data Visualization and Dimension Reduction
, volume 58 of LNCSE. Springer Berlin Heidelberg, 2007.  Roweis and Saul [2000] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.

Saul and Roweis [2004]
L. K. Saul and S. T. Roweis.
Think globally, fit locally: Unsupervised learning of low dimensional
manifolds.
Journal of Machine Learning Research
, 4(2):119–155, 2004.  Tenenbaum et al. [2000] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
 Hastie and Stuetzle [1989] T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84:502–516, 1989.
 Kohonen [2001] T. Kohonen. SelfOrganizing Maps. Springer, 3rd edition, 2001.

Schölkopf et al. [1998]
B. Schölkopf, A.J. Smola, and K.R. Müller.
Nonlinear component analysis as a kernel eigenvalue problem.
Neural Computation, 10:1299–1319, 1998.  Mika et al. [1999] S. Mika, B. Schölkopf, A.J. Smola, K.R. Müller, M. Scholz, and G. Rätsch. Kernel PCA and de–noising in feature spaces. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 536–542. MIT Press, 1999.
 Girard and Iovleff [2005] S. Girard and S. Iovleff. Autoassociative models and generalized principal component analysis. J. Multivar. Anal., 93(1):21–39, 2005. doi: 10.1016/j.jmva.2004.01.006.
 Demartines and Herault [1997] P. Demartines and J. Herault. Curvilinear component analysis: A selforganizing neural network for nonlinear mapping of data sets. IEEE Transactions on Neural Networks, 8(1):148–154, 1997.
 Efron and Tibshirani [1994] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New York, 1994.
 Harmeling et al. [2004] S. Harmeling, F. Meinecke, and K.R. Müller. Injecting noise for analysing the stability of ICA components. Signal Processing, 84:255–266, 2004.
 Honkela and Valpola [2005] A. Honkela and H. Valpola. Unsupervised variational bayesian learning of nonlinear models. In Y. Weis L. Saul and L. Bottous, editors, Advances in Neural Information Processing Systems 17 (NIPS’04), pages 593–600, 2005.
 Chalmond and Girard [1999] B. Chalmond and S. C. Girard. Nonlinear modeling of scattered multivariate data and its application to shape change. IEEE Trans. Pattern Anal. Mach. Intell., 21(5):422–432, 1999.
 Hsieh [2007] W. W. Hsieh. Nonlinear principal component analysis of noisy data. Neural Networks, 20(4):434–443, 2007. doi: 10.1016/j.neunet.2007.04.018.
 Lu and Pandolfo [2011] B.W. Lu and L. Pandolfo. Quasiobjective nonlinear principal component analysis. Neural Networks, 24(2):159–170, 2011. doi: 10.1016/j.neunet.2010.10.001.
 Ilin and Raiko [2010] A. Ilin and T. Raiko. Practical approaches to principal component analysis in the presence of missing values. Journal of Machine Learning Research, 11:1957–2000, 2010.
 Lawrence [2005] N. D. Lawrence. Probabilistic nonlinear principal component analysis with gaussian process latent variable models. Journal of Machine Learning Research, 6:1783–1816, 2005.
 Scholz and Vigário [2002] M. Scholz and R. Vigário. Nonlinear PCA: a new hierarchical approach. In M. Verleysen, editor, Proceedings ESANN, pages 439–444, 2002.
 Scholz et al. [2005] M. Scholz, F. Kaplan, C. L. Guy, J. Kopka, and J. Selbig. Nonlinear PCA: a missing data approach. Bioinformatics, 21(20):3887–3895, 2005.
 Hinton [1987] G. E. Hinton. Learning translation invariant recognition in massively parallel networks. In Proceedings of the Conference on Parallel Architectures and Languages Europe (PARLE), pages 1–13, 1987.
 Hestenes and Stiefel [1952] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952.
 Hsieh [2001] W. W. Hsieh. Nonlinear principal component analysis by neural networks. Tellus A, 53(5):599–615, 2001.
Comments
There are no comments yet.