I Introduction
Intrinsic dimensionality (ID) of data is a key priori knowledge in pattern recognition and statistics, such as time series analysis, classification and neural networks, to improve their performance. In time series analysis
[1], the domain of attraction of a nonlinear dynamic system has a very complex geometric structure, and study on the geometry of the attraction domain is closely related to the fractal geometry. Fractal dimension is an important tool to characterize certain geometric properties of complex sets. In neural network design [2], the number of hidden units in the encoding middle layer should be chosen according to the ID of data. In classification tasks [3], in order to balance the generalization ability and the empirical risk value, the complexity of the function should also be related to the ID of data.Recently, manifold learning, an important approach for nonlinear dimensionality reduction, has drawn great interests. Important manifold learning algorithms include isometric feature mapping (Isomap) [4], locally linear embedding (LLE) [5] and Laplacian eigenmaps (LE) [6]. They all assume data to distribute on an intrinsically lowdimensional submanifold [7] and reduce the dimensionality of data by investigating the intrinsic structure of data. However, all manifold learning algorithms require the ID of data as a key parameter for implementation.
Previous ID estimation methods can be categorized mainly into three groups: projection approach, geometric approach and probabilistic approach. The projection approach [9, 11, 10] finds ID by checking up the lowdimensional embedding of data. The geometric method [22] finds ID by investigating the intrinsic geometric structure of data. The probabilistic technique [19] builds estimators by making distribution assumptions on data. These approaches will be briefly introduced in Section II.
In this paper, we propose a new PCAbased method for ID estimation which is called the CPCA method. The proposed method first finds a minimal cover of the data set, and each subset in the cover is considered as a small subregion of the data manifold. Then, on each subset, a revised PCA procedure is applied to examine the local structure. The revised PCA method can filter out noise in data and leads to a stable and convergent ID estimation with the increase of the subregion size, as shown by the experimental results. This is an advantage over the traditional PCA method which is very sensitive to noise, outliers and the choice of the subregion size. Further analysis shows that the revised PCA procedure can efficiently reduce the running time complexity and utilize all data samples for ID estimation. We should remark that our ID estimation method is also applicable to incremental learning for consecutive data. Our method is compared with the maximum likelihood estimation (MLE) method
[19], the manifold adaptive method (which is referred to as the  NN method in this paper) [18] and the knearest neighbor graph (NNG) method [26, 27] through experiments.The rest of the paper is organized as follows. In Section II, previous ID estimation methods are briefly reviewed. In Section III, the new ID estimation method (CPCA) is introduced. In Section IV, experiments are conducted on synthetic and real world data sets to show the effectiveness of the proposed algorithm. Conclusion is made in Section V.
Ii Previous algorithms on ID estimation
Previously, there are mainly three approaches to estimate the ID of data: projection, geometric and probabilistic approaches.
The projection approach first projects data into a lowdimensional space and then determine the ID by verifying the lowdimensional representation of data. PCA is a classical projection method which finds ID by counting the number of significant eigenvalues. However, the traditional PCA only works on data lying in a linear subspace but becomes ineffective when data distribute on a nonlinear manifold. To overcome this limitation, localPCA
[9] and OTPMs PCA [10] have been proposed and can discover the ID of data lying on nonlinear manifolds by performing the PCA method locally. The Isomap algorithm yields ID of data by inspecting the elbow of residual variance curve [4]. Cheng et al. gave an efficient procedure to compute eigenvalues and eigenvectors in PCA
[24].Geometric approaches make use of the geometric structure of data to build ID estimators. Fractalbased methods have been well developed and used in time series analysis. For example, the correlation dimension (a kind of fractal dimensions) was used in [13] to estimate the ID, whilst the method of packing numbers was proposed in [14] to find the ID. Other fractalbased methods include the kernel correlation method [23] and the quantization estimator [21]. A good survey on fractalbased methods can be found in [22]. There are also many methods based on techniques from computational geometry. Lin [15] and Cheng [25] suggested to construct simplices to find the ID, while the nearest neighbor approach uses the distances between data points with their nearest neighbors to build ID estimators such as the estimator proposed by Pettis et al. [29], the NNG method [26, 27] and the incising ball method [17]. A comparison of the localPCA method with that introduced by Pettis et al. was made in [16].
Probabilistic methods are based on probabilistic assumptions of data and have been tested on various data sets with stable performance. The MLEmethod [19] is a representative method of this approach, whose final global estimator is given by averaging the local estimators:
where is the distance between and its th nearest neighbor. MacKay and Ghahramani [20] pointed out that compared with averaging the local estimators directly, it is more sensible to average their inverses , for the maximum likelihood purpose. The recommended final estimator is
where is the estimated ID corresponding to the neighborhood size .
Iii ID estimation using PCA with cover sets: CPCA
Basically, there are two kinds of definitions of ID that are commonly used. One is based on the fractal dimension, such as the Hausdorff dimension and the packing dimension that are usually real positive numbers. The other kind of definition is based on the embedding manifold whose ID is always an integer.
Definition III.1 (Embedding manifold and dimension)
Let and let be a compact open set in . Assume that and is a smooth function. The set is called an embedding manifold with its embedding dimension.
More and more real world data are proved to have nonlinear intrinsic structures and may possibly distribute on nonlinear embedding manifolds [7]. Therefore, estimation of embedding ID of data becomes an important problem [17]. In this paper, we focus on estimation of embedding dimensions.
Iiia PCAbased methods for ID estimation
The traditional PCA can find a subspace on which data projections have maximum variance. Given a data set with . Let and . The covariance matrix of is given by
Since is a positive semidefinite matrix, we can assume that are the eigenvalues of with the corresponding orthonormal eigenvectors, respectively. The eigendecomposition of matrix is denoted as where is a diagonal matrix with and . The eigenvector is the th principal direction (PD) and, for any variable , is defined as the th principal component (PC). By the definition, we have the variance and the covariance .
If the data set distributes on a linear subspace, then the primary PDs should be able to span the subspace and the corresponding PCs can account for most of the variations contained in . On the other hand, the variance of PCs on (i.e., the PDs which are orthogonal to the linear subspace of dimension ) will be trivial. The most commonlyused criterions for ID estimation with the PCA method are
(1) 
and the percentage of the accounted variance
(2) 
In this paper, the ID, , is determined if the condition (1) or (2) is satisfied.
IiiB Filtering out the noise of data
There are two challenges for PCAbased ID estimation methods. The first one is how to filter out the noise in data, while the second one is how to choose the size of subregions on the manifold. Previously, the ID estimation of data obtained with PCAbased methods always increases with the size of subregions so the methods can not converge to give a stable ID estimation. In order to address these two limitations, we propose the following noise filtering procedure which can efficiently filter out the noise in data and make PCAbased methods to converge.
Consider the effect of additive white noise
in data with and . The covariance matrix of the noise corrupted data is given bywhere is the covariance matrix of the data . It can be seen that the PDs of are identical to those of and the eigenvalues of are If is relatively large, then the ID criterions (1) and (2) will be ineffective.
The variance of data projections on the PDs that are orthogonal to the intrinsic embedding subspace is very small, and the most part of the variance is produced by noise. Therefore, it is possible to calculate the variance of noise by projecting data on the orthogonal PDs. Given a real number which is very close to ( is taken to be in this paper), the noise part of data is determined by
Thus, the variance of noise contained in data can be estimated as
(3) 
Our new ID estimation criterions make use of the updated variance on PDs:
Remark III.1
Noise is typically different from outliers. Noise affects every data points independently, while outliers are referred to data points that are at least at a certain distance from the data points on manifold. The proposed procedure is very robust to both noise and outliers, as shown in experiments. On the other hand, the traditional PCA procedure can handle limited noise but is very sensitive to outliers.
IiiC The local region selection method
An embedding manifold can be approximated locally by linear subspaces. The dimensionality of each linear subspace should be equal to the ID of the embedding manifold. Therefore, it is possible to estimate the ID of a nonlinear manifold by checking it locally. A cover is referred to a set whose elements are subsets of the data set satisfying that the union of all subsets in the cover contains the whole data set.
Definition III.2 (The set cover problem)
Given a universe of elements and a collection of subsets of , where . Set cover is concerned with finding a minimum subcollection of that covers all data points.
Using a minimum cover has two advantages. First, it can find the minimal number of subregions, which helps save the computational time. Secondly, the result of ID estimation that utilizes the whole data set is more reliable. However, searching such a minimal cover is an NPhard problem. In the following, we introduce an algorithm which can approximately find a minimal cover of a data set.
Given the parameter, an integer or a real number , there are two ways to define the neighborhood of any data point :

The NN method: any data point that is one of nearest data points of is in the neighborhood of ;

The NN method: any data point in the region is in the neighborhood of .
Without loss of generality, we may assume that the index of data points is independent of their locations.
Using the above approximation algorithm, a cover of the data set can be found. Compared with the local region selection algorithm used in[9], our algorithm above has a low time complexity and avoids the supervised process to choose the neighborhood. Intuitively, the cardinality of the cover satisfies that , where is the average number of neighbors.
IiiD The proposed ID estimation algorithm
We now present the proposed ID estimation algorithm using local PCA on the minimal set cover: the CPCD algorithms, which are summarized below for both batch and incremental data, respectively.
In many cases, consecutive data are collected incrementally. This requires an incremental learning algorithm to inspect the change of the data structure on time. The incremental CPCA algorithm is presented as follows.
Remark III.2
Our method is different from the LocalPCA [9] in many aspects. First, the centers and the local regions are determined simultaneously by using one parameter  the neighborhood size, whilst, in [9], the centers and neighborhood sizes are determined by two parameters. Secondly, our approach finds the subregions by approximating a minimum cover of the data set, while the localPCA in [9] does not guarantee whether or not the selected subregions cover the whole data set.
IiiE Computational complexity analysis
The computational complexity of our algorithms is one of the most important issues for its application. The batch mode ID estimation can be divided into two parts. In the first part, computing the distance matrix needs time, searching the nearest neighbors for every data point needs time and finding an approximate minimum cover of needs time. Therefore, the first part needs running time. In the second part, performing PCA locally needs running time. To sum up, the total running time needed for the batch mode algorithm is . If the proposed method is embedded in a manifold learning algorithm, then the running time complexity can be reduced to in the case when the distance matrix and the neighborhood are already defined. This is a relatively small increase in the time complexity of a manifold learning algorithm which is always as high as .
For incremental learning, the neighborhood identification step needs running time, whilst the local PCA consumes running time. Therefore, the total time complexity for incremental learning is .
Iv Experiments
The proposed algorithm was implemented with parameters and for all the experiments.
In practice, it is found that noise contained in data is of lowdimension, except an additive white noise which is assumed to be in every component of the data vectors in
. Thus, in practice, we only use variances of the first PCs in the noise part of data to estimate the variance of noise (see Eq. (3)).Comparison is made among the  NN method [18], the NNG method [26], the revised MLE (MLE in short) method [20], the CPCA method and the LPCA method, where the LPCA method stands for the CPCA method without the noise filtering procedure proposed in Subsection IIIB. It should be noted that the results obtained by the MLE,  NN and NNG methods are positive real numbers, while the LPCA and CPCA methods produce only integer ID estimations. In order to make a comparison among these results, we average the local ID estimations obtained with the CPCA and LPCA methods to provide a real ID estimation:
Iva Mobius data
The first data set is a Mobius ring embedded in . Fig. 1(a) shows the scatter plot of the Mobius ring data set. As can be seen, the Mobius data points are lying on a highly nonlinear manifold with
points uniformly distributing on the surface. Fig.
1(b) shows the results obtained by the five ID estimation algorithms against the neighborhood size ranging from to . The MLE method is the most stable and accurate algorithm for all neighborhood sizes. All algorithms converge to the correct estimation. It seems that the LPCA method does not diverge on this data set. This is possibly because the original dimensionality of data is low.IvB Real world data sets
Our algorithm is compared with the MLE,  NN and NNG methods on some benchmark real world data sets: the Isoface data set [4], the LLEface data set [5] and the MNIST ’’ and ’’ data sets [28].
The Isoface data set is comprised of images of a head with the resolution . Some samples of the Isoface data set are shown in Fig. 2(a). In the experiments, each image is reshaped to a dimensional vector. It can be seen that the Isoface data set is under a threedimensional movement: updown, leftright and lighting changes. In [4], the Isomap algorithm estimated its ID as using the projection approach. As can be seen from Fig. 2(b), corresponding to the neighborhood sizes from to , the CPCA estimator ranges from to and the MLE estimator ranges from to . The estimation given by the NNG and  NN methods is oscillating badly with the neighborhood sizes, so they are bot unstable. Since the LPCA method can not filter out noise contained in data, it tends to overestimate the ID as the neighborhood size increases. This means that our noise filtering process plays a key role in the convergence of the CPCA method.
The second data set is the LLEface data set, which contains samples in a dimensional space (see Fig. 3(a) for some samples). From Fig. 3(b), it is seen that both the CPCA and the MLE methods give a convergent ID estimation with the increase of the neighborhood size, while the LPCA, NN and NNG methods seem not convergent when the neighborhood size is increasing. The ID estimation given by the CPCA method changes between and with the convergent estimation being , while the estimation result obtained by the MLE method changes gradually from to with a convergent estimation of .
We now consider two MNIST data sets: the set ’0’ and the set ’1’ (see Fig. 4(a) and Fig. 5(a) for some samples of these two data, respectively). The data set ’0’ contains data points, while the data set ’1’ contains data points. It can be seen from Fig. 4(b) and Fig. 5(b) that all methods, except the LPCA and NNG methods, converge with the increase of the neighborhood size. For the data set ’0’, it can be seen from Fig. 4(b) that the ID estimation given by the CPCA method converges to and the estimation given by both the MLE estimator and the NN estimator converges to . For the data set ’1’, Fig. 5(b) shows that the ID estimation obtained by the CPCA method converges to and the estimation provided with both the MLE method and the NN method converges to . Note that the result given by our method is in a big disagreement with the results given by other methods for the ID estimation of the data sets ’’ and ’’. A digit ’’ is usually represented as an ellipse which can be determined by the coordinates of its focus and its major and minor axes, so the ID of the data set ’’ is likely to be . The number ’’ can be considered as a line segment, which rotates from left to right, so a sensible ID estimation for the data set ’’ may be between and .
IvC Noisy data sets
The traditional PCA algorithm is very sensitive to outliers, and the performance of PCAbased algorithms deteriorate rapidly if data points are sparse on a manifold such as the hand rotation data set ^{1}^{1}1CMU database: http://vasc.vi.cmu.edu/idb/html/motion/hand/index.html. As can be seen from Fig. 6(a), the hand is under a onedimensional movement, so the data points can be considered as lying on a onedimensional curve. The data set contains image samples, and each sample is a vector in a dimensional space. Many outliers can be seen from its lowdimensional embedding by the Isomap algorithm (see Fig. 6(b)). Its ID estimation results with different methods are shown in Fig. 6(c).
Both the  NN and NNG methods are sensitive to the choice of the neighborhood size and tend to overestimate the ID as the neighborhood size increases. On the other hand, the MLE estimator is more stable (see Fig. 6(c)). However, the minimum estimation of MLE method is , which is still higher than the ID of this data set. LPCA method has the worst performance due to the outliers contained in the data set. The estimation of the CPCA method, which changes between and , is the closest one to the correct ID of this data set.
We now transform the original Mobius data in a 4dimensional space using an Euclidean transformation. A random noise with mean and variance is added to the transformed data. The ID estimation results with different algorithms are given in Fig. 7. As can be seen from Fig. 7, the ID estimation given by the CPCA method is the closest one to the correct ID of this noised Mobius data set. The other algorithms tend to overestimate the ID of the noised data set. The estimation obtained by the LPCA method is a little higher than that given by the CPCA due to the effect of noise.
V Conclusion
In this paper, we proposed a new ID estimation method based on PCA. The proposed algorithm is simple to implement and gives a convergent ID estimation corresponding to a wide range of neighborhood sizes. It is also convenient for incremental learning. Experiments have shown that the new algorithm has a robust performance.
Acknowledgment
The work of H. Qiao was supported in part by the National Natural Science Foundation (NNSF) of China under grant no. 60675039 and 60621001 and by the Outstanding Youth Fund of the NNSF of China under grant no. 60725310. The work of B. Zhang was supported in part by the 863 Program of China under grant no. 2007AA04Z228, by the 973 Program of China under grant no. 2007CB311002 and by the NNSF of China under grant no. 90820007.
References
 [1] T.M. Buzuga, J.V. Stammb, G. Pfister, Characterising experimental time series using local intrinsic dimension, Physics Letters A202 (1995), 183190.
 [2] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, Oxford, 1995.
 [3] J. Friedman, T. Hastie, R. Tibshirani, The Elements of Statistical Learning  Data Mining, Inference and Prediction, Springer, Berlin, 2001.
 [4] J.B. Tenenbaum, V. de Sliva, J.C. Landford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000), 23192323.
 [5] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000), 23232326.
 [6] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and representation, Neural Computation 15 (2003), 13731396.
 [7] H.S. Seung, D.D. Lee, The manifold ways of perception, Science 290 (2000), 22682269.
 [8] I.T. Jolliffe, Principal Component Analysis, Springer, Berlin, 1989.
 [9] K. Fukunaga, D.R. Olsen, An algorithm for finding intrinsic dimensionality of data, IEEE Transactions on Computers 20 (1971), 176183.
 [10] J. Bruske, G. Sommer, Intrinsic dimension estimation with optimally topology preserving maps, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998), 572575.
 [11] T. Hastie, W. Stuetzle, Principal curves, Journal of the American Statistical Association 84 (1988), 502516.
 [12] S. Chatterjee, M. R. Yilmag, Chaos, fractals and statistics, Statistical Science 7 (1992), 4968.
 [13] P. Grassberger, I. Procaccia, Measuring the strangeness of strange attractors, Physica D9 (1983), 189208.
 [14] B. Kegl, Intrinsic dimension estimation using packing numbers, Advances in Neural Information Processing Systems 16 (2002), 681688.
 [15] T. Lin, H. Zha, Riemannian manifold learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2008), 796809.
 [16] P.J. Verveer, R.P.W. Duin, An evaluation of intrinsic dimensionality estimators, IEEE Transactions on Pattern Analysis and Machine Intelligence 17 (1995), 8186.
 [17] M. Fan, H. Qiao, B. Zhang, Intrinsic dimension estimation of manifolds by incising balls, Pattern Recognition 42 (2009), 780787.

[18]
A.M. Farahmand, C. Szepesvari, J.Y. Audibert, Manifoldadaptive dimension estimation, in: Proceedings of the 24th Annual International Conference on Machine Learning, 2007, pp. 265272.
 [19] E. Levina, P.J. Bickel, Maximum likelihood estimation of intrinsic dimension, Advances in Neural Information Processing Systems 18 (2004), 777784.

[20]
D.J.C. MacKay, Z. Ghahramani, Comments on ’Maximum likelihood estimation
of intrinsic dimension’ by E. Levina and P. Bickel, see
http://www.inference.phy.cam.ac.uk/mackay/dimension/, 2005.  [21] M. Raginsky, S. Lazebnik, Estimation of intrinsic dimensionality using highrate vector quantization, Advances in Neural Information Processing Systems 19 (2005), 352356.
 [22] F. Camastra, Data dimensionality estimation methods: a survey, Pattern Recognition 36 (2003), 29452954.
 [23] M. Hein, J.Y. Audibert, Intrinsic dimensionality estimation of submanifolds in , in: Proceedings of the 22nd International Conference on Machine Learning (ed. Morgan Kaufmann), 2005, pp. 289296.
 [24] S.W. Cheng, Y.J. Wang, Z.Z. Wu, Provable dimension detection using principal component analysis, in: Proceedings of the 21th Annual Symposium on Computational Geometry, 2005, pp. 208217.
 [25] S.W. Cheng, M.K. Chiu, Dimension detection via slivers, in: Proceedings of the 19th Annual ACMSIAM Symposium on Discrete Algorithms, 2009, pp. 10011010.
 [26] J.A. Costa, A.O. Hero, Geodesic entropic graphs for dimension and entropy estimation in manifold learning, IEEE Transactions on Signal Processing 52 (2004), 22102221.
 [27] J.A. Costa, A.O. Hero, Estimating local intrinsic dimension with knearest neighbor graphs, IEEE Transactions on Statistical Signal Processing 30 (23) (2005), 14321436.
 [28] Y. Le Cun, L. Bottou, Y. Bengio, H. Patrick, Gradientbased learning applied to document recognition, Proceedings of the IEEE 86 (1998), 22782324.
 [29] K.W. Pettis, T.A. Bailey, A.K. Jain, R.C. Dubes, An intrinsic dimensionality estimator from nearneighbor information, IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (1979), 2537.
Comments
There are no comments yet.