Transport Model for Feature Extraction

10/31/2019
by   Wojciech Czaja, et al.
56

We present a new feature extraction method for complex and large datasets, based on the concept of transport operators on graphs. The proposed approach generalizes and extends the many existing data representation methodologies built upon diffusion processes, to a new domain where dynamical systems play a key role. The main advantage of this approach comes from the ability to exploit different relationships than those arising in the context of e.g., Graph Laplacians. Fundamental properties of the transport operators are proved. We demonstrate the flexibility of the method by introducing several diverse examples of transformations. We close the paper with a series of computational experiments and applications to the problem of classification of hyperspectral satellite imagery, to illustrate the practical implications of our algorithm and its ability to quantify new aspects of relationships within complicated datasets.

READ FULL TEXT VIEW PDF

Authors

page 24

page 25

page 26

page 31

page 32

page 33

10/26/2020

Automatic Feature Extraction for Classifying Audio Data

Today, many private households as well as broadcasting or film companies...
03/05/2020

Feature Extraction for Hyperspectral Imagery: The Evolution from Shallow to Deep

Hyperspectral images provide detailed spectral information through hundr...
09/02/2020

Unsupervised Feature Learning by Autoencoder and Prototypical Contrastive Learning for Hyperspectral Classification

Unsupervised learning methods for feature extraction are becoming more a...
08/18/2020

RTFN: Robust Temporal Feature Network

Time series analysis plays a vital role in various applications, for ins...
07/23/2019

NPSA: Nonorthogonal Principal Skewness Analysis

Principal skewness analysis (PSA) has been introduced for feature extrac...
02/16/2018

Dropout Model Evaluation in MOOCs

The field of learning analytics needs to adopt a more rigorous approach ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Feature extraction has been at the core of many data science applications for more than a century. The goal of feature extraction is to derive new measurements (or features) from an initial set of measure data with the intention of retaining the core information while eliminating redundancies. A well-known feature extraction algorithm is principal components analysis (PCA) which can be traced back to the year 1901 

[26]. However, due to the linear nature of PCA, the method falls short in capturing the intrinsic structure of the data when a non-linear relationship governs the underlying structure within the data. Since then, the complex, non-linear, and growing amount of data have led scientists to come up with new techniques. A few well-known techniques are: kernel PCA [29], isomap [34], locally linear embedding (LLE) [27], and Laplacian eigenmaps (LE) [3]. Today, the use of feature extraction techniques varies based on applications from the classification of hyperspectral images [7, 32, 33, 39] to the prediction of stock market prices [40].

The aforementioned non-linear feature extraction methods lead to applications of linear operators, e.g., the Laplacian. In the present study, we have developed a more general approach that constructs non-linear feature extraction algorithms based on non-linear operators, such as appropriately chosen transport by advection operators. A recent technique [18] sought to find the optimal transport method between two point sets based on an adaptive multiscale decomposition, which itself is derived from diffusion wavelets and diffusion maps. In our work, we focus on the transport operator directed by by velocity fields  [8, 23, 36], because of its well-studied properties as well as its partial similarity to Schroedinger Eigenmaps method [14]. This transport model has not been used in the literature as a tool for building a feature extraction algorithm. Nevertheless, some related work can be found in the fields of water resource management and in bio-medical research [22], where feature extraction is used to construct simplified transport models for cardiovascular flow.

At its core, our work will focus on exploring and exploiting the differences and similarities of this novel approach to the state-of-the-art feature extraction algorithms found in the literature. After providing some background in Section 2, we introduce the model in Section 3 together with some properties of the model. The algorithm for our approach is given in Section 4, and we provide an application of our algorithm for feature extraction and subsequent classification of hyperspectral image data in Section 5 and Section 6. Some open problems are posed in the last section.

2. Background

In many data science applications, high dimensional data tend to lie on low dimensional manifolds within the high dimensional space. To take advantage of this information, methods such as the Laplacian eigenmaps (LE) 

[3] and the Schroedinger eigenmaps (SE) [14], invoke the adjacency graph constructed from a set of initial points, in , in order to extract the most important features from the bunch.

In LE, the problem is reduced to solving the following generalized eigenvector problem (after building a weighted graph

from the points),

(2.1)

where , viz., the Laplacian matrix, with representing the (symmetric) weight matrix and the diagonal matrix with entries . Let be the solution set to (2.1

) written in ascending order according to their eigenvalues. The

-dimensional Euclidean space mapping is given by

In SE, the -dimensional Euclidean space mapping is given in a similar manner. Dubbed as a generalization of the LE algorithm, SE uses partial knowledge about the data set and fuses this information into the LE algorithm to obtain better representation or more desirable results. Additional work related to data fusion can be found in the following papers [6, 12, 16, 20]. The problem in SE is reduced to solving the following generalized eigenvector problem,

(2.2)

where , viz., the Schroedinger matrix, with as the potential matrix encoding the partial information and as a real parameter keeping the balance between the matrices and .

The algorithm we are developing in this article, viz., transport eigenmaps (TE), has some similarities to SE in the sense that both algorithms use extra information about the data set to define a generalization of LE. Unlike supervised learning techniques which assume prior knowledge of the ground truth, i.e., knowledge of what the output values for our samples should be, SE and TE only assumes partial knowledge of said ground truth. This puts SE and TE in a class of machine learning techniques between supervised learning and unsupervised learning (no prior knowledge) called semi-supervised learning (see

[5, 15, 17, 37, 38] for more examples). While SE uses potentials to encode to additional information, TE may use advection (the active transportation of a distribution by a flow field) or measure/weight modifiers. In contrast to SE, TE could come from a non-linear operator which we will describe in section 3.

3. The transport model

Transport operators have been used in modeling and analyzing data in a variety of fields [1, 9, 10, 19, 21, 30, 31]. We aim to bring this idea into the graph setting to help with data representation.

3.1. Notation and introduction

We first briefly present the basic setting for studying transport model on graphs. Fix a weighted simple graph with nodes. Let be a function defined on the edges of . As an matrix, is assumed to be anti-symmetric since it will be used to model a velocity field. The transport operator

acting on a vector

y is formally defined as

(3.3)

This agrees with the continuous transport operator.

We use the following rules to translate between the continuous and discrete settings. For any matrix , which is viewed as a function defined on the edges, the divergence of is a function defined on nodes, i.e., is a vector:

(3.4)

When models a velocity field on the graph, is just the net flow coming from or to the node .

For any function defined on the nodes, its gradient, the dual operator of the divergence, is defined on the edges

(3.5)

A matrix (e.g., a velocity field) can act on a vector

(e.g., a probability distribution) in the following way

This corresponds to the standard centered discretization of the transport operator (after taking the divergence).

The Laplacian of , , is defined on the nodes:

(3.6)

This differs from the graph Laplacian by a sign, as we prefer to have positive semi-definite graph Laplacian.

Based on the above rules, we have and . Therefore, the definition of (3.3) becomes

(3.7)

It is unclear from the expression (3.7) that a transport operator would always produce real eigenvalues as the Laplacian and Schroedinger operators do. We will address this issue in the next subsection.

3.2. Self-adjointness

As the properties of the transport operator ultimately depend on , an anti-symmetric matrix, we aim to find ’s so that the corresponding transport operator is self-adjoint (probably with respect to a non-standard inner product).

For any positive definite matrix , we use to denote the inner product

When

is the identity matrix, this agrees with the standard inner-product.

It is natural to assume that if the nodes and are not connected, as in this case as well. Let

and otherwise. Then is also anti-symmetric, , and

Our goal is to find a positive definite matrix such that is self-adjoint with respect to . It turns out is a natural choice (see the Supplementary Materials for a discussion and comparison with another choice ).

Theorem 3.1.

Let be a symmetric matrix. Assume for some positive ’s. Then the operator is self-adjoint with respect to the inner product , with for some positive .

Proof.

For the convenience of future discussion, denote and we try to “solve” for . In general, could be non-diagonal. We need to verify that for any vectors y and z

(3.8)

The left-hand-side (LHS) of (3.8) is

Compare this with the right-hand-side (RHS) of (3.8)

and we see that in order to make (3.8) hold,

(3.9)

must be true for any pair of connected nodes and .

Now make use of the assumption . In this case, the key condition (3.9) becomes

which is

This clearly holds as by the assumption of the theorem. ∎

We can immediately extend this theorem to a more general model by introducing a symmetric matrix . This new collection of parameters will allow us to implement the transport eigenmap method in various settings.

Theorem 3.2.

Let and be symmetric matrices. Define to be the operator such that

(3.10)

Assume for some positive ’s. Then is self-adjoint with respect to the inner product , with for some positive .

Proof.

Simply notice that the symmetric matrix can be incorporated into the symmetric matrix and thus the operator has the same form as in Theorem 3.1. ∎

When , the general transport operator can be rewritten as

(3.11)

This expression also indicates that is non-negative when .

Theorem 3.3.

The operator defined by (3.11) is non-negative in , where for some positive . More precisely,

(3.12)

with and . In particular, iff the quantity is constant on every connected component of the graph.

Proof.

By a straightforward computation,

When , the above expression is and thus must the constant on any connected component. The converse is trivial by (3.11). ∎

The above theorem ensures that is diagonalizable, with real-valued and negative eigenvalues. In applications, we will however look for the generalized eigenvectors of : eigenvectors that are normalized by the degree on the graph, i.e. vectors u s.t.

where is the degree matrix as before: . Equivalently we are looking for the eigenvectors y of with or and the same generalized eigenvalues. From Theorem 3.3, it is now straightforward to deduce that

Corollary 3.4.

Let be given by (3.11) and let be the degree matrix. Then the operator is self-adjoint in and non-negative, where for some positive . Furthermore, iff is constant on connected components of the graph.

Proof.

is self-adjoint on , simply because is diagonal and so is the metric provided by . It would be very different if we had to use non-diagonal metric (and we would have to study directly instead of ).

The operator is still non-negative with

and by Theorem 3.3, equality holds iff is constant on connected components of the graph. ∎

Compared with the Laplacian operator , we see that generalizes in the following ways:

  • modifies the measure/coordinate and thus makes the representation of -th point closer to the origin if is large or further away from the origin if is small.

  • can enlarge or reduce the weight between two nodes and , serving as a weight modifier.

We can then use these two sets of parameters to guide data representation given by LE.

3.3. Two examples

We will use TE to denote the general transport operator (3.11). Although the matrix can be used to fuse extra information, the implementation with could be more time-consuming as the size of is . We will therefore first look at two examples (denoted by TA and TG respectively) using only. As Section 6 will show, TA and TG are often good enough to handle classification tasks when one class is known. The general TE, however, is needed when more than one classes are known.

3.3.1. Transport by advection (TA)

Advection is the active transportation of a distribution by a flow field. Let be a vector that will be used to direct the clustering process. Let be a real parameter. Set , , and . Clearly . By Theorem 3.2, the operator with

(3.13)

is self-adjoint and enjoys other desired properties.

The operator can also be derived directly from the general operator (3.7) by choosing the velocity field , . In this case, and becomes

(3.14)

which is no longer linear. We can then linearize the second term in (3.14) in the direction of and will be exactly (see [25] for details).

This choice of operator is inspired by the porous medium equation, for which we refer for example to [35] for a thorough discussion of this type of non-linear diffusion on . In the present context, the idea behind having is to use the distribution y itself to help with clustering. The velocity field naturally points in the direction of the higher values of y if or towards lower values if . Similarly solving the advection-diffusion equation

would naturally lead to concentration around higher values of y if (limited by the dispersive effects of the graph Laplacian) or a contrario to faster dispersion if . The ability to control concentrations and hence clustering is of obvious interest for our purpose.

3.3.2. Transport by gradient flows (TG)

Set in (3.11). Then the general transport operator becomes

(3.15)

Note that this is in fact the same operator appeared in Theorem 3.1, where is an scaling-invariant gradient of . Here plays a similar role as in the first example of the transport by advection. One advantage of having the extra term is that even the weight modifier is constant, the weight could still be changed. In applications, the default value for the measure modifier is and some of them may be greater than if extra information is known. When , which often indicates that the two points and belong to different clusters, the factor , weakening the original weight . Therefore, the formulation of the operator achieves measure modification and weight modification simultaneously without using .

4. The transport eigenmap method

We describe the implementation of our new TE (short for transport eigenmap or transport extended) algorithm, including TA and TG as two important special cases.

4.1. The algorithm

The steps are identical to those of LE and SE. We only need to modify the matrix used in the generalized eigenvalue problem. Given a set of points in , the goal is to find a map

so that the points in given by represents for all from to .

The goal is typically to have a lower dimensional representation of the set of points with while still keeping the main features of the original set . For example if the points lie on a -dimensional manifold where , the hope would be to take as map a good approximation of the projection on the manifold.

  • Step 1: Construct the adjacency graph using the

    -nearest neighbor (kNN) algorithm. This is done by putting an edge connecting nodes

    and given that is among the nearest neighbors of according to the Euclidean metric. We choose large enough so that the graph that we obtain is fully connected.

  • Step 2: Define the weight matrix, , on the graph. The weights in are chosen using the heat kernel with some parameter . If nodes and are connected,

    otherwise, .

  • Step 3: Choose an appropriate transport operator and construct the corresponding matrix. Recall the general transport operator given in (3.11)

    Here, the vector and the matrix are the parameters to be chosen. Let denote the matrix with entries . Then the matrix form of is

    (4.16)

    To get the matrix form of the special operator TA, we can either set and in (4.16), or use the operator form (3.13) to derive its matrix form directly

    where is the Laplacian matrix and is the identity.

    Similarly, for the operator , we can let in (4.16) or use the expression in Theorem 3.1 to get

    where , and .

  • Step 4: Find the -dimensional transport mapping by solving the generalized eigenvector problem,

    (4.17)

    This can be done because of Corollary 3.4. Denote be the solution set to (4.17) written in ascending order according to their eigenvalues. Since there is hence no additional information in , we define the mapping by

4.2. A toy example

We illustrate the behavior of LE, SE and TE (including TA and TG) with a toy example. The first picture in Figure 1 is a dataset with points. The ground truth is that there are clusters, each containing points.

LE is an unsupervised method that preserves local distance. We chose for KNN in Step 1 and in Step 2 for simplicity.

SE, which uses the matrix , requires extra parameters: and the diagonal potential matrix . Assume the red points are known. Simply let if the -th point is red and otherwise. Let . This new parameter will allow us to balance the impact of the Laplacian matrix and the potential in the algorithm. We chose . As expected, points with non-zero potential (the red ones in this example) are pushed towards the origin. As tries to preserve local distance, other points close to the red are dragged towards the origin as well.

For TA, we chose and in the same way as : for red and for other points. The red go to the origin because of rescaling of the coordinates, but the surrounding points don’t “see” any changes in distance. This explains the less dragging effect in TA compared with SE.

In TG, we set by default and for the red points. The red are even better separated from others. This is because the factor in (3.15) is less than and thus weakens the original weight if and are not both red.

For the general TE, the matrix needs to be determined. The default is . Then it is natural to set

(4.18)

We set and , which help further gathering the red points.

If the pre-identified cluster is not near the center of the data points, e.g., the blue points, then we can set to be less than for the blue to push them away from the origin. The weight modifier in TE is always helpful to gather these points to their natural location. See Figure 1 for the case for the blue and otherwise in TE ( remains to be in (4.18)).

The general TE can even handle the case when more than one cluster are known. Let for red and for blue. is still given by (4.18). We can see in Figure 1 that both red and blue are well-separated from others.

Figure 1. The first plot presents the dataset, points grouped in clusters of points each. The next plots show the results of various mappings.

It is very promising that TE can be used to help with clustering. We shall not pursuit this direction in this paper. Instead, we will test our methods on real hyperspectral data.

5. Hyperspectral sample set experiments

5.1. The datasets

We have taken advantage of two hyperspectral data sets: Indian Pines and Salinas. The Indian Pines dataset (cf. an example in Figure 4 in the supplementary document) was gathered by AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) sensor over the Indian Pines test site in North-western Indiana. The Indian Pines dataset consists of pixels images that contain spectral bands in the wavelength range to meters. The ground truth available is designated into sixteen classes (see Table 7 in the supplement). The number of bands has been reduced to by removing bands covering the region of water absorption. The Indian Pines dataset is available through Purdue’s university MultiSpec site [24, 2].

The Salinas dataset was similarly gathered by AVIRIS sensor over Salinas Valley, California (see Figure 5 in the supplementary document). With again a similar structure, Salinas images are pixels with spectral bands of approximately meter high spatial resolution. The ground truth available is also clustered into sixteen classes (see Table 8 in the supplement). We again reduce the number of bands to by removing those bands covering the region of water absorption. The Salinas dataset is publicly available [24].

For easier testing purposes, we have also used a small sub-scene of the Salinas dataset, which we denote Salinas-B (shown in Figure 6 in the supplement). Salinas-B consists of a data cube located within the same scene at [samples, lines]=[200:349, 40:139] and includes only eight classes (see Table 9 in the supplement). The Salinas-B dataset was used to allow for a faster and more thorough exploration of the parameters’ space.

After the various mappings, we employ Matlab’s 1-nearest neighbor algorithm to classify the data sets. We use

of the data from each class to train the classifier and the remaining number of data points as the validation set. We took an average of ten runs to produce the confusion matrices, each using a disjoint set of data to train the classifier.

5.2. Choice of parameters

Following the description of the mapping algorithms for the various methods under consideration in subsection 4.1, we made the following choices to construct the graph over which all methods rely

  • The adjacency graph is built using nearest neighbors;

  • The weight matrix was obtained by using ;

  • We calculated generalized eigenvectors for the Indian Pines dataset and for the Salinas-B dataset. The final mappings were obtained from those generalized eigenvectors as described in Step 4 of subsection 4.1.

For SE, TA and TG, we also need to choose the potential , the vector and a. In our testing, for example, we have assumed prior knowledge of either class 2-corn-notill or class 11-soybean-mintill in the Indian Pines dataset. This leads to the typical choice in the case

In TG, the default is and we will set for the known points. It remains to chose the parameters and . For SE, recall that in Section 4.2 we introduced the parameter given by . To obtain the results listed in the next subsection, we used

  • for the Indian Pines data set and for the Salinas-B data set for SE;

  • for both the Indian Pines and the Salinas-B data set for TA and TG.

The particular choices of parameters summarized here were obtained after a more thorough investigation and optimization among possible values. This parameter exploration is shown in Section C in the supplementary material.

5.3. Measuring accuracy

We will compare the performance of several feature extraction methods in the next section. To obtain a more complete perspective, we consider several measurements of accuracy.

We first use the adjusted Rand index (ARI), which is a widely used cluster validation index for measuring agreement between partitions [28]. Given a set of points and a partition, e.g., clusterings, of these points, into clusters, the ARI compares it to the ground truth partition into clusters, by calculating

(5.19)

where

is the confusion matrix,

, and , for and .

We next consider the overall accuracy (OA), and the average or weighted accuracy (AA). The overall accuracy is simply the total number of well classified objects w.r.t. the total number of objects, while the average accuracy is the average of the accuracy in each class

with the notations above.

We finally employ the average F-score (FS) and Cohen’s kappa coefficient (

), which are given by

Both the ARI and coefficient measure the degree of agreement between clusters. A strong cluster agreement with the ground truth usually also results in high overall accuracy. The average accuracy and the F-score are validation metrics that serve as test scores to ensure that our results are not bias towards a few particular classes. A comparable average accuracy and the F-score across methods is an indication that the algorithms do not favor a few particular classes over the others.

6. Results

We summarize the main results of our numerical experiments on the real hyperspectral images introduced in the previous section. More details are available in the supplementary document.

6.1. Overall performance

The following feature extraction algorithms are used in the experiment: principal components analysis [26] (PCA), Laplacian eigenmaps [4] (LE), diffusion maps [13] (DIF), isomap [34] (ISO), Schroedinger eigenmaps [11] (SE), transport eigenmaps (TE, including TA and TG). The classification maps for each of the results can be found in the supplement.

We especially focus on the Adjusted Rand Index, Overall Accuracy, and on the Cohen’s kappa coefficient (emphasized in bold in the tables) as the main indicators for the performance of the algorithms.

6.1.1. Testing on two examples

We first test TA on the Salinas-B dataset (Table 1), assuming the class “lettuce” is known in SE and TA. Unsurprisingly, the semi-supervised algorithms, SE and TA, outperform the unsupervised algorithms, PCA, LE, DIF and ISO. The performance of the SE and TA is roughly similar, but with a small but consistent advantage to TA.

SB PCA LE DIF ISO SE TA
ARI 0.9429 0.9346 0.9164 0.9440 0.9439 0.9463
OA 0.9729 0.9685 0.9603 0.9733 0.9762 0.9780
AA 0.9690 0.9643 0.9564 0.9700 0.9777 0.9802
FS 0.9693 0.9638 0.9557 0.9696 0.9766 0.9795
0.9682 0.9630 0.9534 0.9687 0.9720 0.9742
Table 1. Classification results for Salinas-B (SB): assume lettuce (class 14) is known

Classification algorithms frequently mis-classify samples of similar classes due to the similarities in their spectra information. For this reason, we tested the algorithms by grouping similar classes within the Indian Pines and Salinas-B data set to make new ground truths which we denote Indian Pines-G and Salinas-B-G (see Table 10 and Table 11 in the supplement).

It turns out SE and TA indeed perform better on grouped Salinas-B (Table 2) than on Salinas-B. TA remains to be the best method for the grouped dataset.

SBG PCA LE DIF ISO SE TA
ARI 0.9460 0.9421 0.9154 0.9480 0.9711 0.9767
OA 0.9791 0.9767 0.9677 0.9795 0.9858 0.9880
AA 0.9769 0.9750 0.9669 0.9784 0.9819 0.9840
FS 0.9797 0.9763 0.9697 0.9797 0.9829 0.9850
0.9725 0.9694 0.9576 0.9731 0.9814 0.9843
Table 2. Classification results for Salinas-B-G (SBG): assume lettuce (class 11) is known

We then test TG on Indian Pines dataset and its grouped version, assuming the class “soybean” is known. In this difficult image, the gain of performance in using TG is significant. See Table 3 and Table 4 below.

IP PCA LE DIF ISO SE TG
ARI 0.4426 0.3745 0.4210 0.3930 0.6955 0.7104
OA 0.6761 0.6133 0.6557 0.6309 0.7354 0.7431
AA 0.6403 0.5782 0.6219 0.5979 0.6249 0.6248
FS 0.6471 0.5784 0.6212 0.5996 0.6255 0.6250
0.6301 0.5592 0.6065 0.5785 0.6982 0.7071
Table 3. Classification results for Indian Pines (IP): assume soybean (class 11) is known.
IPG PCA LE DIF ISO SE TG
ARI 0.5330 0.4785 0.5102 0.4902 0.8929 0.9264
OA 0.7744 0.7307 0.7575 0.7418 0.9088 0.9155
AA 0.6987 0.6462 0.6883 0.6671 0.7111 0.7072
FS 0.7111 0.6479 0.6905 0.6739 0.7157 0.7087
0.6996 0.6423 0.6770 0.6563 0.8788 0.8877
Table 4. Classification results for Indian Pines-G (IPG): assume soybean (class 10) is known

We remark that ideally the way to implement TE (e.g. TA or TG) should depend on physical interpretation of the data. The above tables show that TA and TG are good for “arbitrary” datasets.

6.1.2. Testing the general TE

Although being expensive in computation, the use of general TE is needed if information about more than one classes is known. Table 6 and Table 5 show that SE, TA and TG can often perform worse when two classes are known. However, TE gives significant improvements. Here we use given by (4.18) with and , and set and on the two known classes.

IP SE TG TE IPG SE TG TE
ARI 0.5272 0.7693 0.8169 ARI 0.4351 0.8547 0.9372
OA 0.6855 0.8091 0.8268 OA 0.6858 0.8967 0.9252
AA 0.6221 0.6759 0.6864 AA 0.6431 0.7055 0.7221
FS 0.6229 0.6766 0.6855 FS 0.6467 0.7083 0.7242
0.6409 0.7818 0.8024 0.5821 0.8620 0.9004
Table 5. Classification results for Indian Pines (IP) and its grouped version (IPG): assume both corn and soybean are known.
SB SE TA TE SBG SE TA TE
ARI 0.9381 0.9805 0.9812 ARI 0.7916 0.9773 0.9823
OA 0.9702 0.9909 0.9914 OA 0.9211 0.9902 0.9921
AA 0.9671 0.9903 0.9908 AA 0.9877 0.9877 0.9900
FS 0.9666 0.9902 0.9909 FS 0.9365 0.9889 0.9906
0.9651 0.9894 0.9899 0.8966 0.9871 0.9896
Table 6. Classification results for Salinas-B (SB) and its grouped version (SBG): assume both corn and lettuce are known

In SE, points with positive potential will always be mapped towards the origin. There is no mechanism to handle two different clusters. This explains that SE often perform worse in the above tests. In TA and TG, although the distance from the points to the origin can be modified in different ways by varying , points from different classes could still collide after mapping because of their initial locations. The general TE has the power of minimizing the possibility of mixing two known classes since the matrix provides internal force to group points in the same class.

6.2. Dependence on the amount of the information

We performed further experiments on Indian Pines-G and Salinas-G to see how the amount of information available from one particular class affects the performance measures for SE and transport methods TA and TG.

SE and transport methods have very close overall performance on the Indian Pines-G and Salinas-B-G datasets so the comparison may help to understand better the differences between them. As the amount of information increases, so do the performance measures. Figure 2 shows the change in performance of SE, TA and TG from using to using of the ground truth with increments of from a particular class.

Figure 2. Classification performance measures for SE (blue squares), TA (red diamonds) and TG (green x’s) as a function of the amount of information provided, from to with increments of . The Indian Pines-G data set (top row) is used with the advection and potential placed on class 10–soybean. The Salinas-B-G (bottom row) is used with the advection and potential placed on class 10–corn-senesced-green-weeds.

Over most of the figure, the SE actually performs slightly better than the TA and TG, with TA and TG only surpassing SE when we have close to of the information on the class. However the difference between the two algorithms remains very small in those two simplified datasets; this is especially striking on the Indian Pines-G.

In lieu of these results, it is worth mentioning that in real-life applications, the extra information provided to the algorithms of SE and transport methods does not come directly from the ground truth. Ideally, better and richer cluster information than the ground truth are produced using laboratory measurements and provided to the algorithms as the extra information. These laboratory measurements include various signals representing different materials in a wide range of conditions, e.g., lighting and weather. The use of the ground truth in our aforementioned results is simply due to the unavailability of those better and richer cluster information. It may very well be that with a more complete set of laboratory measurements, we would observe more significant differences between SE and transport methods.

6.3. Robustness of Transport eigenmaps

In a last set of experiments, we investigate the robustness of transport methods TA and TG and some of our other feature extraction algorithms such as PCA, LE, and SE. For this experiment, we have added Gaussian noise to individual data points in the data set before it is processed by the feature extraction algorithms. The added Gaussian noise has a mean of and we selected

logarithmically spaced values for the standard deviation varying from

to which covers the range for values taken by the individual data points in both set of data. For SE and transport methods, the ground truths for class 10–soybean (Indian Pines-G) and class 11–lettuce (Salinas-B-G) are added to the algorithms. The results are shown on Figure 3.

Figure 3. Classification performance measures for TA (red diamonds), TG (cyan diamonds), SE (blue squares), PCA (green x’s), and LE (black circles) as a function of noise. For Indian Pines-G (top row) the potential and advection are placed on class 10–soybean. For Salinas-B-G (bottom row) the potential and advection are placed on class 11–lettuce-romaine.

We first gather from the experiments that SE and transport methods are more resilient to noise than PCA and LE. While the performance of all algorithms naturally decreases very significantly (and interestingly at almost the same mark), SE and transport methods resist better. On the Indian Pines-G, transport methods also end up being the best algorithm by a significant margin, performing better than SE for large noise whereas they are comparable for small noise. This again suggests that our new Transport algorithm is especially useful in difficult settings where previous methods do not perform well.

7. Conclusion

In this manuscript, we propose a novel approach to semi-supervised non-linear feature extraction extending the Laplacian eigenmaps. Similar in spirit to previous extension such as Schroedinger eigenmaps, our algorithm is derived from non-linear transport model. We provide a set of experiments based on artificially generated data sets and on publicly available hyperspectral data sets to compare the new method’s performance to a variety of algorithms for reducing the dimension of the data provided to a standard classification algorithm.

Those experiments show intriguing possibilities for the new method, which has proved competitive with other algorithms in all settings and significantly outperforms other methods in the more difficult cases (low accuracy because of noise for example). We believe that this demonstrates a strong potential for new methods using advection/gradient flow operators, with in particular the following open questions

  • How to further generalize the transport operator? The choice of the velocity field in Theorem 3.1 makes the transport operator self-adjoint with respect to an inner product associated with a diagonal matrix . It is natural to investigate the case with a non-diagonal, positive definite .

  • Can we better relate the choice of an algorithm to the expected structure of the problem? A good example might be time-dependent data, where a clear direction of propagation of the signal would lead to conjecture a even better performance of advection-based eigenmaps.

  • What is the best way to choose the parameters in the general transport method. The intuition provided in Section 4.2 is good only for low dimensional data. When the dimension is high or there are two or more clusters, the choice of and

    can be very complicated. We plan to use neural network to attack this problem.

References

  • [1] J. Bartsch, K. Brander, M. Heath, P. Munk, K. Richardson, and E. Svendsen (1989) Modelling the advection of herring larvae in the north sea. Nature 340 (6235), pp. 632. Cited by: §3.
  • [2] M. F. Baumgardner, L. L. Biehl, and D. A. Landgrebe (2015-09) 220 band AVIRIS hyperspectral image data set: June 12, 1992 Indian Pine test site 3. External Links: Link, Document Cited by: §5.1.
  • [3] M. Belkin and P. Niyogi (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, pp. 585–591. Cited by: §1, §2.
  • [4] M. Belkin and P. Niyogi (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15 (6), pp. 1373–1396. Cited by: §6.1.
  • [5] M. Belkin and P. Niyogi (2004) Semi-supervised learning on Riemannian manifolds. Machine Learning 56 (1-3), pp. 209–239. Cited by: §2.
  • [6] J. J. Benedetto, W. Czaja, J. Dobrosotskaya, T. Doster, K. Duke, and D. Gillis (2012) Semi-supervised learning of heterogeneous data in remote sensing imagery. In Independent Component Analyses, Compressive Sampling, Wavelets, Neural Net, Biosystems, and Nanoengineering X, Vol. 8401, pp. 840104. Cited by: §2.
  • [7] J. J. Benedetto, W. Czaja, J. C. Flake, and M. Hirn (2009) Frame based kernel methods for automatic classification in hyperspectral data. In Geoscience and Remote Sensing Symposium, 2009 IEEE International, IGARSS 2009, Vol. 4, pp. IV–697. Cited by: §1.
  • [8] T. Bennett (2012) Transport by advection and diffusion. Wiley Global Education. Cited by: §1.
  • [9] H. Brogniez, R. Roca, and L. Picon (2009) A study of the free tropospheric humidity interannual variability using meteosat data and an advection–condensation transport model. Journal of Climate 22 (24), pp. 6773–6787. Cited by: §3.
  • [10] W. Brutsaert and H. Stricker (1979)

    An advection-aridity approach to estimate actual regional evapotranspiration

    .
    Water Resources Research 15 (2), pp. 443–450. Cited by: §3.
  • [11] N. D. Cahill, W. Czaja, and D. W. Messinger (2014)

    Schroedinger eigenmaps with nondiagonal potentials for spatial-spectral clustering of hyperspectral imagery

    .
    In Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XX, Vol. 9088, pp. 908804. Cited by: §6.1.
  • [12] A. Cloninger, W. Czaja, and T. Doster (2017) The pre-image problem for Laplacian eigenmaps utilizing regularization with applications to data fusion. Inverse Problems 33 (7), pp. 074006. Cited by: §2.
  • [13] R. R. Coifman and S. Lafon (2006) Diffusion maps. Applied and Computational Harmonic Analysis 21 (1), pp. 5–30. Cited by: §6.1.
  • [14] W. Czaja and M. Ehler (2013) Schroedinger eigenmaps for the analysis of biomedical data. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (5), pp. 1274–1280. Cited by: §1, §2.
  • [15] I. Davidson (2009) Knowledge driven dimension reduction. In

    Twenty-First International Joint Conference on Artificial Intelligence

    ,
    pp. 1034–1039. Cited by: §2.
  • [16] T. J. Doster (2014) Harmonic analysis inspired data fusion for applications in remote sensing. Ph.D. Thesis, University of Maryland, College Park. Cited by: §2.
  • [17] H. Fang, M. Cheng, and C. Hsieh (2017)

    A hyperplane-based algorithm for semi-supervised dimension reduction

    .
    In IEEE International Conference on Data Mining, Cited by: §2.
  • [18] S. Gerber and M. Maggioni (2017-01) Multiscale strategies for computing optimal transport. J. Mach. Learn. Res. 18 (1), pp. 2440–2471. External Links: ISSN 1532-4435, Link Cited by: §1.
  • [19] H. Guo, J. Zhang, R. Liu, L. Liu, X. Yuan, J. Huang, X. Meng, and J. Pan (2014-12) Advection-based sparse data management for visualizing unsteady flow. IEEE Transactions on Visualization and Computer Graphics 20 (12), pp. 2555–2564. Cited by: §3.
  • [20] A. Halevy (2011) Extensions of Laplacian eigenmaps for manifold learning. Ph.D. Thesis, University of Maryland, College Park. Cited by: §2.
  • [21] Han-Wei Shen, C. R. Johnson, and Kwan-Liu Ma (1996) Visualizing vector fields using line integral convolution and dye advection. In Proceedings of 1996 Symposium on Volume Visualization, Vol. , pp. 63–70. Cited by: §3.
  • [22] K. B. Hansen and S. C. Shadden (2016) A reduced-dimensional model for near-wall transport in cardiovascular flows. Biomechanics and Modeling in Mechanobiology 15 (3), pp. 713–722. Cited by: §1.
  • [23] W. Hundsdorfer and J. G. Verwer (2013) Numerical solution of time-dependent advection-diffusion-reaction equations. Vol. 33, Springer Science & Business Media. Cited by: §1.
  • [24] Hyperspectral remote sensing scenes. Note: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_ScenesAccessed: 2018-04-04 Cited by: §5.1, §5.1.
  • [25] F. Njeunje (2018) Computational methods in machine learning: transport model, haar wavelet, dna classification, and mri. Ph.D. Thesis, University of Maryland, College Park. Cited by: §3.3.1.
  • [26] K. Pearson (1901) On lines and planes of closest fit to systems of point in space. Philosophical Magazine 2 (11), pp. 559–572. Cited by: §1, §6.1.
  • [27] S. T. Roweis and L. K. Saul (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), pp. 2323–2326. Cited by: §1.
  • [28] J. M. Santos and M. Embrechts (2009) On the use of the adjusted Rand index as a metric for evaluating supervised classification. In International Conference on Artificial Neural Networks, pp. 175–184. Cited by: §5.3.
  • [29] B. Schölkopf, A. Smola, and K-R. Müller (1997) Kernel principal component analysis. In International Conference on Artificial Neural Networks, pp. 583–588. Cited by: §1.
  • [30] J. R. Sibert, J. Hampton, D. A. Fournier, and P. J. Bills (1999) An advection–diffusion–reaction model for the estimation of fish movement parameters from tagging data, with application to skipjack tuna (katsuwonus pelamis). Canadian Journal of Fisheries and Aquatic Sciences 56 (6), pp. 925–938. Cited by: §3.
  • [31] Song,Helen, B. R., T. D., G. J., and I. F. (2003) Experimental test of scaling of mixing by chaotic advection in droplets moving through microfluidic channels. Applied Physics Letters 83 (22), pp. 4664–4666. Cited by: §3.
  • [32] W. Sun, A. Halevy, J. J. Benedetto, W. Czaja, W. Li, C. Liu, B. Shi, and R. Wang (2014) Nonlinear dimensionality reduction via the ENH-LTSA method for hyperspectral image classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (2), pp. 375–388. Cited by: §1.
  • [33] W. Sun, A. Halevy, J. J. Benedetto, W. Czaja, C. Liu, H. Wu, B. Shi, and W. Li (2014) UL-Isomap based nonlinear dimensionality reduction for hyperspectral imagery classification. ISPRS Journal of Photogrammetry and Remote Sensing 89, pp. 25–36. Cited by: §1.
  • [34] J. B. Tenenbaum, V. De Silva, and J. C. Langford (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), pp. 2319–2323. Cited by: §1, §6.1.
  • [35] J. L. Vázquez (2007) The porous medium equation. Oxford Mathematical Monographs, The Clarendon Press, Oxford University Press, Oxford. Note: Mathematical theory External Links: ISBN 978-0-19-856903-9; 0-19-856903-3, MathReview (Vicenţiu D. Rădulescu) Cited by: §3.3.1.
  • [36] C. Vreugdenhil and B. Koren (1993) Numerical methods for advection- diffusion problems. Notes on Numerical Fluid Mechanics. Cited by: §1.
  • [37] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl (2001)

    Constrained k-means clustering with background knowledge

    .
    In ICML, Vol. 1, pp. 577–584. Cited by: §2.
  • [38] D. Zhang, Z. Zhou, and S. Chen (2007) Semi-supervised dimensionality reduction. In Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 629–634. Cited by: §2.
  • [39] W. Zhao and S. Du (2016)

    Spectral–spatial feature extraction for hyperspectral image classification: a dimension reduction and deep learning approach

    .
    IEEE Transactions on Geoscience and Remote Sensing 54 (8), pp. 4544–4554. Cited by: §1.
  • [40] X. Zhong and D. Enke (2017) Forecasting daily stock market return using dimensionality reduction. Expert Systems with Applications 67, pp. 126–139. Cited by: §1.

Appendix A Selecting the field

a.1. Limitations of

Among all anti-symmetric matrices, this choice is probably the simplest. In this case, the condition

(1.20)

becomes

(1.21)

There are unknown variables and at most independent equations. The solution space is therefore always one-dimensional provided it is non-trivial. For example, suppose every node is connected with the first node, then the value of determines the rest of the variables by (1.21).

If the nodes are connected as a chain, then the value of any one node in the chain determine the values of the rest.

The only situation one should to be careful is when there is a cycle of nodes. In this case, one needs to check the consistency of the system (1.21). We summarize these observations in the following theorem.

Theorem A.1.

The system (1.21) has a one-dimensional solution space if for any and any cycle of length , the following holds: Without loss of generality and for notation simplicity, assume that the cycle is formed by the first nodes . Then

(1.22)

The following corollaries provide more easy-to-check condition than (1.22).

Corollary A.2.

If the graph is a tree, i.e., without cycles, then the system (1.21) has solutions.

Corollary A.3.

If the cardinality of the set is , then the system (1.21) has solutions.

Proof.

We will show by induction that (1.22) always holds when ’s attain only two values. When , for any indices ,

(1.23)

holds whenever and attain two values. This proves the base case.

Assume (1.22) is valid for . Let . We need to show that

(1.24)

Write LHS of the above equality as

where we applied the induction hypothesis for in the last step.

It is clear that (1.24) will follow once we show