Entropy-Isomap: Manifold Learning for High-dimensional Dynamic Processes

02/19/2018
by   Frank Schoeneman, et al.
University at Buffalo
0

Scientific and engineering processes produce massive high-dimensional data sets that are generated as highly non-linear transformations of an initial state and few process parameters. Mapping such data to a low-dimensional manifold can facilitate better understanding of the underlying process, and ultimately their optimization. We show that off-the-shelf non-linear spectral dimensionality methods, such as Isomap, fail for such data, primarily due to the presence of strong temporal correlation among observations belonging to the same process pathways. We propose a novel method, Entropy-Isomap, to address this issue. The proposed method is successfully applied to morphology evolution data of the organic thin film fabrication process. The resulting output is ordered by the process variables. It allows for low-dimensional visualization of the morphological pathways, and provides key insights to guide subsequent design and exploration.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

02/27/2019

High-Dimensional Bayesian Optimization with Manifold Gaussian Processes

Bayesian optimization (BO) is a powerful approach for seeking the global...
07/17/2020

Visualizing the Finer Cluster Structure of Large-Scale and High-Dimensional Data

Dimension reduction and visualization of high-dimensional data have beco...
03/16/2020

G-LBM:Generative Low-dimensional Background Model Estimation from Video Sequences

In this paper, we propose a computationally tractable and theoretically ...
10/26/2021

Estimating Mutual Information via Geodesic kNN

Estimating mutual information (MI) between two continuous random variabl...
08/21/2021

Using growth transform dynamical systems for spatio-temporal data sonification

Sonification, or encoding information in meaningful audio signatures, ha...
01/17/2017

3D Morphology Prediction of Progressive Spinal Deformities from Probabilistic Modeling of Discriminant Manifolds

We introduce a novel approach for predicting the progression of adolesce...
11/26/2021

Conditional Manifold Learning

This paper addresses a problem called "conditional manifold learning", w...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In many physical processes, the underlying dimensionality of the data is much smaller than the dimensionality of the observation. This insight is at the core of all dimensionality reduction techniques, and it means both that data can be represented more concisely by using a latent state, and more importantly, that physical processes might be better understood by discovering their underlying low dimensionality.

Our focus in this work is on the second point, specifically, we propose a novel method for dimensionality reduction of process data, i.e. data that represents time-series of process states. While such data are ubiquitous, they are challenging for current dimensionality reduction techniques due to a combination of several factors: the datasets are large, since each sample of a process is a time series of points; the underlying processes are highly non-linear, which rules out many methods that could otherwise handle large data; and the individual data points are sampled in a highly correlated way, which can confuse many dimensionality reduction techniques.

In our prior work, we have developed S-Isomap, a spectral dimensionality reduction technique for non-linear streaming data (Schoeneman et al., 2017) that addresses two of the above challenges. The method can efficiently and reliably handle large non-linear datasets, but assumes that the input data is weakly correlated. Consequently, it fails when applied directly to process data. S-Isomap has been derived from the standard Isomap algorithm (Tenenbaum et al., 2000), which is frequently used and favored in the scientific computing data analysis (Lim et al., 2003; Dawson et al., 2005; Zhang et al., 2006; Rohde et al., 2008; Ruan et al., 2014; Strange and Zwiggelaar, 2014; Samudrala et al., 2015). Unfortunately, while there is some prior work on applying Isomap to spatiotemporal data (Jenkins and Matarić, 2004), the focus has been on segmentation of trajectories rather than discovering a continuous latent state. To the best of our knowledge, currently there are no spectral methods that could handle high-dimensional process data.

Time step Time step Time step Time step
Figure 1. Example morphologies corresponding to three selected variants of fabrication process. Morphologies were selected to show the diversity of morphology classes handled in this work. The combination of the two process parameters, and , and time step are shown. Note that the morphologies vary both in terms of size and topologies (bubble-like vs interpenetrated). Morphologies corresponds to polymer blend; dark blue correspond to regions rich in one polymer, while dark red corresponds to domains rich in another polymer. (Please view in color)

The current work is motivated by the need to analyze large and high-dimensional datasets generated from highly non-linear differential equations modeling morphology evolution during fabrication process of organic thin films (see Section 5.1). The fabrication of organic thin films is a key factor controlling properties of organic electronics, including transistors, batteries, and displays, but is computationally expensive and difficult to model precisely. Depending on fabrication parameters, different process trajectories are possible, and scientists and engineers are interested in gaining rapid insights from such data by identifying low-dimensional representations. The ultimate goal is to design smart exploration of the design space and optimize the fabrication to make devices with desired properties.

We first show that linear techniques, such as Principal Component Analysis (PCA) 

(Pearson F.R.S., 1901), overestimate the latent dimension of the process, and that dimensionality reduction techniques that assume uniformity in sampling, including non-linear strategies, fail due to the highly correlated nature of process data. We hypothesize that the poor performance of non-linear techniques is related to a lack of mixing (or cross-talk) between different trajectories, and present remedy based on resampling of the data. The main contribution of this paper is the concept of neighborhood entropy of a point, which indicates mixing between trajectories. We use neighborhood entropy to adaptively size the neighborhoods when computing the geodesic distance approximation in Isomap. The resulting dimensionality reduction method is both easy to implement and effective, and could likely be extended to other spectral dimensionality reduction approaches applied to process data.

2. Preliminaries

2.1. Spectral Dimensionality Reduction

Spectral Dimensionality Reduction (SDR) refers to a class of methods used to identify low-dimensional structure in high-dimensional data. For a given set of points, , in a high-dimensional space 

, SDR methods work by computing either top or bottom eigenvectors (eigendecomposition) of a feature matrix derived from 

. Here, the feature matrix, , captures structure of the data through some selected property (e.g. pairwise distances). The SDR methods rely on the assumption that there exists a function , , that maps low-dimensional representation, , of each data sample to the observed . The goal then becomes to learn the inverse mapping, , that can be used to map high-dimensional to low-dimensional .

The SDR methods are frequently categorized based on the assumption they make about (i.e. linear vs. non-linear). Linear methods assume that the data lie on a low-dimensional subspace of

, and construct a set of basis vectors representing the mapping. However, in many cases the data is complex, and the linearity assumption is too restricting. In such cases, non-linear methods work off the assumption that the data is sampled from some low-dimensional submanifold

, embedded within .

When working with the linearity assumption the most commonly used methods are Principal Component Analysis (PCA) and MDS. PCA learns the subspace which best preserves covariance. The basis vectors learned by PCA, known as principal components, are the directions along which the data has highest variance. The data

X can be transformed by first computing the covariance matrix, . In the case of MDS, the feature matrix contains some pairwise relationships between data points in X. When these relationships are Euclidean distances, the result is equivalent to that of PCA, and this is known as classical MDS. Spectral decomposition of F in both methods yields eigenvectors Q. Taking the top eigenvectors, the data can be mapped to low-dimensions as Y by the transformation Y = X.

0:  X,
0:  Y
1:   PairwiseDistances(X)
2:  
3:  for  do
4:     kNN KNN(, X, )
5:     for  kNN do
6:        
7:   AllPairsShortestPaths(G)
8:  
9:  return  Y
Algorithm 1 Isomap

In cases when data is assumed to be generated by some non-linear process, both PCA and MDS are not robust enough to learn the inverse mapping . Although variants of PCA have been proposed to address such situations (e.g. Kernel PCA (Schölkopf et al., 1998)), the most common approach is to use Isomap (Tenenbaum et al., 2000). Isomap constructs feature matrix by approximating distances between input points along the manifold , and then proceeds as regular MDS. This is accomplished in four steps, as shown in Algorithm 1. First all pairwise distances are computed for points in X. Then geodesic distances along manifold are approximated (lines 3–6) by first constructing a neighborhood graph, G, where each point is adjacent to its -nearest-neighbors, and then by computing shortest paths between all points in G (line 7). The resulting geodesics approximations are contained in the feature matrix F, which is processed by MDS to yield the final low-dimensional transformation.

2.2. Dynamic Process Data

When dataset represents a dynamic process, points in X can be partitioned into trajectories, , ,…, . Each trajectory is given by a -parameterized sequence of data points. In other words, = (, ,…, ), where when . Parameter is often, but not necessarily, used to denote time, and trajectory can be a function of one or more additional parameters.

In this work, we investigate the use of SDR methods in the analysis of dynamic processes. Specifically, we consider high-dimensional data describing morphology evolution during fabrication of organic thin film (Wodo and Ganapathysubramanian, 2012). Each trajectory in is a function of two variables corresponding to two fabrication parameters: , which denotes blend ratio of polymers making organic film, and , which denotes strength of interaction between these polymers. Each data point is an image representing one morphology snapshot generated by complex non-linear differential equation solver modeling morphology evolution. Each image is then represented by a high-dimensional vector in , obtained by simple processing of image pixels. Example morphologies from selected trajectories are shown in Figure 1, and we give detailed description of the data and the data generation process in Section 5.

The main challenge in analyzing the temporal morphology evolution data comes from the inherent bias in the exploration of possible states of the fabrication process. In the essence, sampling in is commonly unbalanced meaning much more dense than in parameters or . This is because the computational cost of executing solver to generate a single trajectory (i.e. sampling in ) is prohibitive to allow for the exhaustive sampling in the space formed by parameters and . Furthermore, data points in the same trajectory have high temporal correlation, which is reflective of how morphologies evolve. These factors strongly influence connectivity of the neighborhood graph, and in turn affect approximation of the manifold distances.

3. Challenges in Using SDR with Dynamic Process Data

To obtain the low-dimensional representation of morphology data, scientists have two broad categories of SDR methods to choose from: linear and non-linear. Linear SDR methods project the observed data into a low-dimensional linear subspace. Non-linear methods on the other hand, focus on finding non-linear transformations of the observed data. Here, we study the effectiveness of two representative methods from each category: PCA (linear) and Isomap (non-linear).

3.1. Assessing Quality of Mapping by State of the Art SDR Methods

A reliable way of determining the quality of the mapping for each method is to compare the original data X in with the mapped data Y in , by computing the residual variance. The process of computing residual variance for PCA differs from Isomap, but the values are directly comparable.

In PCA, each principal component (PC) explains a fraction of the total variance in the dataset. If we consider

as the eigenvalue corresponding to the

PC and as the total energy in the spectrum, i.e., , then the variance explained by the PC can be computed as . The residual variance can be calculated as:

(1)

In the Isomap setting, residual variance is computed by comparing the approximate pairwise geodesic distances, calculated using G represented by matrix , to the pairwise distances of the mapped data Y, represented by matrix :

(2)

Here, is the standard linear correlation coefficient, taken over all entries of and .

In the first step of our analysis, we compared the residual variance obtained using PCA and Isomap on the morphological process data (see Sections 2.2 and 5) consisting of data from six different trajectories, each trajectory corresponding to a unique configuration of pair and . Figure 2

summarizes our findings for PCA and Isomap. From the figure, we can see that PCA in unable to learn an effective low-dimensional mapping. In fact, while Isomap is able to explain about 70% of the variance using 3 dimensions, PCA requires more than 9 dimensions. Here we note that the ability to explain most of the information in the data in two or three dimensions is highly desired as it permits data visualization.

Figure 2. Isomap and PCA run on data for fixed and six pathways with varying . The quality of the Isomap manifold and PCA subspace are assessed using residual variance.
(a)
(b)
(c)
Figure 3. Six pathways with fixed and variable were selected to learn mapping and transform the data to 3-dimensions using (a) PCA (b) Isomap with (c) Isomap with using only the first 30 time steps of each pathway. (Please view in color)

To visualize the data, we used both PCA and Isomap to map the data to dimensions. The results are shown in Figures 2(a) and 2(b). For PCA, one of the dimensions () describes the time aspect of the process evolution. However, the PCA visualization does not offer additional insights into the process, which we attribute primarily to the PCA’s inability to capture non-linearities.

Since Isomap outperforms PCA in terms of residual variance, it is expected that the 3-dimensional data obtained from Isomap would offer more meaningful insights. However, as shown in Figure 2(b), all trajectories diverge from one another in -dimensions and there is no reasonable interpretation of the empty space. This indicates that the standard application of Isomap is inadequate when working with parameterized high-dimensional time series data.

3.2. Understanding Issues with Standard Isomap in Handling Dynamic Process Data

To further study the reason behind Isomap performance, we focus on the initial stage of the trajectories, where the morphologies are expected to evolve in a similar fashion. This is reflected in the Isomap visualization in Figure 2(b), where all trajectories appear to start from a common point in the 3-dimensional space and then diverge. We applied Isomap on only the early stage data represented by the first time steps of each trajectory (threshold selected by the domain expert). The results are shown in Figure 2(c), where we can clearly observe that the early data points for all trajectories cluster together before quickly diverging.

This leads us to the first key observation of this paper: When dealing with dynamic process data, in which the data points exhibit a strong temporal correlation within the trajectory to which they belong, but are different from data points that belong to other trajectories, Isomap cannot capture the relationships across different trajectories. Thus, the resulting mapping is dominated by the time dimension, as can be seen in Figure 2(b). The key issue is with the way the neighbors are selected for each point (see Algorithm 1, line 4). Figure 3(a) shows the matrix, , containing the distance between every pair of points, with rows and columns ordered by trajectory and time. In Figure 3(b), the same row ordering is retained, however, each row contains the sorted distances of the corresponding point to all points in the dataset, and colored by the trajectory to which they belong. For most of the points, the first several nearest neighbors are from the same trajectory.

(a)  Pairwise distance matrix for all data grouped by and ordered by time step along both axes. Distances in subblocks along the main diagonal denote inter-point distances within a fixed -value trajectory. Off-diagonal subblocks highlight distance between points lying on disjoint trajectories.
(b)  Distance matrix with rows grouped by and ordered by time step. Entries in row are sorted by increasing distance from and colored according with their value. Clusters of similar color nearest the left edge reflect -NN having common -value for sufficiently large .
Figure 4. Pairwise distances of all points points with and from six trajectories for visualized in two ways. (Please view in color)

The ability of the Isomap algorithm to learn an accurate description of the underlying manifold, depends on how well the neighborhood matrix captures the relationship across the trajectories. We refer to this relationship as cross-talk, or mixing among the trajectories. For any given point, the desired effect would be that the nearest neighborhood set contains points from multiple trajectories. However, the sorted neighborhood matrix indicates a lack of mixing, which essentially means that the Isomap algorithm does not consider information from other trajectories, when learning the shape of the manifold in the neighborhood of one trajectory.

3.3. Quantifying Trajectory Mixing

We use the information-theoretic notion of entropy to assess the quality of neighborhoods and to understand the mixing of trajectories. For a given point , let be the fraction of closest neighbors of that lie on the trajectory . The entropy of the -neighborhood of point is calculated as:

(3)

Similarly, we can define the -neighborhood entropy for a trajectory , as the average of -neighborhood entropy for all points on .

The neighborhood entropy of a point is high, if its nearest neighbors are uniformly distributed across all trajectories (high level of mixing). On the other hand, the entropy of a point is low, if its nearest neighbors mostly lie on a single trajectory (low level of mixing). Thus, neighborhood entropy quantifies the mixing level across the trajectories, for a given neighborhood size,

.

Intuitively, if is increased, the neighborhood entropy of the points should increase, thereby inducing more trajectory mixing. Figure 5 shows the average neighborhood entropy for each of the six trajectories in the data set, for different values of . We note that the neighborhood entropy increases linearly with , consistently for all six trajectories. Thus, for a small value of , Isomap is unable to obtain a meaningful low-dimensional representation, as evident in Figure 2(b), where was set to 8.

Figure 5. Neighborhood entropy of different trajectories as a function of ( and ).

3.4. Strategies for Inducing Trajectory Mixing

Figure 5 shows that using large could result in the desired level of trajectory mixing. However, as discussed in the original Isomap paper (Tenenbaum et al., 2000), the approximation error between the true Geodesic distance on the manifold between a pair of points and the approximate distance calculated using a shortest paths algorithm (see Algorithm 1) is inversely related to . For large , Isomap is essentially reduced to PCA, and is unable to capture the nonlinearities in the underlying manifold.

One approach to inducing trajectory mixing is subsampling, i.e., selecting a subset of points from a given trajectory. However, this would result in reduction of the data, which yields poor results. Another approach is to use skipping in the neighborhood selection, i.e., for a given point, skip the nearest points before including points in the neighborhood. In experimenting with skipping and subsampling approaches we experienced a loss in local manifold quality or data size, respectively. Based on this and the desirability of achieving the most accurate local and global qualities, we propose an entropy-driven approach in the next section.

4. Entropy-Isomap

Standard Isomap does not work well for dynamic process data since data points are typically closest to other data points from the same trajectory, yet the global structure of the process depends on relations between different trajectories. When the -NN neighborhoods are computed, this can result in poor mixing. Worse, how much trajectories interact can change throughout the process. For example, when trajectories come from simulations with similar initial conditions, the trajectories might interact for a while, but then diverge to explore different parts of the state space. A neighborhood size that produces good results in early stages might produce poor results later on in the process. A value of that is large enough to work for all times might include so many data points that the geodesic and Euclidean distances become essentially the same, which results in a PCA like behavior, defeating the purpose of using Isomap.

To address this situation, we propose to directly measure the amount of mixing and use it to change the neighborhood size for different data points adaptively. This mitigates the shortcomings of the two methods described in the previous section that either discard data (subsampling) or lose local information (skipping).

Figure 5 shows that neighborhood entropy increases when the next nearest neighbors are added. We propose using an entropy threshold to determine a neighborhood size . This modification allows the flexibility of larger neighborhoods in regions where it is necessary to force mixing between trajectories.

To prevent neighborhoods that are so large as to reduce Isomap to PCA, the maximum neighborhood size

is left as a parameter. This check allows processing datasets which contain trajectories in poorly sampled regions of the state space without skewing the rest of the analysis, which would otherwise result in unreasonably large neighborhood sizes.

0.975
0:  X, , ,
0:  Y
1:   PairwiseDistances()
2:  
3:  for all  do
4:     
5:     while  and  do
6:        
7:        kNN KNN(, X, )
8:         where kNN.
9:         NeighborhoodEntropy(, , )
10:   AllPairsShortestPaths(G)
11:  
12:  return  Y
Algorithm 2 Entropy-Isomap

The proposed Entropy-Isomap algorithm is shown in Figure 2. Compared to the standard approach, the algorithm takes additional argument, the target entropy level, . This parameter is used to decide when adaptively computed neighborhoods are producing good mixing. The initial step, computing all pairwise distances for data points in , remains the same as in the standard algorithm. Then, the entropy-based neighborhood selection is performed (lines 3-9). For each point , the algorithm proceeds with neighborhood size , initially equal to some default value . -nearest-neighbors are identified, and their neighborhood entropy is computed (lines 7-9). If the entropy threshold is not satisfied, then is incremented (line 6), and the process repeats. Once the entropy threshold is reached, or a user-defined maximum of iterations have been performed, the process terminates. The entire process is repeated for each , and after all neighborhoods have been identified, the algorithm continues the same way as standard Isomap (lines 10-12). We note that our presentation of the algorithm is simplified for clarity. In the practical implementation, the size of the neighborhood can be found via simple binary search, which further can be coupled with efficient incremental -NN solver, without the need to instantiate a complete distance matrix .

We applied Entropy-Isomap to our data with and the maximum number of steps . We selected this large to compute the fraction of large neighborhoods that would be required to strictly enforce mixing, in this case nearly 5% of our data. We also varied the entropy threshold from to , to explore the effect it has on the neighborhood size distribution. Example low dimensional representation obtained by Entropy-Isomap is presented in Figure 6 (a).

(a)
(b)
Figure 6. Entropy-Isomap with , is run on and variable . (a) The learned mapping is used to transform the data to 3-dimensions. (b) Neighborhood cross-mixing: for each trajectory , neighbors of each point belonging to individual trajectories are aggregated and shown in stacked bar graph form. (Please view in color)

We start our analysis by observing, that in the experiments high-entropy thresholds were often not reachable (see Figure 8). We believe that this is because nearest neighbors for the majority of points are in the same trajectory (see Figure 3(b)), which leads to skewed neighborhoods distribution. As a result, even when a satisfactory number of neighbors come from other trajectories, the entropy for the neighborhood might be low. Figure 6 (b) shows that even when trajectories mix, the majority of neighbors are still from the same trajectory. Therefore, high entropy implies good mixing, but the converse is not necessarily true. Large neighborhoods could produce mixing, while still having low entropy. Since large neighborhoods produce poor results with Isomap, we would like to avoid them in any case. In practice, we achieved good results with entropy thresholds in the range of , which were achievable by a large fraction of neighborhoods. This is further confirmed by experiments in Section 5.

When strictly enforcing entropy, the neighborhood sizes can become too large. Figure 7 shows the neighborhood size distribution for . When such neighborhoods are included for many points in the dataset, the neighborhood graph tends toward a completely connected graph, and the Isomap solution reduces to the PCA result. Recall that PCA is equivalent to classical MDS and that classical MDS is Isomap with .

Figure 7. Entropy-Isomap with default and entropy threshold run for data with and variable . The distribution of selected neighborhood sizes that did not reach the maximum are shown.

Points that produce no mixing also end up with large neighborhoods, as Entropy-Isomap tries to increase

in order to meet the entropy threshold. These points occur when the dataset does not contain enough trajectories that pass near those particular states to produce good geodesic distance estimates. Interestingly, plotting entropy versus time in Figure 

8 reveals that trajectories can pass through poorly sampled parts of the state space and again “meet up” with other trajectories.

The proposed methods can be used to detect trajectories that do not interact and also which regions of the state space are poorly sampled. This can be used to either remove them form the dataset or as a guide to decide where to collect more process data.

Figure 8. Entropy-Isomap with default and is run on and variable . The entropy of the discovered neighborhood is shown for the data point at each time step. (Please view in color)

5. Application

Organic electronics (OE) has the potential to revolutionize the next generation of electronic devices. It provides opportunity for sustainable manufacturing of organic transistors (Crone et al., 2000; Dimitrakopoulos and Mascaro, 2001), organic solar cells (OSC) (Hoppe and Sariciftci, 2004; Brabec et al., 2014), diode lighting (Tyan, 2011; Thejo K. and Dhoble, 2012), flexible displays (Crawford, 2005), integrated smart systems such as RFIDs (Myny et al., 2010; Myny et al., 2013), smart textiles (Stoppa and Chiolerio, 2014), artificial skin (Someya et al., 2004), and implantable medical devices and sensors (Lochner et al., 2014; Zhu et al., 2014). Its promise is further fueled by the ability for inexpensive, rapid and low-temperature roll-to-roll fabrication. However, many promising OE technologies are bottlenecked at the manufacturing stage – more specifically, at efficiently choosing fabrication pathways that result in desired morphologies that in turn deliver tailored device properties.

The key scientific breakthroughs that improved organic solar cell (OSC) performance were closely related to manufacturing pathway. Most advances have been achieved by non-trivial and non-intuitive (and sometimes very minor) changes in the fabrication protocol. Classic examples include changing the solvent (Shaheen et al., 2001), and thermal annealing (Li et al., 2005) that together resulted in two-orders of magnitude increase in the efficiency of OSCs. These advances have reaffirmed the importance of exploring processing conditions to impact device properties, and have resulted in the proliferation of manufacturing variants. However, these variants are invariably chosen using trial-and-error approaches, that has resulted in the exploration of very scattered and narrow zones of the space of potential processing pathways, due to resource and time constraints. These challenges are exacerbated by the lack of computational tools to analyze the data, and to suggest the next steps in exploration or modification to the fabrications.

One can easily identify more than a dozen material and process variables that can be tuned (evaporation rate, blend ratio of polymers, final film thickness, solubility of three components, degree of polymerization, atmosphere, shearing stress, chemical strength and frequency of patterning substrate), leading to the combinatorial explosion of manufacturing variants.

5.1. Data Generation

The morphology data analyzed in this paper, has been generated by a computational model based on the phase-field method to record the morphology evolution during thermal annealing of the organic thin films (Wodo and Ganapathysubramanian, 2012, 2011). We focused on the exploration of two manufacturing parameters, blend ratio and strength interaction. We selected these parameters, since they are known to strongly influence properties of the resulting morphologies. For each fabrication variant we generated a series of morphologies that together formed one trajectory .

We selected the range of two design parameters and to explore several factors. First of all, we are interested in two stages of the process: early phase separation and coarsening. Moreover, we would like to explore various topological classes of morphologies. For OSC, we are interested in identifying fabrication condition leading to interpenetrated structures. Finally, we seek to find the optimal annealing time that results in desired domain sizes. In total, we generated sixteen trajectories, with morpohologies on average per trajectory. Each morphology was represented as an image converted into an -dimensional space defined by pixel composition values. The entire data used in our experiments is open and available from https://gitlab.com/SCoRe-Group/Entropy-Isomap.

5.2. Results

From the manufacturing design perspective, there are two basic aims for dimensionality reduction of morphological pathways. First of all, we seek to discover the common latent variables driving the dynamic process. Secondly, we seek to learn the geometry of manifold to device subsequent round of exploration.

Figures 9 and 10 depict three dimensional manifold discovered using Entropy-Isomap for the complete set of 16 pathways. When mapped to the manifold, the pathways show ordering according to the process variables that were varied to generate the data. In both figures, for easier inspection we marked the pathways according to one varying variable. For example, the top row in Figure 9 depicts the pathways for fixed and varying . Pathways for increasing are ordered from right (dark) to left (light), while pathways for increasing are ordered from front (green) to back (blue).

The observed ordering of pathways strongly indicates that the variables are also latent variables controlling the dynamic process. More importantly, the ordering reveals that denser sampling is required in space. Specifically, the pathways sharing the same but varying are spread further apart than these sharing the same value. This observation has important implications for the design of the next round of exploration in the phase space.

Finally, we notice that Entropy-Isomap mapped the data into two regions. The early stages of the process are mapped to evolve in the radial direction, while late stages are mapped parallel to each other. This is interesting as the underlying process indeed has two inherent time scales. In the early stage, the phase separation between two polymer occur. During this stage, the changes mostly result in increase of the composition amplitude. In the second stage, the coarsening between already formed domains occurs. Here, the amplitude of the composition (signal) does not change significantly. The changes mostly occur in the frequency space with the domain sizes increases over time.

Figure 9. The manifold of early stage of the morphology evolution with first 30 points per each trajectory. To better illustrate discovered ordering by two variables we color coded the same manifold with increasing (top) and (bottom). (Please view in color)
Figure 10. The manifold of the late stage with the first 80 points per each trajectory. The same manifold is color coded by increasing (top) and (bottom). (Please view in color)

6. Conclusions

Dynamic process data, represented by trajectories, poses challenges to commonly used SDR methods due to unbalanced sampling in the parameter space describing the system, compared to the dense sampling in the time space. This unbalanced sampling, in this case the strong temporal correlation of data points, leads to poor quality construction of a neighborhood graph. This in turn directly influences the calculated geodesic distances and learned manifold. In this work, we introduce the notion of Neighborhood Entropy. The neighborhood entropy quantifies the information exchange of data points from less densely sampled parameters in combating the impact of the highly correlated parameter.

Equipped with a measure of neighborhood quality, we have presented an algorithm, Entropy-Isomap, that can be used to learn a mapping with more reliable geodesics. From a practical standpoint, Entropy-Isomap is able to discover latent variables governing dynamic processes and learn the true manifold geometry to drive decisions in the application domain.

We successfully showcased our method with the morphology evolution data set. In particular, our results indicate that the proposed Entropy-Isomap orders the trajectory data according to the two process variables. Moreover, the method exposes the need for denser sampling in one of the explored variables. This observation can be used to design the next round of simulations and should lead to overall higher entropy of the neighborhood between trajectories. When performed in iterative fashion, we anticipate the unbalance sampling (time space vs. design variables space) to be mitigated. Finally, the learned manifold captured the physically meaningful transition point between two states of the process.

Acknowledgements.
Authors wish to acknowledge support provided by the Center for Computational Research at the University at Buffalo.

References

  • (1)
  • Brabec et al. (2014) C. Brabec, U. Scherf, and V. Dyakonov. 2014. Organic photovoltaics: materials, device physics, and manufacturing technologies. John Wiley & Sons.
  • Crawford (2005) G. Crawford. 2005. Flexible flat panel displays. John Wiley & Sons.
  • Crone et al. (2000) B. Crone, A. Dodabalapur, Y.-Y. Lin, R.W. Filas, Z. Bao, A. LaDuca, R. Sarpeshkar, H.E. Katz, and W. Li. 2000. Large-scale complementary integrated circuits based on organic transistors. Nature 403, 6769 (2000), 521–523.
  • Dawson et al. (2005) K. Dawson, R.L. Rodriguez, and W. Malyj. 2005. Sample phenotype clusters in high-density oligonucleotide microarray data sets are revealed using Isomap, a nonlinear algorithm. BMC Bioinformatics 6, 1 (2005), 195.
  • Dimitrakopoulos and Mascaro (2001) C. D. Dimitrakopoulos and D. J. Mascaro. 2001. Organic thin-film transistors: A review of recent advances. IBM Journal of Research and Development 45, 1 (2001), 11–27.
  • Hoppe and Sariciftci (2004) H. Hoppe and N.S. Sariciftci. 2004. Organic solar cells: An overview. Journal of Materials Research 19 (2004), 1924–1945.
  • Jenkins and Matarić (2004) O.C. Jenkins and M.J. Matarić. 2004. A spatio-temporal extension to isomap nonlinear dimension reduction. In

    Proceedings of the twenty-first international conference on Machine learning

    . ACM, 56.
  • Li et al. (2005) G. Li, V. Shrotriya, J. Huang, Y. Yao, T. Moriarty, K. Emery, and Y. Yang. 2005. High-efficiency solution processable polymer photovoltaic cells by self-organization of polymer blends. Nature Materials 4 (2005), 864–868.
  • Lim et al. (2003) I. Lim, P. de Heras Ciechomski, S. Sarni, and D. Thalmann. 2003. Planar arrangement of high-dimensional biomedical data sets by isomap coordinates. In Proceedings of the 6th IEEE Symposium on Computer-Based Medical Systems. 50–55.
  • Lochner et al. (2014) C.M. Lochner, Y. Khan, A. Pierre, and A.C. Arias. 2014. All-organic optoelectronic sensor for pulse oximetry. Nature communications 5 (2014).
  • Myny et al. (2010) K. Myny, S. Steudel, S. Smout, P. Vicca, F. Furthner, B. van der Putten, A. K. Tripathi, G. H. Gelinck, J. Genoe, and W. Dehaene. 2010. Organic RFID transponder chip with data rate compatible with electronic product coding. Organic Electronics 11, 7 (2010), 1176–1179.
  • Myny et al. (2013) K. Myny, S. Steudel, P. Vicca, S. Smout, M. J. Beenhakkers, N. van Aerle, F. Furthner, B. van der Putten, A. K. Tripathi, and G. H. Gelinck. 2013. Organic RFID tags. In Applications of Organic and Printed Electronics. Springer, 133–155.
  • Pearson F.R.S. (1901) K. Pearson F.R.S. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572.
  • Rohde et al. (2008) G.K. Rohde, W. Wang, T. Peng, and R.F. Murphy. 2008. Deformation-based nonlinear dimension reduction: Applications to nuclear morphometry. In 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. 500–503.
  • Ruan et al. (2014) Y. Ruan, G.L. House, S. Ekanayake, U. Schutte, J.D. Bever, H. Tang, and G. Fox. 2014. Integration of clustering and multidimensional scaling to determine phylogenetic trees as spherical phylograms visualized in 3 dimensions. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 720–729.
  • Samudrala et al. (2015) S.K. Samudrala, J. Zola, S. Aluru, and B. Ganapathysubramanian. 2015. Parallel framework for dimensionality reduction of large-scale datasets. Scientific Programming (2015).
  • Schoeneman et al. (2017) F. Schoeneman, S. Mahapatra, V. Chandola, N. Napp, and J. Zola. 2017. Error metrics for learning reliable manifolds from streaming data. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 750–758.
  • Schölkopf et al. (1998) B. Schölkopf, A. Smola, and K.R. Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation 10, 5 (1998), 1299–1319.
  • Shaheen et al. (2001) S. E. Shaheen, C. J. Brabec, and N. S. Sariciftci. 2001. 2.5% efficient organic plastic solar cells. Applied Physics Letters 78 (2001), 841–843.
  • Someya et al. (2004) T. Someya, T. Sekitani, S. Iba, Y. Kato, H. Kawaguchi, and T. Sakurai. 2004. A large-area, flexible pressure sensor matrix with organic field-effect transistors for artificial skin applications. Proceedings of the National Academy of Sciences of the United States of America 101, 27 (2004), 9966–9970.
  • Stoppa and Chiolerio (2014) M. Stoppa and A. Chiolerio. 2014. Wearable electronics and smart textiles: a critical review. Sensors 14, 7 (2014), 11957–11992.
  • Strange and Zwiggelaar (2014) H. Strange and R. Zwiggelaar. 2014. Open Problems in Spectral Dimensionality Reduction. Springer.
  • Tenenbaum et al. (2000) J.B. Tenenbaum, V. de Silva, and J.C. Langford. 2000. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 5500 (2000), 2319.
  • Thejo K. and Dhoble (2012) N. Thejo K. and S.J. Dhoble. 2012. Organic light emitting diodes: Energy saving lighting technology—A review. Renewable and Sustainable Energy Reviews 16, 5 (2012), 2696–2723.
  • Tyan (2011) Y.-S. Tyan. 2011. Organic light-emitting-diode lighting overview. Journal of Photonics for Energy 1, 1 (2011), 011009–011009.
  • Wodo and Ganapathysubramanian (2011) O. Wodo and B. Ganapathysubramanian. 2011. Computationally efficient solution to the Cahn-Hilliard equation: adaptive implicit time schemes, mesh sensitivity analysis and the 3D isoperimetric problem. J. Comput. Phys. 230 (2011), 6037–6060.
  • Wodo and Ganapathysubramanian (2012) O. Wodo and B. Ganapathysubramanian. 2012. Modeling morphology evolution during solvent-based fabrication of organic solar cells. Computational Materials Science 55 (2012), 113–126.
  • Zhang et al. (2006) Q. Zhang, R. Souvenir, and R. Pless. 2006. On manifold structure of cardiac MRI data: Application to segmentation. In

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , Vol. 1. 1092–1098.
  • Zhu et al. (2014) C. Zhu, C. Ninh, and C.J. Bettinger. 2014. Photoreconfigurable polymers for biomedical applications: chemistry and macromolecular engineering. Biomacromolecules 15, 10 (2014), 3474–3494.