Diffusion-based nonlinear filtering for multimodal data fusion with application to sleep stage assessment

01/13/2017 ∙ by Ori Katz, et al. ∙ 0

The problem of information fusion from multiple data-sets acquired by multimodal sensors has drawn significant research attention over the years. In this paper, we focus on a particular problem setting consisting of a physical phenomenon or a system of interest observed by multiple sensors. We assume that all sensors measure some aspects of the system of interest with additional sensor-specific and irrelevant components. Our goal is to recover the variables relevant to the observed system and to filter out the nuisance effects of the sensor-specific variables. We propose an approach based on manifold learning, which is particularly suitable for problems with multiple modalities, since it aims to capture the intrinsic structure of the data and relies on minimal prior model knowledge. Specifically, we propose a nonlinear filtering scheme, which extracts the hidden sources of variability captured by two or more sensors, that are independent of the sensor-specific components. In addition to presenting a theoretical analysis, we demonstrate our technique on real measured data for the purpose of sleep stage assessment based on multiple, multimodal sensor measurements. We show that without prior knowledge on the different modalities and on the measured system, our method gives rise to a data-driven representation that is well correlated with the underlying sleep process and is robust to noise and sensor-specific effects.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Code for the toy problem example described in https://arxiv.org/abs/1701.03619.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Often, when measuring a phenomenon of interest that arises from a complex dynamical system, a single data acquisition method is not capable of capturing its entire complexity and characteristics, and it is usually prone to noise and interferences. Recently, due to technological advances, the use of multiple types of measurement instruments and sensors have become more and more popular; nowadays, such equipment is smaller, less expensive, and can be mounted on every-day products and devices more easily. In contrast to a single sensor, multimodal sensors may capture complementary aspects and features of the measured phenomenon, and may enable us to extract a more reliable and detailed description of the measured phenomenon.

The vast progress in the acquisition of multimodal data calls for the development of analysis and processing tools, which appropriately combine data from the different sensors and handle well the inherent challenges that arise. One particular challenge is related to the heterogeneity of the data acquired in the different modalities; datasets acquired from different sensors may comprise different sources of variability, where only few are relevant to the phenomenon of interest. This particular challenge as well as many others have been the subject of many studies. For a recent comprehensive reviews, see [23, 24, 15].

In this paper we consider a setting in which a physical phenomenon is measured by multiple sensors. While all sensors measure the same phenomenon, each sensor consists of different sources of variability; some are related to the phenomenon of interest, possibly capturing its various aspects, whereas other sources of variability are sensor-specific and irrelevant. We present an approach based on manifold learning, which is a class of nonlinear data-driven methods, e.g. [43, 33, 12, 2], and specifically, we use the framework of DM [7]. On the one hand, manifold learning is particularly suitable for problems with multiple modalities since it aims to capture the intrinsic geometric structure of the underlying data and relies on minimal prior model knowledge. This enables to handle multimodal data in a systematic manner, without the need to specially tailor a solution for each modality. On the other hand, applying manifold learning to data acquired in multiple (multimodal) sensors may capture undesired/nuisance geometric structures as well. Recently, several manifold learning techniques for multimodal data have been proposed [9, 22, 45, 34]. In [9]

, the authors suggest to concatenate the samples acquired by different sensors into unified vectors. However this approach is sensitive to the scaling of each dataset, which might be especially diverse among datasets acquired by different modalities. To alleviate this problem, it is proposed in

[22] to use DM to obtain “standardized” representation of each dataset separately, and then to concatenate these “standardized” representations into the unified vectors. Despite handling better multimodal data, this concatenation scheme does not utilize the mutual relations and co-dependencies that might exist between the datasets.

While methods such as [9, 22, 34] take into account all the measured information, the methods in [26, 42, 45, 25] use local kernels to implement nonlinear filtering. Specifically, following a recent line of study in which multiple kernels are constructed and combined [10, 11, 4, 30], in [26, 42], it was shown that a method based on alternating applications of diffusion operators extracts only the common source of variability among the sensors, while filtering out the sensor-specific components. Therefore we choose to establish our framework based on DM which relies on those theoretical foundations. Other nonlinear methods, such as [45, 25], do not have that theoretical assurance of convergence to an operator that extract the common part, but may also be suitable as a framework. Those methods can be tested and compared empirically with the proposed DM based framework in future work. The shortcoming of alternating applications of diffusion operators arises when having a large number of sensors; often, sensors that measure the same system capture different information and aspects of that system. As a result, the common source of variability among all the sensors captures only a partial or empty look of the system, and important relevant information may be undesirably filtered out.

Here, we address the tradeoff between these two approaches. That is, we aim to maintain the relevant information captured by multiple sensors, while filtering out the nuisance components. Since the relevance of the various components is unknown, our main assumption is that the sources of variability which are measured only in a single sensor, i.e., sensor-specific, are nuisance. Conversely, we assume that components measured in two or more sensors are of interest. Importantly, such an approach implements implicitly a smart “sensor selection”; “bad” sensors that are, for example malfunctioned and measure only nuisance information, are automatically filtered out. These assumptions stem from the fact that the phenomenon of interest is global and not specific to one sensor. We propose a nonlinear filtering scheme, in which only the sensor-specific sources of variability are filtered out while the sources of variability captured by two or more sensors are preserved.

Based on prior theoretical results [26, 42], we show that our scheme indeed accomplishes this task. We illustrate the main features of our method on a toy problem. In addition, we demonstrate its performance on real measured data in an application for sleep stage assessment based on multiple, multimodal sensor measurements. Sleep is a global phenomenon with systematic physiological dynamics that represents a recurring non-stationary state of mind and body. Sleep evolves in time and embodies interactions between different subsystems, not solely limited in the brain. Thus, in addition to the well-known patterns in electroencephalogram (EEG) signals, its complicated dynamics are manifested in other sensors such as sensors measuring breathing patterns, muscle tones and muscular activity, eyeball movements, etc. Each one of the sensors is characterized by different structures and affected by numerous nuisance processes as well. In other words, while we could extract the sleep dynamics by analyzing different sensors, each sensor captures only part of the entire sleep process, whereas it introduces modality artifacts, noise, and interferences. We show that our scheme allows for an accurate systematic sleep stage identification based on multiple EEG recordings as well as multimodal respiration measurements. In addition, we demonstrate its capability to perform sensor selection by artificially adding noise sensors.

The remainder of the paper is organized as follows. In Section 2 we present a formulation for the common source extraction problem and present an illustrative toy problem. In Section 3, a brief review for the method proposed in [26, 42] is outlined, and then, a detailed description and interpretation of the proposed scheme are presented. In Section 4, we first demonstrate the capabilities of the proposed scheme on the toy problem introduced in Section 2. Then, in Section 5, we demonstrate the performance in sleep stage identification based on multimodal measured data recorded in a sleep clinic. Finally, in Section 6, we outline several conclusions.

2. Problem Setting

Consider a system driven by a set of

hidden random variables

, where . The system is measured by observable variables , where each sensor has access to only a partial view of the entire system and its driving variables . To formulate it, we define a “sensitivity table” given by the binary matrix , indicating the variables sensed by each observable variable. Specifically, the th element in indicates whether the hidden variable is measured by the observable variable . It should be noted that this binary notation is a rough simplification since that there are soft degrees of observability. However, at least theoretically, when we have sufficeint data the algorithm is guaranteed to work for any bilishitz observation function. When the data amount is limited, those degrees of observability are dominated by the Lipschitz constants and the signal-to-noise ratio. Some of these observalities issues were treated in [13] .Further quantification of those parameters for the derivation of soft observability scores is beyond the scope of this article and may be addressed in a future work. The observable variables are therefore given by


where is a bilipschitz observation function, are hidden random variables captured only by the th observable variable, and is the subset of driving hidden variables of interest sensed by , given by


The random hidden variables are sensor-specific (associated only with the th observer). They are conditionally independent given the hidden variables of interest and will be assumed as noise/nuisance variables. We further assume that each random hidden variable in is measured by at least two observable variables, such that for each . As a result, we refer to the hidden variables in as common variables.

In order to simplify the notation, we denote the subset of all hidden variables (both common and sensor-specific) measured by the th observable by . Furthermore, we assume that the dimensions of the observations and the hidden variables satisfy


i.e., the observations are in higher dimension than the hidden common and nuisance variables.

number of common hidden variables
th common hidden variable
dimension of
set of all common hidden variables
number of observable variables
th observable variable
dimension of
sensitivity table
a bilipschitz th observation function
th sensor-specific hidden (nuisance) variables
dimension of
subset of sensed by
subset of all hidden variables measured by
Table 1. List of important notation.

An observation of the system denoted as is associated with a realization of the hidden variables and realizations of the hidden nuisance variables . Given observation samples , our goal is to obtain a parametrization for the underlying realizations of the common hidden random variables while filtering out the nuisance variables . We note that the observations index may represent the time index in case of time series.

2.1. Illustrative toy problem

We illustrate the problem setting using the following toy example. Consider six rotating arrows captured in simultaneous snapshots by three different cameras. We assume that each arrow rotates at different speed, and that each camera can capture only a partial image of the entire system. The partial view of each camera is depicted in Figure 1. Thus, overall, each camera captures a sequence of snapshots (a movie) of three rotating colored arrows. Further illustration of the entire system and of the captured images by each camera can be seen in the following link https://youtu.be/a-yb7ScdnnA.

(a) (b)
Figure 1. Toy problem setup. (a) The coverage area of each camera, the system’s range of interest is marked by the dashed circle. (b) Sample snapshot taken by each camera.

In this problem setting, the hidden variables are the six rotation angles of the arrows: the common variables are the three rotation angles of the centred arrows, which are marked by the dashed circle in Figure 1, and the nuisance variables are the three rotation angles of the peripheral arrows, since each is captured only by a single camera. It should be noted that none of the arrows is common to all of the cameras, meaning that the set of common components within the entire set of observables is empty.

In order to identify the hidden variables, we use different colors for the arrows. The arrows rotating according to the common variables are colored in red, green and blue, respectively, and the arrows rotating according to the nuisance variables are colored in orange, purple and gray, respectively. The hidden variables measured by each camera are , and . Our goal is to obtain a parametrization of the rotation angles of the three common arrows given the three movies of the cameras, without any prior knowledge on the system and the problem structure. In the sequel, we will use this toy problem for demonstrating important aspects and how our method accomplishes this task.

3. Nonlinear Filtering Scheme

3.1. Diffusion Maps

DM is a non-linear data-driven dimensionality reduction method [7]. Assume we have high-dimensional data-points

. The DM method begins with the calculation of a pairwise affinity matrix based on a local kernel, often using some metric within a gaussian kernel, i.e.,


where is a tuneable kernel scale and is a metric. The choice of the metric depends on the application; common choices are the Euclidean and the Mahalanobis distances [7, 35, 40, 39, 41]. This construction implicitly defines a weighted graph, where the data samples are the nodes of the graph, and is the weight of the edge connecting node and node . The next step is to normalize the affinity matrix and then to build the diffusion operator , e.g., by:


where is a diagonal matrix used for normalization, such that in this case is row-stochastic. Hence,

can be viewed as the transition matrix of a Markov chain defined on the graph. Accordingly, for


is the transition probability matrix of

consecutive steps, and is the probability to jump from node to node in steps. Let be the diffusion distance [7] between the th and the th data samples, i.e. is a function defined by


where is the stationary distribution of the Markov chain. The diffusion distance has been shown to be a powerful metric for measuring geometrical similarities between data-points [7]. While the Euclidean distance compares two individual data-points and might be affected by distortions and noise, the diffusion distance introduces much more noise-robust affinities since it relies on the connectivity between the two data-points using the entire data-set [7, 21].

However, the direct computation of the diffusion distance is cumbersome. An efficient calculation is attainable via the spectral decomposition of . Let and

be the sets of eigenvalues and right eigenvectors of

, where the eigenvalues are in descending order. Define a new representation (embedding) of the data-points:


where denotes the the element of . The obtained embedding provides a new representation of the data, referred to as DM, in which the Euclidean distance between two embedded data-point is equal to the diffusion distance [7], i.e.:


In order to achieve a compact representation in reduced dimensionality, DM is often redefined by keeping only the first components (i.e., the eigenvalues and eigenvectors corresponding to the largest eigenvalues):


where is usually determined by the eigenvalues decay. For more details and full analysis of this algorithm see [7, 36]. The entire DM method is outlined in Algorithm 1.

The term “diffusion distance” in (6) suggests that induces a reasonable notion of distance. Recall the definition of a distance.

Definition 1.

Let be a set. A distance (or metric) on is a function such that for all :

  1. if and only if ,

  2. ,

  3. .

The following proposition states that the “diffusion distance” is really a metric defined on the nodes of the graph.

Proposition 2.

If is full rank, then is a distance function.

Since we could not find a proof in the literature, for the sake of self-containment, we provide a proof that summarizes the discussion in [37].


We prove that (1)-(3) hold. Define where is a diagonal matrix such that . Since is full rank and since is non-degenerate by the construction of the weighted graph, is full rank. Denote the th row of as . Accordingly, can be expressed as the Euclidean distance between the th and the th rows of :


The properties of the Euclidean distance in (10) imply that (2) and (3) hold. If , then . If , then , implying that . Since is full rank, there are no identical columns. In other words, no two different samples and for have identical affinities to all other samples, i.e., . Therefore, if , then . ∎

3.2. Alternating Diffusion

Consider a system similar to the one described in Section 2, with only observable variables and common variable. The AD algorithm, outlined in Algorithm 2, builds from the observations an AD operator that is equivalent to a simple diffusion operator (as described in Section 3.1) that would have been computed if we had a direct access to samples of the common hidden variables. This operator enables to capture only the structure of the common variables while ignoring the nuisance (sensor-specific) variables. For more details and full analysis of this algorithm see [26, 42]; here, we only bring a brief review of the method and the construction of the AD operator.

Assume we have aligned samples (realizations) from observable variables: . For each observation we build an affinity matrix and as follows:


for all , where and are the tuneable kernel scales and and are the chosen metrics for each set of observations. Based on the affinity matrix, we calculate the diffusion operators and according to:


where and are diagonal matrices used for normalization. Next, we define as the AD operator. Note that is row-stochastic, and hence, can be considered as a transition probability matrix of a new Markov chain that alternates between the two data sets. Namely, each step of this alternating process consists of a propagation step using followed by a propagation step using .

Broadly, in each propagation step, the Markov chain jumps with high probability to neighboring samples that are similar in terms of the kernel. Combining alternating steps results in consecutive jumps according to similarities in the first set and then according to similarities in the second set. Overall, only similarities in terms of the common components among the two views are maintained.

Formally, we define the diffusion distance between the th and the th sample based on the AD operator as the following Euclidean distance


where is the stationary distribution of and is the number of alternating steps. The following corollary is an immediate results of Proposition 2.

Corollary 3.

If is full rank, then is a distance function.

It can be shown that this distance is equivalent to the diffusion distance that would have been computed if we had a direct access to observable variables that see only the common variable [26].

Input: High-dimensional samples from an observable variables: .
Output: dimensional representation of the data-set where .

  1. Calculate the affinity matrix :

  2. Compute the diffusion operator (transition matrix) :

  3. Calculate the spectral decomposition of and obtain its eigenvalues and eigenvectors .

  4. Define a new embedding for the data-points:


    where is a selected number of steps and denotes the th element of .

Algorithm 1 Diffusion Maps

Input: Aligned samples from observable variables: .
Output: Diffusion distances .

  1. Calculate two pairwise affinity matrices and based on a gaussian kernel as follows:


    for all , where and are the kernel scales and is the chosen metric.

  2. Create two diffusion operators and :

  3. Build the alternating-diffusion kernel:

  4. Compute the altenating-diffusion distance between each two points


    where is the stationary distribution of and is a tuneable parameter.

Algorithm 2 Alternating Diffusion

3.3. Common Graph

AD provides us with an access to the common variables between a pair of observable variables. By using AD as a building block, we propose a generalization for a set of multiple observable variables. Consider the system described in Section 2 with aligned samples from observable variables: . The observable variables are driven by a set of hidden random variables and contaminated by a set of nuisance sensor-specific variables . Our goal is to obtain a parametrization of the common hidden random variables from the observations.

Corollary 3 provides the analytic foundation and justification to the method presented in this paper. More specifically, in the context of our problem, consider a pair of observable variables and . Applying AD to and yields the common hidden variables measured by the two. Therefore, its operation can be written as


In other words, AD captures only a subset of the common hidden variables , and in addition, filters out the nuisance variables and , which are specific to each observation.

The main idea in our method is based on the fact that the desired set of variables can be derived from the union of the pairwise intersections between all pairs, meaning that:


A direct implementation of the scheme in (20) is not feasible, since the pairwise intersections of and are not accessible to us. However, note that by substituting (19) in (20), we get


meaning that can be expressed using the accessible observations sets through the union of the intersections of all possible pairs. Thus, this scheme for recovering can be implemented by multiple applications of AD to all possible pairs of observable variables.

The union is implemented through the formulation of a new kernel in which the affinity between each pair of samples is given by the sum of the diffusion distances over all pairs of observations. Therefore for each kernel resulting from an application of AD to a single pair of observations, we compute the following diffusion distance , similarly to (12)


where is the stationary distribution of and is a tuneable parameter indicating the number of AD steps. We then define the common diffusion distance as a summation over the alternating diffusion distances (22) resulting from applications to all possible pairs of observations, according to


where . We now show that is a metric.

Proposition 4.

Let be a set and consider two distance functions . Define for all . Then is a distance function as well. In particular, if are full rank for all , then is a distance function.


By definition is . We prove that properties (1)–(3) in Definition 1 hold. Using the symmetry property of and we have that . Consider , using property (3) of and , for any . If then . If , using the non-negativity property (1) of and we have that and . From property (1) we obtain that .

Now, by Corollary 3, if is full rank, then, is a distance function, and therefore, by a straight-forward generalization, it follows that is a distance function. ∎

Based on the common diffusion distance , then we calculate an affinity matrix


where is the chosen kernel scale. Next we normalize the affinity matrix and build the common diffusion operator


In conclusion, the new graph with kernel consists of two main components. First, the intersections between any pair of observations are implemented using AD that provides the extraction of the common hidden variables . Second, the union is implemented via the summation of the resulting diffusion distances from the AD applications. By construction, in the kernel , the connectivity between the th and the th data samples is proportional to the intrinsic distance . This means that the common global diffusion kernel can be used for obtaining a low-dimensional representation of . The common graph algorithm described in this section is summarized in Algorithm 4.

Four final remarks follow. First, it should be noted that there is a theoretical gap between the desired union described in (21) and its implementation via the summation of the common diffusion distances in the proposed metric which is described in (23). The motivation for this choice for implementation is that the embeddings achieved using the proposed metric corresponds to those that would have been achieved using a union scheme. Although that this claim is supported by empirical results in Section 4, the derivation of a union scheme still calls for rigorous analysis in future work.

Second, the proposed implementation of the union via diffusion distance summation enhances the common variables that appear multiple times in the various intersections. By doing so, we slightly abuse the definition of the union, where duplicates are all “put together”. In other words, in the strict definition of a union, in contrast to our implementation, common hidden variables related to two or more intersection results should be taken into account only once. Depending on the application at hand, this may be a desired property, and the derivation of a scheme in which each common components has a uniform gain is postponed to future work.

Third, the proposed algorithm can be viewed from a nonlinear filtering standpoint. By applying the proposed algorithm, we maintain or even enhance the common hidden variables, while filtering out the nuisance variables that are sensor/observation-specific.

While that the previous remarks addressed the proposed implementations, and the potential theoretical gaps that should be addressed in the future, the last remark deals with the practice of computing the approximation of the proposed implementation . For the application of sleep stage identification described in Section 5, we have empirically found that a modified computation of gives rise to improve performance. This modification results in a “smoother” embedding, better representing the sleep stage. In the alternative implementation, rather than calculating as in (29), we calculate it in the following way. First, for each pair of sensors we apply the standard DM based on the pairwise kernels , computed in (17). For each pair we obtain a -dimensional representation, where is a chosen parameter for the pair

, estimated using the “spectral gap” of the decay of the eigenvalues of

. Second, we concatenate the low-dimensional representations obtained from the previous step into a single vector. In other words, we now have concatenated -dimensional vectors, where , representing the observations taken simultaneously from all sensors. Third, we calculate the pairwise distance between the new concatenated vectors. Broadly, this technique is similar to [22], only here we combine the already “filtered” components (the results of AD rather than DM). Since these vectors consist of components from different sensors, we chose to use a modified version of the Mahalanhobis distance. This modified Mahalanobis distance was first introduced in [35], and since then, was shown to exhibit remarkable capability to standardize measurements from different sources, e.g. in [44, 40, 39, 41]. In [40, 39], it was shown to build intrinsic representations by revealing a hidden process driving the measurements. Recently, this technique was applied to multimodal data in [34]. Importantly, compared with AD and our proposed method, these methods [40, 39, 34] combine the information embodied in all the measurements and do not attempt to suppress nuisance variables or to extract only the common components.

The numerical implementation of the Mahalanobis distance deserves a remark. The computation of the Mahalanobis distance requires estimation of the local covariance matrices of the vectors, each of size . This computation might be computationally cumbersome when is large, as often in our case. In order to relax the required computational load, prior to the computation of the Mahalanobis distance, one can project the concatenated samples onto a lower dimensional vector space, for example, using RP [5]

, and then, compute the Mahalanobis distance for the projected samples with reduced dimensionality. This heuristic method for calculating the common diffusion

is summarized in Algorithm 3.

Input: alternating-diffusion operators ,
Output: Alternating-diffusion distance

  1. For , calculate the spectral decomposition of each kernel , and obtain its eigenvalues and eigenvectors .

  2. For , build an -dimensional representation using standard DM (9) for each time sample .


    where is a tuneable parameter.

  3. For each time sample , concatenate the low-dimensional representations into a single vector

  4. Calculate using the Mahalanobis distance:


    for .

Algorithm 3 Mahalanobis-based Union Scheme

Input: Aligned samples from sets of observations: .

Output: Low-dimensional representation of the common hidden random variables .

  1. For each pair of observation sets , apply alternating diffusion (Algorithm 2), and obtain the diffusion distance .

  2. Compute the distance


    for .

  3. Based on the common diffusion distance calculate an affinity matrix

  4. Construct the diffusion operator :

  5. Apply standard diffusion maps (steps 3 and 4 in Algorithm 1) using , and obtain an -dimensional representation of .

Algorithm 4 Common Graph

4. Simulation Results

Consider the toy problem described in Section 2.1. We simulate hidden scalar variables: common variables and nuisance variables

. The variables are statistically independent and uniformly distributed in

. We then build sets of RGB images: . The sensitivity table of this example is given by


Each image contains arrows, where each arrow is rotated according to a randomly generated angle: the angles of the arrows in are , the angles in are , and the angles in are . The dimensionality of each RGB image is . We column-stack the RGB images, i.e., are vectors of length . The proposed algorithm is data-driven, and therefore, it does not assume any prior knowledge on the nature of observations. In order to highlight this important property, we use RP. First, RP with sufficiently large dimension maintain the underlying geometry, yet the image appearances are lost, which shows that our algorithm does not apply any image processing. Second, in the original images, the different hidden variables are manifested in separate coordinates/pixels; RP mix the hidden variables, enabling a more challenging extraction task. We generate orthonormal vectors of length and denote by the matrix whose columns are these random vectors. We build the data of the sensors (cameras) by RP , where is the camera index. In the case of data acquired by cameras, can be viewed as the coding system in the cameras. An illustration of the images and their RP is depicted in Figure 2. Illustration of the “movies” of the RP captured by each camera can be seen in the following link https://youtu.be/91N6mhlYQYY.

Figure 2. Random projection diagram of the th image. Each RGB image was column stacked into a vector of length . Then it was projected on a subspace of using an orthonormal set . The projection is illustrated by a gray-scale image. As can be seen the image’s property are lost through this projection.

We first apply DM separately to each set of observations. Figure 3 presents -dimensional views of the obtained -dimensional embeddings. Each subfigure presents a scatter plot of embedded data-points. Each data-point is an image (a frame in the movie) captured by a certain camera after a random-projection , where is the frame index and is the camera index. The axes of the scatter plot are the first components of the obtained embedding derived from the corresponding camera. The embedded data-points are colored according to the rotating angles and noise variables . It should be noted that this information (the color) was added after calculating the embedding and was not taken into account in the computation of embedding. The subfigure in the th column and in the th row contains the embedded data-points derived from the th camera , and its data-points are colored according to the rotating angle of the th arrow. In the left columns the color coding is according to , , , and in the right columns the color coding is according to , , . In other words, in each row the same scatter plot is shown, but with different color coding. The -dimensional scatter plots are rotated so that the obtained color gradient is best visualized from our -dimensional view point. For example, the subfigures in the second row are derived from the observations from the second camera . The data-points in the first column are colored according to the rotation angles , in the second column according to , etc.

As can be seen, in each row, scatter plots exhibit a smooth color gradient, from the left columns and from the right columns, corresponding to the variables sensed by the respective camera. In the left columns, we see that the color gradients indicates accurate detection of the common variables according to the sensing matrix . On the right columns, only in the diagonal subfigures exhibit a smooth color gradient, indicating that each captures only its own nuisance variable, as expected. In conclusion, Figure 3 implies that the obtained embeddings by DM provide accurate parametrizations of the hidden variables measured by each observation (camera), both the common and the nuisance variables.

Figure 3. 3D embedding obtained by applying diffusion map on a single observer. The subfigures are arranged such that subfigures in each row are obtained from the same observer. The data-points in each column are colored according to different arrow’s rotation angles.

The proposed algorithm is applied to the three sets of observations. The obtained embedding is depicted in Figure 4. The same dimensional scatter plot of the obtained embedding is shown with different color coding. The subfigures in the top row are colored (from left to right) according to the common variables , , , while the subfigures in the bottom row are colored (from left to right) according to the nuisance variables , , . As in Figure 2, the dimensional embedding is rotated, such that the corresponding color gradient is emphasized from the depicted dimensional point of view. We can see from the obtained color gradients that the embedding provides a parametrization of only the common variables, meaning that the proposed algorithm manages to extract all of the common variables (despite having none in common to all observations), while suppressing all nuisance observation-specific variables . Upon publication, the Matlab code and data of this toy problem will be made available online.

Figure 4. 3D embedding obtained by applying the proposed algorithm on the observers set. The subplots in the first rows are colored according to the common variables, the subplots in the second row are colored according to the noise variables. As can be seen, the obtained parametrization corresponds to the common variables.

5. Application to Sleep Stage Assessment

As mentioned above, the problem of extracting the common hidden variables from multiple data sets taken by different observables can be perceived as a problem of nonlinear filtering. To demonstrate the potential of this particular nonlinear filtering scheme in processing real data, we apply the proposed algorithm to sleep data, where the ultimate goal is to devise an automatic system for sleep stage assessment.

Sleep is a global and recurrent physiological process, which is in charge of the memory consolidation, the learning redistribution, tissue regeneration, immune system enhancement, etc [28]. The sleep dynamics are characterized by particular temporal physiological features, which are intimately related to the quality of sleep. The clinically acceptable sleep stage is mainly determined by reading recorded electroencephalogram (EEG) signals based on the Rechtschaffen and Kales (R&K) criteria [31, 17]. In the R&K criteria, the sleep dynamics are divided into two broad stages: rapid eye movement (REM), and non-rapid eye movement (NREM) [28]. The NREM stage is further divided into two shallow stages, which are denoted N1 and N2, and a deep sleep stage, which is denoted N3. In addition to the interest stemming from physiological aspects, sleep stage assessment has important clinical applications. For example, REM is associated with perceptual skill improvement [20], slow wave sleep is associated with Alzheimer’s disease [18], poor sleep quality is associated with weaning failure [32], etc. Besides personal health purposes, the sleep quality is also responsible for several public catastrophes [8]. These facts indicate the importance of an accurate automatic annotation system for sleep stage assessment and its broad applications.

In the past decades, various automatic annotation methods have been proposed. Those methods mainly extract various features from the EEG recordings for the purpose of studying sleep dynamics [1], such as time domain summary statistics, spectral or coherence features, time-frequency features, and information entropy, just to name a few [19, 3, 14]. Recently, a theoretically solid approach suitable for analyzing and estimating the dynamics of the brain activity from recorded EEG signals has been proposed in [40, 39]. A particular aspect of sleep dynamics, which has not gained much research attention in the line of research mentioned above, is that sleep is not localized solely in the brain and is reflected in other physiological systems as well. For example, the regulation of mechanoreceptor and the chemoreceptor leads to breathing pattern variability in the respiratory signal. We have a remarkably regular breathing during N3 stage and irregular breathing with fast varying instantaneous frequency and amplitude during REM stage. Those physiological phenomena motivated various studies to explore the relation between the sleep stage and the patterns in the respiratory signals, e.g. [6, 16, 38]. Physiologically, these variations are not originated from the same controller, and phenomenologically do not have the same patterns in the recorded time series. Thus, while we could observe the sleep dynamics via observing the characteristics of different sensors, each of them reflects only part of the sleep dynamic, and is complicated by the nature of the sensor.

Based on the above physiological facts, an automatic approach for assessing the sleep stage was presented in [44]. It relies on the assumption that there exist hidden low-dimensional physiological processes driving the sleep dynamics, and hence the accessible measured signals. However, these hidden processes may be deformed by the observation procedures; each observation (e.g., an EEG channel measuring brain activity or a chest belt measuring respiration) can be influenced by nuisance factors, which are sensor- or channel-specific (e.g., the specific type of sensors and their exact positions), yet our interest is in the intrinsic variables related to the sleep stages. In [44], empirical intrinsic geometry (EIG) method [40, 39]

, which is based on nonlinear independent component analysis

[35] and was proven to be invariant to the measurement modality, was applied to build an intrinsic representation of the measured data. In [27], this method was extended to a pair of sensors. It was shown that by analyzing the measurements taken simultaneously from sensors, a more reliable intrinsic representation of the sleep dynamics can be obtained, compared with the analysis based only on a single signal.

In this section we extend the algorithm shown in [27], and process jointly multiple channels. We show that extracting the underlying common variables from multiple data sets acquired in different channels recovers systematically a representation, which is well correlated with the sleep stage. The analogy to the setting described in this paper is as follows. We assume that the sleep dynamics are intimately related to hidden controllers that affect the respiratory as well as the brain neural system. These controllers are not accessible to us; yet, they can be recovered by analyzing observations from multiple channels/sensors, each captures different, partial yet complementary aspects of it. Under this assumption, our interest is in obtaining the intrinsic variables underlying the measurements related to these controllers. On the one hand, by analyzing multiple observation channels we can gather more information on the hidden controllers. On the other hand, observations from each channel might be deformed by the different acquisition and measurement modalities and may be affected by noise and interferences, specific to the particular (type of) sensor. In the context of this work, this tradeoff is addressed by defining the intrinsic variables (related to the hidden controllers of interest) as those which are not sensor-specific, and hence, the variables of interest are those that are common among at least two observables.

Twenty subjects without sleep apnea were chosen for this study. The demographic characteristics of these individuals fall within the normal ranges. We used recordings of hours per subject, which were performed in the sleep center at Chang Gung Memorial Hospital (CGMH), Linkou, Taoyuan, Taiwan. The institutional review board of the CGMH approved the study protocol (No. 101-4968A3) and the enrolled subjects provided written informed consent. See [44] for more details regarding the experimental setting and the collected data.

We build the common graph according to Algorithm 4 for extracting the common hidden variables separately to two sets of sensors. The first set includes signals: abdominal and chest motions, which are recorded by piezo-electric bands, and airflow, which is measured using thermistors and nasal pressure, all at sampling rate of  Hz. The second set comprises recordings from EEG channels: C3A2, C4A1, O1A2 and O2A1 at sampling rate of  Hz. The recorded respiratory signals are denoted by and the EEG signals are denoted by .

Prior to the application of our method, each of the single-channel recordings was preprocessed by applying the scattering transform as in [44], which was shown to improve the regularity and stability of signals with respect to various deformations [29]. We then apply Algorithm 4 separately twice: once to the respiratory set, and once to the EEG set.

In order to demonstrate the inherent “sensor selection” capability of the proposed method, for each set of measurements we added an artificial “pure noise” sensor to simulate possible sensor failure. Because our processing pipeline begins with the application of the scattering transform, which automatically degenerates any stationary noise, the noise sensor consists of a non-stationary sequence generated by modulating a sine-wave according to


where is the continuous time signal and can be viewed as the instantaneous frequency of . We sample the obtained modulated sine-wave at a sampling rates of  Hz and  Hz for the EEG set and for the respiratory set, respectively. It should be noted that this particular non-stationary “noise” implementation was chosen just for the sake of demonstration, and any other non-stationary sequence could be chosen instead.

We compare the results of the common graph algorithm, analyzing multiple sensors, with the results attained by the standard DM applied separately to each individual sensor. In addition, we compare the results to two competing schemes analyzing multiple sensors. In the first scheme, we concatenate the scattering transform components from each sensor, and then, apply the standard DM. We note that conceptually this scheme takes into account the information captured by all the sensors without any filtering. We refer to the first scheme as the concatenation scheme. In the second scheme, we apply AD to the entire set of sensors. Namely, we calculate the diffusion kernel for the th sensor, where and build an AD kernel based on the product of all the kernels, that is,