## 1 Introduction

M ost of the data used in video surveillance, social computing, and environmental sciences are collected from diverse domains or obtained from various feature extractors. These data are heterogeneous, because their variables can be naturally partitioned into groups. Each variable group is referred to as a particular view, and the multiple views for a particular problem can take different forms. For example, a sparse camera network containing multiple cameras is used for person re-identification and understanding global activity through color descriptors, local binary patterns, local shape descriptors, slow features and spatial temporal contexts.

The information obtained from an individual view cannot comprehensively describe all examples. It has therefore become popular to leverage the information derived from the connections and differences between multiple views to better describe the objects, which has resulted in multi-view learning algorithms that integrate multiple features from diverse views (or simply multi-view features).

Recently, numbers of multi-view learning algorithms have been designed and successfully applied to various computer vision and intelligent system problems

[1, 2]. Co-training [3] is one of the earliest semi-supervised schemes for multi-view learning. It trains alternately to maximize the mutual agreement on two distinct views of the unlabeled data. Many variants [4, 5, 6, 7, 8, 9, 10, 11], such as co-EM [4] and co-regularization [6, 7], have since been developed. Their success is relied on the assumption that the two sufficient and redundant views are conditional independent to the other given the class label. However, this assumption tends to be too rigorous for many practical applications, and thus some alternative assumptions have been studied. [12] showed that weak dependence can also guarantee successful co-training. [13] proved a weaker assumption called-expansion was sufficient for iterative co-training to succeed. After that, Wang and Zhou conducted a series of in-depth analyses and revealed some interesting properties of co-training, including the large-diversity of classifiers

[14, 15], label propagation over two views [16] and co-training with insufficient views [17].Multiple kernel learning (MKL) was originally developed to control the search space capacity of possible kernel matrices to achieve good generalization but has been widely applied to problems involving multi-view data [18, 19]. This is because kernels in MKL naturally correspond to different views and combining kernels either linearly or non-linearly improves learning performance, especially when the views are assumed to be independent. [20, 21] formulated MKL as a semi-definite programming problem. [22] treated MKL as a second order cone program problem and developed an SMO algorithm to efficiently obtain the optimal solution. [23, 24]

developed an efficient semi-infinite linear program and made MKL applicable to large scale problems.

[25, 26] proposed simple MKL by exploring an adaptive 2-norm regularization formulation. [27, 28] constructed the connection between MKL and group-LASSO to model group structure.A number of works exploit the shared latent subspace across diverse views, such as canonical correlation analysis (CCA) [29, 30, 31], its kernel extension [32], its probabilistic interpretation [33], and its sparse formulation [34]. Recently, other methodologies have been proposed for this task: [35, 36] used Gaussian process to discover latent variable model shared by multi-view data; [37, 38] found the joint embedding for multi-view data by maximizing mutual information; these techniques are particularly effective for modeling the correlations between different views. To simultaneously account for the dependencies and independencies of different input views, various methods have been introduced that factorize the latent space into a shared part common to all views and a private part for each view [39, 40]. By considering the side information, the recent work of max-margin Harmonium (MMH) [41] showed that applying the large-margin principle to learn subspace shared by multi-view data is more suitable for prediction.

However, existing multi-view learning methods have their own limitations. First, since an individual view is insufficient for learning, integration of multi-view information is necessary and valuable; however, besides several works concentrating on co-training style algorithms [42, 17], the issue of single-view insufficiency has not been clearly addressed and comprehensively studied. Second, the term“intact” means *complete* and *not damaged* in Merriam-Webster, which are exactly the two favorable properties we wish to possess in the latent intact space. However, most of the existing multi-view learning algorithms fail to discover latent intact spaces, due to the information loss in learning from insufficient views or the influence of noises in insufficient views.
Finally, there is a demand on the theoretical supports to guarantee the performance of multi-view learning.

In this paper, we assume that while each individual view only captures partial information, all the views together possess redundant information of the object. In contrast to most existing multi-view learning models that assume view sufficiency, we propose a Multi-view Intact Space Learning (MISL) algorithm to address insufficiency in each individual view and to integrate the encoded complementary information. The new view functions in the MISL algorithm are rigorously studied. To enhance the robustness of the model, we measure the reconstruction error from different views using the Cauchy loss, which is robust to outliers and has an optimal breakdown point compared with conventional and losses [43]. To solve the two sub-problems w.r.t. the view generation functions and the intact space derived from the optimization problem, we develop an Iteratively Reweight Residuals (IRR) optimization technique, which is efficiently implemented and has guaranteed convergence. Although each view only captures partial information of the latent intact space, MISL theoretically guarantees that given enough views the latent intact space can be approximately restored. We introduce a new definition of “multi-view stability” to analyze the robustness of the proposed algorithm. Moreover, we derive the generalization error bound based on the multi-view stability and Rademacher complexity, and show that the complementarity of multiple views can improve the multi-view stability and the generalization. Finally, we conduct experiments to explicitly illustrate the view insufficiency assumption and the robustness of the proposed algorithm and show, using real-world datasets, that our approach can accurately discover an intact space for the subsequent classification tasks.

The rest of the paper is organized as follows. In Section 2, we formulate the multi-view learning problem and propose the MISL algorithm. The optimization method is presented in Section 3 and theoretical analysis is given in Section 4. Section 5 presents the experimental results, and Section 6 concludes the paper. The detailed proofs of the theoretical results are in Section 7.

## 2 Problem Formulation

View sufficiency is usually not guaranteed in practice. By contrast, we assume “view insufficiency” that each view only captures partial information but all the views together carry redundant information about the latent intact representation (shown in Figure 1). Many practical problems support this assumption. For example, in a camera network, cameras are placed in public areas to predict potentially dangerous situations in time to take necessary action. However, each camera alone captures insufficient information and thus cannot comprehensively describe the environment, which can only be fully recovered by integrating multiple data from all the cameras.

In multi-view learning, an example is represented by multi-view features , where is the number of views and . Supposing is the latent intact representation, each view is a particular reflection of the example, and obtained from the view generation function on ,

(1) |

where is the view-dependent noise. According to the view insufficiency assumption, we know that the function is non-invertible, so we cannot recover from even given the view function . For a linear function , non-invertibility implies that is not column full-rank.

Hence, our objective is to learn a series of view generation functions to generate multi-view data points from a latent intact space . A straightforward approach is minimizing the empirical risk over using the or loss. Given Eq. (1), the noise in different views seriously influences the discovery of the optimal view generation functions and the latent intact space. However, as thoroughly studied in robust statistics [44], neither nor loss is robust to outliers, and thus the performance of multi-view learning will be seriously degraded.

### 2.1 Robust Estimators

M-estimator is popular in robust statistics. Let

be the residual of the -th data point, i.e., the difference between the -th observation and its fitted values. The standard least-squares method tries to minimize , which is unstable if outliers are present, and which has a strong effect to distort the estimated parameters. M-estimators try to reduce the effect of outliers by replacing the squared residuals with another function of residuals(2) |

where is a symmetric, positive-definite function with a unique minimum at zero and chosen to be less increasing than the square function. The corresponding influence function is defined as

(3) |

which measures the influence of a random data point on the value of the parameter estimate.

As shown in Figure 2, for the estimator (least-squares) with , the influence function is ; that is, the influence of a data point on the estimate increases linearly with the size of its error. This confirms the non-robustness of the least-squares estimate. Although the (absolute value) estimator with reduces the influence of large errors, its influence function has no cut-off. When an estimator is robust, it is inferred that the influence of any single observation is insufficient to yield a significant offset. Cauchy estimator has been shown to own this valuable property

(4) |

along with the upper bounded influence function

(5) |

Furthermore, Cauchy estimator [45] theoretically has a breakdown point of nearly , which means that nearly half of the observations (e.g., arbitrarily large observations) can be incorrect before the estimator gives an incorrect (e.g., arbitrarily large) result. Therefore, we deploy the Cauchy estimator in MISL.

### 2.2 Multi-view Intact Space Learning

We consider a multi-view training sample whose view number is and sample size is . The reconstruction error over the latent intact space can be measured using the Cauchy loss

(6) |

where is a data point in the latent intact space , is the -th view generation matrix, and is a constant scale parameter.

Moreover, we adopt some regularization terms to penalize the latent data point and the view generation matrix . Finally, the resulting objective function can be written as

(7) |

where and are non-negative constants that can be determined using cross validation. Problem (7) jointly models the relationships between the latent intact space and each view space using a robust approach. It can be expected that by solving this problem with an input of multiple insufficiency views, a series of view generation functions and a latent intact space can be found that represents the object in its entirety.

At inference, given a new multi-view example , the corresponding data point in the intact space can be obtained by solving the problem

(8) |

where is the optimal view generation function.

### 2.3 Kernel Extension

When the view space lies in an infinite-dimensional Hilbert space, there exists a nonlinear mapping , such that . Furthermore, we denote the learned projection function in the feature space as . Assuming atoms of lie in the space spanned by the input data, we can write , where is the atom representation matrix and . The kernel extension of Problem (7) can then be obtained through

and

(9) |

where and are the kernel matrix and the atom matrix of view- respectively. The kernelized problem can be optimized using the same technique as the linear MISL defined by Eq. (7).

## 3 Optimization

Problem (7) can be decomposed into two sub-problems over the view generation function and the latent intact space using the alternating optimization method. Inspired by the generalized Weiszfeld’s method [46], we develop an Iteratively Reweight Residuals (IRR) algorithm to efficiently optimize these two subproblems.

Given fixed view generation functions , Eq. (7) can be minimized over each latent point in the latent intact space ,

(10) |

Setting the gradient of with respect to to , we have

(11) |

which can be rewritten as

(12) |

where is referred to as the residual of the example on each view. A weight function is then defined as

(13) |

which can be used to reduce the influence of outliers and adjust the errors introduced by different views. Based on Eqs. (12) and (13), we have

(14) |

Considering depends on , we thus iteratively update using Eq. (14) with an initial estimate until convergence. The iterative procedure is described in algorithm 1.

By fixing all data points in the intact space , Eq. (7) is reduced to the minimization over each view generation function

(15) |

Given the residual and the weight function on the training data

(16) |

we can update the projection function by

(17) |

Similar to the optimization over the latent comprehensive space , we can also estimate by Algorithm 1.

## 4 Theoretical Analysis

In this section, we analyze the convergence of the optimization technique, present a new definition of “multi-view stability”, and then derive the generalization error bound of the proposed multi-view learning algorithm. The detailed proofs are given in Section 7.

### 4.1 Convergence Analysis

We employ the majorize-minimize framework from [46] to analyze the convergence of the IRR algorithm. The key idea of this framework is to globally approximate using a sequence of quadratic functions. Taking the subproblem over as an example, after having found , we can construct a quadratic function to upper bound such that the following conditions hold:

(18) |

Then has the form

(19) |

with symmetric matrix

(20) |

Therefore, we have the following theorem to guarantee the convergence of the IRR algorithm.

###### Theorem 1.

The IRR algorithm update in Eq. (14) guarantees that the sequence is monotonic, i.e., for all , and the sequence will converge to the local minimizer of .

### 4.2 View Insufficiency Analysis

Information theory provides a natural channel to explain the view insufficiency assumption. In particular, for discrete random variables

and , the conditional mutual information measures how much information is shared between and conditioned on already known .Given an active view set in the multi-view learning setting, each time we randomly generate view from the latent intact space , and add it into . For example, beginning with the view set already containing one view , the insufficiency of the newly generated view can be measured by

(21) |

where is a variable larger than zero. Conventional view sufficiency assumption with some small [47] states that both and are redundant with regards to their information about . By contrast, our view insufficiency assumption (i.e., Eq. (21)) implies that an individual view cannot sufficiently describe , and therefore each view will carry additional information about that other views do not have. For the current view set , the information obtained with respect to is measured using

(22) |

According to the chain rule of mutual information, we have the following proposition.

###### Proposition 1.

Considering there are randomly generated views from the latent comprehensive space , the information obtained to learn can be measured by

Proposition 1 suggests that more views will bring in more information with respect to . Although each individual view is insufficient, we can receive abundant information to learn the latent intact space by exploiting the complementarity between multiple views.

Assume that the latent intact space can be completely captured by the ideal view set . Let denote the optimal solution with respect to , i.e., , where

is the expectation of the loss function

over the samples on different views. Since could be very large, we randomly select views from to construct a smaller view set for multi-view learning. Sridharan and Kakade [47] developed a significant information theoretical framework to analyze multi-view learning algorithms, based on which we show the learning error of the newly proposed algorithm is bounded by the following theorem.###### Theorem 2.

Given the ideal view set , and the bounded loss function , the expected losses of multi-view learning based on and its subset are denoted by and . Their difference is bounded by

which will decrease with increasing .

According to Theorem 2, although we cannot obtain all the necessary views to learn the latent intact space , we can approximately restore it when provided with enough views.

### 4.3 Generalization Error Analysis

In this section, we propose a new definition of *multi-view stability* and use it with the Rademacher complexity to analyze the generalization error of the proposed multi-view learning algorithm.

In learning theory, algorithmic stability [48] is employed to measure the variation of the output for different training sets that are identical up to removal or change of a single example. We apply this idea to multi-view learning, and then propose the definition of multi-view stability.

###### Definition 1.

Given and , where is the view function on -view. The function class is said to have multi-view stability , if for any two multi-view examples and that differ only at a single coordinate on an individual view, .

Stability characterizes a system’s persistence against the perturbation of variables. On the other hand, since the perturbed variables can be regarded as the outliers (or at least noised examples) for the system, the stability has an implicit connection with robustness. Hence, the theoretical analysis based on the stability here is expected to deliver similar conclusions from the perspective of robustness.

Conventional concentration inequalities, e.g., Hoeffding’s inequality and McDiarmid’s inequality, are designed under the independent identical distribution assumption. However, for more complex cases (e.g., multi-view learning), the variables can have dependence on each other, which calls for the new concentration inequality developed for dependent random variables. By leveraging recent results in the concentration of dependent random variables [49], we use the following concentration inequality for our analysis on multi-view learning.

###### Theorem 3.

Let , and be the coefficient matrix measuring variable dependence. For any two multi-view examples and that differ only at a single coordinate on an individual view, . Then for any ,

Before proceeding, we define some notations for convenience. Given a multi-view example and multi-view functions , we define . For any , let

Besides multi-view stability, we employ the Rademacher complexity to measure the hypothesis complexity, and its definition is adapted from [50] by removing the assumption that are i.i.d.

###### Definition 2.

Let be a set of random variables. Let

be a set of independent Rademacher random variables, with zero mean and unit standard deviation. The empirical Rademacher complexity of

is(23) |

The Rademacher complexity of is .

With these definitions, we now present our main result.

###### Theorem 4.

We can directly apply Theorem 4 to the setting in which there are i.i.d. multi-view examples.

###### Corollary 1.

Fix , and . Let be a set of multi-view examples. If has multi-view stability , then with probability at least over ,

Through the theoretical analysis, we find that the generalization error of multi-view leaning algorithms can be well bounded by the multi-view stability and Rademacher complexity of the hypothesis. Next we proceed to analyze the specific multi-view stability and Rademacher complexity for the the proposed MISL algorithm.

#### 4.3.1 Multi-view Stability

For any multi-view example , we consider that there is a perturbation at -th coordinate of the -th view. The new multi-view example is thus written as , where has only one non-zero element at -th coordinate. We suppose .

The following proposition illustrates the multi-view stability of the proposed MISL algorithm.

###### Proposition 2.

Given and , where is the view function on -view. The function class learned through MISL algorithm has multi-view stability

According to the above analysis, we know that multiple views together determine the multi-view stability. If one view is perturbed by outliers or noise, the other clean views will act to alleviate the influence of the perturbation and preserve the output results unchanged. This resistance will be further strengthened with the increasing of the number of clean views. Thus, the multi-view learning performance is improved through the corporation between multiple views. However, if we model all views independently, this corporation will be ignored.

#### 4.3.2 Rademacher Complexity

We proceed to bound the Rademacher complexity of by first bounding its covering number [51, 52]. Given , we assume that lies in a sphere with radius . The covering number of can be bounded by the following lemma.

###### Lemma 1.

For and , the covering number of is upper bounded by

(24) |

where the exponent is the dimension of the constraint set in the sense of its manifold structure.

Obviously, can be upper bounded using . We suggest that if there exist dependencies between multiple views, simultaneously handling multiple views is advantageous over modeling each view independently. In particular, the connections between matrices will decrease the intrinsic dimension of , lead to a lower , and then improve the covering number.

By applying the discretization theorem on the covering number, we can easily get the following proposition to bound the Rademacher complexity of the proposed algorithm.

###### Proposition 3.

The Rademacher complexity of learned through MISL algorithm is bounded by

Finally, by integrating Propositions 2 and 3 with Corollary 1, we can easily get the generalization error bound of the proposed MISL algorithm. Though some views maybe noisy, multi-view stability can preserve the performance of multi-view learning by exploiting the information on the other accurate views. The dependencies between views can decrease the complexity of the hypotheses space, and then improve the generalization error bound.

## 5 Experiments

In this section, we present qualitative as well as quantitative evaluation on two toys and three real-world datasets. The proposed MISL algorithm and its kernel extension KMISL were compared with convex Multi-view Subspace Learning algorithm (MSL) [53], Factorized Latent Spaces with Structured Sparsity algorithm (FLSSS) [40], and shared Gaussian Latent Variable Model (sGPLVM) [36].

### 5.1 Toy Examples

First we evaluated our approach on the problem of 3-D point cloud reconstruction. For the 3-D model provided by [54], we extracted the point cloud data (i.e., point positions in rectangular space coordinates), and then projected it into three 2-D planes (e.g., X-Y, X-Z, and Y-Z planes) as the base views and of . To further validate the robustness of the proposed algorithm, we attempted to generate more noisy views from these base views. Given a randomly placed window of fixed size on each base view, we add some noise to the data points in it, and then obtain a serious of distinct noisy views for multi-view learning.

In Figure 3, we show the 3-D latent intact spaces discovered by MISL algorithm with different numbers of noisy view under distinct signal noise ratio (SNR) as input for vase, chair and compel models, respectively. Specifically, Figure 3 (a) shows the initial 3-D point cloud in red color and its corresponding meshed result in green. For each base view, we randomly generate 3, 6 and 9 noisy views, and then we totaly have three different settings of 9, 18 and 27 noisy views for training. Given , we show the intact spaces discovered from these three settings in Figure 3 (b), (c) and (d), respectively. When , the intact spaces are presented in Figure 3 (e), (f) and (d). The reconstruction results under similar settings for chair and compel models are presented in Figures 4 and 5, respectively. From these reconstruction results, we find that if the signal is cleaner (i.e., associated with a higher SNR), the reconstructed object will be more accurate. More importantly, although each individual view is noisy and insufficient for completely inferring the 3-D objects, MISL algorithm is robust enough to outliers and can accurately discover the intact space by utilizing the complementarity between these views.

The second toy is based on the synthetic data “S-curve” as shown in Figure 6 (a). We uniformly sampled some 3-dimensional data points (see Figure 6 (b)), and then projected them into three 2-D planes (e.g., X-Y, X-Z, and Y-Z planes) as three views and of (see Figure 6 (c)). Based on these three views, we find that the proposed MISL algorithm can effectively reconstruct the s-curve, as shown in Figure 6 (d). We further add some noises on the clean views to evaluate the robustness of MISL. Figures 6 (e) and (g) show the noisy views of and , respectively. Specifically, it has already became difficult for us human being to figure out the original curve from the terrible views in Figure 6 (e). Thanks to the robustness of MISL, the noise can be appropriately handled, and the intact spaces (see Figures 6 (f) and (h)) under and will be approximately restored.

### 5.2 Face Recognition

CMU PIE face database [55]

contains 41,368 images of 68 people, each person under 13 different poses, 43 different illumination conditions, and with 4 different expressions. To construct the multi-view setting, we selected two near frontal poses (i.e., C9 and C29) as two views. Therefore a pair of images of one person under these two poses with the same illumination can be seen as a two-view example. Different algorithms were used to project the multi-view faces into some appropriate spaces for face recognition.

images of one people were randomly selected for training, and the rest for test. -nearest neighbor method based on the Euclidean distance was applied for face recognition, where was set as 3. Given the noisy views of and , the face recognition accuracies for different algorithms on different dimensional spaces were shown in Figure 7 (a) and (b).From Figure 7, we find that MISL stably outperforms other algorithms at all the dimensionalities. The noisy views do not seriously damage the performance of MISL, but are optimally combined to find the latent intact space. This is mainly due to the satisfied robust property of MISL and its ability to appropriately handle the complementarity between multiple views.

### 5.3 Human Motion Recognition

The UCF101 dataset [56]

is a large dataset of human actions. It consists of 101 action classes, which can be further divided into five types: Human-Object Interaction, Body-Motion Only, Human-Human Interaction, Playing Musical Instrument and Sports. There are totally over 13k clips and 27 hours of video data. The entire dataset was split between train and test samples three times, each split randomly selecting two-thirds of the data for training and the remaining data for testing. Therefore videos from the same group never appear in both the training and test set. Recently, deep learning models

[57, 58] have been widely used to learn effective features from videos for action recognition. Since feature learning is not in our research interests, we employ the easily obtained conventional handcrafted features as input, and launch multi-view learning to evaluate whether the complementarity between multiple views can actually been exploited to improve the recognition performance. Motion Boundary Histograms (MBH) and Histograms of Oriented Gradients (HOG) are two well-known motion descriptors [59], and thus we chose these two views to represent each clip. In the experiments, we used a linear SVM for classification, and conduct cross validation to find the optimal . In the case of multi-class classification, we used a one-against-rest approach and select the class with the highest score. The performance measurement was the average classification accuracy over all classes on three splits.MBH/HOG | MBH+HOG | MBH | HOG |
---|---|---|---|

sGPLVM | |||

FLSSS | |||

MSL | |||

MISL | |||

KMISL |

We utilized different algorithms to project the multi-view video clips into the 1500-dimensional latent spaces and reported the recognition accuracy based on these embedded examples. We presented the confusion matric for KMISL over 101 action classes in Figure 9 (a). After merging the recognition results of classes belonging to the same action types, the confusion matrices over five action types for MISL, MSL, FLSSS, and sGPLVM algorithms are shown in Figure 9 (b)-(e) respectively. MISL provides performance improvements in most of the action types. The recognition accuracy of different algorithms based on various feature combinations is summarized in Table I. Compared with the recognition performance based on single kind of feature as input, learning with multiple features through different multi-view learning algorithms demonstrate variable performance improvements, as a result of the exploitation of complementary information underlying the multi-view features. In particular, MISL obtains the best recognition result, due to its robust approach to obtain the intact latent space.

### 5.4 RGB-D Object Recognition

RGB-D datset is a large-scale multi-view object dataset collected through an RGB-D camera. This dataset contains color and depth images of 300 physically distinct everyday objects. Therefore, the RGB image and the corresponding depth image can be seen as two approaches capturing the shape and the visual appearance of an object from 51 categories. Video sequences of each object were recorded at 20 Hz, and then subsampled by taking every fifth video frame, giving 41,877 RGB and depth images for the experiments. We extracted 1000-dimensional gradient kernel descriptors to depict both the RGB image and depth image. For category recognition, we randomly left one object out from each category for testing and training SVM classifiers on all the remaining objects. The performance were measured in averaged recognition accuracy across 10 trials.

We used different algorithms to project the multi-view objects into different dimensional spaces and reported the category recognition accuracy on these spaces in Figure 10. We can see that both MISL and its kernel extension perform much better than other competitors. In contrast with these competitors, whose objective is to discover the subspace bearing dependencies or independencies of different views, the proposed MISL algorithm attempts to find an intact space fusing all the information provided by diverse views. Therefore the dimensionality of the spaces discovered by MISL is not limited to be lower than those of original features. Moreover, the robustness of MISL ensures that the noises introduced by multiple views can be appropriately handled, which leads to the performance improvement of multi-view learning.

## 6 Conclusion

We have presented a novel robust learning algorithm for recovering the latent intact representations for multi-view examples, by assuming view insufficiency, i.e., that each view only captures partial information but all views together carry redundant information about the object. Theoretical analysis on view insufficiency assumption suggests that we can approximately restore the latent intact space by exploiting the complementarity between multiple insufficient views. Based on a new definition of “multi-view stability” and the Rademacher complexity, we derive the generalization error bound for the proposed multi-view learning algorithm, and show that this bound can be improved through the complementarity between multiple views. Finally, we design a new Iteratively Reweight Residuals (IRR) technique that converges fast to solve the optimization problem. Experimental results on synthetic data and real-world datasets demonstrate that the proposed MISL algorithm and its kernel extension are robust, effective and promising for practical applications.

## 7 Proofs

We provide below the detailed proofs of the theoretical results in Section 4.

### 7.1 Proof of Theorem 1

Assume that is locally convex with respect to and has a local minimizer. Setting as this minimizer, we have

(25) |

Substituting for , we obtain the update rule in Eq. (14) (similar for Eq. (17)).

By appropriately choosing near , we have implying that

(26) |

By combining Eqs. (25) and (26), we have

(27) |

This shows that , proving the first part of Theorem 1.

Since the sequence is monotonic and lower bounded, it converges. From Eq. (27), we can then write

(28) |

Considering is convergent, we conclude that has convergent subsequences.

In non-convex optimization [60], a common assumption is that the non-convex function is “locally convex” around its “local” minimum. Suppose we assume that the initialization is “good”, in that it is situated sufficiently close a local minimizer . It is then possible that the entire is restricted to a ball of small enough radius around , and the ball contains no other stationary points of . In this case, we are guaranteed that every convergent subsequence, and hence the whole sequence , converges to .

### 7.2 Proof of Theorem 2

Considering

, and two probability distributions

and , we have(29) |

where and the last step is due to the variational distance being bounded by the square root of the KL divergence. Using this, for a fixed view set we have

(30) |

Taking the expectation with respect to and using Jensen’s inequality, we get

(31) |

Since

(32) |

and , we obtain

(33) |

Considering

(34) |

we have

(35) |

Furthermore,

which shows that the larger , the less will be, and the difference between and will decrease.

### 7.3 Proof of Theorem 4

For any two multi-view examples and that differ only at a single coordinate on an individual view, assume that . We have

The last equality results from the multi-view stability. According to Theorem 3, we therefore have

(36) |

with probability at least .

Using the symmetry argument and the contraction principle, we proceed to upper-bound by the Rademacher complexity . It is necessary to note that our analysis does not require the random variables to be independent.

Based on Jensen’s inequality, we begin with the definition of and have

Given a set of independent variables

, we define(37) |

and

(38) |

By symmetry,

Finally, we can complete the proof by combining the above results.

### 7.4 Proof of Proposition 2

We obtain the optimal intact representations of multi-view examples and through

(39) |

and

(40) |

and define . We introduce the following notation for convenience

(41) |

Though is not globally convex w.r.t. , it can be assumed to be locally convex within a small region. Given and its perturbated copy , we assume that their intact representations and are not far away from each other, and fulfill the following inequalities derived from the convex principle that

(42) |

and

(43) |

Summing the two inequalities yields

(44) |

Considering that and are optimal solutions of Eq. (39) and Eq. (40), respectively, we have

(45) |

and