An increasing number of applications, including face recognition, video surveillance, social computing and 3-D point cloud reconstruction, require the data obtained from various domains or extracted from diverse feature extractors to achieve a high accuracy and satisfactory performance. These kinds of data are known as multi-view data. For instance, videos can be generated from different angles (as shown in Fig.1) or from different sensors in a surveillance scene, and a given image can be represented by different types of features such as SIFT and HoG.
It has been proven that these data are more comprehensive and sufficient to indicate a particular object or situation than that obtained from only a single view 
, as a result of more abundant information. Thus the multi-view related research plays an important role in both academic and practical fields. Various multi-view learning methods have been presented for both supervised and unsupervised learning problems. For the former, multi-view based face recognition and video tracking ,  have achieved good performance in practical applications. For the latter, multi-view data restoration and recovery, or multi-view subspace learning (MSL) have been proposed in ,  etc. This paper mainly focuses on the unsupervised MSL issue, which is an important branch in this research line.
The basic assumption of MSL is that each specific view of data lies on a low-dimensional subspace, and these multi-view subspaces share some common knowledge. Through properly encoding and learning such multi-view common knowledge, the subspace learning task for each view can be finely compensated and extracted more appropriately than that only using one simple view knowledge. Most current MSL methods encode such common knowledge by the deterministic components of data, like the shared subspace [6, 7], or similar coefficient representations of different views [8, 9, 1, 4, 10, 11]. Such assumption is always rational in problems and can get satisfactory performance in cases where the contexts and scenarios of different views are consistent and not very complicated.
There are, however, still critical limitations of current methods, especially when being used on multi-view data with complicated noisy scenarios. Firstly, the current methods generally use a simple -norm or -norm loss in the model, implying that they assume the noise embedded in each view of data following a simple Gaussian or Laplacian. In practical cases, however, the noise is always much more complicated, like those shown in Fig. 1. E.g., there may possibly exist foreground objects, along with their shadows and weak camera noises, in a multi-view surveillance video (see Fig. 1). Such complicated noise evidently cannot be finely approximated by a simple Gaussian/Laplacian noise as traditional, which always degenerates their robustness in practical applications.
Secondly, most current methods utilize an unique loss-term to encode all views of data, which implicitly assumes an i.i.d. noise among all data. This, however, is always not correct in many real cases. The noises in different views of data are always of evident distinctiveness due to their different collecting angles, domains and sensors. E.g., in multi-view videos collected from different surveillance cameras, some views might capture a foreground object with big occlusion area, which makes the noise in the view should be better encoded as a long-tailed distribution like Laplacian (i.e., better using -norm loss), while other views might just miss such object, which makes the view contains weak noise signals and leads to a better Gaussian approximation (i.e., better using -norm loss), as clearly shown in Fig. 1. The neglecting of such noise distinctiveness among different views tends to negatively influence the performance of current methods.
Last but not least, besides distinctiveness, there is also correlation and similarity among noises in different views of data. E.g., when one view of videos has large occlusion noise, implying there is an evident object entering the surveillance area, thus, more other views might possibly have large noises, and should be commonly encoded with long-tailed distributions. We should consider such noise similarity among all views to further enhance the noise fitting capability as well as robustness of the multi-view learning strategy.
To address the aforementioned noise fitting issue, in this work we initiate a MSL method by fully taking the complexity, non-consistency and similarity of noise in multi-view learning into consideration . To our best knowledge, this is the first work to consider stochastic components in multi-view learning in such an elaborate manner to make it robust to practical complicated noises. The main contributions can be summarized as follows:
To address the problem of modeling such intra-view complicated and inconsistent while inter-view correlated noise, we apply this KL divergence regularization into the noise modeling of multi-view subspace learning by formulating each view a separated MoG for its noise and regularizing them with KL divergence term instead of only using an uniform MoG as conventional.
Further, we propose an EM algorithm to solve the model with KL divergence regularization, and each involved step can be solved efficiently. To be more specific, all of the parameters and variables of noise distribution have a closed-form solution in M step.
A detailed theoretical explanation is given for KL divergence regularization by conjugate prior for local distribution and KL divergence average for global distribution. Moreover, to utilize this regularization term into complex noise modeling succinctly, we extend it to a joint form for mixture of fully exponential family distributions (including MoG) by using a certain alternative regularization term which is a upper bound of original term.
The paper is organized as follows: Section 2 reviews some related works about MSL. Section 3 proposes our model and Section 4 designs an EM algorithm for solving the model. Section 5 presents some theoretical explanations on KL divergence regularization used in the model. Section 6 gives experiments and finally a conclusion is made.
Ii Related works
In recent years, numbers of multi-view learning approaches have been proposed. The Canonical Correlation Analysis (CCA)  which learns a shared latent subspace across two views is a typical method to analyze the linear correlation. To handle nonlinear alignment, the kernel CCA  was proposed by projecting data into a high-dimensional feature space. Additionally, the sparsity as a prior distribution was imposed to CCA . Several robust CCA-based strategies were proposed by Nicolaou et al.  and Bach et al. . The
-loss was introduced to limit the influence of outliers and noise in, while a Student- density model was presented in  to handle outliers.
Other works on MSL have also attracted much attention recently.  used structured sparsity to deal with multi-view learning problem by solving two convex optimization alternately. Similarly, Guo  proposed a Convex Subspace Representation Learning (CSRL) for multi-view subspace learning; this technique relaxed the problem and reduced dimensionality while retaining a tractable formulation. Other related methodologies on convex formulations for MSL can be found in [8, 17, 18, 19, 20] . Moreover, a Gaussian process regression  was developed to learn a common nonlinear mapping between corresponding sets of heterogenous observations. Multiple kernel learning (MKL) [22, 23] has also been widely used for multi-view data since combining kernels either linearly or nonlinearly has a crucial improvement on learning performance. The works in  and  exploited the Cauchy loss  and loss, respectively, to strengthen robustness to the noise. Considering correlation and independence of multi-view data, several methods have been introduced to divide the data into correlated components to all views and specific components to each view. Lock et al.  presented a Joint and Individual Variation Explained (JIVE) which decomposes the multi-view data into three parts: a low-rank subspace capturing common components, a low-rank subspace obtaining individual ones and a residual errors. Inspired by JIVE, Zhou et al.  used a common orthogonal basis extraction (COBE) algorithm to identify and separate the shared and individual features.
However, the conventional methods put their main focus on the deterministic components of multi-view data, while not elaborately consider the complicated stochastic components (i.e., noises) in data, which inclines to degenerate their robustness especially in real cases with complex noise. To the best of our knowledge, our method is the first one investigating this MSL noise issue and formulating the model capable of both adapting intra-view noise complexity (by parametric MoG) and delivering inter-view noise correlation (by KL-divergence regularization). Its novelty reflects in both its investigated issue and designed methodology (regularized noise modeling).
Iii Problem Formulation
The observed multi-view data is denoted as , where means the data collected from the view, mean the dimension and number of data in each view111To notion convenience, we assume each view has the same data dimensionality and number. Yet our method can also be easily used in cases where they are different in multiple views.. We set as the variables in deterministic parts of our model, where , ( is the subspace rank) denote the subspace parameters. Besides, denote , , as variables in stochastic parts of our model, where , , as well as . Denote as all variables involved in our model. The parameters , , and
denote the hyperparameters to infer model variables.and means the
row and column vectors of matrix, respectively. and are the strength parameters of regularization term.
Iii-B Probabilistic Modeling
Iii-B1 MoG modeling on each view of noises
As conventional MSL method , we model deterministic component of each view of data as a specific subspace with specific coefficients supplemented a shared coefficients among all views. Then each element (, ) of the input matrix is modeled as:
where represents the noise in . From Eq. (1) we can see that a shared variable is learned to reveal the relationship among different views.
Unlike previous works , ,  only using a simple Gaussian or Laplacian to model noise distribution, we model in each view as a MoG distribution to make it better adapt the noise complexity in practice . I.e.,
where , denotes the latent variable , which satisfies and , “Multi” denotes the multinomial distribution. Note that the MoG parameters are different from view to view, implying that each view has its specific noise configuration.
Then we can deduce that:
The prior distributions on , and are to constrain their scales to avoid overfitting problem. The full graphical model of our proposed method is shown in Fig. 2.
Iii-B2 Shared MoG modeling on all views of noises
In order to encode the correlation of noises among different views of data, we assume that there is prior distribution to
, which is related to a KL divergence regularization term. Then all of the model variables and MoG parameters can be inferred by the MAP estimation. After marginalizing the variable, the posterior of then can be written as:
Then our method needs to solve the problem of minimizing the following objective (negative log of Eq. (4)):
where is the index set of the non-missing entries in .
For regularization term , we can easily use the following KL divergence form:
which is easily understandable to link all views of MoG noise parameters to a latent common one with parameter and . This KL-divergence term corresponds to an improper prior222Improper prior distribution  means that it does not integrate to 1. imposed on local MoG parameters and . The MAP model (4) is thus theoretically sound. More theoretical explanations on the model will be presented in Section 5.
Note that the physical meanings of the objective function (7) can be easily interpretted. The first term is the likelihood term, mainly aiming to fit input data , the second one encodes the prior knowledge on deterministic parameters , and the third one regularizes each stochastic variable involved in the model by pulling it close to a latent shared distribution.
Iv EM algorithm for solving MSL-RMoG
The EM algorithm is readily applied to solve (5). The algorithm contains three steps: calculate the expectation of posterior of the latent variable ; optimize the MoG noise parameters; optimize the model parameters.
E Step: the posterior responsibility of mixture component can be calculate by
We can then do the M-step by maximizing the corresponding upper bound w.r.t and :
M Step for updating MoG parameters 333The inference is listed at A1 in supplementary material.: A way to address problem (9) is to take derivatives w.r.t. all the MoG parameters and set them to zeros. The updating formulations of these parameters are written as follows:
Update and : Referring to , the closed-form updating equations of and are:
Update and : The parameters and also have a closed-form solution in M step:
Both parameters of the common MoG noise can be easily explained:
is the harmonic mean ofand
is proportional to the geometric mean of.
M Step for model variables : The related terms in Eq. (9) can be equivalently reformulated as follows:
where represents the Hadamard product and the element of the indicator matrix , with same size of , is
There exist many off-the-shelf algorithms [31, 32, 33]) to tackle Eq. (13). We easily apply the ALS owing to its simplicity and effectiveness. The detailed steps of MSL-RMoG are then provided in Algorithm 1.
The memory consumption and complexity of MSL-RMoG are and , where is iteration number of inner loop in ALS, and is iteration number for outside loop in MSL-RMoG. This complexity is comparable to or even less than those of the current MSL methods [9, 8, 1, 4]. In our experiments, the setting of is not sensitive to the algorithm performance. We just empirically specify it as a small number.
V KL divergence regularization
V-a Theoretical explanation
Relationship to conjugate prior: For MoG parameters , defined in each view, the KL regularization term in Eq. (7) can be explained from the perspective of conjugate prior.  shows that the relationship between KL divergence and conjugate prior under fully exponential family distribution. In this paper, we can show that this conclusion is also correct for all exponential family distributions. The theorem can be summarized as:
Theorem 1 If a distribution belongs to the exponential family with the form: with natural parameter , and its conjugate prior follows: then we have:
where and is a constant independent of .
Specifically, by giving a Dirichlet distribution prior and
Inverse-Gamma distribution prior as:
we can deduce the KL-divergence regularization terms in Eq.(7) for and under MAP framework.
Theorem 2 If and are the same type of exponential family distribution , then
where is the Bregman divergence with convex function , which can be defined as .
 has proven this conclusion. Instead of calculating function integral, this Theorem can give a fast solution to calculate the KL divergence between two same type exponential family distributions.
KL divergence average: For the parameters defined in the shared latent noise distribution, it corresponds to a KL divergence average problem. Specifically, we can prove the following theorem:
Theorem 3 If distributions and , belong to the same kind of full exponential family distribution, which means they all have the form: then the solution of the problem: is
V-B Joint regularization for mixture distribution
Generally speaking, the mixture of full exponential family distributions does not belong to the fully exponential family, so for this mixture distribution we can use a independent KL divergence for each parameter of its distribution like Eq. (7
). However, there are so many hyperparameters to be set with this approach. Actually, we can further prove that the joint distributionof observed variables and latent variable is exactly a fully exponential family distribution:
Theorem 4 If the distributions all belong to the full exponential family with natural parameter and (, and ), then, belongs to the exponential family.
Moreover, we can also prove:
Theorem 5 For any two distributions and with their marginal distributions and , the following inequality holds
From the theorem, it is easy to see that constitutes an upper bound of the KL divergence between original two mixture distributions, and thus can be rationally used as a regularization term for the original mixture distribution. For example, the joint KL divergence regularization for MoG is
where . This regularization term also leads to a simple solution, by setting and in Eq. (11) and let
The solution for other variables and parameters is not changed. The joint KL divergence regularization only have one compromising parameter to be set, which can be generally easy to set. The proofs of all theorems are listed in supplementary material.
Vi Experimental results
To qualitatively and quantitatively evaluate the performance of our proposed multi-view subspace Learning with the complex noise method, we conduct three types of experiments containing face image reconstruction, multi-view and RGB background subtractions. We compare our method with JIVE , CSRL , MISL and MSL , which represent the state-of-the-art MSL development. Most parameters of the compared methods is set to be the default value and the rank is set the same for all these methods. For MSL-RMoG, the number of MoG components is set as 3 in all cases, except 2 in face experiments with Gaussian noise. The model parameters , and and are set as , and , respectively. Besides, we use joint regularization in Eq. (15) with strength parameter throughout all our experiments.
Vi-a Background Subtraction on Multi-view data
In this experiment, our method is applied to the problem of background subtraction. Two multi-camera pedestrian videos [2, 3], shot by 4 cameras located at different angles, are employed: Passageway and Laboratory. All the frames in the video are 288360. Without loss of generality, we resize the original frames with 144180. 200 frames of Passageway and Laboratory sequences beginning at the first frame and ending at the 1000 frame (take the first one of each 5 frames) are extracted to compose the learning data. The JIVE, CSRL, MISL, MSL and our MSL-RMoG method are implemented in these videos, and the rank is set as 2 for all methods. Fig. 3 shows the results obtained by all competing methods on some typical frames on the multi-view video data.
From the figure, it is seen that the background image achieved by our proposed method is clearer in details. Compared with most other competing methods, the MSL-RMoG method is able to extract the foreground objects in a more accurate manner. As shown in Fig. 3, MSL-RMoG decomposes the foreground into three components with different extents, each having its own physical meaning: (1) moving objects in the foreground; (2) shadows alongside the foreground objects due to lighting changes and people walking; (3) background variation caused mostly by the camera noise. As most existing methods merge the object, its shadow and background noise, the foreground extracted by them is relatively more coarse. Besides, in order to illustrate how the KL divergence regularization works, we set the strength parameter with different
in Laboratory video experiments and draw the curves of largest varianceand its mixing coefficient in Fig. 4. we can find the KL divergence regularization makes the distributions of noise in different views have a certain similarity.
Vi-B Face Images Recovery Experiments
This experiment aims to test the effectiveness of the proposed MSL-RMoG methods in face images reconstruction. The CMU Multi-PIE face dataset  including 337 subjects with size 12896 is used, in which images are multiple poses and expressions. 200 subjects are randomly selected and each subject contains 3 views ( , , ). In this experiment, we add different types of noise or outliers to the original image: (1) Gaussian noise (0,0.15) (shown in the row of Fig. 6 (a1)); (2) sparse noise: random 20 noise (shown in the row of Fig. 6(a2)); (3) mixture noise: Gaussian noise (0,0.02)+block occlusion (salt&pepper noise inside)+20 sparse noise (shown in the row of Fig. 6(a3)). The comparison methods include JIVE, CSRL, MISL, and MSL, and the rank is set as in this experiments. Different views of reconstructed images obtained by different methods are shown in Fig. 6. The PSNR values of images obtained by different methods are listed in Table I.
From the figure and table, it is easy to observe that our proposed MSL-RMoG method is capable of finely recovering the clean faces in various noise cases, especially in the case of relatively more complicated mixture noises. MSL performs well on sparse noise, but not well on Gaussian noise. JIVE and CSRL can not work well on the sparse noise and mixture noise since they implicitly assume the noise as a i.i.d. Gaussian. MISL performs slightly better than JIVE and CSRL as a result of the utilization of Cauchy loss, which is more robust than loss.
Vi-C Background Subtraction on RGB data
MSL-RMoG is further applied to the background subtraction on RGB data. In this experiment, we regard the three channels (red, green and blue) of a video as its three different views. Actually different channels do have different residual distributions since the sensitive spectral band of R,G and B sensors is distinct. However, it is the same objects that all sensors observed, therefore, residual distributions also have a certain degree of similarity. The noises in this data are thus with a more complicated non-i.i.d. structures than those assumed by the current MSL methods.
Fig. 5 shows the result of some typical frames on the Li dataset with all competing methods including JIVE, CSRL, MISL, MSL and MSL-RMoG. We can easily observe that MSL-RMoG achieves clearer background, and meanwhile, the extracted foreground objects by our method is also of a better visualization effect. Table II shows the F-measure of all methods in Li data set, which quantitatively shows the better performance of the proposed method.
Current multi-view learning (MSL) methods mainly emphasize deterministic shared knowledge of data, but ignore the complexity, non-consistency and similarity of noise. This, however, is deviated from most real cases with more complicated noises in-between views and alleviates their robustness in practice. This paper has proposed a new MSL method, which firstly investigates this MSL noise issue and formulates the model capable of both adapting intra-view noise complexity (by parametric MoG) and delivering inter-view noise correlation (by KL-divergence regularization). Its novelty reflects in both its investigated issue and designed methodology (regularized noise modeling). Further, we also give a detailed and reasonable theoretical explanation for this term. Experiments show that the new method is potentially useful to compensate previous MSL research to further enhance performance in complex noise scenarios.
-  C. Xu, D. Tao, and C. Xu, “Multi-view intact space learning,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 37, no. 12, pp. 2531–2544, 2015.
-  J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple Object Tracking using K-Shortest Paths Optimization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.
-  F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “Multi-Camera People Tracking with a Probabilistic Occupancy Map,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 267–282, February 2008.
-  M. White, X. Zhang, D. Schuurmans, and Y.-l. Yu, “Convex multi-view subspace learning,” in Advances in Neural Information Processing Systems, 2012, pp. 1673–1681.
-  Y. Jia, M. Salzmann, and T. Darrell, “Factorized latent spaces with structured sparsity,” in Advances in Neural Information Processing Systems, 2010, pp. 982–990.
-  Z. Ding and Y. Fu, “Low-rank common subspace for multi-view learning,” in 2014 IEEE International Conference on Data Mining. IEEE, 2014, pp. 110–119.
-  Z. Zhu, L. Du, L. Zhang, and Y. Zhao, “Shared subspace learning for latent representation of multi-view data.”
-  Y. Guo, “Convex subspace representation learning from multi-view data.” in AAAI, vol. 1, 2013, p. 2.
-  E. F. Lock, K. A. Hoadley, J. S. Marron, and A. B. Nobel, “Joint and individual variation explained (jive) for integrated analysis of multiple data types,” The annals of applied statistics, vol. 7, no. 1, p. 523, 2013.
P. Xu, Q. Yin, Y. Huang, Y.-Z. Song, Z. Ma, L. Wang, T. Xiang, W. B. Kleijn, and J. Guo, “Cross-modal subspace learning for fine-grained sketch-based image retrieval,”Neurocomputing, 2017.
Z. Ma, J.-H. Xue, A. Leijon, Z.-H. Tan, Z. Yang, and J. Guo, “Decorrelation of
neutral vector variables: Theory and applications,”
IEEE transactions on neural networks and learning systems, 2016.
-  H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.
-  S. Akaho, “A kernel method for canonical correlation analysis,” arXiv preprint cs/0609071, 2006.
-  C. Archambeau and F. R. Bach, “Sparse probabilistic projections,” in Advances in neural information processing systems, 2009, pp. 73–80.
-  M. A. Nicolaou, Y. Panagakis, S. Zafeiriou, and M. Pantic, “Robust canonical correlation analysis: Audio-visual fusion for learning continuous interest,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1522–1526.
-  F. R. Bach and M. I. Jordan, “A probabilistic interpretation of canonical correlation analysis,” 2005.
-  A. Goldberg, B. Recht, J. Xu, R. Nowak, and X. Zhu, “Transduction with matrix completion: Three birds with one stone,” in Advances in neural information processing systems, 2010, pp. 757–765.
-  R. S. Cabral, F. De la Torre, J. P. Costeira, and A. Bernardino, “Matrix completion for multi-label image classification.” in NIPS, vol. 201, no. 1, 2011, p. 2.
-  C. Christoudias, R. Urtasun, and T. Darrell, “Multi-view learning in the presence of view disagreement,” arXiv preprint arXiv:1206.3242, 2012.
-  B. Behmardi, C. Archambeau, and G. Bouchard, “Overlapping trace norms in multi-view learning,” arXiv preprint arXiv:1404.6163, 2014.
-  A. Shon, K. Grochow, A. Hertzmann, and R. P. Rao, “Learning shared latent structure for image synthesis and robotic imitation,” in Advances in Neural Information Processing Systems, 2005, pp. 1233–1240.
M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,”
The Journal of Machine Learning Research, vol. 12, pp. 2211–2268, 2011.
-  Z. Xu, R. Jin, H. Yang, I. King, and M. R. Lyu, “Simple and efficient multiple kernel learning by group lasso,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 1175–1182.
I. Mizera and C. H. Müller, “Breakdown points of cauchy regression-scale
Statistics & probability letters, vol. 57, no. 1, pp. 79–89, 2002.
G. Zhou, A. Cichocki, Y. Zhang, and D. P. Mandic, “Group component analysis for multiblock data: Common and individual feature extraction,” 2015.
-  Y. Panagakis, M. Nicolaou, S. Zafeiriou, and M. Pantic, “Robust correlated and individual component analysis,” 2015.
-  P. J. Huber, Robust statistics. Springer, 2011.
-  V. Maz’ya and G. Schmidt, “On approximate approximations using gaussian kernels,” IMA Journal of Numerical Analysis, vol. 16, no. 1, pp. 13–29, 1996.
-  R. Christensen, W. Johnson, A. Branscum, and T. E. Hanson, Bayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians. Hoboken: CRC Press, 2010.
-  A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the royal statistical society. Series B (methodological), pp. 1–38, 1977.
-  N. Srebro, T. Jaakkola et al., “Weighted low-rank approximations,” in ICML, vol. 3, 2003, pp. 720–727.
-  A. M. Buchanan and A. W. Fitzgibbon, “Damped newton algorithms for matrix factorization with missing data,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2. IEEE, 2005, pp. 316–322.
-  F. De La Torre and M. J. Black, “A framework for robust subspace learning,” International Journal of Computer Vision, vol. 54, no. 1-3, pp. 117–142, 2003.
-  H. Yong, D. Meng, W. Zuo, and L. Zhang, “Robust online matrix factorization for dynamic background subtraction,” IEEE transactions on pattern analysis and machine intelligence, 2017.
-  F. Nielsen and V. Garcia, “Statistical exponential families: A digest with flash cards,” arXiv preprint arXiv:0911.4863, 2009.
-  R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,” Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010.