I Introduction
An increasing number of applications, including face recognition, video surveillance, social computing and 3D point cloud reconstruction, require the data obtained from various domains or extracted from diverse feature extractors to achieve a high accuracy and satisfactory performance. These kinds of data are known as multiview data. For instance, videos can be generated from different angles (as shown in Fig.
1) or from different sensors in a surveillance scene, and a given image can be represented by different types of features such as SIFT and HoG.It has been proven that these data are more comprehensive and sufficient to indicate a particular object or situation than that obtained from only a single view [1]
, as a result of more abundant information. Thus the multiview related research plays an important role in both academic and practical fields. Various multiview learning methods have been presented for both supervised and unsupervised learning problems. For the former, multiview based face recognition
[1] and video tracking [2], [3] have achieved good performance in practical applications. For the latter, multiview data restoration and recovery, or multiview subspace learning (MSL) have been proposed in [4], [5] etc. This paper mainly focuses on the unsupervised MSL issue, which is an important branch in this research line.The basic assumption of MSL is that each specific view of data lies on a lowdimensional subspace, and these multiview subspaces share some common knowledge. Through properly encoding and learning such multiview common knowledge, the subspace learning task for each view can be finely compensated and extracted more appropriately than that only using one simple view knowledge. Most current MSL methods encode such common knowledge by the deterministic components of data, like the shared subspace [6, 7], or similar coefficient representations of different views [8, 9, 1, 4, 10, 11]. Such assumption is always rational in problems and can get satisfactory performance in cases where the contexts and scenarios of different views are consistent and not very complicated.
There are, however, still critical limitations of current methods, especially when being used on multiview data with complicated noisy scenarios. Firstly, the current methods generally use a simple norm or norm loss in the model, implying that they assume the noise embedded in each view of data following a simple Gaussian or Laplacian. In practical cases, however, the noise is always much more complicated, like those shown in Fig. 1. E.g., there may possibly exist foreground objects, along with their shadows and weak camera noises, in a multiview surveillance video (see Fig. 1). Such complicated noise evidently cannot be finely approximated by a simple Gaussian/Laplacian noise as traditional, which always degenerates their robustness in practical applications.
Secondly, most current methods utilize an unique lossterm to encode all views of data, which implicitly assumes an i.i.d. noise among all data. This, however, is always not correct in many real cases. The noises in different views of data are always of evident distinctiveness due to their different collecting angles, domains and sensors. E.g., in multiview videos collected from different surveillance cameras, some views might capture a foreground object with big occlusion area, which makes the noise in the view should be better encoded as a longtailed distribution like Laplacian (i.e., better using norm loss), while other views might just miss such object, which makes the view contains weak noise signals and leads to a better Gaussian approximation (i.e., better using norm loss), as clearly shown in Fig. 1. The neglecting of such noise distinctiveness among different views tends to negatively influence the performance of current methods.
Last but not least, besides distinctiveness, there is also correlation and similarity among noises in different views of data. E.g., when one view of videos has large occlusion noise, implying there is an evident object entering the surveillance area, thus, more other views might possibly have large noises, and should be commonly encoded with longtailed distributions. We should consider such noise similarity among all views to further enhance the noise fitting capability as well as robustness of the multiview learning strategy.
To address the aforementioned noise fitting issue, in this work we initiate a MSL method by fully taking the complexity, nonconsistency and similarity of noise in multiview learning into consideration . To our best knowledge, this is the first work to consider stochastic components in multiview learning in such an elaborate manner to make it robust to practical complicated noises. The main contributions can be summarized as follows:

To address the problem of modeling such intraview complicated and inconsistent while interview correlated noise, we apply this KL divergence regularization into the noise modeling of multiview subspace learning by formulating each view a separated MoG for its noise and regularizing them with KL divergence term instead of only using an uniform MoG as conventional.

Further, we propose an EM algorithm to solve the model with KL divergence regularization, and each involved step can be solved efficiently. To be more specific, all of the parameters and variables of noise distribution have a closedform solution in M step.

A detailed theoretical explanation is given for KL divergence regularization by conjugate prior for local distribution and KL divergence average for global distribution. Moreover, to utilize this regularization term into complex noise modeling succinctly, we extend it to a joint form for mixture of fully exponential family distributions (including MoG) by using a certain alternative regularization term which is a upper bound of original term.
The paper is organized as follows: Section 2 reviews some related works about MSL. Section 3 proposes our model and Section 4 designs an EM algorithm for solving the model. Section 5 presents some theoretical explanations on KL divergence regularization used in the model. Section 6 gives experiments and finally a conclusion is made.
Ii Related works
In recent years, numbers of multiview learning approaches have been proposed. The Canonical Correlation Analysis (CCA) [12] which learns a shared latent subspace across two views is a typical method to analyze the linear correlation. To handle nonlinear alignment, the kernel CCA [13] was proposed by projecting data into a highdimensional feature space. Additionally, the sparsity as a prior distribution was imposed to CCA [14]. Several robust CCAbased strategies were proposed by Nicolaou et al. [15] and Bach et al. [16]. The
loss was introduced to limit the influence of outliers and noise in
[15], while a Student density model was presented in [16] to handle outliers.Other works on MSL have also attracted much attention recently. [5] used structured sparsity to deal with multiview learning problem by solving two convex optimization alternately. Similarly, Guo [8] proposed a Convex Subspace Representation Learning (CSRL) for multiview subspace learning; this technique relaxed the problem and reduced dimensionality while retaining a tractable formulation. Other related methodologies on convex formulations for MSL can be found in [8, 17, 18, 19, 20] . Moreover, a Gaussian process regression [21] was developed to learn a common nonlinear mapping between corresponding sets of heterogenous observations. Multiple kernel learning (MKL) [22, 23] has also been widely used for multiview data since combining kernels either linearly or nonlinearly has a crucial improvement on learning performance. The works in [1] and [4] exploited the Cauchy loss [24] and loss, respectively, to strengthen robustness to the noise. Considering correlation and independence of multiview data, several methods have been introduced to divide the data into correlated components to all views and specific components to each view. Lock et al. [9] presented a Joint and Individual Variation Explained (JIVE) which decomposes the multiview data into three parts: a lowrank subspace capturing common components, a lowrank subspace obtaining individual ones and a residual errors. Inspired by JIVE, Zhou et al. [25] used a common orthogonal basis extraction (COBE) algorithm to identify and separate the shared and individual features.
However, the conventional methods put their main focus on the deterministic components of multiview data, while not elaborately consider the complicated stochastic components (i.e., noises) in data, which inclines to degenerate their robustness especially in real cases with complex noise. To the best of our knowledge, our method is the first one investigating this MSL noise issue and formulating the model capable of both adapting intraview noise complexity (by parametric MoG) and delivering interview noise correlation (by KLdivergence regularization). Its novelty reflects in both its investigated issue and designed methodology (regularized noise modeling).
Iii Problem Formulation
Iiia Notation
The observed multiview data is denoted as , where means the data collected from the view, mean the dimension and number of data in each view^{1}^{1}1To notion convenience, we assume each view has the same data dimensionality and number. Yet our method can also be easily used in cases where they are different in multiple views.. We set as the variables in deterministic parts of our model, where , ( is the subspace rank) denote the subspace parameters. Besides, denote , , as variables in stochastic parts of our model, where , , as well as . Denote as all variables involved in our model. The parameters , , and
denote the hyperparameters to infer model variables.
and means therow and column vectors of matrix
, respectively. and are the strength parameters of regularization term.IiiB Probabilistic Modeling
IiiB1 MoG modeling on each view of noises
As conventional MSL method [9], we model deterministic component of each view of data as a specific subspace with specific coefficients supplemented a shared coefficients among all views. Then each element (, ) of the input matrix is modeled as:
(1) 
where represents the noise in . From Eq. (1) we can see that a shared variable is learned to reveal the relationship among different views.
Unlike previous works [15], [26], [27] only using a simple Gaussian or Laplacian to model noise distribution, we model in each view as a MoG distribution to make it better adapt the noise complexity in practice [28]. I.e.,
(2) 
where , denotes the latent variable , which satisfies and , “Multi” denotes the multinomial distribution. Note that the MoG parameters are different from view to view, implying that each view has its specific noise configuration.
Then we can deduce that:
(3)  
The prior distributions on , and are to constrain their scales to avoid overfitting problem. The full graphical model of our proposed method is shown in Fig. 2.
IiiB2 Shared MoG modeling on all views of noises
In order to encode the correlation of noises among different views of data, we assume that there is prior distribution to
, which is related to a KL divergence regularization term. Then all of the model variables and MoG parameters can be inferred by the MAP estimation. After marginalizing the variable
, the posterior of then can be written as:(4) 
Then our method needs to solve the problem of minimizing the following objective (negative log of Eq. (4)):
(5) 
where
(6)  
where is the index set of the nonmissing entries in .
For regularization term , we can easily use the following KL divergence form:
(7)  
which is easily understandable to link all views of MoG noise parameters to a latent common one with parameter and . This KLdivergence term corresponds to an improper prior^{2}^{2}2Improper prior distribution [29] means that it does not integrate to 1. imposed on local MoG parameters and . The MAP model (4) is thus theoretically sound. More theoretical explanations on the model will be presented in Section 5.
Note that the physical meanings of the objective function (7) can be easily interpretted. The first term is the likelihood term, mainly aiming to fit input data , the second one encodes the prior knowledge on deterministic parameters , and the third one regularizes each stochastic variable involved in the model by pulling it close to a latent shared distribution.
Iv EM algorithm for solving MSLRMoG
The EM algorithm is readily applied to solve (5). The algorithm contains three steps: calculate the expectation of posterior of the latent variable ; optimize the MoG noise parameters; optimize the model parameters.
E Step: the posterior responsibility of mixture component can be calculate by
(8) 
We can then do the Mstep by maximizing the corresponding upper bound w.r.t and :
(9) 
where
(10) 
M Step for updating MoG parameters ^{3}^{3}3The inference is listed at A1 in supplementary material.: A way to address problem (9) is to take derivatives w.r.t. all the MoG parameters and set them to zeros. The updating formulations of these parameters are written as follows:
Update and : Referring to [30], the closedform updating equations of and are:
(11)  
Update and : The parameters and also have a closedform solution in M step:
(12) 
Both parameters of the common MoG noise can be easily explained:
is the harmonic mean of
andis proportional to the geometric mean of
.M Step for model variables : The related terms in Eq. (9) can be equivalently reformulated as follows:
(13) 
where represents the Hadamard product and the element of the indicator matrix , with same size of , is
(14) 
There exist many offtheshelf algorithms [31, 32, 33]) to tackle Eq. (13). We easily apply the ALS owing to its simplicity and effectiveness. The detailed steps of MSLRMoG are then provided in Algorithm 1.
The memory consumption and complexity of MSLRMoG are and , where is iteration number of inner loop in ALS, and is iteration number for outside loop in MSLRMoG. This complexity is comparable to or even less than those of the current MSL methods [9, 8, 1, 4]. In our experiments, the setting of is not sensitive to the algorithm performance. We just empirically specify it as a small number.
V KL divergence regularization
Va Theoretical explanation
Relationship to conjugate prior: For MoG parameters , defined in each view, the KL regularization term in Eq. (7) can be explained from the perspective of conjugate prior. [34] shows that the relationship between KL divergence and conjugate prior under fully exponential family distribution. In this paper, we can show that this conclusion is also correct for all exponential family distributions. The theorem can be summarized as:
Theorem 1 If a distribution belongs to the exponential family with the form: with natural parameter , and its conjugate prior follows: then we have:
where and is a constant independent of .
Specifically, by giving a Dirichlet distribution prior and
InverseGamma distribution prior as:
we can deduce the KLdivergence regularization terms in Eq.(7) for and under MAP framework.
Theorem 2 If and are the same type of exponential family distribution , then
where is the Bregman divergence with convex function , which can be defined as .
[35] has proven this conclusion. Instead of calculating function integral, this Theorem can give a fast solution to calculate the KL divergence between two same type exponential family distributions.
KL divergence average: For the parameters defined in the shared latent noise distribution, it corresponds to a KL divergence average problem. Specifically, we can prove the following theorem:
Noise type  Gaussian  Sparse  Mixture  

View  
Noise image  14.01  13.95  13.96  11.82  11.67  11.73  3.62  3.75  3.63 
JIVE  22.69  22.72  22.75  21.19  21.37  21.49  14.15  14.15  14.24 
CSRL  23.14  23.38  23.03  21.79  22.05  21.74  14.37  14.33  14.41 
MISL  23.17  23.24  22.98  22.03  22.26  21.88  19.74  19.56  19.15 
MSL  21.30  21.59  21.31  25.90  26.89  25.83  25.04  25.95  25.48 
MSLRMoG  23.20  23.65  23.37  25.77  26.69  25.98  25.40  26.56  25.83 
Theorem 3 If distributions and , belong to the same kind of full exponential family distribution, which means they all have the form: then the solution of the problem: is
For instance, the natural parameter of Gaussian distribution
is , so their KL divergence average is , which can lead to the result in Eq.(12).VB Joint regularization for mixture distribution
Generally speaking, the mixture of full exponential family distributions does not belong to the fully exponential family, so for this mixture distribution we can use a independent KL divergence for each parameter of its distribution like Eq. (7
). However, there are so many hyperparameters to be set with this approach. Actually, we can further prove that the joint distribution
of observed variables and latent variable is exactly a fully exponential family distribution:Theorem 4 If the distributions all belong to the full exponential family with natural parameter and (, and ), then, belongs to the exponential family.
Moreover, we can also prove:
Theorem 5 For any two distributions and with their marginal distributions and , the following inequality holds
From the theorem, it is easy to see that constitutes an upper bound of the KL divergence between original two mixture distributions, and thus can be rationally used as a regularization term for the original mixture distribution. For example, the joint KL divergence regularization for MoG is
(15) 
where . This regularization term also leads to a simple solution, by setting and in Eq. (11) and let
The solution for other variables and parameters is not changed. The joint KL divergence regularization only have one compromising parameter to be set, which can be generally easy to set. The proofs of all theorems are listed in supplementary material.
Vi Experimental results
To qualitatively and quantitatively evaluate the performance of our proposed multiview subspace Learning with the complex noise method, we conduct three types of experiments containing face image reconstruction, multiview and RGB background subtractions. We compare our method with JIVE [9], CSRL [8], MISL[1] and MSL [4], which represent the stateoftheart MSL development. Most parameters of the compared methods is set to be the default value and the rank is set the same for all these methods. For MSLRMoG, the number of MoG components is set as 3 in all cases, except 2 in face experiments with Gaussian noise. The model parameters , and and are set as , and , respectively. Besides, we use joint regularization in Eq. (15) with strength parameter throughout all our experiments.
Via Background Subtraction on Multiview data
In this experiment, our method is applied to the problem of background subtraction. Two multicamera pedestrian videos [2, 3], shot by 4 cameras located at different angles, are employed: Passageway and Laboratory. All the frames in the video are 288360. Without loss of generality, we resize the original frames with 144180. 200 frames of Passageway and Laboratory sequences beginning at the first frame and ending at the 1000 frame (take the first one of each 5 frames) are extracted to compose the learning data. The JIVE, CSRL, MISL, MSL and our MSLRMoG method are implemented in these videos, and the rank is set as 2 for all methods. Fig. 3 shows the results obtained by all competing methods on some typical frames on the multiview video data.
From the figure, it is seen that the background image achieved by our proposed method is clearer in details. Compared with most other competing methods, the MSLRMoG method is able to extract the foreground objects in a more accurate manner. As shown in Fig. 3, MSLRMoG decomposes the foreground into three components with different extents, each having its own physical meaning: (1) moving objects in the foreground; (2) shadows alongside the foreground objects due to lighting changes and people walking; (3) background variation caused mostly by the camera noise. As most existing methods merge the object, its shadow and background noise, the foreground extracted by them is relatively more coarse. Besides, in order to illustrate how the KL divergence regularization works, we set the strength parameter with different
in Laboratory video experiments and draw the curves of largest variance
and its mixing coefficient in Fig. 4. we can find the KL divergence regularization makes the distributions of noise in different views have a certain similarity.ViB Face Images Recovery Experiments
This experiment aims to test the effectiveness of the proposed MSLRMoG methods in face images reconstruction. The CMU MultiPIE face dataset [36] including 337 subjects with size 12896 is used, in which images are multiple poses and expressions. 200 subjects are randomly selected and each subject contains 3 views ( , , ). In this experiment, we add different types of noise or outliers to the original image: (1) Gaussian noise (0,0.15) (shown in the row of Fig. 6 (a1)); (2) sparse noise: random 20 noise (shown in the row of Fig. 6(a2)); (3) mixture noise: Gaussian noise (0,0.02)+block occlusion (salt&pepper noise inside)+20 sparse noise (shown in the row of Fig. 6(a3)). The comparison methods include JIVE, CSRL, MISL, and MSL, and the rank is set as in this experiments. Different views of reconstructed images obtained by different methods are shown in Fig. 6. The PSNR values of images obtained by different methods are listed in Table I.
From the figure and table, it is easy to observe that our proposed MSLRMoG method is capable of finely recovering the clean faces in various noise cases, especially in the case of relatively more complicated mixture noises. MSL performs well on sparse noise, but not well on Gaussian noise. JIVE and CSRL can not work well on the sparse noise and mixture noise since they implicitly assume the noise as a i.i.d. Gaussian. MISL performs slightly better than JIVE and CSRL as a result of the utilization of Cauchy loss, which is more robust than loss.
Video  Methods  

JIVE  CSRL  MISL  MSL  MSLRMoG  
air.  0.6450  0.6476  0.6536  0.6722  0.6770 
boo.  0.6553  0.6622  0.6619  0.6902  0.6778 
sho.  0.7145  0.7186  0.7205  0.7274  0.7264 
lob.  0.5426  0.5442  0.5444  0.6771  0.7781 
esc.  0.5911  0.5947  0.5945  0.6142  0.5994 
cur.  0.5083  0.5393  0.5403  0.6198  0.7189 
cam.  0.4186  0.4190  0.4199  0.4609  0.4405 
wat.  0.6028  0.6026  0.6029  0.8647  0.8709 
fou.  0.5857  0.5895  0.6341  0.7148  0.7181 
Average  0.5849  0.5908  0.5969  0.6713  0.6897 
ViC Background Subtraction on RGB data
MSLRMoG is further applied to the background subtraction on RGB data. In this experiment, we regard the three channels (red, green and blue) of a video as its three different views. Actually different channels do have different residual distributions since the sensitive spectral band of R,G and B sensors is distinct. However, it is the same objects that all sensors observed, therefore, residual distributions also have a certain degree of similarity. The noises in this data are thus with a more complicated noni.i.d. structures than those assumed by the current MSL methods.
Fig. 5 shows the result of some typical frames on the Li dataset with all competing methods including JIVE, CSRL, MISL, MSL and MSLRMoG. We can easily observe that MSLRMoG achieves clearer background, and meanwhile, the extracted foreground objects by our method is also of a better visualization effect. Table II shows the Fmeasure of all methods in Li data set, which quantitatively shows the better performance of the proposed method.
Vii Conclusion
Current multiview learning (MSL) methods mainly emphasize deterministic shared knowledge of data, but ignore the complexity, nonconsistency and similarity of noise. This, however, is deviated from most real cases with more complicated noises inbetween views and alleviates their robustness in practice. This paper has proposed a new MSL method, which firstly investigates this MSL noise issue and formulates the model capable of both adapting intraview noise complexity (by parametric MoG) and delivering interview noise correlation (by KLdivergence regularization). Its novelty reflects in both its investigated issue and designed methodology (regularized noise modeling). Further, we also give a detailed and reasonable theoretical explanation for this term. Experiments show that the new method is potentially useful to compensate previous MSL research to further enhance performance in complex noise scenarios.
References
 [1] C. Xu, D. Tao, and C. Xu, “Multiview intact space learning,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 37, no. 12, pp. 2531–2544, 2015.
 [2] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple Object Tracking using KShortest Paths Optimization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.
 [3] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “MultiCamera People Tracking with a Probabilistic Occupancy Map,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 267–282, February 2008.
 [4] M. White, X. Zhang, D. Schuurmans, and Y.l. Yu, “Convex multiview subspace learning,” in Advances in Neural Information Processing Systems, 2012, pp. 1673–1681.
 [5] Y. Jia, M. Salzmann, and T. Darrell, “Factorized latent spaces with structured sparsity,” in Advances in Neural Information Processing Systems, 2010, pp. 982–990.
 [6] Z. Ding and Y. Fu, “Lowrank common subspace for multiview learning,” in 2014 IEEE International Conference on Data Mining. IEEE, 2014, pp. 110–119.
 [7] Z. Zhu, L. Du, L. Zhang, and Y. Zhao, “Shared subspace learning for latent representation of multiview data.”
 [8] Y. Guo, “Convex subspace representation learning from multiview data.” in AAAI, vol. 1, 2013, p. 2.
 [9] E. F. Lock, K. A. Hoadley, J. S. Marron, and A. B. Nobel, “Joint and individual variation explained (jive) for integrated analysis of multiple data types,” The annals of applied statistics, vol. 7, no. 1, p. 523, 2013.

[10]
P. Xu, Q. Yin, Y. Huang, Y.Z. Song, Z. Ma, L. Wang, T. Xiang, W. B. Kleijn, and J. Guo, “Crossmodal subspace learning for finegrained sketchbased image retrieval,”
Neurocomputing, 2017. 
[11]
Z. Ma, J.H. Xue, A. Leijon, Z.H. Tan, Z. Yang, and J. Guo, “Decorrelation of
neutral vector variables: Theory and applications,”
IEEE transactions on neural networks and learning systems
, 2016.  [12] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.
 [13] S. Akaho, “A kernel method for canonical correlation analysis,” arXiv preprint cs/0609071, 2006.
 [14] C. Archambeau and F. R. Bach, “Sparse probabilistic projections,” in Advances in neural information processing systems, 2009, pp. 73–80.
 [15] M. A. Nicolaou, Y. Panagakis, S. Zafeiriou, and M. Pantic, “Robust canonical correlation analysis: Audiovisual fusion for learning continuous interest,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1522–1526.
 [16] F. R. Bach and M. I. Jordan, “A probabilistic interpretation of canonical correlation analysis,” 2005.
 [17] A. Goldberg, B. Recht, J. Xu, R. Nowak, and X. Zhu, “Transduction with matrix completion: Three birds with one stone,” in Advances in neural information processing systems, 2010, pp. 757–765.
 [18] R. S. Cabral, F. De la Torre, J. P. Costeira, and A. Bernardino, “Matrix completion for multilabel image classification.” in NIPS, vol. 201, no. 1, 2011, p. 2.
 [19] C. Christoudias, R. Urtasun, and T. Darrell, “Multiview learning in the presence of view disagreement,” arXiv preprint arXiv:1206.3242, 2012.
 [20] B. Behmardi, C. Archambeau, and G. Bouchard, “Overlapping trace norms in multiview learning,” arXiv preprint arXiv:1404.6163, 2014.
 [21] A. Shon, K. Grochow, A. Hertzmann, and R. P. Rao, “Learning shared latent structure for image synthesis and robotic imitation,” in Advances in Neural Information Processing Systems, 2005, pp. 1233–1240.

[22]
M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,”
The Journal of Machine Learning Research
, vol. 12, pp. 2211–2268, 2011.  [23] Z. Xu, R. Jin, H. Yang, I. King, and M. R. Lyu, “Simple and efficient multiple kernel learning by group lasso,” in Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 1175–1182.

[24]
I. Mizera and C. H. Müller, “Breakdown points of cauchy regressionscale
estimators,”
Statistics & probability letters
, vol. 57, no. 1, pp. 79–89, 2002. 
[25]
G. Zhou, A. Cichocki, Y. Zhang, and D. P. Mandic, “Group component analysis for multiblock data: Common and individual feature extraction,” 2015.
 [26] Y. Panagakis, M. Nicolaou, S. Zafeiriou, and M. Pantic, “Robust correlated and individual component analysis,” 2015.
 [27] P. J. Huber, Robust statistics. Springer, 2011.
 [28] V. Maz’ya and G. Schmidt, “On approximate approximations using gaussian kernels,” IMA Journal of Numerical Analysis, vol. 16, no. 1, pp. 13–29, 1996.
 [29] R. Christensen, W. Johnson, A. Branscum, and T. E. Hanson, Bayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians. Hoboken: CRC Press, 2010.
 [30] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the royal statistical society. Series B (methodological), pp. 1–38, 1977.
 [31] N. Srebro, T. Jaakkola et al., “Weighted lowrank approximations,” in ICML, vol. 3, 2003, pp. 720–727.
 [32] A. M. Buchanan and A. W. Fitzgibbon, “Damped newton algorithms for matrix factorization with missing data,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2. IEEE, 2005, pp. 316–322.
 [33] F. De La Torre and M. J. Black, “A framework for robust subspace learning,” International Journal of Computer Vision, vol. 54, no. 13, pp. 117–142, 2003.
 [34] H. Yong, D. Meng, W. Zuo, and L. Zhang, “Robust online matrix factorization for dynamic background subtraction,” IEEE transactions on pattern analysis and machine intelligence, 2017.
 [35] F. Nielsen and V. Garcia, “Statistical exponential families: A digest with flash cards,” arXiv preprint arXiv:0911.4863, 2009.
 [36] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multipie,” Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010.