1 Introduction
It is becoming increasingly important to learn well generalizing representations that are invariant to many common nuisance transformations of the data. Indeed, being invariant to intraclass transformations while being discriminative
to betweenclass transformations can be said to be one of the fundamental problems in pattern recognition. The nuisance transformations can give rise to many ‘degrees of freedom’ even in a constrained task such as face recognition (
e.g. pose, agevariation, illumination etc.). Explicitly factoring them out leads to improvements in recognition performance as found in pal2016discriminative ; leibo2014subtasks ; hinton1987learning . It has also been shown that that features that are explicitly invariant to intraclass transformations allow the sample complexity of the recognition problem to be reduced AnselmiLRMTP13 . To this end, the study of invariant representations and machinery built on the concept of explicit invariance is important.Invariance through Data Augmentation. Many approaches in the past have enforced invariance by generating transformed labelled training samples in some form such as Poggio92recognitionand ; scholkopf2002learning ; scholkopf1998 ; Niyogi98incorporatingprior ; Reisert_2008 ; Haasdonk07invariantkernel . Perhaps, one of the most popular method for incorporating invariances in SVMs is the virtual support method (VSV) in incorporatinginvariances
, which used sequential runs of SVMs in order to find and augment the support vectors with transformed versions of themselves.
Indecipherable transformations in data leads to shortage of transformed labelled samples. The above approaches however, assume that one has explicit
knowledge about the transformation. This is a strong assumption. Indeed, in most general machine learning applications, the transformation present in the data is not clear and cannot be modelled easily,
e.g. transformations between different views of a general 3D object and between different sentences articulated by the same person. Methods which work on generating invariance by explicitly transforming or augmenting labelled training data cannot be applied to these scenarios. Further, in cases where we do know the transformations that exist and we actually can model them, it is difficult to generate transformed versions of very large labelled datasets. Hence there arises an important problem: how do we train models to be invariant to transformations in test data, when we do not have access to transformed labelled training samples ?Availability of unlabeled transformed data. Although it is difficult to obtain or generate transformed labelled data (due to the reasons mentioned above), unlabeled transformed data is more readily available. For instance, if different views of specific objects of interest are not available, one can simply collect views of general objects. Also, if different sentences spoken by a specific group of people are not available, one can simply collect those spoken by members of the general population. In both these scenarios, no explicit knowledge or model of the transformation is needed, thereby bypassing the problem of indecipherable transformations. This situation is common in vision e.g. only unlabeled transformed images are observed, but has so far mostly been addressed by the community by intense efforts in large scale data collection. Note that the transformed data that is collected is not required to be labelled. We now are in a position to state the central problem that this paper addresses.
The Unlabeled Transformation (UT) Problem:
Having access to transformed versions of the training unlabeled data but not of labelled data, how do we learn a discriminative model of the labelled data, while being invariant to transformations present in the unlabeled data ?
Overall approach. The approach presented in this paper however (see Fig. 1), can solve this problem and learn invariance to transformations observed only through unlabeled samples and does not need labelled training data augmentation. We explicitly and simultaneously address both problems of generating invariance to intraclass transformation (through invariant kernels) and being discriminative to inter or between class transformations (through maxmargin classifiers). Given a new test sample, the final extracted feature is invariant to the transformations observed in the unlabeled set, and thereby generalizes using just a single example. This is an example of oneshot learning.
Prior Art: Invariant Kernels. Kernel methods in machine learning have long been studied to considerable depth. Nonetheless, the study of invariant kernels and techniques to extract invariant features has received much less attention. An invariant kernel allows the kernel product to remain invariant under transformations of the inputs. Most instances of incorporating invariances focused on local invariances through regularization and optimization such as incorporatinginvariances ; scholkopf1998 ; decoste_2002 ; zhang2013learning . Some other techniques were jittering kernels scholkopf2002learning ; decoste_2002 and tangentdistance kernels Haasdonk_tangent_distance , both of which sacrificed the positive semidefinite property of its kernels and were computationally expensive. Though these methods have had some success, most of them still lack explicit theoretical guarantees towards invariance. The proposed invariant kernel SVM formulation on the other hand, develops a valid PSD kernel that is guaranteed to be invariant. Haasdonk07invariantkernel used group integration to arrive at invariant kernels but did not address the Unlabeled Transformation problem which our proposed kernels do address. Further, our proposed kernels allow for the formulation of the invariant SVM and application to large scale problems. Recently, RajKuhMroFleSch17 presented some work with invariant kernels. However, unlike our nonparametric formulation, they do not learn the group transformations from the data itself and assume known parametric transformations (i.e. they assume that transformation is computable).
Key ideas. The key ideas in this paper are twofold.

The first is to model transformations using unitary groups (or subgroups) leading to unitarygroup invariant kernels. Unitary transforms allow the dot product to be preserved and allow for interesting generalization properties leading to low sample complexity and also allow learning transformation invariance from unlabeled examples (thereby solving the Unlabeled Transformation Problem). Classes of learning problems, such as vision, often have transformations belonging to a unitarygroup, that one would like to be invariant towards (such as translation and rotation). In practice however, liao2013 found that invariance to much more general transformations not captured by this model can been achieved.

Secondly, we combine maxmargin classifiers with invariant kernels leading to nonlinear maxmargin unitarygroup invariant classifiers. These theoretically motivated invariant nonlinear SVMs form the foundation upon which MaxMargin Invariant Features (MMIF) are based. MMIF features can effectively solve the important Unlabeled Transformation Problem. To the best of our knowledge, this is the first theoretically proven formulation of this nature.
Contributions.
In contrast to many previous studies on invariant kernels, we study nonlinear positive semidefinite unitarygroup invariant kernels guaranteeing invariance that can address the UT Problem. One of our central theoretical results to applies group integration in the RKHS. It builds on the observation that, under unitary restrictions on the kernel map, group action in the input space is reciprocated in the RKHS. Using the proposed invariant kernel, we present a theoretically motivated approach towards a nonlinear invariant SVM that can solve the UT Problem with explicit invariance guarantees. As our main theoretical contribution, we showcase a result on the generalization of maxmargin classifiers in groupinvariant subspaces. We propose MaxMargin Invariant Features (MMIF) to learn highly discriminative nonlinear features that also solve the UT problem. On the practical side, we propose an approach to face recognition to combine MMIFs with a pretrained deep learning feature extractor (in our case VGGFace
parkhi2015deep ). MMIF features can be used with deep learning whenever there is a need to focus on a particular transformation in data (in our application pose in face recognition) and can further improve performance.2 UnitaryGroup Invariant Kernels
Premise: Consider a dataset of normalized samples along with labels with and . We now introduce into the dataset a number of unitary transformations part of a locally compact unitarygroup . We note again that the set of transformations under consideration need not be the entire unitary group. They could very well be a subgroup. Our augmented normalized dataset becomes . For clarity, we denote by the action of group element on , i.e. . We also define an orbit of under as the set . Clearly, . An invariant function is defined as follows.
Definition 2.1 (Invariant Function).
For any group , we define a function to be invariant if .
One method of generating an invariant towards a group is through group integration. Group integration has stemmed from classical invariant theory and can be shown to be a projection onto a invariant subspace for vector spaces. In such a space and thus the representation is invariant under the transformation of any element from the group . This is ideal for recognition problems where one would want to be discriminative to betweenclass transformations (for e.g. between distinct subjects in face recognition) but be invariant to withinclass transformations (for e.g. different images of the same subject). The set of transformations we model as are the withinclass transformations that we would like to be invariant towards. An invariant to any group can be generated through the following basic (previously) known property (Lemma 2.1) based on group integration.
Lemma 2.1.
(Invariance Property) Given a vector , and any affine group , for any fixed and a normalized Haar measure , we have
The Haar measure () exists for every locally compact group and is unique up to a positive multiplicative constant (hence normalized). A similar property holds for discrete groups. Lemma 2.1 results in the quantity enjoy global invariance (encompassing all elements) to group . This property allows one to generate a invariant subspace in the inherent space through group integration. In practice, the integral corresponds to a summation over transformed samples. The following two lemmas (novel results, and part of our contribution) (Lemma 2.2 and 2.3) showcase elementary properties of the operator for a unitarygroup ^{1}^{1}1All proofs are presented in the supplementary material. These properties would prove useful in the analysis of unitarygroup invariant kernels and features.
Lemma 2.2.
If for unitary , then
Lemma 2.3.
(Unitary Projection) If for any affine , then , i.e. it is a projection operator. Further, if is unitary, then
Sample Complexity and Generalization. On applying the operator to the dataset , all points in the set for any map to the same point in the invariant subspace thereby reducing the number of distinct points by a factor of (the cardinality of , if is finite). Theoretically, this would drastically reduce sample complexity while preserving linear feasibility (separability). It is trivial to observe that a perfect linear separator learned in would also be a perfect separator for , thus in theory achieving perfect generalization. Generalization here refers to the ability to perform correct classification even in the presence of the set of transformations . We prove a similar result for Reproducing Kernel Hilbert Spaces (RKHS) in Section 2.2. This property is theoretically powerful since cardinality of can be large. A classifier can avoid having to observe transformed versions of any and yet generalize perfectly.
The case of Face Recognition. As an illustration, if the group of transformations considered is pose (it is hypothesized that small changes in pose can be modeled as unitary pal2016discriminative ), then represents a pose invariant subspace. In theory, all poses of a subject will converge to the same point in that subspace leading to near perfect pose invariant recognition.
We have not yet leveraged the power of the unitary structure of the groups which is also critical in generalization to test cases as we would see later. We now present our central result showcasing that unitary kernels allow the unitary group action to reciprocate in a Reproducing Kernel Hilbert Space. This is critical to set the foundation for our core method called MaxMargin Invariant Features.
2.1 Group Actions Reciprocate in a Reproducing Kernel Hilbert Space
Group integration provides exact invariance as seen in the previous section. However, it requires the group structure to be preserved, i.e. if the group structure is destroyed, group integration does not provide an invariant function. In the context of kernels, it is imperative that the group relation between the samples in be preserved in the kernel Hilbert space corresponding to some kernel with a mapping . If the kernel is unitary in the following sense, then this is possible.
Definition 2.2 (Unitary Kernel).
A kernel is a unitary kernel if, for a unitary group , the mapping satisfies .
The unitary condition is fairly general, a common class of unitary kernels is the RBF kernel. We now define a transformation within the RKHS itself as for any where is a unitary group. We then have the following result of significance.
Theorem 2.4.
(Covariance in the RKHS) If is a unitary kernel in the sense of Definition 2.2, then is a unitary transformation, and the set is a unitarygroup in .
Theorem 2.4 shows that the unitarygroup structure is preserved in the RKHS. This paves the way for new theoretically motivated approaches to achieve invariance to transformations in the RKHS. There have been a few studies on group invariant kernels Haasdonk07invariantkernel ; pal2016discriminative . However, Haasdonk07invariantkernel does not examine whether the unitary group structure is actually preserved in the RKHS, which is critical. Also, DIKF was recently proposed as a method utilizing group structure under the unitary kernel pal2016discriminative . Our result is a generalization of the theorems they present. Theorem 2.4 shows that since the unitary group structure is preserved in the RKHS, any method involving group integration would be invariant in the original space. The preservation of the group structure allows more direct group invariance results to be applied in the RKHS. It also directly allows one to formulate a nonlinear SVM while guaranteeing invariance theoretically leading to MaxMargin Invariant Features.
2.2 Invariant Nonlinear SVM: An Alternate Approach Through Group Integration
We now apply the group integration approach to the kernel SVM. The decision function of SVMs can be written in the general form as for some bias (we agglomerate all parameters of in ) where is the kernel feature map, i.e.
. Reviewing the SVM, a maximum margin separator is found by minimizing loss functions such as the hinge loss along with a regularizer. In order to invoke invariance, we can now utilize group integration in the the kernel space
using Theorem 2.4. All points in the set get mapped to for a given in the input space . Group integration then results in a invariant subspace within through using Lemma 2.1. Introducing Lagrange multipliers , the dual formulation (utilizing Lemma 2.2 and Lemma 2.3) then becomes(1) 
under the constraints . The SVM separator is then given by thereby existing in the invariant (or equivalently invariant) subspace within (since is a bijection). Effectively, the SVM observes samples from and therefore enjoys exact global invariance to . Further, is a maximummargin separator of (i.e. the set of all transformed samples). This can be shown by the following result.
Theorem 2.5.
(Generalization) For a unitary group and unitary kernel , if is a perfect separator for , then is also a perfect separator for with the same margin. Further, a maxmargin separator of is also a maxmargin separator of .
The invariant nonlinear SVM in objective 1, observes samples in the form of and obtains a maxmargin separator . This allows for the generalization properties of maxmargin classifiers to be combined with those of group invariant classifiers. While being invariant to nuisance transformations, maxmargin classifiers can lead to highly discriminative features (more robust than DIKF pal2016discriminative as we find in our experiments) that are invariant to withinclass transformations.
Theorem 2.5 shows that the margins of and are deeply related and implies that is a maxmargin separator for both datasets. Theoretically, the invariant nonlinear SVM is able to generalize to on just observing and utilizing prior information in the form of for all unitary kernels . This is true in practice for linear kernels. For nonlinear kernels in practice, the invariant SVM still needs to observe and integrate over transformed training inputs.
Leveraging unitary group properties. During test time to achieve invariance, the SVM would require to observe and integrate over all possible transformations of the test sample. This is a huge computational and design bottleneck. We would ideally want to achieve invariance and generalize by observing just a single test sample, in effect perform one shot learning. This would not only be computationally much cheaper but make the classifier powerful owing to generalization to full transformed orbits of test samples by observing just that single sample. This is where unitarity of helps and we leverage it in the form of the following Lemma.
Lemma 2.6.
(Invariant Projection) If for any unitary group , then for any fixed (including the identity element) we have
Assuming is the learned SVM classifier, Lemma 2.6 shows that for any test , the invariant dot product which involves observing all transformations of is equivalent to the quantity which involves observing only one transformation of . Hence one can model the entire orbit of under by a single sample where can be any particular transformation including identity. This drastically reduces sample complexity and vastly increases generalization capabilities of the classifier since one only need to observe one test sample to achieve invariance Lemma 2.6 also helps us in saving computation, allowing us to apply the computationally expensive (group integration) operation only once on he classifier and not the test sample. Thus, the kernel in the Invariant SVM formulation can be replaced by the form .
For kernels in general, the invariant subspace cannot be explicitly computed since it lies in the RKHS. It is only implicitly projected upon through . It is important to note that during testing however, the SVM formulation will be invariant to transformations of the test sample regardless of a linear or nonlinear kernel.
Positive SemiDefiniteness. The invariant kernel map is now of the form . This preserves the positive semidefinite property of the kernel while guaranteeing global invariance to unitary transformations., unlike jittering kernels scholkopf2002learning ; decoste_2002 and tangentdistance kernels Haasdonk_tangent_distance . If we wish to include invariance to scaling however (in the sense of scaling an image), then we would lose positivesemidefiniteness (it is also not a unitary transform). Nonetheless, walder2007learning show that conditionally positive definite kernels still exist for transformations including scaling, although we focus of unitary transformations in this paper.
3 MaxMargin Invariant Features
The previous section utilized a group integration approach to arrive a theoretically invariant nonlinear SVM. It however does not address the Unlabeled Transformation problem i.e. the kernel still requires observing transformed versions of the labelled input sample namely (or atleast one of the labelled samples if we utilize Lemma 2.6). We now present our core approach called MaxMargin Invariant Features (MMIF) that does not require the observation of any transformed labelled training sample whatsoever.
Assume that we have access to an unlabeled set of templates . We assume that we can observe all transformations under a unitarygroup , i.e. we have access to . Also, assume we have access to a set of labelled data with classes which are not transformed. We can extract an dimensional invariant kernel feature for each as follows. Let the invariant kernel feature be to explicitly show the dependence on . Then the dimension of for any particular is computed as
(2) 
The first equality utilizes Lemma 2.6 and the third equality uses Theorem 2.4. This is equivalent to observing all transformations of since using Lemma 2.3. Thereby we have constructed a feature which is invariant to without ever needing to observe transformed versions of the labelled vector . We now briefly the training of the MMIF feature extractor. The matching metrics we use for this study is normalized cosine distance.
Training MMIF SVMs. To learn a dimensional MMIF feature (potentially independent of ), we learn independent binaryclass linear SVMs. Each SVM trains on the labelled dataset with each sample being label for some subset of the classes (potentially just one class) and the rest being labelled . This leads us to a classifier in the form of . Here, is the label of for the SVM. It is important to note that the unlabeled data was only used to extract . Having multiple classes randomly labelled as positive allows the SVM to extract some feature that is common between them. This increases generalization by forcing the extracted feature to be more general (shared between multiple classes) rather than being highly tuned to a single class. Any dimensional MMIF feature can be trained through this technique leading to a higher dimensional feature vector useful in case where one has limited labelled samples and classes ( is small). During feature extraction, the inner products (scores) of the test sample with the distinct binaryclass SVMs provides the dimensional MMIF feature vector. This feature vector is highly discriminative due to the maxmargin nature of SVMs while being invariant to due to the invariant kernels.
MMIF. Given and , the MMIF feature is defined as for any test with each dimension being computed as for . Further, with each dimension being . The process is illustrated in Fig. 2.
Inheriting transformation invariance from transformed unlabeled data: A special case of semisupervised learning. MMIF features can learn to be invariant to transformations () by observing them only through . It can then transfer the invariance knowledge to new unseen samples from thereby becoming invariant to despite never having observed any samples from . This is a special case of semisupervised learning where we leverage on the specific transformations present in the unlabeled data. This is a very useful property of MMIFs allowing one to learn transformation invariance from one source and sample points from another source while having powerful discrimination and generalization properties. The property is can be formally stated as the following Theorem.
Theorem 3.1.
(MMIF is invariant to learnt transformations) where is observed only through .
Thus we find that MMIF can solve the Unlabeled Transformation Problem. MMIFs have an invariant and a discriminative component. The invariant component of MMIF allows it to generalize to new transformations of the test sample whereas the discriminative component allows for robust classification due to maxmargin classifiers. These two properties allow MMIFs to be very useful as we find in our experiments on face recognition.
Max and Mean Pooling in MMIF.
Group integration in practice directly results in mean pooling. Recent work however, showed that group integration can be treated as a subset of Itheory where one tries to measure moments (or a subset of) of the distribution
since the distribution itself is also an invariant poggio2013. Group integration can be seen as measuring the mean or the first moment of the distribution. One can also characterize using the infinite moment or the max of the distribution. We find in our experiments that max pooling outperforms mean pooling in general. All results in this paper however, still hold under the Itheory framework.
MMIF on external feature extractors (deep networks). MMIF does not make any assumptions regarding its input and hence one can apply it to features extracted from any feature extractor in general. The goal of any feature extractor is to (ideally) be invariant to withinclass transformation while maximizing betweenclass discrimination. However, most feature extractors are not trained to explicitly factor out specific transformations. If we have access to even a small dataset with the transformation we would like to be invariant to, we can transfer the invariance using MMIFs (e.g. it is unlikely to observe all poses of a person in datasets, but pose is an important nuisance transformation).
Modelling general nonunitary transformations.
General nonlinear transformations such as outofplane rotation or pose variation are challenging to model. Nonetheless, a small variation in these transformations can be approximated by some unitary
assuming piece wise linearity through transformationdependent submanifold unfolding park2010extension . Further, it was found that in practice, integrating over general transformations produced approximate invariance liao2013 .4 Experiments on Face Recognition
As illustration, we apply MMIFs using two modalities overall 1) on raw pixels and 2) on deep features from the pretrained VGGFace network
parkhi2015deep . We provide more implementation details and results discussion in the supplementary.A. MMIF on a largescale semisynthetic mugshot database (Rawpixels and deep features). We utilize a largescale semisynthetic face dataset to generate the sets and for MMIF. In this dataset, only two major transformations exist, that of pose variation and subject variation. All other transformations such as illumination, translation, rotation etc are strictly and synthetically controlled. This provides a very good benchmark for face recognition. where we want to be invariant to pose variation and be discriminative for subject variation. The experiment follows the exact protocol and data as described in pal2016discriminative ^{2}^{2}2We provide more details in the supplementary. Also note that we do not need utilize identity information, all that is required is the fact that a set of pose varied images belong to the same subject. Such data can be obtained through temporal sampling. We test on 750 subjects identities with 153 pose varied realtextured grayscale image each (a total of 114,750 images) against each other resulting in about 13 billion pairwise comparisons (compared to 6,000 for the standard LFW protocol). Results are reported as ROC curves along with VR at FAR. Fig. 3(a) shows the ROC curves for this experiment. We find that MMIF features outperforms all baselines including VGGFace features (pretrained), DIKF and NDP approaches thereby demonstrating superior discriminability while being able to effectively capture poseinvariance from the transformed template set . MMIF is able to solve the Unlabeled Transformation problem by extracting transformation information from unlabeled .
B. MMIF on LFW (deep features): Unseen subject protocol.
In order to be able to effectively train under the scenario of general transformations and to challenge our algorithms, we define a new much harder protocol on LFW. We choose the top 500 subjects with a total of 6,300 images for training MMIF on VGGFace features and test on the remaining subjects with 7,000 images. We perform all versus all matching, totalling upto 49 million matches (4 orders more than the official protocol). The evaluation metric is defined to be the standard ROC curve with verification rate reported at
false accept rate. We split the 500 subjects into two sets of 250 and use as and . We do not use any alignment for this experiment, and the faces were cropped according to sanderson2009multi . Fig. 3(b) shows the results of this experiment. We see that MMIF on VGG features significantly outperforms raw VGG on this protocol, boosting the VR at FAR from 0.56 to 0.71. This demonstrates that MMIF is able to generate invariance for highly nonlinear transformations that are not welldefined rendering it useful in realworld scenarios where transformations are unknown but observable.5 Main Experiments: Detailed notes supplementing the main paper.
A. MMIF on a largescale semisynthetic mugshot database (Rawpixels and deep features).
MMIF template set and . We utilize a largescale semisynthetic face dataset to generate the sets and for MMIF. The face textures are sampled from realfaces although the poses are rendered using 3D model fit to each face independently, hence the dataset is semisynthetic. This semisynthetic dataset helps us to evaluate our algorithm in a clean setting, where there exists only one challenging nuisance transformation (pose variation). Therefore models pose variation in faces. We utilize the same pose variation dataset generation procedure as described in pal2016discriminative in order for a fair comparison. The poses were rendered varying from to (yaw) and to (pitch) in steps of using 3DGEM prabhu2011unconstrained . The total number of images we generate is images. We align all faces by the two eyecenter locations in a crop.
Protocol. Our first experiment is a direct comparison with approaches similar in spirit to ours, namely DIKF and DIKF pal2016discriminative and NDP and NDP liao2013 ; poggio2013 . We train on 250 subjects (38,250 images) and test each method on the remaining 750 subjects (114,750 images), matching all posevaried images of a subject to each other. DIKF follows the same protocol as in pal2016discriminative . For MMIF, we utilize the first images (125 subjects with 153 poses each) as and the next images as . A total of 500 SVMs were trained on subsets of (10 randomly chosen subjects per SVM with all images of 3 of those 10 subjects, again randomly chosen, being and the rest being ). Note that although in this case contains pose variation, we do not integrate over them to generate invariance. All explicit invariance properties are generated through integration over . For testing, we compare all 153 images of the remaining unseen 750 subjects against each other (114,750 images). The algorithms are therefore tested on about 13 billion pair wise comparisons. Results are reported as ROC curves along with VR at FAR. For this experiment, we report results working on 1) raw pixels directly and 2) 4096 dimensional features from the pretrained VGGFace network parkhi2015deep . As a baseline, we also report results on using the VGGFace features directly.
Results. Fig.3(a) shows the ROC curves for this experiment. We find that MMIF features outperform both DIKF and NDP approaches thereby demonstrating superior discriminability while being able to effectively capture poseinvariance from the transformed template set . We find that VGGFace features suffer a handicap due to the images being grayscale. Nonetheless, MMIF is able to transfer poseinvariance from onto the VGG features. This significantly boosts performance owing to the fact that the main nuisance transformation is pose. MMIF being explicitly pose invariant along with solving the Unlabeled Transformation Problem is able to help VGG features while preserving the discriminability of the VGG features. In fact, the maxmargin SVMs further add discriminability. This illustrates in a clean setting (dataset only contains synthetically generated pose variation as nuisance transformation), that MMIF is able to work well in conjunction with deep learning features, thereby rendering itself immediately usable in more realistic settings. Our next set of experiments focus on this exact aspect.
B. MMIF on LFW (deep features).
Unseen subject protocol. LFW LFWTech has received a lot of attention in the recent years, and algorithms have approached near human accuracy on the original testing protocol. In order to be able to effectively train under the scenario of general transformations and to challenge our algorithms, we define a new much harder protocol on LFW. Instead of evaluating on about pair wise matches, we pair wise match on all images of subjects not seen in training. We have no way of modelling these subjects whatsoever, making this a difficult task. We utilize 500 subjects and all their images for training and test on the remaining 5249 subjects and all of their images. To use maximum amount of data for training, we pick the top 500 subjects with the most number of images available (about 6,300 images). The test data thus contains about 7000 images. The number of test pairwise matches is about 49 million, four orders of magnitude larger than the 6000 matches that the original LFW testing protocol defined. The evaluation metric is defined to be the standard ROC curve with verification rate reported at false accept rate.
MMIF template set and . We split the 500 subjects data into two parts of 250 subjects each. We use the 250 subjects with the most number of images as transformed template set and use the rest of the 250 subjects as . Note that in this experiment, the transformations considered are very generic and highly nonlinear making it a difficult experiment. We do not use any alignment for this experiment, and the faces were cropped according to sanderson2009multi .
Protocol. For MMIF, we process the kernel features from the transformed template set exactly as in the previous experiment A. Similarly, we learn a total of 500 SVMs on subsets of following the same protocol as the previous experiment.
Results. Fig.3(b) shows the results of this experiment. We see that MMIF on VGG features significantly outperforms raw VGG on this protocol, boosting the VR at FAR from 0.56 to 0.71. This suggests, that MMIF can be used in conjunction with pretrained deep features. In this experiment, MMIF capitalizes on the nonlinear transformations that exist in LFW, whereas in the previous experiment on the semi0synthetic dataset (Experiment A), the transformation was welldefined to be pose variation. This demonstrates that MMIF is able to generate invariance for highly nonlinear transformations that are not welldefined rendering it useful in realworld scenarios where transformations are unknown but observable.
6 Additional Experiments
6.1 Largescale Semi Synthetic Mugshot Data
Motivation: In the main paper, the transformations were observed only through unlabeled while is only meant to provide labeled untransformed data. However, during our expeirments in the main paper, even though we do not explicitly pool over the transformations , we utilize all transformations for training the SVMs. In order to be closer to our theoretical setting, we now run MMIF on raw pixels and VGGFace features parkhi2015deep while constraining the number of images the SVMs train on to 30 random images for each subject.
MMIF Template set and : We utilize a large scale semisynthetic face dataset to generate the template set for MMIF. The face textures are sampled from real faces and the poses are rendered using a 3D model fit to each face independently, making the dataset semisynthetic. This semisynthetic dataset helps us evaluate our algorithm in a clean setting, where there exists only one challenging nuisance transformation (pose variation). Therefore models pose variation in faces. We utilize the same pose variation dataset generation procedure as described in pal2016discriminative in order for a fair comparison. The poses were rendered varying from to (yaw) and to (pitch) in steps of using 3DGEM [15]. The total number of images we generate is 153 x 1000 = 153,000 images. We align all faces by the two eyecenter locations in a crop. Unlike our experiment presented in the main paper on this dataset, the template set is constrained to include only 30 randomly selected poses that contained . This is done to better simulate a realworld setting where through we would only observe faces at a few random poses.
Protocol: This experiment is a direct comparison with approaches similar in spirit to ours, namely DIKF and DIKF pal2016discriminative and NDP and NDP liao2013 ; poggio2013 . We call this setting for MMIF as MMIFcons (constrained) for reference. We train on 250 subjects (38,250 images) and test each method on the remaining 750 subjects (114,750 images), matching all posevaried images of a subject to each other. DIKF follows the same protocol as in pal2016discriminative .
For MMIF, we utilize the first 125 x 153 images (125 subjects with 153 poses each) as the template set . Thus, remains exactly the same as the protocol in the main paper. The template set is generated by choosing 30 random poses (for every subject) of the next 125 subjects. A total of 500 SVMs are trained on with a random subset of 5 subjects being labeled +1 and the rest labeled 1. It’s important to note that since does not contain transformations that are observed in its entirety, all explicit invariance properties are generated through integration over .
For testing, we follow the same protocol as in the main paper. We compare all 153 images of the remaining unseen 750 subjects against each other (114,750 images). The algorithms are therefore tested on about 13 billion pair wise comparisons. Results are reported as ROC curves along with the VR at 0.1% FAR. For this experiment, we report results working on 1) raw pixels directly and 2) 4096 dimensional features from the pretrained VGGFace network parkhi2015deep . As a baseline, we also report results on using the VGGFace features directly.
Results: Fig. 4 shows the ROC curves for this experiment. We find that even though we train SVMs for MMIFconsVGG on a constrained version of , it outperforms raw VGG features. Although, we do observe that MMIFconsraw outperforms NDP methods thereby demonstrating superior discriminability, it fails to match the original MMIFraw method performance. Interestingly however, MMIFconsVGG matches MMIFVGG features in performance despite being trained on much lesser data (30 instead of 153 images per subject). Thus, we find that MMIF when trained on a good feature extractor can provide added benefits of discrimination despite having lesser labeled samples to train on.
6.2 IARPA IJBA Janus
In this experiment, we explore how the number of SVMs influences the recognition performance on a large scale realworld dataset, namely the IARPA Janus Benchmark A (IJBA) dataset.
Data: We work on the verification protocol (1:1 matching) of the original dataset IJBA Janus. This subset consists of 5547 image templates that map to 492 distinct subjects with each template containing (possibly) multiple images. The images are cropped with respect to bounding boxes that are specified by the dataset for all labeled images. The cropped images are then resized to 244 x 244 pixels in accordance with the requirements of the VGG face model. Explicit pose invariance (MMIF) is then applied to these general face descriptors.
MMIF Template set and : In order to effectively train under the scenario of general transformations, we define a new protocol the Janus dataset similar to the LFW protocol defined in the main paper. This protocol is suited for MMIF since we explicitly generate invariance to transformations that exist in Janus data. We utilize the first 100 subjects and all the templates that map to these subjects (23723 images) for training MMIF and test on the remaining 392 subjects (27363 images). To make use of the maximum amount of data for training, we pick the top 100 subjects with the most number of images, the rest are all utilized for testing. Our training dataset is further split into templates and similar to our LFW protocol in the main paper. We use the first 50 subjects (of the top 100 subjects) as and the rest as in order to maximize the transformations that we generate invariance towards. To showcase the ability of MMIF to be used in conjunction with deep learning techniques, similar to our LFW experiment in the main paper, we train and test on VGGFace features parkhi2015deep on the Janus data.
Protocol: As in our LFW experiment, we split the training data into two templates  and . Similarly to all MMIF protocols in this paper, we train a total of 100, 250 and 500 SVM’s on subsets of following the same protocol. We perform pairwise comparisons for the entirety of the test data ( million image comparisons) which far exceeds the number of comparisons defined in the original testing protocol ( template comparisons) thereby making this protocol much larger and harder. Recall that throughout this supplementary and the main paper we always test on completely unseen subjects. The evaluation metric is defined to be the standard ROC curve using cosine distance.
Results: Fig. 5 shows the ROC curves for this experiment with new much larger and harder protocol. We find that even with just 100 SVMs or 100 maxmargin feature extractors, the performance is close to that of 500 feature extractors. This suggests, that though the SVMs provide enough discrimination, the invariant kernel provides bulk of the recognition performance by explicitly being invariant to the transformations in the . Hence, our proposed invariant kernel is effective at learning invariance towards transformations present in a unlabeled dataset. We provide these curves as baselines for future work focusing on the problem on learning unlabeled transformations from a given dataset.
7 Proofs of theoretical results
7.1 Proof of Lemma 2.1
Proof.
We have,
Since the normalized Haar measure is invariant, i.e. . Intuitively, simply rearranges the group integral owing to elementary group properties. ∎
7.2 Proof of Lemma 2.2
Proof.
We have,
Using the fact and . ∎
7.3 Proof Lemma 2.3
Proof.
We have,
(3)  
(4)  
(5)  
(6) 
Since the Haar measure is normalized (), and invariant. Also for any , we have ∎
7.4 Proof of Theorem 2.4
Proof.
We have , since the kernel is unitary. Here we define as the action of on . Thus, the mapping preserves the dotproduct in while reciprocating the action of . This is one of the requirements of a unitary operator, however needs to be linear. We note that linearity of can be derived from the linearity of the inner product and its preservation under in . Specifically for an arbitrary vector and a scalar , we have
(7)  
(8)  
(9)  
(10) 
Similarly for vectors , we have
We now prove that the set is a group. We start with proving the closure property. We have for any fixed
Since therefore by definition. Also, and thus closure is established. Associativity, identity and inverse properties can be proved similarly. The set is therefore a unitarygroup in . ∎
7.5 Proof of Theorem 2.5
Proof.
Since is a perfect separator for , , s.t. .
Using Lemma 2.4 and Theorem 2.5, we have for any fixed ,
Hence,
(11)  
(12) 
Thus, is perfect separator for with a margin of atleast . It also implies that a maxmargin separator of is also a maxmargin separator of . ∎
7.6 Proof of Lemma 2.6
Proof.
We have
In the second equality, we fix any group element since the innerproduct is invariant using the argument . This is true using Lemma 2.1 and the fact that is unitary. Further, the final equality utilizes the fact that the Haar measure is normalized. ∎
7.7 Proof of Theorem 3.1
Proof.
Given and , the MMIF feature is defined as for any test with each dimension being computed as for . Further, with each dimension being . Here, where in the RKHS corresponds to the group action of acting in the space of .
We therefore have for the dimension of ,
(13)  
(14)  
(15)  
(16)  
(17)  
(18)  
(19) 
Here, in line 15 we utilize the closure property of a group (since forms a group according to Theorem 2.4). Line 17 utilizes the fact that is unitary, and finally line 18 uses Theorem 2.4. Hence we find that every element of is invariant to observed only through , and thus trivially, for any observed only through . ∎
References
 (1) F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio. Magic materials: a theory of deep hierarchical architectures for learning sensory representations. MIT, CBCL paper, 2013.
 (2) F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio. Unsupervised learning of invariant representations in hierarchical architectures. CoRR, abs/1311.4158, 2013.

(3)
D. Decoste and B. Schölkopf.
Training invariant support vector machines.
Mach. Learn., 46(13):161–190, Mar. 2002.  (4) B. Haasdonk and H. Burkhardt. Invariant kernel functions for pattern analysis and machine learning. In Machine Learning, pages 35–61, 2007.
 (5) B. Haasdonk and D. Keysers. Tangent distance kernels for support vector machines. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 2, pages 864–868 vol.2, 2002.
 (6) G. E. Hinton. Learning translation invariant recognition in a massively parallel networks. In PARLE Parallel Architectures and Languages Europe, pages 1–13. Springer, 1987.
 (7) G. B. Huang, M. Ramesh, T. Berg, and E. LearnedMiller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 0749, University of Massachusetts, Amherst, October 2007.

(8)
J. Z. Leibo, Q. Liao, and T. Poggio.
Subtasks of unconstrained face recognition.
In
International Joint Conference on Computer Vision, Imaging and Computer Graphics, VISIGRAPP
, 2014.  (9) Q. Liao, J. Z. Leibo, and T. Poggio. Learning invariant representations and applications to face verification. Advances in Neural Information Processing Systems (NIPS), 2013.
 (10) P. Niyogi, F. Girosi, and T. Poggio. Incorporating prior information in machine learning by creating virtual examples. In Proceedings of the IEEE, pages 2196–2209, 1998.

(11)
D. K. Pal, F. JuefeiXu, and M. Savvides.
Discriminative invariant kernel features: a bellsandwhistlesfree approach to unsupervised face recognition and pose estimation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5590–5599, 2016.  (12) S. W. Park and M. Savvides. An extension of multifactor analysis for face recognition based on submanifold learning. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2645–2652. IEEE, 2010.
 (13) O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. 2015.
 (14) T. Poggio and T. Vetter. Recognition and structure from one 2d model view: Observations on prototypes, object classes and symmetries. Laboratory, Massachusetts Institute of Technology, 1992.
 (15) U. Prabhu, J. Heo, and M. Savvides. Unconstrained poseinvariant face recognition using 3d generic elastic models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(10):1952–1961, 2011.

(16)
A. Raj, A. Kumar, Y. Mroueh, T. Fletcher, and B. Schölkopf.
Local group invariant representations via orbit embeddings.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017)
, volume 54 of Proceedings of Machine Learning Research, pages 1225–1235, 2017.  (17) M. Reisert. Group integration techniques in pattern analysis – a kernel view. PhD Thesis, 2008.
 (18) C. Sanderson and B. C. Lovell. Multiregion probabilistic histograms for robust and scalable identity inference. In International Conference on Biometrics, pages 199–208. Springer, 2009.
 (19) B. Schölkopf and A. J. Smola. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2002.
 (20) B. Schölkopf, C. Burges, and V. Vapnik. Incorporating invariances in support vector learning machines. pages 47–52. Springer, 1996.
 (21) B. Schölkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector kernels. Advances in Neural Information Processing Systems (NIPS), 1998.
 (22) C. Walder and O. Chapelle. Learning with transformation invariant kernels. In Advances in Neural Information Processing Systems, pages 1561–1568, 2007.
 (23) X. Zhang, W. S. Lee, and Y. W. Teh. Learning with invariance via linear functionals on reproducing kernel hilbert space. In Advances in Neural Information Processing Systems, pages 2031–2039, 2013.