Common Variable Learning and Invariant Representation Learning using Siamese Neural Networks

12/29/2015 ∙ by Uri Shaham, et al. ∙ Princeton University Yale University 0

We consider the statistical problem of learning common source of variability in data which are synchronously captured by multiple sensors, and demonstrate that Siamese neural networks can be naturally applied to this problem. This approach is useful in particular in exploratory, data-driven applications, where neither a model nor label information is available. In recent years, many researchers have successfully applied Siamese neural networks to obtain an embedding of data which corresponds to a "semantic similarity". We present an interpretation of this "semantic similarity" as learning of equivalence classes. We discuss properties of the embedding obtained by Siamese networks and provide empirical results that demonstrate the ability of Siamese networks to learn common variability.



There are no comments yet.


page 4

page 12

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many machine learning and signal processing methods aim to separate desired variability (“signal”) from undesired variability (“noise” and sensor idiosyncrasies). When the variability of interest is known on some set of data, supervised learning methods (such as regression) may be applied. Alternatively, when a data model is available, classic signal processing techniques (such as filtering) may be used to discard noise. The assumption that such knowledge is available is not always realistic; in many cases, in particular when exploring new data, it is not clear how to identify and represent the “interesting part of the phenomenon”. For example, in analysis of epileptic seizures we are interested in recovering patterns of activity that drive multiple areas of the brain, even though these patterns may be masked by massive nonlinear, non-additive effects of local activity, for which we have no model.

The purpose of this manuscript is to propose an approach for separating desired variability from irrelevant variability, in the absence of data model or label information. Our proposed approach is purely unsupervised, and is based on coincidence or co-occurrence, as a source of information from which the desired variability can be recovered. Specifically, we assume the data are measured through multiple sensors, and that the desired variability is recorded by both of the sensors, while the irrelevant effects are sensor-specific idiosyncrasies (i.e., each sensor records local phenomena and noise, independently of the other sensors). We use coincidence (pairs that were measured at the same time by the different sensors) to recognize the common source of variability. The learning is performed using Siamese neural networks.

The modern form of Siamese neural networks have been proposed by [1] to obtain an embedding of the input data that corresponds to “semantic similarity”, so that Euclidean proximity between points in the embedding space implies that the points are “semantically similar”. Siamese networks have been used for various tasks, such as dimensionality reduction, learning invariant representations [2] and learning hashing functions [3]. This manuscript provides a formal model and a mathematical interpretation for the embedding that Siamese networks try to obtain, as representing equivalence classes.

Training of Siamese networks requires a collection of pairs of similar and dissimilar input objects; we refer to the choice of these training collections as “pairing”. In many works on Siamese networks, the pairing is based on information such as given classified samples, or on knowledge of a data model. For example, In

[1] the pairing of objects is based on class membership, and the resulting representation is shown to be invariant to some input transformations such as pose and illumination of faces in images. In [2] Siamese networks are applied to obtain a low dimensional embedding of the input data; the approach is based on computing a similarity graph of the input data; in the experiments, the graph is computed using Euclidean distance or knowledge of the model generating the data. We assume no such model. Furthermore, as discussed in Section 4.4, Euclidean proximity in the input space might not always capture the desired similarity. In [3] Siamese networks are used to obtain hash maps given a similarity measure; the experiments in the manuscript rely of class membership for pairing. In [4] Siamese networks are applied to obtain representations of images of people, which correspond to pose, and are invariant to undesired variability, such as identity and clothing; in this case the similarity calculation is based on having images where people were imitating the positions shown in a fixed set of seed images, i.e., on a known data model. In [5] and [6], variants of Siamese architectures are applied to textual objects; in both cases the availability of pairs of objects labeled with a degree of similarity is assumed.

Our contributions are as follows: first, we use the formulation of the common variable learning problem [7] to interpret what Siamese network try to do. Specifically, we demonstrate that Siamese networks are in fact trained to recognize equivalence classes of an equivalence relation defined by the common variability we aim to learn. Put another way, the embedding obtained from a Siamese network corresponds to the quotient space of this relation. Second, we demonstrate how Siamese networks can exploit coincidence as an efficient approach for separating desired variability from undesired variability in absence of neither data model nor label information.

The organization of this manuscript is as follows: In Section 2, we describe the common variable learning problem. In Section 3, we show how coincidence can be used to train a Siamese network, and briefly review a typical training algorithm. Some mathematical properties of the embedding which Siamese networks aim to obtain are discussed in Section 4. In Section 5, we present experiments using synthetic data, demonstrating that the common variable is indeed learned by Siamese networks. Brief conclusions are presented in Section 6.

2 Learning by Coincidence

2.1 Motivation

The purpose of this section is to illustrate the motivation for “learning from coincidence”. We use a simplified toy example, adapted from [7]. While this example appears to be a simple image processing problem, which is easily treated with some domain knowledge, we intentionally refrain from using this domain knowledge in order to demonstrate how the Siamese Networks work without the domain knowledge.

The experimental setup is presented in Figure 1. Three objects, Yoda, a bulldog and a bunny, are placed on spinning tables and each object spins independently of the other objects. Two cameras are used to take simultaneous snapshots: Camera 1, whose field of view includes Yoda and the bulldog, and Camera 2, whose field of view includes the bulldog and the bunny. In this setting, the rotation angle of the bulldog is a common hidden variable, which we will denote by ; the common variable is manifested in the snapshot taken by both cameras. The rotation angle of Yoda, which we will denote by , is a sensor-specific source of variability manifested only in snapshots taken by Camera 1, and the rotation angle of the bunny, which we will denote by , is a sensor-specific source of variability manifested only in Camera 2. The three rotation angles are “hidden”, in the sense that they are not measured directly, but only through the snapshots taken by the cameras. Given snapshots from both cameras, our goal is to obtain a parametrization of the “relevant” common hidden variable , i.e., the rotation angle of the bulldog, and ignore the “superfluous” sensor-specific idiosyncratic variables and .

This specific task can be performed with specific knowledge of the problem, by masking the irrelevant objects, reducing the problem to learning of one variable. Indeed, when is the only variable influencing the measurements, there are various methods to explore the geometry of

(e.g. k-means, diffusion maps 

[8]). However, in the more interesting case, it is not known a-priori how the abstract value of influences the measurements, and the measurements need not be images (for example, see Section 5.2). The scenario presented in this section is a simplified case of data-driven modeling and exploration of systems; represents some underlying phenomena which we would like to investigate although we don’t have a model for the phenomena or for the other variables and that influence our measurements. In particular, we do not have any labeled examples of . Furthermore, we do not know in advance that it is possible to isolate by masking certain pixels (in fact, we do not know in advance that the samples represent images). The key to isolating , from the superfluous and is the fact that we have multiple instances of measurements taken simultaneously by the two cameras; in this case, while we don’t know anything about the nature of the images, we know that (which turns out to be the bulldog) was in the same state when the two cameras took the snapshots. More generally, such measurements that have the same value of , give us a natural clue that we can use to reveal the structure of .

In this manuscript, we argue that the Siamese Networks, used in this context with the proper adaptations, attempt to recover a representation of that ignores the sensor specific and . We demonstrate that in this way, Siamese Networks can be used for unsupervised data-driven exploration of phenomena with very little domain knowledge, based mainly on examples that are obtained simultaneously, but without a prior model and without class labels or target values for regression.

Figure 1: The Toy dataset experiment. Top: The experiment setting, in which Yoda, the bulldog and the bunny on spinning tables and the two cameras. Bottom: two sample snapshots, taken by cameras 1 (left) and 2 (right) at the same time. In both pictures the bulldog is in the same state (i.e., rotation angle, with respect to the table).

2.2 Common Variable Learning: Formal Definition

We follow a similar setting to [7], which discussed the problem from a manifold learning perspective. Let

be three hidden random variables from the (possibly high dimensional) spaces

, and , with distribution , where, given , the variables and are independent.

We have access to these hidden variables through two observable random variables and , where and are bi-Lipschitz (therefore, invertible). We denote the range of and by and , respectively; these ranges may be embedded in a high dimensional space. We refer to the random variables and as the measurement in Sensor 1 and the measurement in Sensor 2, respectively. The -th realization of the system consists of the hidden triplet and the corresponding pair of measurements ; while , and are hidden and not available to us directly, and are observable. We note that both and are functions of the same realization of . Our dataset is composed of pairs of corresponding measurements .

A natural way to obtain such pairs is to measure the same phenomenon with two different sensors and , with both sensors influenced by the same phenomenon , and each of them also influenced by its own idiosyncratic “irrelevant” state, or .

Ideally, we would like to construct a function that recovers from , so that for every , and every , we would have . However, since and are unknown, we cannot expect to recover precisely, and we are interested in a function that recovers up to some scaling and bi-Lipschitz transformation. In particular, we require that for all and


and for all and


3 Algorithm

In this section we discuss the rationale for Siamese Networks in the context of the problem formulated above, and briefly review the Siamese Networks algorithm in this context. While the algorithm given in Section 3.3 is a typical variant of a Siamese network training algorithm, the key element of our approach is the implementation of learning through coincidence, which is manifested in the construction of the “positive” and “negative” datasets, described in Sections 3.1 and 3.2.

3.1 Rationale

In order to satisfy Equations (1) and (2), we would like to depend on and be invariant to the value of . The crucial information is provided in the dataset through the fact that both and in the -th pair are functions of the same value of . The idea is to use this information to learn maps and such that for all ,


For every , and , the functions and are required to satisfy and . To avoid a trivial solution, in which and are simply constant functions, we add the requirement that for all , the functions also satisfy


so that and cannot simply “ignore” the value of .

We implement the function by a network which we denote by , and the function by a network which we denote by . We are given a dataset of “positive” pairs, , in which both elements correspond to the same realization of the common variable ; in addition we are given (or construct) a dataset of “negative” pairs, in which the two elements correspond to different realizations of (see section 3.2). The idea is that when we introduce to the network a positive pair as input to and , we require the outputs of to be identical to the output of , whereas when we introduce to the network a negative pair as input, we require the output of to be different from the output of .

At the end of the process, the map , implemented by , computes our approximate representation; as a useful “side effect”, we also obtain the map which computes a similar approximate representation for the samples obtained from Sensor 2.

Figure 2: A diagram of the network structure.

In summary, Siamese Networks try to achieve the following goal, when looking at pairs of measurements from two different sensors in the context of this manuscript:

  • If the two measurements have the same value of (measurements taken at the same time), give the same output.

  • If the measurements probably don’t have the same value of

    , make the outputs of the two different networks different.

3.2 Implementing Coincidence: Constructing Datasets of Positive Pairs and Negative Pairs

The algorithm is given a dataset of pairs of the form corresponding to realizations of . Hence, each instance is a pair of measurements that were taken at the same time by the two different sensors. We refer to this dataset as the positive dataset. In addition, we construct a second dataset, referred to as the negative dataset , which contains “false pairs”, of the form . Ideally, and should be different realizations with different values of , so that ; in practice, it suffices that with sufficiently high probability. When is not explicitly available, an approximation is constructed from by randomly mixing pairs from .

The training data for the Siamese network, (i.e., the positive and negative pairs) is obtained without assuming any class membership, data model or label information. The entire construction and pairing (as “positive” and “negative” pairs) is based on coincidence, i.e., on the fact that the data are measured through multiple sensors, with each of the sensors capturing the variability of interest.

3.3 Algorithm

Algorithm 1 is a typical training procedure for Siamese networks, presented here for completeness of the discussion and to provide the context for the discussion of pairing in Section 3.2.

  Output: implementation of maps and
  Construct datasets and (see Section 3.2)
  Optimize the parameters of the joint network (equation 5)
  Optional: dimensionality reduction of the learned representation
Algorithm 1 Common variable learning using an ANN

A typical architecture of a Siamese network is presented in Figure 2. The network is composed of two networks and and a single output unit, which is connected to the output layer of both networks. and accept samples from and , respectively. The two networks may have different numbers of layers and different configurations; however, they have the same number of output units.

The output node of compares the output of and by computing , with and the inputs of and , respectively, and the outputs of and , respectively, and

the logistic sigmoid function


In our experiments in Section 5

we set the training loss function to be



is a vector containing the weight parameters (but not bias parameters) of

and . For the positive pairs in , we would like to be close to zero, thus close to ; similarly, for the negative pairs in , we would like to maximize , thus have close to .

Once the network is trained, and implement our proposed functions and , respectively.

The network bears some superficial resemblance to a classifier that determines whether or not two measurements from two different sensors share the same value of (i.e. “real” pairs or “fake” pairs). However, classifiers need not construct a representation of the common variable, which is the goal in this work. In addition, since we do not use any class membership information, and the entire training is based on coincidence, our proposed training approach is purely unsupervised. Having said that, in our experiments we find it useful to measure the “classification accuracy” of the network as a proxy for the quality of learning. Specifically, since the output of the network ranges between and , we set

as a classification threshold for estimating whether

is a “real” or a “fake” pair.

4 Discussion

4.1 Siamese Networks Learn Equivalence Classes

Let be an equivalence relation on the , the space of measurements in Sensor 1. We say that two observations are equivalent if and only if they share the same value of , i.e., . This equivalence relation generates the quotient set , where the equivalence class of is .

We observe that a function that satisfies (1) yields the same value for any member of an equivalence class . Moreover, a function that satisfies (2) also yields a different value for members of different equivalence classes .

Thus, with a minor abuse of notation, there is a natural way to define on the quotient set rather than on . Furthermore, such is an injective function. Hence, ideally, the function implemented by (a single sub-net of a) Siamese network is effectively a map of the equivalence class of its argument.

4.2 Comparing Measurements from the Same Sensor

In practice, because of the continuity of the functions and and the continuity of the computation operations in the networks that we use here, samples that are “close” in would have similar representations, so that the representation of is smooth. Informally,


therefore, the function can be used to estimate if two samples and in Sensor 1 have “close” values and .

Therefore, from this perspective, the vague “semantic similarity” is interpreted as proximity in the space of the common variable.

4.3 Measurements from Different Sensors Become Comparable

The algorithm treats the measurements in Sensor 1 and the measurements in Sensor 2 symmetrically, in the sense that it aims to construct maps and that map into the same codomain and have similar properties. Moreover, the algorithm aims to find such and that agree in the sense defined in equations (3) and (4).

Following the same argument as in Section 4.2, the two functions and can be used to compare a sample from Sensor 1 to a sample from Sensor 2 to estimate whether the two samples are obtained from “close” values of ; informally,


The two sensors might measure different modalities, such as audio signals in one and images in the other, so that the framework proposed here allows to compare two different modalities in terms of the common variable.

Several works propose ANNs that learn representations of inputs that are measured via two sources, possibly of different modalities, for example audio and video, or images and texts [9, 10]

. These works focus on learning a shared representation, containing information from the two modalities, and demonstrate that one modality provides information about the other modality. These works have been particularly interested in recovering the input in one modality from the input in another modality, for example, through an autoencoder. The problem of learning a shared representation of objects which are captured via multiple sensors, possibly of different modalities is discussed also in 

[11], where diffusion maps are used to obtain the representation. However, here as well sensor specific variability is not removed. In this manuscript, we aim to discard modality-specific attributes, and learn the common hidden variable that underlies both modalities.

4.4 Similarities in the Input Space

One of the interesting properties of the common variable problem, demonstrated in the toy example in Figure 1, and in the examples in the next section, is that similarity in the common variable, or “semantic similarity”, can have very little to do with the similarity in the input space. For example, in the toy problem, we can have two different snapshots taken by Camera 2 which are supposed to be equivalent because the bulldog (the common variable ) happens to be in the same place. However, the bunny, which is actually a larger, more dominant object, may appear in any state in the two snapshots, so the snapshots are very different in the input space. In other words, snapshots that are very different in the input space may be equivalent. Similarly, snapshots in which the bunny appears in a similar state but the bulldog does not will be similar in the input space, although they are not equivalent in the sense of the common variable we wish to capture. Therefore, similarity in the input space (measured via, say, Euclidean distance) might have little to do with similarity in the common variable, and consequently is an inappropriate tool for collecting “positive” pairs in this context.

Some Siamese networks use the distance in the input space to define pairs (e.g. nearest neighbors in the input space are paired in [2]), or to regularize the distance in the output space; this use of similarity in the input space has been demonstrated to be useful in dimensionality reduction. However, this type of pairing or regularization cannot discard the superfluous variables because it cannot distinguish between the superfluous variables and the common ones. Therefore, pairing based on the simultaneous measurements, when such measurements are available, is advantageous in recovering the common variable and in distilling hidden underlying phenomena.

4.5 Connection to CCA

A natural approach for discovering common information in a dataset of paired observations is to use Canonical Correlation Analysis (CCA) [12]

. The ability of standard CCA to discover such information is limited, since it only considers linear transformations of its inputs. Among non linear versions of CCA, the Deep-CCA architecture proposed in

[13] bares resemblance to Siamese Network architecture. In this manuscript we follow a different approach in the use of the dataset, and an architecture that resembles Siamese Networks more than Deep-CCA. Our experiment in section 5.4 suggests that the approach proposed in this manuscript better suits the common variable learning problem.

4.6 Learning Invariant Representations using Siamese Networks

In this section we will discuss a related problem, learning invariant representation, which can also be viewed as a problem of learning equivalence classes.

Let be a group that acts on a set . We say that is equivalent to up to if there is such that . We denote the equivalence relation by . We say that a map of is invariant to if it satisfies (a) for all , and (b) for all and for all , .

In the invariant representation learning problem, we have examples of pairs with different randomly selected group actions operating on a randomly selected element and in some cases we may have examples of “negative pairs” with ; we would like to use such examples to find a function that is invariant to . In other words, we would like to learn a map that is defined on the equivalence classes of . Since Siamese networks learn equivalence classes, they can be used to learn invariant representation, as demonstrated in Section 5.3.

Neural networks which are invariant to specific input transformations, such as translation and rotation have been proposed in [14], [15], and [16]), for example. However, these networks are often designed to be invariant to specific, well modeled transformations, rather than to unknown transformations.

5 Experimental Results

In this section we present experimental results of common variable learning. The experiments involve synthetic datasets, generated so that the common variable is defined explicitly, to demonstrate that the embedding obtained by the net indeed corresponds to the quotient space defined by the common variable. We also demonstrate how Siamese networks can be used to learn invariant representations.

In experiments where we have more than a single hidden layer in each stack, we pre-train every hidden layer in and

as a Denoising Autoencoder (DAE)

[17] with activation sparsity loss (see [18]). The optimization of the network

is performed using standard Stochastic Gradient Descent (SGD) with momentum (see, for example,

[19]) and dropout (see, for example, [20]), or using L-BFGS (see, for example, [21]

); in both optimization algorithms, we compute gradients using standard backpropagation (see, for example,

[22]). The classification accuracy we report is measured on a test set consisting of positive and negative examples that were not introduced to the network during training.

5.1 Common Variable Learning: the Toy Dataset (Spinning Figures)

In this experiment we revisit the setup described in Section 2.1. Here, is a snapshot taken by Camera 1 and is a snapshot taken by Camera 2. The dataset consists of pairs of snapshots , with and taken simultaneously by Camera 1 and Camera 2, respectively. The dataset was constructed by pairing snapshots that had been taken at different times. The samples and are color images; positive and negative examples are presented in Figure 3 (top). The training sets and consisted of examples each. Both and had three layers, the two hidden layers in each network had units, and the output layers had units. The joint network was trained using L-BFGS. The classification accuracy on the test set (as defined in Section 3.3) was 95.96%.

The learned representations in this experiment (the outputs of the networks and ), are -dimensional. We used standard dimensionality reduction algorithms to process the output for the purpose of visualization and further processing; in Figure 3 (bottom) we present the reduced representation obtained using diffusion maps [8], which we found to be clearer than the representation obtained using PCA. The closed curve and the smooth transitions in color in the embedding demonstrate that the algorithm recovered a good representation of the common variable , and that the position along the learned manifold corresponds to the value of the common variable.

Figure 3: The Toy dataset experiment. Top: two sample examples. In a positive example snapshots taken simultaneously, containing two different views on the bulldog at the same rotation angle. In a negative example the snapshots from the two cameras were not taken at the same time, hence they do not correspond to same rotation angle of the bulldog. Bottom: embedding of the Toy dataset. The color of each point corresponds to the true common hidden variable, i.e. the rotation angle of the bulldog.

5.2 Common Variable Learning: Two Different Modalities

In the previous experiment, we used the same type of input in both sensors; in this experiment we used a different data modality in each sensor: images in one sensor and audio signals in the other.

We denote by an image rotated by angle . A measurement from sensor 1 is a concatenation of two rotations of an image in arbitrary angles . A measurement from sensor 2 is a dimensional vector with entries , where is a deterministic function, so that determines the frequency of the sine, and determines the phase. In other words, the common variable determines the rotation of the left image in the first sensor and the frequency of the sine in the second sensor; the sensor specific variables are the rotation angle of the right image in , and the phase of the sine in . An example from the dataset of this experiment is presented in Figure 4 (top).

Both and had three layers, with units in each. and consisted of examples each. The accuracy on the test set was .

In Figure 4 (bottom) we present the diffusion embedding of the outputs of both and , colored by the true value of the common variable ; the smooth transition of the color along the manifold implies that the learned representation corresponds to the common variable. Furthermore, the points in Figure 4 (bottom) which correspond to output of are indistinguishable from the points that correspond to outputs of ; in other words, data from the two different modalities has been mapped into a the same space, where data points from same or different modalities can be compared based on their corresponding value of the common variable .

Figure 4: The two modalities dataset. Top: a positive example from the dataset. A measurement from sensor 1 is a concatenation of two rotations of the image , in angles and . A measurement from sensor 2 is a sine, with frequency determined by and phase determined by . Bottom: embedding of the images and audio signals from the test set in the two modalities experiment; the color corresponds to the true value of the common variable . Data from two different modalities is mapped to the same space, and is parametrized by the common variable.

5.3 Learning a Rotation-Invariant Representation

The following experiment demonstrates the application of the algorithm to the problem of learning invariant representations, discussed in Section 4.6.

Our goal here is to learn maps and that are rotation-invariant. Let be the group of rotations of images, so that is a rotation of by degrees. We used images from the Caltech-101 dataset [23] (converted to gray-scale pixels for convenience), and constructed datasets of rotated images. The positive set was composed of samples where and are two instances of the same base image rotated by two randomly chosen angles . Each sample in the negative dataset was composed of two different randomly chosen base images, each rotated by a different randomly chosen angle. Positive and negative examples from the dataset and the first layer weights of are presented in Figure 5 (top).

Figure 5: The rotation-invariance experiment. Top left: two sample examples generated from the Caltech-101 dataset for the rotation-invariance experiment. In a positive example and are the same image, up to rotation. In a negative example and are the different images, rotated in different angles. Top right: first layer features in the rotation-invariance experiment. Bottom: Histograms of (dark blue) and (light blue). Distances between representations of arbitrarily rotated copies of the same image are significantly smaller than distances between representations of different images.

The networks and had three layers each; the joint network was trained using L-BFGS. The learned functions achieved a high accuracy score of 99.44%;

To check whether the hidden representation we obtained is indeed invariant to rotation, we performed the following analysis: we randomly selected an image and rotated it in two random angles; we denote the resulting images by

and . We then selected a different image and rotated it in a random angle; we denote the resulting image by . If the map is indeed invariant to rotations, then we expect to have . Histograms of and for 10,000 repetitions of the above procedure are presented in Figure 5 (bottom); as evident from the histograms, is indeed significantly smaller than , as expected.

5.4 Comparison to Deep CCA

Given realizations of random variables and , the deep CCA algorithm (see [13]) computes maps and so that the cross correlation between and is maximized. We implemented the deep CCA network and applied it to the Toy dataset of Section 5.1, with the same network structure used in our experiment in Section 5.1. The diffusion embedding that was obtained from the last layer representation of the deep CCA network is presented in Figure 6. We observe that in this experiment the position along the embedded manifold does not correspond to the value of the common variable, i.e., the rotation angle of the bulldog; moreover, additional analysis indicated that the representation obtained by deep CCA in this experiment reflects the sensor specific superfluous variables (rotation angles of Yoda and the bunny), which we would like to discard.

Figure 6: Diffusion embedding obtained from the deep CCA network [13] on the Toy dataset. The color of each point corresponds to the value of the common hidden variable, an embedding that captures the common variable would have a smooth transition of color; the embedding here does not correspond to the value of the common hidden variable.

6 Conclusions

In this manuscript we presented Siamese neural networks as a solution to the statistical problem of common variable learning. We demonstrated that Siamese neural networks learn equivalence relations in the input space. We demonstrated how coincidence can be used for the recovery of common variables, in the absence of a model or labeled data, using examples of measurements that are “equivalent” or “related” via an appropriate form of coincidence, and using examples of measurements that are “not equivalent” or “unrelated”. In addition, we demonstrated how a Siamese network can map observations, possibly from different modalities, to a space in which their respective values of the common variable are comparable.

The experiments presented in this manuscript have been carefully designed to illustrate the theoretical arguments regarding the embedding obtained by Siamese networks representing the common variable and regarding limited use of domain knowledge. As demonstrated in other works, when domain knowledge is available it can be used in designing the network architecture: for example, when the samples are images, it is natural to use convolutional networks.


The authors would like to thank Raphy Coifman, Sahand N. Negahban, Andrew R. Barron and Ronen Talmon, for their help.


  • [1] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 539–546, IEEE, 2005.
  • [2] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in Computer vision and pattern recognition, 2006 IEEE computer society conference on, vol. 2, pp. 1735–1742, IEEE, 2006.
  • [3] J. Masci, M. M. Bronstein, A. Bronstein, and J. Schmidhuber, “Multimodal similarity-preserving hashing,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 4, pp. 824–830, 2014.
  • [4] G. W. Taylor, I. Spiro, C. Bregler, and R. Fergus, “Learning invariance through imitation,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 2729–2736, IEEE, 2011.
  • [5] W.-t. Yih, K. Toutanova, J. C. Platt, and C. Meek, “Learning discriminative projections for text similarity measures,” in Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 247–256, Association for Computational Linguistics, 2011.
  • [6] K. M. Hermann and P. Blunsom, “Multilingual models for compositional distributed semantics,” arXiv preprint arXiv:1404.4641, 2014.
  • [7] R. R. Lederman and R. Talmon, “Learning the geometry of common latent variables using alternating-diffusion,” Applied and Computational Harmonic Analysis, 2015.
  • [8] R. R. Coifman and S. Lafon, “Diffusion maps,” Applied and computational harmonic analysis, vol. 21, no. 1, pp. 5–30, 2006.
  • [9]

    J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in

    Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696, 2011.
  • [10]

    N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in

    Advances in neural information processing systems, pp. 2222–2230, 2012.
  • [11] Y. Keller, R. R. Coifman, S. Lafon, and S. W. Zucker, “Audio-visual group recognition using diffusion maps,” Signal Processing, IEEE Transactions on, vol. 58, no. 1, pp. 403–413, 2010.
  • [12] H. Hotelling, “Relations between two sets of variates,” Biometrika, pp. 321–377, 1936.
  • [13] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in Proceedings of the 30th International Conference on Machine Learning, pp. 1247–1255, 2013.
  • [14] M. A. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun, “Unxsupervised learning of invariant feature hierarchies with applications to object recognition,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pp. 1–8, IEEE, 2007.
  • [15] E. Oyallon and S. Mallat, “Deep roto-translation scattering for object classification,” arXiv preprint arXiv:1412.8659, 2014.
  • [16] K. Sohn and H. Lee, “Learning invariant representations with local transformations,” arXiv preprint arXiv:1206.6418, 2012.
  • [17] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, pp. 1096–1103, ACM, 2008.
  • [18] A. Ng, “Unsupevised feature learning and deep learning, stnanford class cs294.”, 2011. Accessed: 2016-01-18.
  • [19] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 1139–1147, 2013.
  • [20]

    G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural networks for lvcsr using rectified linear units and dropout,” in

    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8609–8613, IEEE, 2013.
  • [21] S. J. Wright and J. Nocedal, Numerical optimization, vol. 2. Springer New York, 1999.
  • [22] R. Rojas, Neural networks: a systematic introduction. Springer Science & Business Media, 1996.
  • [23] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Computer Vision and Image Understanding, vol. 106, no. 1, pp. 59–70, 2007.