Inverse Learning of Symmetry Transformations

02/07/2020 ∙ by Mario Wieser, et al. ∙ 0

Symmetry transformations induce invariances and are a crucial building block of modern machine learning algorithms. Some transformations can be described analytically, e.g. geometric invariances. However, in many complex domains, such as the chemical space, invariances can be observed yet the corresponding symmetry transformation cannot be formulated analytically. Thus, the goal of our work is to learn the symmetry transformation that induced this invariance. To address this task, we propose learning two latent subspaces, where the first subspace captures the property and the second subspace the remaining invariant information. Our approach is based on the deep information bottleneck principle in combination with a mutual information regulariser. Unlike previous methods however, we focus on estimating mutual information in continuous rather than binary settings. This poses many challenges as mutual information cannot be meaningfully minimised in continuous domains. Therefore, we base the calculation of mutual information on correlation matrices in combination with a bijective variable transformation. Extensive experiments demonstrate that our model outperforms state-of-the-art methods on artificial and molecular datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In physics, symmetries are used to model quantities which are retained after applying a certain class of transformations. From the mathematical perspective, symmetry can be seen as an invariance property of mappings, where such mappings leave a variable unchanged. Consider the example of rotational invariance from Figure 0(a). We first observe the 3D representation of a specific molecule . The molecule is then rotated. For any rotation , we calculate the distance matrix between the atoms of the rotated molecule with a predefined function . Note that a rotation is a simple transformation which admits a straightforward analytical form. As induces an invariance class, we obtain the same distance matrix for every rotation , i.e.  for any rotation .

(a) Rotational symmetry transformation.

(b) Unknown symmetry transformation.
Figure 1: Left: a molecule is rotated by admitting an analytical form. The distance matrix between atoms is calculated by a known function and remains unchanged for all rotations. Right: samples where is the molecule and the bandgap energy. These samples approximate the function whereas the class of functions leading to the same bandgap energy is unknown.

Now, consider highly complex domains e.g. the chemical space, where analytical forms of symmetry transformations are difficult or impossible to find (Figure 0(b)). The task of discovering novel molecules for the design of organic solar cells in material science is an example of such a domain. Here, all molecules must possess specific properties, e.g. a bandgap energy of approximately 1.4 eV (Shockley and Queisser, 1961), in order to adequately generate electricity from the solar spectrum. In such scenarios, no predefined symmetry transformation (such as rotation) is known or can be assumed. The only available data defining our invariance class are the numeric point-wise samples from the function where is the number of samples, the molecule and the bandgap energy. Therefore, no analytical form of a symmetry transformation which alters the molecule and leaves the bandgap energy unchanged can be assumed.

The goal of our model is thus to learn the class of symmetry transformations which result in a symmetry property of the modelled system. To this end, we learn a continuous data representation and the corresponding symmetry transformation in an inverse fashion from data samples only. To do so, we introduce the Symmetry-Transformation Information Bottleneck (STIB) where we encode the input (e.g. a molecule) into a latent space and subsequently decode it to and a preselected target property (e.g. the bandgap energy). Specifically, we divide the latent space into two subspaces and to explore the variations of the data with respect to a specific target. Here, is the subspace that contains information about input and target, while is the subspace that is invariant to the target. In doing so, we capture symmetry transformations not affecting the target in the isolated latent space .

The central element of STIB is minimising the information about continuous (e.g. bandgap energy) present in by employing adversarial learning. In contrast, cognate models have to the best of our knowledge solely focused on discrete . The potential reason is that naively employing the least squares loss (MSE) as done for maximising mutual information in other models leads to critical problems. This stems from the fact that fundamental properties of mutual information, such as invariance to one-to-one transformations, are not captured by this mutual information estimator. Moreover, minimising mutual information also implies the necessity to maximise the MSE loss which would result in an unbounded loss term.

To overcome the aforementioned issues, we propose a new loss function based on Gaussian mutual information with a bijective variable transformation as an addition to our modelling approach. In contrast to using the MSE, this enables the calculation of the mutual information on the basis of correlations. With this, we ensure a well defined loss function and invariance against linear one-to-one transformations.

In summary, we make the following contributions:

  1. We introduce a deep information bottleneck model that learns a continuous low-dimensional representation of the input data. We augment it with an adversarial training mechanism and a partitioned latent space to learn symmetry transformations based on this representation.

  2. We further propose a continuous mutual information regulation approach based on correlation matrices. This makes it possible to address multiple issues in the continuous domain such as unbounded loss functions and one-to-one transformations.

  3. Experiments on an artificial as well as two molecular datasets demonstrate that the proposed model learns both pre-defined and arbitrary symmetry transformations and outperforms state-of-the-art methods.

2 Related Work

2.1 Information Bottleneck and its connections

The Information Bottleneck (IB) method (Tishby et al., 1999)

describes an information theoretic approach to compressing a random variable

with respect to a second random variable

. The standard formulation of the IB uses only discrete random variables. However, various relaxations such as for Gaussian

(Chechik et al., 2005) and meta-Gaussian variables (Rey and Roth, 2012), have also been proposed. In addition, a deep latent variable formulation of the IB method has been introduced (Alemi et al., 2017; Wieczorek et al., 2018). More recently, multiple applications built on this method. E.g., Keller et al. (Keller et al., 2019, 2020) uncovered archetypal data points in the latent space whereas Parbhoo et al. (Parbhoo et al., 2018) estimated causal effects.

2.2 Enforcing invariance in latent representations

(Bouchacourt et al., 2018) introduced a multi-level VAE. Here, the latent space is decomposed into a local feature space that is only relevant for a subgroup and a global feature space. A more common technique to introduce invariance makes use of adversarial networks (Goodfellow et al., 2014). Specifically, the idea is to combine VAEs and GANs, where the discriminator tries to predict attributes, and the encoder network tries to prevent this (Creswell et al., 2017; Lample et al., 2017). Perhaps most closely related to this study is the work of (Klys et al., 2018) where the authors propose a mutual information regulariser to learn isolated subspaces for binary targets. However, these approaches are only applicable for discrete attributes and our work tackles the more fundamental and challenging problem of learning symmetry transformations for continuous properties.

2.3 Relations to Fairness

Our work has close connections to fairness. Here, the main idea is to penalise the model for presence of nuisance factors that have an unintended influence on the prediction to archive better predictions. Louzios et al. (Louizos et al., 2015), for example, developed a fairness constraint for the latent space based on maximum mean discrepancy (MMD) to become invariant to nuisance variables. Later, Xie et al. (Xie et al., 2017) proposed an adversarial approach to become invariant against nuisance factors . In addition, Moyer et al. (Moyer et al., 2018) introduced a novel objective to overcome the disadvantages of adversarial training. Subsequently, Jaiswal et al. (Jaiswal et al., 2020) built on these methods by introducing a regularisation scheme based on the LSTM (Hochreiter and Schmidhuber, 1997) forget mechanism. In contrast to the described ideas, our work focuses on learning a symmetry transformation for continuous instead of removing nuisance factors . Furthermore, we are interested in learning a generative model instead of solely improving downstream predictions.

3 Preliminaries

3.1 Deep Information Bottleneck

The Deep Variational Information Bottleneck (DVIB) (Alemi et al., 2017) is a compression technique based on mutual information. The main goal is to compress a random variable into a random variable while retaining side information about a third random variable . Note that DVIB builds on VAE (Kingma and Welling, 2013; Rezende et al., 2014), in that the mutual information terms in the former are equivalent to the VAE encoder and decoder (Alemi et al., 2017). Therefore, the VAE remains a special case of DVIB where the compression parameter is set to 1 and is replaced by the input in . represents the mutual information between two random variables. Achieving the optimal compression requires solving the following parametric problem:


where we assume a parametric form of the conditionals and .

represent the neural network parameters and

controls the degree of compression.

The mutual information terms can be expressed as:



denotes the Kullback-Leibler divergence and

the expectation. For the details on the last inequality in Eq. (3), see (Wieczorek and Roth, 2020).

3.2 Adversarial Information Elimination

A common approach to remove information from latent representations in the context of VAEs is using adversarial training (Lample et al., 2017; Creswell et al., 2017; Klys et al., 2018). The main idea is to train an auxiliary network which tries to correctly predict an output from the latent representation by minimising the classification error. Concurrently, an adversary, in our case the encoder network of the VAE, tries to prevent this. To this end, the encoder attempts to generate adversarial representations which contain no information about by maximising the loss with respect to parameters . The overall problem may then be expressed as an adversarial game where we compute:


with denoting the cross-entropy loss. While this approach is applicable for discrete domains, generalising this loss function to continuous settings can lead to severe problems in practice. We elaborate on this issue in the next section.

Figure 2: Graphical illustration of our two-step adversarial training approach. Red circles denote observed input/output of our model. Gray rectangles represent the latent representation which is divided into two separate subspaces and . Blue dotted arrows represent neural networks with fixed parameters Black arrows describe neural networks with trainable parameters. Greek letters define neural network parameters. In the first step (Figure 1(a)), we try to learn a representation of which minimises the mutual information between and by updating and . In the adversary step (Figure 1(b)), we maximise the mutual information between and by updating .

4 Model

As previously described in Section 1, our goal is to learn symmetry transformations based on observations and (see Figure 0(b)). Here, and may be complex objects, such as molecules and its corresponding bandgap energies, which are difficult to manipulate consistently. In order to overcome this issue, we aim to learn a continuous low-dimensional representation of our input data and in Euclidian space. To do so, we augment the traditional deep information bottleneck formulation (Eq. (1)) with an additional decoder reconstructing from . Our base model is thus defined as an augmented parametric formulation of the information bottleneck (see Eq. (1)):


where describe neural network parameters and the compression factor.

Proposed Model.

Based on the model specified in Eq. (5), we provide a novel approach to learn symmetry transformations on latent representations . To this end, we propose partitioning the latent space into two components, and . is intended to capture all information about , while should contain all remaining information about . That is, changing should change but not affect the value of . Thus, expresses a surrogate of the unknown function which is depicted in Figure 0(b). However, simply partitioning the latent space is not sufficient, since information about may still be encoded in .

To address this task, we propose combining the model defined in Eq. (5) with an adversarial approach (Section 3.2). The resulting model thus reduces to playing an adversarial game of minimising and maximising the mutual information where denotes the neural network parameters. This ensures that contains no information about . In more detail, our approach is formulated as follows:

In the first step (see Figure 1(a)), our model learns a low-dimensional representation and of and by maximising the mutual information between and . At the same time, our algorithm tries to eliminate information about from by minimising the mutual via changing with fixed parameters (brown part of Eq. 6).


The second step defines the adversary of our model and is illustrated in Figure 1(b). Here, we try to maximise the mutual information (purple part of Eq. 7) given the current representation of . To do so, we fix all model parameters except of and update the parameters accordingly.


The resulting loss functions and are alternately optimised until convergence. Yet, minimising mutual information for continuous variables, such as bandgap energy, remains a challenging task.

Challenges in Continuous Domains.

A naive solution to deal with a continuous variable is to define mutual information (MI) as MSE. Here, maximising the mutual information can be simply obtained by minimising the MSE since MSE constitutes a good empirical estimator in this case. However, this simple formulation of mutual information cannot be meaningfully minimised, as required by our adversary in Eq. (6). This stems from the fact that we would have to maximize the MSE which results in two problems: (1) our objective becomes unbounded because the MSE is a convex function. That is, the maximum of MSE is undefined and thus impossible to calculate in practice. (2) Furthermore, when maximising the MSE the network may learn trivial solutions which depend on linear one-to-one transformations. For example the network might maximise the MSE by adding only a large bias to the output. This can result in a high MSE even though MI remains unchanged. This, combined with (1), gives rise to solutions, which are not desired when minimising MI. Therefore, we require a more sophisticated approach, that formulates a well defined loss function and introduces invariance against linear one-to-one transformations.

Suggested Solution.

We propose a solution that estimates mutual information in a Gaussian setting. This approach is based on correlation matrices and circumvents the problems discussed in the previous paragraph. The reason is that mutual information based on correlation can be meaningfully minimised and is by definition invariant against linear one-to-one transformations. In this setting, mutual information can be decomposed into a sum of multi-informations (Liu, 2012):


where the specific multi-information terms are specified as follows:


where and are - and -dimensional, respectively. , and denote the sample covariance matrices of , and , respectively. In practice, we calculate the correlation matrices based on the sample covariance. Note that in the Gaussian setting in which our model is defined, correlation is defined as a deterministic function to the mutual information (Cover and Thomas, 2006).

Figure 3: Model extended with the bijective mapping between and . Solid arrows depict a nonlinear function parametrised by a neural network. Gray rectangles denote latent and red circles observed variables.

Relaxing the Gaussian Assumption.

So far, we made the strong parametric assumptions that both and

are Gaussian distributed. However, the target variable

does not necessarily follow a Gaussian distribution. For this reason, we equip the model with a proxy bijective mapping (Figure 3) to introduce more flexibility. This mapping is implemented as two additional networks between and a new proxy variable . The parameters are added to the existing parameters . We subsequently treat as values to be predicted from . The bijective mapping makes it possible to compute the mutual information between and (or its proxy,

) analytically with the formula for Gaussian variables. It is the correct approximation since the Gaussian distribution is the maximum entropy probability distribution for the maximised

after the bijective relaxation of to . Thus, the new loss function of is augmented as follows:

5 Experiments

A detailed description about the setups as well as additional experiments can be found in the supplementary materials.

5.1 Artificial Experiments


Our dataset is generated as follows: Our input consists of two input vectors


. Here, the input vectors are drawn from a uniform distribution defined on

and further multiplied by 8 and subtracted by 4. All input vectors form our input matrix . Subsequently, we define two latent variables and . Here, and are calculated as where . Last, we calculate two target variables and . In doing so, is calculated by where and is 10. is defined as with . Thus, and form a spiral where the angle and the radius are highly correlated. For visualisation purposes, we bin and colour code the values of in the following experiments. During the training and testing phase, samples are drawn from this data generation process.

Experiment 1. We demonstrate the ability of our model to learn a symmetry transformation which admits an analytical form from observations only. We compare our method with VAE (Gomez-Bombarelli et al., 2018) and STIB without regulariser for this purpose. Here, we use the same latent space structure that was described in the experimental setup. Subsequently, we plot the first dimension of (x-axis) against (y-axis) for all three methods. Due to the fact that every dimension in the VAE model contains information about the target, we plotted the first against the second dimension for simplicity. The horizontal coloured lines indicate that our approach (Fig. 3(c)) is able to learn an well defined symmetry transformation, because changing the value of the x-axis does not change the target . In contrast, the VAE (Fig. 3(a)) and STIB without any regulariser (Fig. 3(b)) are not able to preserve this invariance property and encode information about in simultaneously. This can be clearly noted by the colour change of horizontal lines. That is, modifying the invariant space would still result in a change of and thus requires our mutual information regulariser.

Figure 4: Figure 3(a) depicts the latent space of VAE where the first two dimensions are plotted. In contrast, Figure 3(b) shows the latent space of STIB that was trained without our regulariser. Here, the invariant dimension (x-axis) is plotted against the first dimension of (y-axis). Figure 3(c) illustrates first dimension of the invariant latent space (x-axis) plotted against (y-axis) after being trained by our method. Horizontal coloured lines in the bottom right panel indicate invariant with respect to the target . In remaining panels the stripe structure is broken. Black arrows denote the invariant direction.

Experiment 2. Here, we provide a quantitative comparison study with five different models in order to demonstrate the impact of our novel model architecture and mutual information regulariser. In addition to the models considered in Experiment 1, we compare to conditional VAE (CVAE) (Sohn et al., 2015) and conditional Information Bottleneck (CVIB) (Moyer et al., 2018) in Table 1. The setup is identical as described in the experimental setup (see Supplement). We compare the reconstruction MAE of and as well as the amount of information which is remaining in the invariant space by measuring the mutual information between and . Our study shows that we are able to obtain competitive reconstruction results for both and for all of the models. However, we encounter a large difference between the models with respect to the remaining target information in the latent space . In order to quantify the differences, we calculated the mutual information using a nonparametric Kraskov estimator (Kraskov et al., 2004) to obtain a consistent estimate for all the models we compared. We specifically avoid using the Gaussian mutual information in our comparisons here, because in the other models the (non-transformed) is not necessarily Gaussian. Otherwise, we would end up with inaccurate mutual information estimates that make fair comparison infeasible. By using the Kraskov estimator, we observe that all competing models, contain a large amount of mutual information about . In the VAE case, we obtain a mutual information if 3.89 nats and with our method without regularisation a value of 3.85 nats. Moreover, CVAE and CVIB still contain 2.57 nats and 2.44 nats mutual information, respectively. However, if we employ our adversarial regulariser, we are able to decrease the mutual information to 0.30 nats. That is, we have approximately removed all information about from . These quantitative results showcase the effectiveness of our method and support the qualitative results illustrated in Figure 4.

Model Artificial Experiment

STIB w/o adv.




Table 1: Quantitative summary of results from artificial data. Here, we consider the VAE, STIB without regularization, CVAE, CVIB, STIB. For all models the MAE reconstruction errors for and are considered as well as the mutual information (MI) in nats between the invariant space and based on a Kraskov estimator. Lower MAE and MI is better. STIB outperforms each of the baselines considered.

5.2 Real Experiment: Small Organic Molecules

Dataset. We use the 134K organic molecules from the QM9 database (Ramakrishnan et al., 2014), which consists of up to nine main group atoms (C, O, N and F), not counting hydrogens. The chemical space of QM9 is based on the work of GDB-17 (Ruddigkeit et al., 2012), as the largest virtual database of compounds to date, enumerating 166.4 billion molecules of up to 17 atoms of C, N, O, S, and halogens. GDB-17 is systematically generated using molecular connectivity graphs, and represents an attempt of complete and unbiased enumeration of the space of chemical compounds with small and stable organic molecules. Each molecule in QM9 has corresponding geometric, energetic, electronic and thermodynamic properties that are computed from Density Functional Theory (DFT) calculations. In all our experiments, we focus on a subset of this dataset with a fixed stoichiometry (COH), consisting of 6095 molecules and corresponding bandgap energy and polarisability as invariant properties. By restricting chemical space to fixed atomic composition we can focus on how changes in chemical bonds govern these properties.

Experiment 3. We inspect the molecule reconstruction ability of the input given a varying number of latent dimensions (Fig. 5). To do so, we train our model on 95 of our dataset and subsequently evaluate on the remaining 5. The model selection is hence performed by inspecting the reconstruction accuracy to select the optimal number of latent dimensions. In our case, the optimal model converges at 16 latent dimensions. Reconstructing molecules from lower dimensions is in general more challenging because there is a large number of molecules with similar bandgap energies and polarisability. This results in collisions which makes it difficult to resolve the many-to-one mapping in the latent space. In addition, we calculated the mutual information between and using the Kraskov estimator. It is important to note that our model does not come with a trade-off between the reconstruction accuracy and being invariant against in . This property is clearly indicated in Figure 5 (blue line). Here, it can be observed that the mutual information constantly stays between 0.03 and 0.1 for all numbers of latent dimensions considered.

Figure 5: Illustration of the model selection process of STIB on the testset defined in Experiment 4. Therefore, the SMILES reconstruction accuracy (green dot) is considered. The x-axis denotes the number of latent dimensions whereas the left y-axis depicts the reconstruction accuracy of the molecules. The plot indicates that our reconstruction rate saturates at a level of 99% even when varying the number of latent dimensions. In addition, we plotted the mutual information (blue cross) between and for all models which is depicted by the right y-axis.

Experiment 4. Here, we demonstrate that our model can generalise to more than one target (see Experiment 1), meaning novel materials have to satisfy multiple properties and at the same time have a structural invariant subspace. We train a model with a subspace () which contains no information about the material depend properties, bandgap energy and polarisability. In order to illustrate this relationship, we plot the first two dimensions of the property space and colour coded points according to intervals for bandgap energy and polarisability (Figure 5(a) and Figure 5(b) respectively). The color bins are equally spaced by the property range, where the minimum is 1.02 eV / 6.31 bohr and the maximum is 16.93 eV / 143.43 bohr for bandgap energies and polarisability, respectively. For simplicity and readability the we divide the invariant latent space into ten sections and cumulatively group the points. Four sections were chosen for Figure 6. We note that binning is not necessary, but increases the readability of the Figure. In every -section, we observe the stripe structure which means that is invariant according to the target. In contrast, if would encode any property information, the stripe structure would be broken as demonstrated in panel 3(a) and panel 3(b). Thus, our experiment clearly indicates there is no change in the latent space structure according to bandgap energy and polarisability. That is, exploring will not affect the properties of the molecule.

Figure 6: Latent space plots for the first two dimensions in property dependent . Colours illustrates binned target properties, bandgap energies (Fig. 5(a)) and polarisabilities (Fig. 5(b)). The bins are equally spaced by the property range. The values lie between 1.02 eV / 6.31 bohr and 16.93 eV / 143.43 bohr for bandgap energies and polarisability, respectively. The four figures for each property denote four binned sections along the property invariant dimension , out of a total of ten sections. The invariance is illustrated by the lack of overlap of the colour patterns for each section in .

Experiment 5. In this experiment, we perform a quantitative study on real data to demonstrate the effectiveness of our approach. We compare the baseline VAE, CVAE, CVIB, STIB without mutual information regularization of the latent space and STIB with mutal information regularization (Table 2). If we compare both the accuracy of correctly reconstructed SMILES strings and the MAE of the bandgap energy and polarisability, we obtain competitive reconstruction results. For all models considered in the quantitative study we received a SMILES reconstruction accuracy of 98% for VAE, 98%, for STIB without adversarial training scheme, 91% for CVAE, 76% for CVIB and 98% for STIB. In addition, the bandgap and polarisability MAE for the VAE is 0.28 eV and 0.75 bohr, respectively. In comparison, the STIB without adversary receives a bandgap error of 0.28 eV and a polarisability error of 0.70 bohr. Moreover, STIB obtains a MAE for bandgap energy of 0.27 eV and 0.77 bohr for polarisability. This shows that our approach receives competitive results in both reconstruction tasks in comparison to the baseline. As previously described in Experiment 3, we additionally calculated the mutual information with a Kraskov estimator between the target-invariant space and the target . In order to get a better estimate, we estimated the mutual information on the whole dataset. For both the baseline and our model without regularisation, we received a mutual information on 1.54 nats and 0.66 nats, respectively. Here, 1.54 nats represent the amount of mutual information if the entire is considered (e.g. VAE). In addition, CVAE contains 0.56 nats mutual information. That implies that still contains half the information about whereas if we employ our regulariser the mutual information is 0.09 nats. These quantitative results showcase that only STIB is able to remove all property information from for real world applications and support the qualitative results obtained in Experiment 6. Solely, CVIB received a slightly better MI estimate with 0.03 nats, however it posses a vastly weaker reconstruction accuracy.

Model QM9 Zinc
SMILES bandgap polarisability ,Y) SMILES druglikeliness ,Y)
VAE 0.98 0.28 0.75 1.54 0.98 0.05 0.80
STIB w/o adv. 0.98 0.28 0.70 0.66 0.98 0.05 0.24
CondVAE 0.91 - - 0.56 0.98 - 0.28
CVIB 0.76 - - 0.03 0.94 - 0.29
STIB 0.98 0.27 0.77 0.09 0.98 0.05 0.07

Table 2: Summary of quantitative results for QM9 and Zinc experiments. Here, we consider VAE, STIB without regularization, CVAE, CVIB and STIB. The accuracy for SMILES and MAE reconstruction errors bandgap energy (eV), polarizability (bohr) and druglikeliness (probability) are computed, as well as the mutual information (nats) between the invariant space and based on a Kraskov estimator (,Y) ). Higher SMILES accuracy and lower MAE and MI are better. STIB outperforms the other baselines.

5.3 Zinc Dataset

Dataset. In the third experiment, we use the 250K drug-like molecules from the ZINC database (Gomez-Bombarelli et al., 2018). In contrast to QM9, this dataset consists of up to 23 heavy-atoms (C, O, N and F), not including hydrogens and offers a larger variety of molecule structures. The dataset is a randomly picked subset of the larger ZINC database (Sterling and Irwin, 2015) which contains over 17 million molecules. Here, every molecule has calculated drug-specific properties such as synthetic accessibility score(SAS) or the Qualitative Estimate of Drug-likeness (QED).

Experiment 6. In this experiment, we perform an additional quantitative study on Zinc dataset (Table 2). Here, we also obain competitive reconstruction results in terms of SMILES accuracy and druglikeliness. For all models, we received a SMILES reconstruction accuracy of 98% for VAE, 98%, for STIB without adversarial training scheme, 98% for CVAE, 94% for CVIB and 98% for STIB. Furthermore, we investigated the druglikeliness MAE where all models received 0.05. This shows that our approach receives competitive results in both reconstruction tasks in comparison the baseline. Last, we estimated the mutual information with a Kraskov estimator between the target-invariant space and the target . The VAE baseline contains 0.80 nats mutual information whereas STIB without adversary contains 0.24 nats. Moreover, we received a mutual information on 0.28 nats and 0.29 nats for CVAE and CVIB, respectively. That implies that all considered models contain mutual information in about whereas if we employ STIB the mutual information is approximately eliminated (0.07 nats). These results confirm the findings of Experiment 2 and 5 that only STIB is able learn symmetry transformations from data while archiving competitive reconstruction results.

Experiment 7. Lastly, we investigate the generative nature and investigate the property consistency of our model. To do so, we fix three different points in property-latent space . The points in property-latent spaces represent a druglikeliness of 0.5, 0.7, 0.9, for rows one to three in Figure 6(a), respectively. After, we randomly sample points in the invariant latent space which are subsequently generated to SMILES strings.

Figure 7: Illustration of the generative process of our model. Figure 6(a) shows samples drawn by our model. The labels represent the predicted druglikeliness properties which were estimated by out model. Each row in Figure 6(a) denotes molecules generated with a predefined druglikeliness. We further estimate the properties of the generated molecules and show the result in Figure 6(b)

. The blue shaded background is the error confidence interval of our model and the x-axis denotes the MAE of all samples in the boxplot.

Having generated novel SMILES with potentially identical druglikeliness, we now perform a self-consistency check. That is, we feed the generated SMILES into our model and predict the properties. If our model has learned an invariant representation the predicted druglikeliness should be identical to the fixed druglikeliness within the model error. We summarised the results of the model-consitency check in Figure 6(b). Here, we plot the predicted druglikeliness averaged over all generated molecules from the three reference points using a boxplot. Every boxplot contains between 108 and 193 sampled molecules. The x-axis denotes the druglikeliness MAE whereas the blue box denotes the model error. The predicted properties averaged over all generated molecules from the three reference points posses a MAE between 0.04 and 0.05 which lies within calculated the model error range in Table 2. This observation is additionally supported by investigating the boxplots. Here, the predominant proportion of molecules lie within the model error range (blue box). Hence, this experiment demonstrates the generative capabilities of STIB by generating chemically correct novel molecules within the model’s error range.

6 Conclusion

Symmetry transformations constitute a central building block for a large number of machine learning algorithms. In simple cases, symmetry transformations may be formulated analytically, e.g. for rotation or translation (Figure 0(a)). However, in many complex domains, such as the chemical space, invariances can only be observed from data, for example when different molecules have the same bandgap energy (Figure 0(b)). In the latter scenario, the corresponding symmetry transformation cannot be formulated analytically. Hence, learning such symmetry transformations from observed data remains a highly relevant yet challenging task, for instance in drug discovery. To address this task, we make three distinct contributions:

  1. We have presented the STIB method, a novel generative model that is able to learn arbitrary symmetry transformations from observations alone via adversarial training and a partitioned latent space;

  2. In addition to our modelling contribution, we provide a technical solution for continuous mutual information estimation based on correlation matrices. This approach circumvents an unbounded loss function and establishes invariance against linear one-to-one transformations;

  3. Experiments on an artificial as well as two molecular datasets show that the proposed model learns symmetry transformations for both well-defined and arbitrary functions, and outperforms state-of-the-art methods.


  • A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2017) Deep variational information bottleneck. In International Conference on Learning Representations, Cited by: §2.1, §3.1.
  • D. Bouchacourt, R. Tomioka, and S. Nowozin (2018) Multi-level variational autoencoder: learning disentangled representations from grouped observations. In

    AAAI Conference on Artificial Intelligence

    Cited by: §2.2.
  • G. Chechik, A. Globerson, N. Tishby, and Y. Weiss (2005) Information bottleneck for gaussian variables. In Journal of Machine Learning Research, Cited by: §2.1.
  • T. M. Cover and J. A. Thomas (2006) Elements of information theory (wiley series in telecommunications and signal processing). Wiley-Interscience. Cited by: §4.
  • A. Creswell, Y. Mohamied, B. Sengupta, and A. A. Bharath (2017) Adversarial information factorization. In arXiv:1711.05175, Cited by: §2.2, §3.2.
  • R. Gomez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernandez-Lobato, B. Sanchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik (2018) Automatic chemical design using a data-driven continuous representation of molecules. In ACS Central Science, Cited by: §5.1, §5.3.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Cited by: §2.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation. Cited by: §2.3.
  • A. Jaiswal, D. Moyer, G. V. Steeg, W. AbdAlmageed, and P. Natarajan (2020) Invariant representations through adversarial forgetting. In AAAI Conference on Artificial Intelligence, Cited by: §2.3.
  • S. M. Keller, M. Samarin, F. A. Torres, M. Wieser, and V. Roth (2020) Learning extremal representations with deep archetypal analysis. In arXiv:2002.00815, Cited by: §2.1.
  • S. M. Keller, M. Samarin, M. Wieser, and V. Roth (2019) Deep archetypal analysis. In

    German Conference on Pattern Recognition

    Cited by: §2.1.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. In arXiv:1312.6114, Cited by: §3.1.
  • J. Klys, J. Snell, and R. Zemel (2018) Learning latent subspaces in variational autoencoders. In Advances in Neural Information Processing Systems, Cited by: §2.2, §3.2.
  • A. Kraskov, H. Stögbauer, and P. Grassberger (2004) Estimating mutual information.. In Physical review. E, Statistical, nonlinear, and soft matter physics, Cited by: §5.1.
  • G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, and M. Ranzato (2017) Fader networks: manipulating images by sliding attributes. In Advances in Neural Information Processing Systems, Cited by: §2.2, §3.2.
  • Y. Liu (2012) Directed information for complex network analysis from multivariate time series. Ph.D. Thesis. Cited by: §4.
  • C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel (2015) The variational fair autoencoder. In International Conference on Learning Representations, Cited by: §2.3.
  • D. Moyer, S. Gao, R. Brekelmans, A. Galstyan, and G. Ver Steeg (2018) Invariant representations without adversarial training. In Advances in Neural Information Processing Systems, Cited by: §2.3, §5.1.
  • S. Parbhoo, M. Wieser, and V. Roth (2018) Causal Deep Information Bottleneck. In arXiv:1807.02326, Cited by: §2.1.
  • R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld (2014) Quantum chemistry structures and properties of 134 kilo molecules. In Scientific Data, Cited by: §5.2.
  • M. Rey and V. Roth (2012) Meta-gaussian information bottleneck.. In Advances in Neural Information Processing Systems, Cited by: §2.1.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In International Conference on Machine Learning, Cited by: §3.1.
  • L. Ruddigkeit, R. van Deursen, L. C. Blum, and J. Reymond (2012) Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of Chemical Information and Modeling. Cited by: §5.2.
  • W. Shockley and H. J. Queisser (1961) Detailed Balance Limit of Efficiency of p-n Junction Solar Cells. In Journal of Applied Physics, Cited by: §1.
  • K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, Cited by: §5.1.
  • T. Sterling and J. J. Irwin (2015) ZINC 15 – ligand discovery for everyone. In Journal of Chemical Information and Modeling, Cited by: §5.3.
  • N. Tishby, F. C. Pereira, and W. Bialek (1999) The information bottleneck method. In Allerton Conference on Communication, Control and Computing, Cited by: §2.1.
  • A. Wieczorek and V. Roth (2020) On the difference between the information bottleneck and the deep information bottleneck. In Entropy, Cited by: §3.1.
  • A. Wieczorek, M. Wieser, D. Murezzan, and V. Roth (2018) Learning Sparse Latent Representations with the Deep Copula Information Bottleneck. In International Conference on Learning Representations, Cited by: §2.1.
  • Q. Xie, Z. Dai, Y. Du, E. Hovy, and G. Neubig (2017) Controllable invariance through adversarial feature learning. In Advances in Neural Information Processing Systems, Cited by: §2.3.