DeepAI
Log In Sign Up

The Surprising Effectiveness of Equivariant Models in Domains with Latent Symmetry

11/16/2022
by   Dian Wang, et al.
0

Extensive work has demonstrated that equivariant neural networks can significantly improve sample efficiency and generalization by enforcing an inductive bias in the network architecture. These applications typically assume that the domain symmetry is fully described by explicit transformations of the model inputs and outputs. However, many real-life applications contain only latent or partial symmetries which cannot be easily described by simple transformations of the input. In these cases, it is necessary to learn symmetry in the environment instead of imposing it mathematically on the network architecture. We discover, surprisingly, that imposing equivariance constraints that do not exactly match the domain symmetry is very helpful in learning the true symmetry in the environment. We differentiate between extrinsic and incorrect symmetry constraints and show that while imposing incorrect symmetry can impede the model's performance, imposing extrinsic symmetry can actually improve performance. We demonstrate that an equivariant model can significantly outperform non-equivariant methods on domains with latent symmetries both in supervised learning and in reinforcement learning for robotic manipulation and control problems.

READ FULL TEXT VIEW PDF

page 7

page 19

page 20

page 23

02/01/2023

Generative Adversarial Symmetry Discovery

Despite the success of equivariant neural networks in scientific applica...
04/16/2019

Learning a Gauge Symmetry with Neural Networks

We explore the capacity of neural networks to detect a symmetry with com...
02/02/2023

Oracle-Preserving Latent Flows

We develop a deep learning methodology for the simultaneous discovery of...
03/11/2022

Graph Neural Networks for Relational Inductive Bias in Vision-based Deep Reinforcement Learning of Robot Control

State-of-the-art reinforcement learning algorithms predominantly learn a...
12/14/2021

Machine learning a manifold

We propose a simple method to identify a continuous Lie algebra symmetry...
05/21/2022

Equivariant Mesh Attention Networks

Equivariance to symmetries has proven to be a powerful inductive bias in...

1 Introduction

Figure 1: Object vs image transforms. Object transform rotates the object itself (b), while image transform rotates the image (c). We propose to use the image transform to help model the object transform.

Recently, equivariant learning has shown great success in various machine learning domains like trajectory prediction 

(walters2020trajectory), robotics (neural_descriptor), and reinforcement learning (iclr). Equivariant networks (g_conv; steerable_cnns) can improve generalization and sample efficiency during learning by encoding task symmetries directly into the model structure. However, this requires problem symmetries to be perfectly known and modeled at design time – something that is sometimes problematic. It is often the case that the designer knows that a latent symmetry is present in the problem but cannot easily express how that symmetry acts in the input space. For example, Figure 1b is a rotation of Figure 1a. However, this is not a rotation of the image – it is a rotation of the objects present in the image when they are viewed from an oblique angle. In order to model this rotational symmetry, the designer must know the viewing angle and somehow transform the data or encode projective geometry into the model. This is difficult and it makes the entire approach less attractive. In this situation, the conventional wisdom would be to discard the model structure altogether since it is not fully known and to use an unconstrained model. Instead, we explore whether it is possible to benefit from equivariant models even when the way a symmetry acts on the problem input is not precisely known. We show empirically that this is indeed the case and that an inaccurate equivariant model is often better than a completely unstructured model. For example, suppose we want to model a function with the object-wise rotation symmetry expressed in Figure 1a and b. Notice that whereas it is difficult to encode the object-wise symmetry, it is easy to encode an image-wise symmetry because it involves simple image rotations. Although the image-wise symmetry model is imprecise in this situation, our experiments indicate that this imprecise model is still a much better choice than a completely unstructured model.

This paper makes three contributions. First, we define three different relationships between problem symmetry and model symmetry: correct equivariance, incorrect equivariance, and extrinsic equivariance. Correct equivariance means the model correctly models the problem symmetry; incorrect equivariance is when the model symmetry interferes with the problem symmetry; and extrinsic equivariance is when the model symmetry transforms the input data to out-of-distribution data. We theoretically demonstrate the upper bound performance for an incorrectly constrained equivariant model. Second, we empirically compare extrinsic and incorrect equivariance in a supervised learning task and show that a model with extrinsic equivariance can improve performance compared with an unconstrained model. Finally, we explore this idea in a reinforcement learning context and show that an extrinsically constrained model can outperform state-of-the-art conventional CNN baselines.

2 Related Work

Equivariant Neural Networks.

Equivariant networks are first introduced as G-Convolution (g_conv) and Steerable CNN (steerable_cnns; e2cnn; escnn). Equivariant learning has been applied to various types of data including images (e2cnn), spherical data (spherical_cnns), point clouds (dym2020universality), sets maron2020learning, and meshes (de2020gauge), and has shown great success in tasks including molecular dynamics (anderson2019cormorant), particle physics (bogatskiy2020lorentz), fluid dynamics (wang2020incorporating), trajectory prediction (walters2020trajectory), robotics (neural_descriptor; rss22xupeng; rss22haojie) and reinforcement learning (corl; iclr). Compared with the prior works that assume the domain symmetry is perfectly known, this work studies the effectiveness of equivariant networks in domains with latent symmetries.

Symmetric Representation Learning.

Since latent symmetry is not expressable as a simple transformation of the input, equivariant networks can not be used in the standard way. Thus several works have turned to learning equivariant features which can be easily transformed. sen learn an encoder which maps inputs to equivariant features which can be used by downstream equivariant layers. quessard2020learning, klee2022i2i, and marchetti2022equivariant map 2D image inputs to elements of various groups including , allowing for disentanglement and equivariance constraints. falorsi2018explorations use a homeomorphic VAE to perform the same task in an unsupervised manner. dangovski2021equivariant consider equivariant representations learned in a self-supervised manner using losses to encourage sensitivity or insensitivity to various symmetries. Our method may be considered as an example of symmetric representation learning which, unlike any of the above methods, uses an equivariant neural network as an encoder. zhou2020meta and dehmamy2021automatic assume no prior knowledge of the structure of symmetry in the domain and learn the symmetry transformations on inputs and latent features end-to-end with the task function. In comparison, our work assumes that the latent symmetry is known but how it acts on the input is unknown.

Sample Efficient Reinforcement Learning.

One traditional solution for improving sample efficiency is to create additional samples using data augmentation (alexnet). Recent works discover that simple image augmentations like random crop (rad; drqv2) or random shift (drq) can improve the performance of reinforcement learning. Such image augmentation can be combined with contrastive learning (oord2018representation) to achieve better performance (curl; ferm). Recently, many prior works have shown that equivariant methods can achieve tremendously high sample efficiency in reinforcement learning (van2020mdp; mondal2020group; corl; iclr), and realize on-robot reinforcement learning (rss22xupeng; corl22). However, recent equivariant reinforcement learning works are limited in fully equivariant domains. This paper extends the prior works by applying equivariant reinforcement learning to tasks with latent symmetries.

3 Background

Equivariant Neural Networks.

A function is equivariant if it respects symmetries of its input and output spaces. Specifically, a function is equivariant with respect to a symmetry group if it commutes with all transformations , , where and are the representations of the group that define how the group element acts on and , respectively. An equivariant function is a mathematical way of expressing that is symmetric with respect to : if we evaluate for differently transformed versions of the same input, we should obtain transformed versions of the same output.

In order to use an equivariant model, we generally require the symmetry group and representation to be known at design time. For example, in a convolutional model, this can be accomplished by tying the kernel weights together so as to satisfy , where and denote the representation of the group operator at the input and the output of the layer (equi_theory)

. End-to-end equivariant models can be constructed by combining equivariant convolutional layers and equivariant activation functions. In order to leverage symmetry in this way, it is common to transform the input so that standard group representations work correctly, e.g., to transform an image to a top-down view so that image rotations correspond to object rotations.

Equivariant SAC.

Equivariant SAC (iclr) is a variation of SAC (sac) that constrains the actor to an equivariant function and the critic to an invariant function with respect to a group . The policy is a network , where

is the space of action standard deviations (SAC models a stochastic policy). It defines the group action on the output space of the policy network network

as: , where is the equivariant component in the action space, is the invariant component in the action space, , . The actor network is then defined to be a mapping that satisfies the following equivariance constraint: . The critic is a -network that satisfies an invariant constraint: .

4 Learning Symmetry Using Other Symmetries

4.1 Model Symmetry Versus True Symmetry

Figure 2: An example classification task for correct, incorrect, and extrinsic equivariance. The grey ring shows the input distribution. Circles are the training data in the distribution where the color shows the ground truth label. Crosses show the group transformed data.

This paper focuses on tasks where the way in which the symmetry group operates on the input space is unknown. In this case the ground truth function is equivariant with respect to a group which acts on and by and respectively. However, the action on the input space is not known and may not be a simple or explicit map. Since is unknown, we cannot pursue the strategy of learning using an equivariant model class constrained by . As an alternative, we propose restricting to a model class which satisfies equivariance with respect to a different group action , i.e., . This paper tests the hypothesis that if the model is constrained to a symmetry class which is related to the true symmetry , then it may help learn a model satisfying the true symmetry. For example, if is an image viewed from an oblique angle and is the rotation of the objects in the image, can be the rotation of the whole image (which is different from because of the tilted view angle). Section 4.4 will describe this example in detail.

4.2 Correct, Incorrect, and Extrinsic Equivariance

Our findings show that the success of this strategy depends on how relates to the ground truth function

and its symmetry. We classify the model symmetry as

correct equivariance, incorrect equivariance, or extrinsic equivariance with respect to . Correct symmetry means that the model symmetry correctly reflects a symmetry present in the ground truth function . An extrinsic symmetry may still aid learning whereas an incorrect symmetry is necessarily detrimental to learning. We illustrate the distinction with a classification example shown in Figure 2a. (See Appendix B for a more in-depth description.) Let be the support of the input distribution for .

Definition 4.1.

The action has correct equivariance with respect to if for all and .

That is, the model symmetry preserves the input space and is equivariant with respect to it. For example, consider the action of the group acting on by reflection across the horizontal axis and , the trivial action fixing labels. Figure 2b shows the untransformed data as circles along the unit circle. The transformed data (shown as crosses) also lie on the unit circle, and hence the support is reflection invariant. Moreover, the ground truth labels (shown as orange or blue) are preserved by this action.

Definition 4.2.

The action has incorrect equivariance with respect to if there exist and such that but .

In this case, the model symmetry partially preserves the input distribution, but does not correctly preserve labels. In Figure 2c, the rotation group maps the unit circle to itself, but the transformed data does not have the correct label. Thus, constraining the model by will force to mislabel data. In this example, for , and , however, .

Definition 4.3.

The action has extrinsic equivariance with respect to if for , .

Extrinsic equivariance is when the equivariant constraint in the equivariant network enforces equivariance to out-of-distribution data. Since , the ground truth is undefined. An example of extrinsic equivariance is given by the scaling group shown in Figure 2d. For the data , enforcing scaling invariance where will not increase error, because the group transformed data (in crosses) are out of the distribution of the input data shown in the grey ring. In fact, we hypothesize that such extrinsic equivariance may even be helpful for the network to learn the ground truth function. For example, in Figure 2d, the network can learn to classify all points on the left as blue and all points on the right as orange.

4.3 Theoretical Upper Bound on Accuracy for Incorrect Equivariant Models

Consider a classification problem over the set with finitely many classes . Let be a finite group acting on . Consider a model with incorrect equivariance constrained to be invariant to . In this case the points in a single orbit must all be assigned the same label . However these points may have different ground truth labels. We classify how bad this situation is by measuring , the proportion of ground truth labels in the orbit of which are equal to the majority label. Let be the fraction of points which have consensus proportion .

Proposition 4.1.

The accuracy of has upper bound

See the complete version of the proposition and its proof in Appendix A. In the example in Figure 2c, we have and , thus

. In contrast, an unconstrained model with a universal approximation property and proper hyperparameters can achieve arbitrarily good accuracy.

4.4 Object Transformation and Image Transformation

In tasks with visual inputs (), incorrect or extrinsic equivariance will exist when the transformation of the image does not match the transformation of the latent state of the task. In such case, we call the object transform and the image transform. For an image input , the image transform is defined as a simple transformation of pixel locations (e.g., Figure 1a-c where ), while the object transform is an implicit map transforming the objects in the image (e.g., Figure 1a-b where ). The distinction between object transform and image transform is often caused by some symmetry-breaking factors such as camera angle, occlusion, backgrounds, and so on (e.g., Figure 1). We refer to such symmetry-breaking factors as symmetry corruptions.

5 Evaluating Equivariant Network with Symmetry Corruptions

Although it is preferable to use an equivariant model to enforce correct equivariance, real-world problems often contain some symmetry corruptions, such as oblique viewing angles, which mean the symmetry is latent. In this experiment, we evaluate the effect of different corruptions on an equivariant model and show that enforcing extrinsic equivariance can actually improve performance. We experiment with a simple supervised learning task where the scene contains three ducks of different colors. The data samples are pairs of images where all ducks in the first image are rotated by some to produce the second image within each pair. Given the correct , the goal is to train a network to classify the rotation (Figure (a)a). If we have a perfect top-down image observation, then the object transform and image transform are equal, and we can enforce the correct equivariance by modeling the ground truth function as an invariant network where (because the rotation of the two images will not change the relative rotation between the objects in the two images). To mimic symmetry corruptions in real-world applications, we apply seven different transformations to both pairs of images shown in Figure (b)b (more corruptions are considered in Appendix E.1). In particular, for invert-label, the ground truth label is inverted to when the yellow duck is on the left of the orange duck in the world frame in the first input image. Notice that enforcing -invariance in under invert-label is an incorrect equivariant constraint because a rotation on the ducks might change their relative position in the world frame and break the invariance of the task: . However, in all other corruptions, enforcing -invariance is an extrinsic equivariance because will be out of the input distribution. We evaluate the equivariant network defined in group implemented using e2cnn (e2cnn). See Appendix D.1 for the training details.

(a)
(b)
Figure 3:

(a) The rotation estimation task requires the network to estimate the relative rotation between the two input states. (b) Different symmetry corruptions in the rotation estimation experiment.

Figure 4: Comparison of an equivariant network (blue), a conventional network (green), and CNN equipped with image transformation augmentation using

rotations (red). The plots show the prediction accuracy in the test set of the model trained with different number of training data. In all of our experiments, we take the average over four random seeds. Shading denotes standard error.

Comparing Equivariant Networks with CNNs.

We first compare the performance of an equivariant network (Equi) and a conventional CNN model (CNN) with a similar number of trainable parameters. The network architectures are relatively simple (see Appendix C.1) as our goal is to evaluate the performance difference between an equivariant network and an unconstrained CNN model rather than achieving the best performance in this task. In both models, we apply a random crop after sampling each data batch to improve the sample efficiency. See Appendix E.1 for the effects of random crop augmentation on learning. Figure 4 (blue vs green) shows the test accuracy of both models after convergence when trained with varying dataset sizes. For all corruptions with extrinsic equivariance constraints, the equivariant network performs better than the CNN model, especially in low data regimes. However, for invert-label which gives an incorrect equivariance constraint, the CNN outperforms the equivariant model, demonstrating that enforcing incorrect equivariance negatively impacts accuracy. In fact, based on Proposition 4.1, the equivariant network here has a theoretical upper bound performance of . First, . Then when where (i.e., negating the label won’t change it), and . The consensus proportion when , where half of the labels in the orbit of will be the negation of the labels of the other half (because half of will change the relative position between the yellow and orange duck), thus . . This theoretical upper bound matches the result in Figure 4. Figure 4 suggests that even in the presence of symmetry corruptions, enforcing extrinsic equivariance can improve the sample efficiency while incorrect equivariance is detrimental.

Extrinsic Image Augmentation Helps in Learning Correct Symmetry.

In these experiments, we further illustrate that enforcing extrinsic equivariance helps the model learn the latent equivariance of the task for in-distribution data. As an alternative to equivariant networks, we consider an older alternative for symmetry learning, data augmentation, to see whether extrinsic symmetry augmentations can improve the performance of an unconstrained CNN by helping it learn latent symmetry. Specifically, we augment each training sample with image rotations while keeping the validation and test set unchanged. As is shown in Figure 4, adding such extrinsic data augmentation (CNN + Img Trans, red) significantly improves the performance of CNN (green), and nearly matches the performance of the equivariant network (blue). Notice that in invert-label, adding such augmentation hurts the performance of CNN because of incorrect equivariance.

6 Extrinsic Equivariance in Reinforcement Learning

The results in Section 5 suggest that enforcing extrinsic equivariance can help the model better learn the latent symmetry in the task. In this section, we apply this methodology in reinforcement learning and demonstrate that extrinsic equivariance can significantly improve sample efficiency.

6.1 Reinforcement Learning in Robotic Manipulation

Figure 5: The image state in the Block Picking task. Left image shows the RGB channels and right image shows the depth channel.

We first experiment in five robotic manipulation environments shown in Figure 6. The state space is a 4-channel RGBD image captured from a fixed camera pointed at the workspace (Figure 5). The action space is the change in gripper pose , where is the rotation along the -axis, and the gripper open width . The task has latent

symmetry: when a rotation or reflection is applied to the poses of the gripper and the objects, the action should rotate and reflect accordingly. However, such symmetry does not exist in image space because the image perspective is skewed instead of top-down (we also perform experiments with another symmetry corruption caused by sensor occlusion in Appendix 

E.3). We enforce such extrinsic symmetry (group ) using Equivariant SAC (iclr; corl22) equipped with random crop augmentation using RAD (rad) (Equi SAC + RAD) and compare it with the following baselines: 1) CNN SAC + RAD: same as our method but with an unconstrained CNN instead of an equivariant model; 2) CNN SAC + DrQ: same as 1), but with DrQ (drq) for the random crop augmentation; 3) FERM (ferm): a combination of 1) and contrastive learning; and 4) SEN + RAD: Symmetric Embedding Network (sen) that uses a conventional network for the encoder and an equivariant network for the output head. All baselines are implemented such that they have a similar number of parameters as Equivariant SAC. See Appendix C.2 for the network architectures and Appendix F for the architecture hyperparameter search for the baselines. All methods use Prioritized Experience Replay (PER) (per) with pre-loaded expert demonstrations (20 episodes for Block Pulling and Block Pushing, 50 for Block Picking and Drawer Opening, and 100 for Block in Bowl). We also add an L2 loss towards the expert action in the actor to encourage expert actions. More details about training are provided in Appendix D.2.

Figure 7 shows that Equivariant SAC (blue) outperforms all baselines. Note that the performance of Equivariant SAC in Figure 7 does not match that reported in iclr because we have a harder task setting: we do not have a top-down observation centered at the gripper position as in the prior work. Such top-down observations would not only provide correct equivariance but also help learn a translation-invariant policy. Even in the harder task setting without top-down observations, Figure 7 suggests that Equivariant SAC can still achieve higher performance compared to baselines.

(a) Block Pulling
(b) Block Pushing
(c) Block Picking
(d) Drawer Opening
(e) Block in Bowl
Figure 6: The manipulation environments from BulletArm benchmark bulletarm implemented in PyBullet pybullet. The top-left shows the goal for each task.
Figure 7: Comparison of Equivariant SAC (blue) with baselines. The plots show the performance of the evaluation policy. The evaluation is performed every 200 training steps.

6.2 Increasing Corruption Levels

Figure 8: Left: view angle at 90 degrees. Right: view angle at 15 degrees.

In this experiment, we vary the camera angle by tilting to see how increasing the gap between the image transform and the object transform affects the performance of extrinsically equivariant networks. When the view angle is at 90 degrees (i.e., the image is top-down), the object and image transformation exactly match. As the view angle is decreased, the gap increases. Figure 8 shows the observation at 90 and 15 degree view angles. We remove the robot arm except for the gripper and the blue/white grid on the ground to remove the other symmetry-breaking components in the environment so that the camera angle is the only symmetry corruption. We compare Equi SAC + RAD against CNN SAC + RAD. We evaluate the performance of each method at the end of training for different view angles in Figure 9. As expected, the performance of Equivariant SAC decreases as the camera angle is decreased, especially from 30 degrees to 15 degrees. On the other hand, CNN generally has similar performance for all view angles, with the exception of Block Pulling and Block Pushing, where decreasing the view angle leads to higher performance. This may be because decreasing the view angle helps the network to better understand the height of the gripper, which is useful for pulling and pushing actions.

Figure 9: Comparison between Equivariant SAC (blue) and CNN SAC (green) as the view angle decreases. The plots show the evaluation performance of Equivariant SAC and CNN SAC at the end of training in different view angles.

6.3 Example of Incorrect Equivariance

Figure 10:

The environment conducts a random reflection on the state image at every step. The four images show the four possible reflections, each has 25% probability.

We demonstrate an example where incorrect equivariance can harm the performance of Equivariant SAC compared to an unconstrained model. We modify the environments so that the image state will be reflected across the vertical axis with probability and then also reflected across the horizontal axis with probability (see Figure 10). As these random reflections are contained in , the transformed state is affected by Equivariant SAC’s symmetry constraint. In particular, as the actor produces a transformed action for reflect when the optimal action should actually be invariant, the extrinsic equivariance constraint now becomes an incorrect equivariance for these reflected states. As shown in Figure 11, Equivariant SAC can barely learn under random reflections, while CNN can still learn a useful policy.

Figure 11: Comparison between Equivariant SAC (blue) and CNN SAC (green) in an environment that will make Equivariant SAC encode incorrect equivariance. The plots show the performance of the evaluation policy. The evaluation is performed every 200 training steps.

6.4 Reinforcement Learning in DeepMind Control Suite

Figure 12: Comparison between Equivariant DrQv2 and Non-equivariant DrQv2 on easy tasks (top) and medium tasks (bottom). The evaluation is performed every 10000 environment steps.

We further apply extrinsically equivariant networks to continuous control tasks in the DeepMind Control Suite (DMC) (tunyasuvunakool2020). We use a subset of the domains in DMC that have clear object-level symmetry and use the group for cartpole, cup catch, pendulum, acrobot domains, and for reacher domains. This leads to a total of tasks, with easy and medium level tasks as defined in (drqv2). Note that all of these domains are not fully equivariant as they include a checkered grid for the floor and random stars as the background.

We use DrQv2 drqv2

, a SOTA model-free RL algorithm for image-based control, as our base RL algorithm. We create an equivariant version of DrQv2, with an equivariant actor and invariant critic with respect to the environment’s symmetry group. We follow closely the architecture and training hyperparameters used in the original paper except in the image encoder, where two max-pooling layers are added to further reduce the representation dimension for faster training. Furthermore, DrQv2 uses convolution layers in the image encoder and then flattens its output to feed it into linear layers in the actor and the critic. In order to preserve this design choice for the equivariant model, we do not reduce the spatial dimensions to

by downsampling/pooling or stride as commonly done in practice. Rather we flatten the image using a process we term action restriction since the symmetry group is restricted from

to . Let denote the image feature where acts on both the spatial domain and channels. Then we add a new axis corresponding to by . We then flatten to . The intermediate step is necessary to encode both the spatial and channel actions into a single axis which ensures the action restriction is -equivariant. We now map back down to the original dimension with a -equivariant convolution. To the best of our knowledge, this is the first equivariant version of DrQv2.

We compare the equivariant vs. the non-equivariant (original) DrQv2 algorithm to evaluate whether extrinsic equivariance can still improve training in the original domains (with symmetry corruptions). In figures 12, equivariant DrQv2 consistently learns faster than the non-equivariant version on all tasks, where the performance improvement is largest on the more difficult medium tasks. In pendulum swingup, both methods have failed run each, leading to a large standard error, see Figure 27 in Appendix E.4 for a plot of all runs. These results highlight that even with some symmetry corruptions, equivariant policies can outperform non-equivariant ones. See Appendix E.4.1 for an additional experiment where we vary the level of symmetry corruptions as in Section 6.2.

7 Discussion

This paper defines correct equivariance, incorrect equivariance, and extrinsic equivariance, and identifies that enforcing extrinsic equivariance does not necessarily increase error. This paper further demonstrates experimentally that extrinsic equivariance can provide significant performance improvements in reinforcement learning. A limitation of this work is that we mainly experiment in reinforcement learning and a simple supervised setting but not in other domains where equivariant learning is widely used. The experimental results of our work suggest that an extrinsic equivariance should also be beneficial in those domains, but we leave this demonstration to future work. Another limitation is that we focus on planar equivariant networks. In future work, we are interested in evaluating extrinsic equivariance in network architectures that process different types of data.

References

Appendix A Theoretical Upper Bound on Accuracy for Models with Incorrect Symmetry

We consider a classification problem over the set with finitely many classes . Let be the number of classes. Let be the true labels. Let be a finite group acting on . We assume the action of on is density preserving. That is, if is the density function corresponding to the input domain, then . Denote the orbit of a point by and the stabilizer by By the orbit-stabilizer theorem .

Now consider a model with incorrect equivariance constrained to be invariant to . We partition the input set into subsets where

If has correct equivariance then . Incorrect equivariance implies that there are orbits which are assigned more than one label. Since is constrained to be equivariant such orbits will necessarily result in some errors. We give an upper bound on that error. Define . Note that since give a partition, . Also, is empty for since the number of labels assigned to an orbit is also upper bounded by the number of points in the orbit which is at most . Letting , we have .

Proposition A.1.

The accuracy of has upper bound

In contrast, we can choose an unconstrained model from a model class with a universal approximation property and given properly chosen hyperparameters find a model with arbitrarily good accuracy.

Proof.

Let . Then Since the action of is density preserving, applying an element of before sampling does not affect the expectation, and so

If we split the expectation over the partition we get

Interchanging sums gives

By the orbit-stabilizer theorem,

which is the average accuracy over the orbit . Since is constrained to a single value of the orbit, and different true labels appear, the highest accuracy attainable is when and the true labels are maximally unequally distributed such that 1 point in the orbit takes each of labels and all the other points receive a single label. In this case accuracy can be maximized by choosing to be this majority label, and

Substituting back in,

since is constant over and . ∎

Note that the assumption that and that the labels on a given orbit are maximally unequally distributed need not hold in general and thus this bound is not tight. In order to produce a tight upper bound, consider a partition where and define . The set contains points in orbits where the majority label covers a fraction of the points. Note that although is a fraction between 0 and 1, there are only finitely many possible values of since the numerator and denominator and bounded natural numbers. We may thus sum over the values of .

Proposition A.2.

The accuracy of has upper bound

Proof.

The proof is similar to the proof of Proposition A.1 replace and with and respectively. For , the term can be upper bounded by choosing the majority label yielding . The bound then follows as before. ∎

This is a tight upper bound since assigning any but the majority label would result in lower accuracy.

Figure 13 demonstrates the upper bound of an incorrectly constrained equivariant network with the invert label corruption in Section 5, where .

Figure 13: Demonstration of the upper bound of an equivariant model under invert label corruption in our supervised learning experiment. The number on each partition shows the ground truth label.

Appendix B Correct, Incorrect, and Extrinsic Equivariance Examples

Figure 14: An example classification task for correct, incorrect, and extrinsic equivariance. The input distribution is shown as a gray ring. The training data samples are shown as circles, where the color is the ground truth label. Crosses represent the group transformed data. The opaque points highlight the example points while other points are semitransparent.

In this section, we describe how the model symmetry transforms data under correct, incorrect, and extrinsic equivariance and how such transformations relate to the true symmetry present in the task using the example of Section 4.2. The ground truth function is a mapping from to . Let be the coordinates of four points in the data distribution on the unit circle (Figure 14a). The ground truth labels for these points are: .

b.1 Correct Equivariance

Definition 4.1.

The action has correct equivariance with respect to if for all and .

Consider the reflection group (where is the reflection along the horizontal axis) acting on by or and via , the trivial action fixing the labels (Figure 14b). If we define an equivariant model with respect to and , then the model’s symmetry preserves the problem symmetry. For example, consider the point , is the reflection so that and . Since the model is -equivariant, . Substituting and , we obtain , meaning that the output of and are constrained to be equal. Thus the invariance property in the ground truth function where is preserved (notice that this applies to all ). We call this correct equivariance.

b.2 Incorrect Equivariance

Definition 4.2.

The action has incorrect equivariance with respect to if there exist and such that but .

Consider the rotation group (Figure 14c) which acts via on via a rotation matrix of and acts on via . If we define an equivariant model with respect to and , the network’s symmetry will conflict with the problem’s symmetry. For example, consider the point and let be the rotation action so that and . As the model is -equivariant, . Substituting and , we get . However, this constraint interferes with the ground truth function as and . We call this incorrect equivariance.

b.3 Extrinsic Equivariance

Definition 4.3.

The action has extrinsic equivariance with respect to if for , .

Consider the scaling group acting on

by scaling the vector and on

via (Figure 14d). If we define an equivariant model with respect to and , the group-transformed data will be outside the input distribution. Consider the point and let be the scaling action so that . Since the model is -equivariant, . Substituting and we have meaning that the output of and are constrained to be equal. However, is outside of the input distribution (gray ring) and thus the ground truth is undefined. We call this extrinsic equivariance.

Intuitively, it is easy to see in this example how extrinsic equivariance would help the model learn . If the model is equivariant to the scale group , then it can generalize to “scaled” up or down versions of the input distribution and “covers” more of the input space . As such, the model may learn the decision boundary (the vertical axis) more easily because of its equivariance compared to a non-equivariant model, even if the equivariance is extrinsic.

Appendix C Network Architecture

Figure 15: Network architecture of the equivariant network in the supervised learning experiment.
Figure 16: Network architecture of the CNN network in the supervised learning experiment.
Network Equi CNN
Number of Parameters 1.11 million 1.28 million
Table 1: Number of trainable parameters of the equivariant network (Equi) and conventional CNN network (CNN) in the supervised learning task.

c.1 Supervised Learning

Figure 15 shows the network architecture of the equivariant network and Figure 16 shows the network architecture of the CNN network in Section 5

. Both networks are 8-layer convolutional neural networks. The equivariant network is implemented using the e2cnn 

(e2cnn) library, where the hidden layers are defined using the regular representation and the output layer is defined using the trivial representation. Table 1 shows the numbers of trainable parameters in both networks, where both networks have a similar number with a slight advantage in the CNN.

Figure 17: Network architecture of Equivariant SAC in robotic manipulation tasks.

c.2 Reinforcement Learning in Robotic Manipulation

Figure 17 shows the network architecture of Equivariant SAC used in manipulation tasks in Section 6.1. All hidden layers are implemented using the regular representation. For the actor (top), the output is a mixed representation containing one standard representation for the actions, one signed representation for the action, and seven trivial representations for the actions and the standard deviations of all action components. Figure 18 shows the network architecture of CNN SAC for both RAD and DrQ. Figure 19 shows the network architecture of FERM. Figure 20 shows the network architecture of SEN.

Table 2 shows the number of trainable parameters for each model. All baselines have slightly more parameters compared with Equivariant SAC.

Figure 18: Network architecture of CNN SAC in robotic manipulation tasks.
Figure 19: Network architecture of FERM in robotic manipulation tasks.
Figure 20: Network architecture of SEN in robotic manipulation tasks.
Network Equi SAC CNN SAC FERM SEN
Number of Actor Parameters 1.11 million 1.13 million 1.79 million 1.22 million
Number of Critic Parameters 1.18 million 1.27 million 1.90 million 1.24 million
Number of Total Parameters 2.29 million 2.40 million 2.34 million 2.46 million
Table 2: Number of trainable parameters of Equivariant SAC, CNN SAC, FERM, and SEN in the reinforcement learning task in robotic manipulation. Notice that FERM has a shared encoder between the actor and the critic so the total number of parameters is smaller than the sum of the actor parameter and the critic parameter.

Appendix D Training Details

d.1 Supervised Learning

We implement the environment in the PyBullet simulator (pybullet). The ducks are located in a workspace with a size of . The pixel size of the image is (and will be cropped to

during training). We implement the training in PyTorch 

(pytorch) using a cross-entropy loss. The output of the model is the score for each . We use the Adam optimizer (adam) with a learning rate of . The batch size is 64. In all training, we perform a three-way data split with

training data, 200 holdout validation data, and 200 holdout test data. The training is terminated either when the validation prediction success rate does not improve for 100 epochs or when the maximum epoch (1000) is reached.

d.2 Reinforcement Learning in Robotic Manipulation

We use the environments provided by the BulletArm benchmark (bulletarm) implemented in the PyBullet simulator (pybullet). The workspace’s size is . The pixel size of the image observation is (and will be cropped to during training). The action space is for the change of position of the gripper; for the change of top-down rotation of the gripper; and for the open width of the gripper where 0 means fully close and 1 means fully open. All environments have a sparse reward: +1 for reaching the goal and 0 otherwise. During training, we use 5 parallel environments where a training step is performed after all 5 parallel environments perform an action step. The evaluation is performed every 200 training steps. We implement the training in PyTorch (pytorch). We use the Adam optimizer (adam) with a learning rate of . The batch size is 128. The entropy temperature for SAC is initialized at . The target entropy is . The discount factor . The Prioritized Experience Replay (PER) (per) has a capacity of 100,000 transitions with prioritized replay exponent of and prioritized importance sampling exponent as in per. The expert transitions are given a priority bonus of .

The contrastive encoder of the FERM baseline has an encoding size of 50 as in ferm. The FERM baseline’s contrastive encoder is pre-trained for 1.6k steps using the expert data as in ferm. In DrQ, the number of augmentations for calculating the target and the number of augmentations for calculating the loss are both 2 as in drq.

d.3 Reinforcement Learning in DeepMind Control Suite

Sample images of each environment are shown in Figure 21. Environment observations are 3 consecutive frames of RGB images of size

, in order to infer velocity and acceleration. Note that we use odd-sized image sizes instead of

used in drqv2, as the DrQv2 architecture contains a convolutional layer with stride and this breaks equivariance for even-sized spatial inputs (mohamed2020data). For each environment, an episode lasts steps where each step has a reward between and .

(a) Cartpole Balance
(b) Cartpole Swingup
(c) Pendulum Swingup
(d) Cup Catch
(e) Acrobot Swingup
(f) Reacher easy
(g) Reacher hard
Figure 21: DeepMind Control Suite: images of easy (top) and medium (bottom) tasks.

We modify the original DrQv2 by making the encoder map down to a smaller spatial output, leading to faster training. The second and third convolutional blocks have an added max-pooling layer, leading to a spatial output size of . As the equivariant version of DrQv2 has an additional convolutional layer after the action restriction, the non-equivariant version also has an additional convolutional layer at the end of the encoder. We also scale the number of channels by in order to preserve roughly the same number of parameters as the non-equivariant version.

The policy is evaluated by averaging the return of episodes every environment steps. In all DMC experiments, we plot the mean and the standard error over seeds. All other training details and hyperparameters are kept the same as in drqv2.

Appendix E Additional Experiments

e.1 Supervised Learning with More Symmetry Corruption Types

In this section, we demonstrate the experiment in Section 5 in more symmetry corruptions. Figure 22 shows the 15 different corruptions. We also show the performance of ‘Equi’, ‘CNN’, and ‘CNN + Img Trans’ without the random crop augmentation used in Section Section 5 (labeled as ‘no Crop’ variations). The result is shown in Figure 23

. First, comparing blue vs green, and purple vs orange, the equivariant network always outperforms the CNN with or without random crop augmentation, especially with fewer data. Second, comparing blue vs purple, and green vs orange, random crop generally helps both the equivariant network and the CNN network. Third, comparing red vs green, and cyan vs orange, adding the image transformation augmentation improves the performance of CNN. Notice that the condition reverse is an outlier because the equivariant network has incorrect equivariance, where the CNN methods (green and orange) without image transformation augmentation have the best performance.

Figure 22: All symmetry corruptions in the rotation estimation experiment.
Figure 23: Comparison of an equivariant network (blue), a conventional network (green), CNN equipped with image transformation augmentation (red), and their variation without random crop augmentation (purple, orange, cyan). The plots show the prediction accuracy in the test of the model trained with different number of training data. Results are averaged over four runs. Shading denotes standard error.

e.2 RL in Manipulation without Random Crop

In this section, we demonstrate the performance of Equivariant SAC and CNN SAC without random crop augmentation using RAD. As is shown in Figure 24, both methods work poorly without the random crop augmentation.

Figure 24: Comparison between Equivariant SAC and CNN SAC without data augmentation using RAD. The plots show the performance (in terms of discounted reward) of the evaluation policy. The evaluation is performed every 200 training steps. Results are averaged over four runs. Shading denotes standard error.

e.3 RL in Manipulation with Occlusion Corruption

(a)
(b)
Figure 25: Left: the depth image taken from a depth camera. Right: the orthographic projection centered at the gripper position generated from the left image, where the black areas are missing depth values due to occlusions.

In this section, we perform the same experiment as in Section 6.1 with a different type of symmetry corruption: occlusion due to orthographic projection using a single camera. Instead of using an RGBD image observation as in Section 6.1, we take the depth channel from the RGBD image and perform an orthographic projection at the gripper’s position (Figure 25). This is the same process as in corl22 to generate a top-down image for equivariant learning, however, since we only have one camera instead of two as in the prior work, this orthographic projection will have missing depth values due to occlusion and thus leads to an extrinsic equivariant constraint. Figure 26 shows the results. Similar as in Section 6.1, Equivariant SAC outperforms all baselines with a significant margin.

Figure 26: Comparison of Equivariant SAC (blue) with baselines in environments with occlusion corruption. The plots show the performance (in terms of discounted reward) of the evaluation policy. The evaluation is performed every 200 training steps. Results are averaged over four runs. Shading denotes standard error.

e.4 RL in DeepMind Control Suite

Figure 27 is another visualization of equivariant vs non-equivariant DrQv2 on the original pendulum swingup environment. As each method has failed seed, we plot all runs with slightly different color shades. If we exclude the failed run from each method, it can easily be seen that equivariant DrQv2 learns faster than the non-equivariant version.

Figure 27: All runs of equivariant and non-equivariant DrQv2 on the DMC pendulum swingup task. Each method has failed seed - the failed equivariant policy (blue) run is consistently near zero reward and the failed non-equivariant policy run (red) is around . Overall, the equivariant DrQv2 learns faster than the non-equivariant version when it succeeds.

e.4.1 Increasing symmetry corruptions

In these experiments, we modify some domains to have different levels of symmetry-breaking corruptions. For cartpole and cup catch, we either remove the gridded floor and background (None) to make the observation perfectly equivariant or keep the floor and background and further change the camera angle by rolling (), increasing the level of corruption. For reacher, we use the same modifications but tilt the camera instead of rolling. See Figure 3 for sample images. In order to see the effects of increasing corruption on learning, we plot the mean discounted reward when both methods have converged (k frames for cartpole and cup catch, M frames for reacher). Figure 28 shows that both the equivariant and non-equivariant DrQv2 surprisingly perform quite well across all corruption levels, with the exception of on reacher. The equivariant policy seems to converge to a slightly higher discounted reward than the non-equivariant version, though the difference is not significant. On reacher, changing the camera angle may have affected both methods by making the task more difficult for both an equivariant and regular CNN encoder.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
Table 3: Modifications to DMC domains for varying symmetry corruption levels. The gridded floor and background are removed to be fully equivariant (None) or the camera angle is modified to increase the level of symmetry corruption (roll for cartpole and cup catch, tilt for reacher).
Figure 28: DMC performance comparison on various levels of symmetry corruptions. Both the equivariant and non-equivariant DrQv2 perform quite well even with increasing levels of corruption.

Appendix F Baseline Architecture Search

f.1 CNN SAC Architecture Search

Figure 30: Network architecture of the ‘fc2’ variation for CNN SAC.
Figure 29: Network architecture of the ‘fc1’ variation for CNN SAC.
Figure 30: Network architecture of the ‘fc2’ variation for CNN SAC.
Figure 31: Architecture search for CNN SAC. The plots show the performance (in terms of discounted reward) of the evaluation policy. The evaluation is performed every 200 training steps. Results are averaged over four runs. Shading denotes standard error.
Figure 29: Network architecture of the ‘fc1’ variation for CNN SAC.

This section demonstrates the architecture search for CNN SAC. We consider three different architectures (all with a similar amount of trainable parameters): 1) conv (Figure 18): a CNN network with the same structure as Equivariant SAC, where all layers are implemented using convolutional layers. 2) fc1 (Figure 31): a CNN network that replaces some layers in 1) with fully connected layers. 3) fc2 (Figure 31): similar as 2), but with fewer convolutional layers and more weights in the FC layer. We evaluate the three network architectures with SAC equipped random crop augmentation using RAD (rad).

Figure 31 shows the result, where all three variations have a similar performance. We use conv in the main paper since it has a similar structure as Equivariant SAC.

f.2 FERM Architecture Search

Figure 32: Network architecture of the ‘sim enc’ variation for FERM.
Figure 33: Network architecture of the ‘sim total 1’ variation for FERM.
Figure 34: Network architecture of the ‘ferm ori’ variation for FERM.
Figure 35: Architecture search for FERM. The plots show the performance (in terms of discounted reward) of the evaluation policy. The evaluation is performed every 200 training steps. Results are averaged over four runs. Shading denotes standard error.

This section demonstrates the architecture search for FERM. We consider four different architectures: 1) sim total 1 (Figure 33) and 2) sim total 2 (Figure 19) are two different architectures with the similar amount of total trainable parameters as Equivariant SAC. 3) sim enc (Figure 32) has similar amount of trainable parameters in the encoder as Equivariant SAC’s encoder. Notice that since FERM share an encoder between the actor and the critic while Equivariant SAC has separate encoders, having the similar amount of parameters in the encoder will lead to fewer total parameter in FERM compared with Equivariant SAC. 4) ferm ori (Figure 34) is the same network architecture used in the FERM paper (ferm).

Figure 35 shows the comparison across the four architectures. ‘sim total 2’ has a marginal advantage compared with the other three variations, so we use it in the main paper.

f.3 SEN Architecture Search

This section shows the architecture search for SEN. We consider three variations (all with similar amount of trainable parameters): 1) SEN conv (Figure 36): all layers are implemented using convolutional layers. 2) SEN fc1 (Figure 37) and SEN fc2 (Figure 20) replaces some layers in 1) with fully connected layers.

Figure 38 shows the comparison across the three variations. ‘SEN fc2’ shows the best performance.

Figure 36: Network architecture of ‘SEN conv’ variation of SEN.
Figure 37: Network architecture of ‘SEN fc1’ variation of SEN.
Figure 38: Architecture search for SEN. The plots show the performance (in terms of discounted reward) of the evaluation policy. The evaluation is performed every 200 training steps. Results are averaged over four runs. Shading denotes standard error.