1 Introduction

Recently, equivariant learning has shown great success in various machine learning domains like trajectory prediction
(walters2020trajectory), robotics (neural_descriptor), and reinforcement learning (iclr). Equivariant networks (g_conv; steerable_cnns) can improve generalization and sample efficiency during learning by encoding task symmetries directly into the model structure. However, this requires problem symmetries to be perfectly known and modeled at design time – something that is sometimes problematic. It is often the case that the designer knows that a latent symmetry is present in the problem but cannot easily express how that symmetry acts in the input space. For example, Figure 1b is a rotation of Figure 1a. However, this is not a rotation of the image – it is a rotation of the objects present in the image when they are viewed from an oblique angle. In order to model this rotational symmetry, the designer must know the viewing angle and somehow transform the data or encode projective geometry into the model. This is difficult and it makes the entire approach less attractive. In this situation, the conventional wisdom would be to discard the model structure altogether since it is not fully known and to use an unconstrained model. Instead, we explore whether it is possible to benefit from equivariant models even when the way a symmetry acts on the problem input is not precisely known. We show empirically that this is indeed the case and that an inaccurate equivariant model is often better than a completely unstructured model. For example, suppose we want to model a function with the object-wise rotation symmetry expressed in Figure 1a and b. Notice that whereas it is difficult to encode the object-wise symmetry, it is easy to encode an image-wise symmetry because it involves simple image rotations. Although the image-wise symmetry model is imprecise in this situation, our experiments indicate that this imprecise model is still a much better choice than a completely unstructured model.This paper makes three contributions. First, we define three different relationships between problem symmetry and model symmetry: correct equivariance, incorrect equivariance, and extrinsic equivariance. Correct equivariance means the model correctly models the problem symmetry; incorrect equivariance is when the model symmetry interferes with the problem symmetry; and extrinsic equivariance is when the model symmetry transforms the input data to out-of-distribution data. We theoretically demonstrate the upper bound performance for an incorrectly constrained equivariant model. Second, we empirically compare extrinsic and incorrect equivariance in a supervised learning task and show that a model with extrinsic equivariance can improve performance compared with an unconstrained model. Finally, we explore this idea in a reinforcement learning context and show that an extrinsically constrained model can outperform state-of-the-art conventional CNN baselines.
2 Related Work
Equivariant Neural Networks.
Equivariant networks are first introduced as G-Convolution (g_conv) and Steerable CNN (steerable_cnns; e2cnn; escnn). Equivariant learning has been applied to various types of data including images (e2cnn), spherical data (spherical_cnns), point clouds (dym2020universality), sets maron2020learning, and meshes (de2020gauge), and has shown great success in tasks including molecular dynamics (anderson2019cormorant), particle physics (bogatskiy2020lorentz), fluid dynamics (wang2020incorporating), trajectory prediction (walters2020trajectory), robotics (neural_descriptor; rss22xupeng; rss22haojie) and reinforcement learning (corl; iclr). Compared with the prior works that assume the domain symmetry is perfectly known, this work studies the effectiveness of equivariant networks in domains with latent symmetries.
Symmetric Representation Learning.
Since latent symmetry is not expressable as a simple transformation of the input, equivariant networks can not be used in the standard way. Thus several works have turned to learning equivariant features which can be easily transformed. sen learn an encoder which maps inputs to equivariant features which can be used by downstream equivariant layers. quessard2020learning, klee2022i2i, and marchetti2022equivariant map 2D image inputs to elements of various groups including , allowing for disentanglement and equivariance constraints. falorsi2018explorations use a homeomorphic VAE to perform the same task in an unsupervised manner. dangovski2021equivariant consider equivariant representations learned in a self-supervised manner using losses to encourage sensitivity or insensitivity to various symmetries. Our method may be considered as an example of symmetric representation learning which, unlike any of the above methods, uses an equivariant neural network as an encoder. zhou2020meta and dehmamy2021automatic assume no prior knowledge of the structure of symmetry in the domain and learn the symmetry transformations on inputs and latent features end-to-end with the task function. In comparison, our work assumes that the latent symmetry is known but how it acts on the input is unknown.
Sample Efficient Reinforcement Learning.
One traditional solution for improving sample efficiency is to create additional samples using data augmentation (alexnet). Recent works discover that simple image augmentations like random crop (rad; drqv2) or random shift (drq) can improve the performance of reinforcement learning. Such image augmentation can be combined with contrastive learning (oord2018representation) to achieve better performance (curl; ferm). Recently, many prior works have shown that equivariant methods can achieve tremendously high sample efficiency in reinforcement learning (van2020mdp; mondal2020group; corl; iclr), and realize on-robot reinforcement learning (rss22xupeng; corl22). However, recent equivariant reinforcement learning works are limited in fully equivariant domains. This paper extends the prior works by applying equivariant reinforcement learning to tasks with latent symmetries.
3 Background
Equivariant Neural Networks.
A function is equivariant if it respects symmetries of its input and output spaces. Specifically, a function is equivariant with respect to a symmetry group if it commutes with all transformations , , where and are the representations of the group that define how the group element acts on and , respectively. An equivariant function is a mathematical way of expressing that is symmetric with respect to : if we evaluate for differently transformed versions of the same input, we should obtain transformed versions of the same output.
In order to use an equivariant model, we generally require the symmetry group and representation to be known at design time. For example, in a convolutional model, this can be accomplished by tying the kernel weights together so as to satisfy , where and denote the representation of the group operator at the input and the output of the layer (equi_theory)
. End-to-end equivariant models can be constructed by combining equivariant convolutional layers and equivariant activation functions. In order to leverage symmetry in this way, it is common to transform the input so that standard group representations work correctly, e.g., to transform an image to a top-down view so that image rotations correspond to object rotations.
Equivariant SAC.
Equivariant SAC (iclr) is a variation of SAC (sac) that constrains the actor to an equivariant function and the critic to an invariant function with respect to a group . The policy is a network , where
is the space of action standard deviations (SAC models a stochastic policy). It defines the group action on the output space of the policy network network
as: , where is the equivariant component in the action space, is the invariant component in the action space, , . The actor network is then defined to be a mapping that satisfies the following equivariance constraint: . The critic is a -network that satisfies an invariant constraint: .4 Learning Symmetry Using Other Symmetries
4.1 Model Symmetry Versus True Symmetry

This paper focuses on tasks where the way in which the symmetry group operates on the input space is unknown. In this case the ground truth function is equivariant with respect to a group which acts on and by and respectively. However, the action on the input space is not known and may not be a simple or explicit map. Since is unknown, we cannot pursue the strategy of learning using an equivariant model class constrained by . As an alternative, we propose restricting to a model class which satisfies equivariance with respect to a different group action , i.e., . This paper tests the hypothesis that if the model is constrained to a symmetry class which is related to the true symmetry , then it may help learn a model satisfying the true symmetry. For example, if is an image viewed from an oblique angle and is the rotation of the objects in the image, can be the rotation of the whole image (which is different from because of the tilted view angle). Section 4.4 will describe this example in detail.
4.2 Correct, Incorrect, and Extrinsic Equivariance
Our findings show that the success of this strategy depends on how relates to the ground truth function
and its symmetry. We classify the model symmetry as
correct equivariance, incorrect equivariance, or extrinsic equivariance with respect to . Correct symmetry means that the model symmetry correctly reflects a symmetry present in the ground truth function . An extrinsic symmetry may still aid learning whereas an incorrect symmetry is necessarily detrimental to learning. We illustrate the distinction with a classification example shown in Figure 2a. (See Appendix B for a more in-depth description.) Let be the support of the input distribution for .Definition 4.1.
The action has correct equivariance with respect to if for all and .
That is, the model symmetry preserves the input space and is equivariant with respect to it. For example, consider the action of the group acting on by reflection across the horizontal axis and , the trivial action fixing labels. Figure 2b shows the untransformed data as circles along the unit circle. The transformed data (shown as crosses) also lie on the unit circle, and hence the support is reflection invariant. Moreover, the ground truth labels (shown as orange or blue) are preserved by this action.
Definition 4.2.
The action has incorrect equivariance with respect to if there exist and such that but .
In this case, the model symmetry partially preserves the input distribution, but does not correctly preserve labels. In Figure 2c, the rotation group maps the unit circle to itself, but the transformed data does not have the correct label. Thus, constraining the model by will force to mislabel data. In this example, for , and , however, .
Definition 4.3.
The action has extrinsic equivariance with respect to if for , .
Extrinsic equivariance is when the equivariant constraint in the equivariant network enforces equivariance to out-of-distribution data. Since , the ground truth is undefined. An example of extrinsic equivariance is given by the scaling group shown in Figure 2d. For the data , enforcing scaling invariance where will not increase error, because the group transformed data (in crosses) are out of the distribution of the input data shown in the grey ring. In fact, we hypothesize that such extrinsic equivariance may even be helpful for the network to learn the ground truth function. For example, in Figure 2d, the network can learn to classify all points on the left as blue and all points on the right as orange.
4.3 Theoretical Upper Bound on Accuracy for Incorrect Equivariant Models
Consider a classification problem over the set with finitely many classes . Let be a finite group acting on . Consider a model with incorrect equivariance constrained to be invariant to . In this case the points in a single orbit must all be assigned the same label . However these points may have different ground truth labels. We classify how bad this situation is by measuring , the proportion of ground truth labels in the orbit of which are equal to the majority label. Let be the fraction of points which have consensus proportion .
Proposition 4.1.
The accuracy of has upper bound
See the complete version of the proposition and its proof in Appendix A. In the example in Figure 2c, we have and , thus
. In contrast, an unconstrained model with a universal approximation property and proper hyperparameters can achieve arbitrarily good accuracy.
4.4 Object Transformation and Image Transformation
In tasks with visual inputs (), incorrect or extrinsic equivariance will exist when the transformation of the image does not match the transformation of the latent state of the task. In such case, we call the object transform and the image transform. For an image input , the image transform is defined as a simple transformation of pixel locations (e.g., Figure 1a-c where ), while the object transform is an implicit map transforming the objects in the image (e.g., Figure 1a-b where ). The distinction between object transform and image transform is often caused by some symmetry-breaking factors such as camera angle, occlusion, backgrounds, and so on (e.g., Figure 1). We refer to such symmetry-breaking factors as symmetry corruptions.
5 Evaluating Equivariant Network with Symmetry Corruptions
Although it is preferable to use an equivariant model to enforce correct equivariance, real-world problems often contain some symmetry corruptions, such as oblique viewing angles, which mean the symmetry is latent. In this experiment, we evaluate the effect of different corruptions on an equivariant model and show that enforcing extrinsic equivariance can actually improve performance. We experiment with a simple supervised learning task where the scene contains three ducks of different colors. The data samples are pairs of images where all ducks in the first image are rotated by some to produce the second image within each pair. Given the correct , the goal is to train a network to classify the rotation (Figure (a)a). If we have a perfect top-down image observation, then the object transform and image transform are equal, and we can enforce the correct equivariance by modeling the ground truth function as an invariant network where (because the rotation of the two images will not change the relative rotation between the objects in the two images). To mimic symmetry corruptions in real-world applications, we apply seven different transformations to both pairs of images shown in Figure (b)b (more corruptions are considered in Appendix E.1). In particular, for invert-label, the ground truth label is inverted to when the yellow duck is on the left of the orange duck in the world frame in the first input image. Notice that enforcing -invariance in under invert-label is an incorrect equivariant constraint because a rotation on the ducks might change their relative position in the world frame and break the invariance of the task: . However, in all other corruptions, enforcing -invariance is an extrinsic equivariance because will be out of the input distribution. We evaluate the equivariant network defined in group implemented using e2cnn (e2cnn). See Appendix D.1 for the training details.
![]() |
![]() |
(a) The rotation estimation task requires the network to estimate the relative rotation between the two input states. (b) Different symmetry corruptions in the rotation estimation experiment.

rotations (red). The plots show the prediction accuracy in the test set of the model trained with different number of training data. In all of our experiments, we take the average over four random seeds. Shading denotes standard error.
Comparing Equivariant Networks with CNNs.
We first compare the performance of an equivariant network (Equi) and a conventional CNN model (CNN) with a similar number of trainable parameters. The network architectures are relatively simple (see Appendix C.1) as our goal is to evaluate the performance difference between an equivariant network and an unconstrained CNN model rather than achieving the best performance in this task. In both models, we apply a random crop after sampling each data batch to improve the sample efficiency. See Appendix E.1 for the effects of random crop augmentation on learning. Figure 4 (blue vs green) shows the test accuracy of both models after convergence when trained with varying dataset sizes. For all corruptions with extrinsic equivariance constraints, the equivariant network performs better than the CNN model, especially in low data regimes. However, for invert-label which gives an incorrect equivariance constraint, the CNN outperforms the equivariant model, demonstrating that enforcing incorrect equivariance negatively impacts accuracy. In fact, based on Proposition 4.1, the equivariant network here has a theoretical upper bound performance of . First, . Then when where (i.e., negating the label won’t change it), and . The consensus proportion when , where half of the labels in the orbit of will be the negation of the labels of the other half (because half of will change the relative position between the yellow and orange duck), thus . . This theoretical upper bound matches the result in Figure 4. Figure 4 suggests that even in the presence of symmetry corruptions, enforcing extrinsic equivariance can improve the sample efficiency while incorrect equivariance is detrimental.
Extrinsic Image Augmentation Helps in Learning Correct Symmetry.
In these experiments, we further illustrate that enforcing extrinsic equivariance helps the model learn the latent equivariance of the task for in-distribution data. As an alternative to equivariant networks, we consider an older alternative for symmetry learning, data augmentation, to see whether extrinsic symmetry augmentations can improve the performance of an unconstrained CNN by helping it learn latent symmetry. Specifically, we augment each training sample with image rotations while keeping the validation and test set unchanged. As is shown in Figure 4, adding such extrinsic data augmentation (CNN + Img Trans, red) significantly improves the performance of CNN (green), and nearly matches the performance of the equivariant network (blue). Notice that in invert-label, adding such augmentation hurts the performance of CNN because of incorrect equivariance.
6 Extrinsic Equivariance in Reinforcement Learning
The results in Section 5 suggest that enforcing extrinsic equivariance can help the model better learn the latent symmetry in the task. In this section, we apply this methodology in reinforcement learning and demonstrate that extrinsic equivariance can significantly improve sample efficiency.
6.1 Reinforcement Learning in Robotic Manipulation

We first experiment in five robotic manipulation environments shown in Figure 6. The state space is a 4-channel RGBD image captured from a fixed camera pointed at the workspace (Figure 5). The action space is the change in gripper pose , where is the rotation along the -axis, and the gripper open width . The task has latent
symmetry: when a rotation or reflection is applied to the poses of the gripper and the objects, the action should rotate and reflect accordingly. However, such symmetry does not exist in image space because the image perspective is skewed instead of top-down (we also perform experiments with another symmetry corruption caused by sensor occlusion in Appendix
E.3). We enforce such extrinsic symmetry (group ) using Equivariant SAC (iclr; corl22) equipped with random crop augmentation using RAD (rad) (Equi SAC + RAD) and compare it with the following baselines: 1) CNN SAC + RAD: same as our method but with an unconstrained CNN instead of an equivariant model; 2) CNN SAC + DrQ: same as 1), but with DrQ (drq) for the random crop augmentation; 3) FERM (ferm): a combination of 1) and contrastive learning; and 4) SEN + RAD: Symmetric Embedding Network (sen) that uses a conventional network for the encoder and an equivariant network for the output head. All baselines are implemented such that they have a similar number of parameters as Equivariant SAC. See Appendix C.2 for the network architectures and Appendix F for the architecture hyperparameter search for the baselines. All methods use Prioritized Experience Replay (PER) (per) with pre-loaded expert demonstrations (20 episodes for Block Pulling and Block Pushing, 50 for Block Picking and Drawer Opening, and 100 for Block in Bowl). We also add an L2 loss towards the expert action in the actor to encourage expert actions. More details about training are provided in Appendix D.2.Figure 7 shows that Equivariant SAC (blue) outperforms all baselines. Note that the performance of Equivariant SAC in Figure 7 does not match that reported in iclr because we have a harder task setting: we do not have a top-down observation centered at the gripper position as in the prior work. Such top-down observations would not only provide correct equivariance but also help learn a translation-invariant policy. Even in the harder task setting without top-down observations, Figure 7 suggests that Equivariant SAC can still achieve higher performance compared to baselines.
![]() |
![]() |
![]() |
![]() |
![]() |

6.2 Increasing Corruption Levels

In this experiment, we vary the camera angle by tilting to see how increasing the gap between the image transform and the object transform affects the performance of extrinsically equivariant networks. When the view angle is at 90 degrees (i.e., the image is top-down), the object and image transformation exactly match. As the view angle is decreased, the gap increases. Figure 8 shows the observation at 90 and 15 degree view angles. We remove the robot arm except for the gripper and the blue/white grid on the ground to remove the other symmetry-breaking components in the environment so that the camera angle is the only symmetry corruption. We compare Equi SAC + RAD against CNN SAC + RAD. We evaluate the performance of each method at the end of training for different view angles in Figure 9. As expected, the performance of Equivariant SAC decreases as the camera angle is decreased, especially from 30 degrees to 15 degrees. On the other hand, CNN generally has similar performance for all view angles, with the exception of Block Pulling and Block Pushing, where decreasing the view angle leads to higher performance. This may be because decreasing the view angle helps the network to better understand the height of the gripper, which is useful for pulling and pushing actions.

6.3 Example of Incorrect Equivariance

The environment conducts a random reflection on the state image at every step. The four images show the four possible reflections, each has 25% probability.
We demonstrate an example where incorrect equivariance can harm the performance of Equivariant SAC compared to an unconstrained model. We modify the environments so that the image state will be reflected across the vertical axis with probability and then also reflected across the horizontal axis with probability (see Figure 10). As these random reflections are contained in , the transformed state is affected by Equivariant SAC’s symmetry constraint. In particular, as the actor produces a transformed action for reflect when the optimal action should actually be invariant, the extrinsic equivariance constraint now becomes an incorrect equivariance for these reflected states. As shown in Figure 11, Equivariant SAC can barely learn under random reflections, while CNN can still learn a useful policy.

6.4 Reinforcement Learning in DeepMind Control Suite

We further apply extrinsically equivariant networks to continuous control tasks in the DeepMind Control Suite (DMC) (tunyasuvunakool2020). We use a subset of the domains in DMC that have clear object-level symmetry and use the group for cartpole, cup catch, pendulum, acrobot domains, and for reacher domains. This leads to a total of tasks, with easy and medium level tasks as defined in (drqv2). Note that all of these domains are not fully equivariant as they include a checkered grid for the floor and random stars as the background.
We use DrQv2 drqv2
, a SOTA model-free RL algorithm for image-based control, as our base RL algorithm. We create an equivariant version of DrQv2, with an equivariant actor and invariant critic with respect to the environment’s symmetry group. We follow closely the architecture and training hyperparameters used in the original paper except in the image encoder, where two max-pooling layers are added to further reduce the representation dimension for faster training. Furthermore, DrQv2 uses convolution layers in the image encoder and then flattens its output to feed it into linear layers in the actor and the critic. In order to preserve this design choice for the equivariant model, we do not reduce the spatial dimensions to
by downsampling/pooling or stride as commonly done in practice. Rather we flatten the image using a process we term action restriction since the symmetry group is restricted from
to . Let denote the image feature where acts on both the spatial domain and channels. Then we add a new axis corresponding to by . We then flatten to . The intermediate step is necessary to encode both the spatial and channel actions into a single axis which ensures the action restriction is -equivariant. We now map back down to the original dimension with a -equivariant convolution. To the best of our knowledge, this is the first equivariant version of DrQv2.We compare the equivariant vs. the non-equivariant (original) DrQv2 algorithm to evaluate whether extrinsic equivariance can still improve training in the original domains (with symmetry corruptions). In figures 12, equivariant DrQv2 consistently learns faster than the non-equivariant version on all tasks, where the performance improvement is largest on the more difficult medium tasks. In pendulum swingup, both methods have failed run each, leading to a large standard error, see Figure 27 in Appendix E.4 for a plot of all runs. These results highlight that even with some symmetry corruptions, equivariant policies can outperform non-equivariant ones. See Appendix E.4.1 for an additional experiment where we vary the level of symmetry corruptions as in Section 6.2.
7 Discussion
This paper defines correct equivariance, incorrect equivariance, and extrinsic equivariance, and identifies that enforcing extrinsic equivariance does not necessarily increase error. This paper further demonstrates experimentally that extrinsic equivariance can provide significant performance improvements in reinforcement learning. A limitation of this work is that we mainly experiment in reinforcement learning and a simple supervised setting but not in other domains where equivariant learning is widely used. The experimental results of our work suggest that an extrinsic equivariance should also be beneficial in those domains, but we leave this demonstration to future work. Another limitation is that we focus on planar equivariant networks. In future work, we are interested in evaluating extrinsic equivariance in network architectures that process different types of data.
References
Appendix A Theoretical Upper Bound on Accuracy for Models with Incorrect Symmetry
We consider a classification problem over the set with finitely many classes . Let be the number of classes. Let be the true labels. Let be a finite group acting on . We assume the action of on is density preserving. That is, if is the density function corresponding to the input domain, then . Denote the orbit of a point by and the stabilizer by By the orbit-stabilizer theorem .
Now consider a model with incorrect equivariance constrained to be invariant to . We partition the input set into subsets where
If has correct equivariance then . Incorrect equivariance implies that there are orbits which are assigned more than one label. Since is constrained to be equivariant such orbits will necessarily result in some errors. We give an upper bound on that error. Define . Note that since give a partition, . Also, is empty for since the number of labels assigned to an orbit is also upper bounded by the number of points in the orbit which is at most . Letting , we have .
Proposition A.1.
The accuracy of has upper bound
In contrast, we can choose an unconstrained model from a model class with a universal approximation property and given properly chosen hyperparameters find a model with arbitrarily good accuracy.
Proof.
Let . Then Since the action of is density preserving, applying an element of before sampling does not affect the expectation, and so
If we split the expectation over the partition we get
Interchanging sums gives
By the orbit-stabilizer theorem,
which is the average accuracy over the orbit . Since is constrained to a single value of the orbit, and different true labels appear, the highest accuracy attainable is when and the true labels are maximally unequally distributed such that 1 point in the orbit takes each of labels and all the other points receive a single label. In this case accuracy can be maximized by choosing to be this majority label, and
Substituting back in,
since is constant over and . ∎
Note that the assumption that and that the labels on a given orbit are maximally unequally distributed need not hold in general and thus this bound is not tight. In order to produce a tight upper bound, consider a partition where and define . The set contains points in orbits where the majority label covers a fraction of the points. Note that although is a fraction between 0 and 1, there are only finitely many possible values of since the numerator and denominator and bounded natural numbers. We may thus sum over the values of .
Proposition A.2.
The accuracy of has upper bound
Proof.
The proof is similar to the proof of Proposition A.1 replace and with and respectively. For , the term can be upper bounded by choosing the majority label yielding . The bound then follows as before. ∎
This is a tight upper bound since assigning any but the majority label would result in lower accuracy.
Figure 13 demonstrates the upper bound of an incorrectly constrained equivariant network with the invert label corruption in Section 5, where .

Appendix B Correct, Incorrect, and Extrinsic Equivariance Examples

In this section, we describe how the model symmetry transforms data under correct, incorrect, and extrinsic equivariance and how such transformations relate to the true symmetry present in the task using the example of Section 4.2. The ground truth function is a mapping from to . Let be the coordinates of four points in the data distribution on the unit circle (Figure 14a). The ground truth labels for these points are: .
b.1 Correct Equivariance
Definition 4.1.
The action has correct equivariance with respect to if for all and .
Consider the reflection group (where is the reflection along the horizontal axis) acting on by or and via , the trivial action fixing the labels (Figure 14b). If we define an equivariant model with respect to and , then the model’s symmetry preserves the problem symmetry. For example, consider the point , is the reflection so that and . Since the model is -equivariant, . Substituting and , we obtain , meaning that the output of and are constrained to be equal. Thus the invariance property in the ground truth function where is preserved (notice that this applies to all ). We call this correct equivariance.
b.2 Incorrect Equivariance
Definition 4.2.
The action has incorrect equivariance with respect to if there exist and such that but .
Consider the rotation group (Figure 14c) which acts via on via a rotation matrix of and acts on via . If we define an equivariant model with respect to and , the network’s symmetry will conflict with the problem’s symmetry. For example, consider the point and let be the rotation action so that and . As the model is -equivariant, . Substituting and , we get . However, this constraint interferes with the ground truth function as and . We call this incorrect equivariance.
b.3 Extrinsic Equivariance
Definition 4.3.
The action has extrinsic equivariance with respect to if for , .
Consider the scaling group acting on
by scaling the vector and on
via (Figure 14d). If we define an equivariant model with respect to and , the group-transformed data will be outside the input distribution. Consider the point and let be the scaling action so that . Since the model is -equivariant, . Substituting and we have meaning that the output of and are constrained to be equal. However, is outside of the input distribution (gray ring) and thus the ground truth is undefined. We call this extrinsic equivariance.Intuitively, it is easy to see in this example how extrinsic equivariance would help the model learn . If the model is equivariant to the scale group , then it can generalize to “scaled” up or down versions of the input distribution and “covers” more of the input space . As such, the model may learn the decision boundary (the vertical axis) more easily because of its equivariance compared to a non-equivariant model, even if the equivariance is extrinsic.
Appendix C Network Architecture


Network | Equi | CNN |
Number of Parameters | 1.11 million | 1.28 million |
c.1 Supervised Learning
Figure 15 shows the network architecture of the equivariant network and Figure 16 shows the network architecture of the CNN network in Section 5
. Both networks are 8-layer convolutional neural networks. The equivariant network is implemented using the e2cnn
(e2cnn) library, where the hidden layers are defined using the regular representation and the output layer is defined using the trivial representation. Table 1 shows the numbers of trainable parameters in both networks, where both networks have a similar number with a slight advantage in the CNN.
c.2 Reinforcement Learning in Robotic Manipulation
Figure 17 shows the network architecture of Equivariant SAC used in manipulation tasks in Section 6.1. All hidden layers are implemented using the regular representation. For the actor (top), the output is a mixed representation containing one standard representation for the actions, one signed representation for the action, and seven trivial representations for the actions and the standard deviations of all action components. Figure 18 shows the network architecture of CNN SAC for both RAD and DrQ. Figure 19 shows the network architecture of FERM. Figure 20 shows the network architecture of SEN.
Table 2 shows the number of trainable parameters for each model. All baselines have slightly more parameters compared with Equivariant SAC.



Network | Equi SAC | CNN SAC | FERM | SEN |
Number of Actor Parameters | 1.11 million | 1.13 million | 1.79 million | 1.22 million |
Number of Critic Parameters | 1.18 million | 1.27 million | 1.90 million | 1.24 million |
Number of Total Parameters | 2.29 million | 2.40 million | 2.34 million | 2.46 million |
Appendix D Training Details
d.1 Supervised Learning
We implement the environment in the PyBullet simulator (pybullet). The ducks are located in a workspace with a size of . The pixel size of the image is (and will be cropped to
during training). We implement the training in PyTorch
(pytorch) using a cross-entropy loss. The output of the model is the score for each . We use the Adam optimizer (adam) with a learning rate of . The batch size is 64. In all training, we perform a three-way data split withtraining data, 200 holdout validation data, and 200 holdout test data. The training is terminated either when the validation prediction success rate does not improve for 100 epochs or when the maximum epoch (1000) is reached.
d.2 Reinforcement Learning in Robotic Manipulation
We use the environments provided by the BulletArm benchmark (bulletarm) implemented in the PyBullet simulator (pybullet). The workspace’s size is . The pixel size of the image observation is (and will be cropped to during training). The action space is for the change of position of the gripper; for the change of top-down rotation of the gripper; and for the open width of the gripper where 0 means fully close and 1 means fully open. All environments have a sparse reward: +1 for reaching the goal and 0 otherwise. During training, we use 5 parallel environments where a training step is performed after all 5 parallel environments perform an action step. The evaluation is performed every 200 training steps. We implement the training in PyTorch (pytorch). We use the Adam optimizer (adam) with a learning rate of . The batch size is 128. The entropy temperature for SAC is initialized at . The target entropy is . The discount factor . The Prioritized Experience Replay (PER) (per) has a capacity of 100,000 transitions with prioritized replay exponent of and prioritized importance sampling exponent as in per. The expert transitions are given a priority bonus of .
The contrastive encoder of the FERM baseline has an encoding size of 50 as in ferm. The FERM baseline’s contrastive encoder is pre-trained for 1.6k steps using the expert data as in ferm. In DrQ, the number of augmentations for calculating the target and the number of augmentations for calculating the loss are both 2 as in drq.
d.3 Reinforcement Learning in DeepMind Control Suite
Sample images of each environment are shown in Figure 21. Environment observations are 3 consecutive frames of RGB images of size
, in order to infer velocity and acceleration. Note that we use odd-sized image sizes instead of
used in drqv2, as the DrQv2 architecture contains a convolutional layer with stride and this breaks equivariance for even-sized spatial inputs (mohamed2020data). For each environment, an episode lasts steps where each step has a reward between and .
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
We modify the original DrQv2 by making the encoder map down to a smaller spatial output, leading to faster training. The second and third convolutional blocks have an added max-pooling layer, leading to a spatial output size of . As the equivariant version of DrQv2 has an additional convolutional layer after the action restriction, the non-equivariant version also has an additional convolutional layer at the end of the encoder. We also scale the number of channels by in order to preserve roughly the same number of parameters as the non-equivariant version.
The policy is evaluated by averaging the return of episodes every environment steps. In all DMC experiments, we plot the mean and the standard error over seeds. All other training details and hyperparameters are kept the same as in drqv2.
Appendix E Additional Experiments
e.1 Supervised Learning with More Symmetry Corruption Types
In this section, we demonstrate the experiment in Section 5 in more symmetry corruptions. Figure 22 shows the 15 different corruptions. We also show the performance of ‘Equi’, ‘CNN’, and ‘CNN + Img Trans’ without the random crop augmentation used in Section Section 5 (labeled as ‘no Crop’ variations). The result is shown in Figure 23
. First, comparing blue vs green, and purple vs orange, the equivariant network always outperforms the CNN with or without random crop augmentation, especially with fewer data. Second, comparing blue vs purple, and green vs orange, random crop generally helps both the equivariant network and the CNN network. Third, comparing red vs green, and cyan vs orange, adding the image transformation augmentation improves the performance of CNN. Notice that the condition reverse is an outlier because the equivariant network has incorrect equivariance, where the CNN methods (green and orange) without image transformation augmentation have the best performance.


e.2 RL in Manipulation without Random Crop
In this section, we demonstrate the performance of Equivariant SAC and CNN SAC without random crop augmentation using RAD. As is shown in Figure 24, both methods work poorly without the random crop augmentation.

e.3 RL in Manipulation with Occlusion Corruption
![]() |
![]() |
In this section, we perform the same experiment as in Section 6.1 with a different type of symmetry corruption: occlusion due to orthographic projection using a single camera. Instead of using an RGBD image observation as in Section 6.1, we take the depth channel from the RGBD image and perform an orthographic projection at the gripper’s position (Figure 25). This is the same process as in corl22 to generate a top-down image for equivariant learning, however, since we only have one camera instead of two as in the prior work, this orthographic projection will have missing depth values due to occlusion and thus leads to an extrinsic equivariant constraint. Figure 26 shows the results. Similar as in Section 6.1, Equivariant SAC outperforms all baselines with a significant margin.

e.4 RL in DeepMind Control Suite
Figure 27 is another visualization of equivariant vs non-equivariant DrQv2 on the original pendulum swingup environment. As each method has failed seed, we plot all runs with slightly different color shades. If we exclude the failed run from each method, it can easily be seen that equivariant DrQv2 learns faster than the non-equivariant version.

e.4.1 Increasing symmetry corruptions
In these experiments, we modify some domains to have different levels of symmetry-breaking corruptions. For cartpole and cup catch, we either remove the gridded floor and background (None) to make the observation perfectly equivariant or keep the floor and background and further change the camera angle by rolling (), increasing the level of corruption. For reacher, we use the same modifications but tilt the camera instead of rolling. See Figure 3 for sample images. In order to see the effects of increasing corruption on learning, we plot the mean discounted reward when both methods have converged (k frames for cartpole and cup catch, M frames for reacher). Figure 28 shows that both the equivariant and non-equivariant DrQv2 surprisingly perform quite well across all corruption levels, with the exception of on reacher. The equivariant policy seems to converge to a slightly higher discounted reward than the non-equivariant version, though the difference is not significant. On reacher, changing the camera angle may have affected both methods by making the task more difficult for both an equivariant and regular CNN encoder.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |

Appendix F Baseline Architecture Search
f.1 CNN SAC Architecture Search
![]() |
![]() |
![]() |
This section demonstrates the architecture search for CNN SAC. We consider three different architectures (all with a similar amount of trainable parameters): 1) conv (Figure 18): a CNN network with the same structure as Equivariant SAC, where all layers are implemented using convolutional layers. 2) fc1 (Figure 31): a CNN network that replaces some layers in 1) with fully connected layers. 3) fc2 (Figure 31): similar as 2), but with fewer convolutional layers and more weights in the FC layer. We evaluate the three network architectures with SAC equipped random crop augmentation using RAD (rad).
Figure 31 shows the result, where all three variations have a similar performance. We use conv in the main paper since it has a similar structure as Equivariant SAC.
f.2 FERM Architecture Search




This section demonstrates the architecture search for FERM. We consider four different architectures: 1) sim total 1 (Figure 33) and 2) sim total 2 (Figure 19) are two different architectures with the similar amount of total trainable parameters as Equivariant SAC. 3) sim enc (Figure 32) has similar amount of trainable parameters in the encoder as Equivariant SAC’s encoder. Notice that since FERM share an encoder between the actor and the critic while Equivariant SAC has separate encoders, having the similar amount of parameters in the encoder will lead to fewer total parameter in FERM compared with Equivariant SAC. 4) ferm ori (Figure 34) is the same network architecture used in the FERM paper (ferm).
Figure 35 shows the comparison across the four architectures. ‘sim total 2’ has a marginal advantage compared with the other three variations, so we use it in the main paper.
f.3 SEN Architecture Search
This section shows the architecture search for SEN. We consider three variations (all with similar amount of trainable parameters): 1) SEN conv (Figure 36): all layers are implemented using convolutional layers. 2) SEN fc1 (Figure 37) and SEN fc2 (Figure 20) replaces some layers in 1) with fully connected layers.
Figure 38 shows the comparison across the three variations. ‘SEN fc2’ shows the best performance.


