Analytic Manifold Learning: Unifying and Evaluating Representations for Continuous Control

06/15/2020 ∙ by Rika Antonova, et al. ∙ 7

We address the problem of learning reusable state representations from streaming high-dimensional observations. This is important for areas like Reinforcement Learning (RL), which yields non-stationary data distributions during training. We make two key contributions. First, we propose an evaluation suite that measures alignment between latent and true low-dimensional states. We benchmark several widely used unsupervised learning approaches. This uncovers the strengths and limitations of existing approaches that impose additional constraints/objectives on the latent space. Our second contribution is a unifying mathematical formulation for learning latent relations. We learn analytic relations on source domains, then use these relations to help structure the latent space when learning on target domains. This formulation enables a more general, flexible and principled way of shaping the latent space. It formalizes the notion of learning independent relations, without imposing restrictive simplifying assumptions or requiring domain-specific information. We present mathematical properties, concrete algorithms for implementation and experimental validation of successful learning and transfer of latent relations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 11

Code Repositories

bulb

Benchmarking Unsupervised Learning with pyBullet


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this work, we address the problem of learning reusable state representations from streaming high-dimensional observations. Consider the case when a deep reinforcement learning (RL) algorithm is trained on a set of source domains. Low-dimensional state representations could be extracted from intermediate layers of RL networks, but they might not be reusable on a target domain with different rewards or dynamics. To aid transfer and ensure non-degenerate embeddings, it is common to add unsupervised learning objectives. However, the quality of resulting representations is usually not evaluated rigorously. Moreover, constructing and prioritizing such objectives is done manually: auxiliary losses are picked heuristically and hand-tuned for transfer to a new set of domains or tasks.

As the first part of our contribution, we provide a set of tools and environments to improve evaluation of learning representations for use in continuous control. We evaluate commonly used unsupervised approaches and explain new insights that highlight the need for critical analysis of existing approaches. Our evaluation suite provides tools to measure alignment between the latent state from unsupervised learners and the true low-dimensional state from the physics simulator. Furthermore, we introduce new environments for manipulation with multiple objects and ability to vary their complexity: from geometric shapes to mesh scans and visualizations of real objects. We show that, while alignment with true state is achieved on the simpler benchmarks, new environments present a formidable challenge: existing unsupervised objectives do not guarantee robust and transferable state representation learning.

The second part of our contribution is a formalization of learning latent objectives from a set of source domains. We describe the mathematical perspective of this approach as finding a set of functionally independent relations that hold for the data sub-manifold. We explain theoretical properties and guarantees that this perspective offers. Previous work constructed latent relations based on domain knowledge or algorithmic insights, e.g. using continuity wiskott2002slow, mutual information with prior states anand2019unsupervised, consistency with a forward or inverse model (see lesort2018state for a survey). Our formulation offers a unified view, allowing to leverage known relations, discover new ones and incorporate relations into joint training for transfer to target domains. We describe algorithms for concrete implementation and visualize the learned relations on analytic and physics-based domains. In our final set of experiments, we show successful transfer of relations learned from source domains with simple geometric shapes to target domains that contain objects with real textures and 3D scanned meshes. We also show that our approach obtains improved latent space encoder mappings with smaller distortion variability.

2 Evaluation Suite for Unsupervised Learning for Continuous Control

Figure 1: Evaluation suite environments. Left: Standard PyBullet envs for which our suite yields both pixels and low-dimensional state. Right: Proposed new advanced domains with YCB objects.

Reinforcement learning (RL) has shown strong progress recently franccois2018introduction, and RL for continuous control is particularly promising for robotics kober2013reinforcement

. However, training for each robotics task from scratch is prohibitively expensive, especially for high-dimensional observations. Unsupervised learning could help obtaining low-dimensional latent representations, e.g. with variational autoencoder (VAE) 

kingma2013auto variants. However, evaluation of these mostly focused on datasets, with a limiting assumption that the training data distribution is stationary deeprobo2018limits. Moreover, advanced approaches usually report best-case results, achieved only with exact parameters that the authors find to work for a given static dataset. Obtaining reconstructions that are clear enough to judge whether all important information is encoded in the latent state could still require days or weeks of training IODINE19; heljakka2018pioneer. These limitations severely impair the adoption of unsupervised representation learning in robotics. In stark contrast to the learning community, a vast majority of roboticists still need to rely on hand-crafted low-dimensional features.

We propose an evaluation suite that helps analyze the alignment between the learned latent state and true low-dimensional state. Unsupervised approaches receive frames that an RL policy yields during its own training: a non-stationary stream of RGB images. The alignment of the learned latent state and the true state is measured periodically as training proceeds. For this, we do a regression fit using a small fully-connected neural network, which takes latents as inputs and is trained to produce low-dimensional states as outputs (position & orientation of objects; robot joint angles, velocities, contacts). The quality of alignment is characterized by the resulting test error rate. This approach helps quantify latent space quality without the need for detailed reconstructions. To connect our suite to existing benchmarks, we extend the OpenAI gym interface 

brockman2016openai

of widely used robotics domains so that both pixel- and low-dimensional state is reported. We use an open source simulator: PyBullet 

coumans2019. Simulation environments are parallelized, ensuring scalability. We introduce advanced domains utilizing meshes from 3D scans of real objects from the YCB dataset calli2015ycb. This yields realistic object appearances and dynamics. Our RearrangeYCB domain models object rearrangement tasks, with variants for using realistic vs basic robot arms. The RearrangeGeom domain offers an option with simple geometric shapes instead of object scans. The YCB-on-incline domain models objects sliding down an incline, with options to change friction and apply external forces; Geom-on-incline offers a variant with simple single-color geometric shapes. Figure 1 gives an overview.

2.1 Benchmarking Latent State Alignment of Unsupervised Approaches

To demonstrate usage and benefits of the suite we evaluated several widely used and recently proposed unsupervised learning approaches. Unsupervised approaches get 64x64 pixel images sampled from replay buffers, filled by PPO RL learners schulman2017proximal. Figures 23 show results for the following unsupervised approaches (Appendix A gives more detailed descriptions and learning parameters):  kingma2013auto: a VAE with a 4-layer convolutional encoder and corresponding de-convolutional decoder; : a VAE with a replay buffer that retains 50% of frames from beginning of training (our modification of VAE for improved performance on a wider range of RL policies);  higgins2017beta: a VAE with parameter to encourage disentanglement (we tried several parameters and also included the replay enhancement from ); : a sequential VAE that reconstructs a sequence of frames ; : a VAE that, given a sequence of frames , constructs a predictive sequence ;  yingzhen2018disentangled: a sequential autoencoder that uses structured variational inference to encourage separation of static and dynamic aspects of the latent state;  SPAIR19: a spatially invariant and faster version of AIR AIR16 that imposes a particular structure on the latent state.

Figure 2:

Benchmarking alignment with true low-dimensional state. Plots show mean test error of NN regressors trained with current latent codes as inputs and true states (robot positions, velocities, contacts) as outputs. 90% confidence intervals over 6 training runs for each unsupervised approach are shown (

training runs overall). Training uses frames from replay buffers (1024 frames per batch; 10 batches per epoch, 50 for locomotion). Top row: performance on frames from current RL policy

, middle row: random policy . 1st column shows results for CartPole and InvertedPendulum for position & angle; 2nd column: for velocity. 3rd column shows aggregated results for position, velocity and contacts for HalfCheetah; 4th column shows these for Ant domain.

Figure 2 shows results on multicolor versions of CartPole, InvertedPendulum, HalfCheetah and Ant domains (multicolor to avoid learning trivial color-based features). We evaluated using two kinds of policies: a current RL learner policy , and a random policy . Success on is needed for transfer: when learning a new task, initial frames are more similar to those from a random policy than a final source task policy. performed poorly on . We discovered that this can be alleviated by replaying frames from initial random policy. The resulting offers good alignment for positions. Surprisingly, offered no improvement over . We used ; the best () performed slightly worse than on pendulum domains (shown in Figure 2), the rest did significantly worse (omitted from plots). Sequential approaches ,, offered significant gains when measuring alignment for velocity. Despite its simpler architecture, performed best on pendulum domains. For aggregated performance on position, velocity and contacts (i.e. whether robot joints touch the ground) for locomotion: PRED outperformed on , but was second-best on . Overall, this set of experiments was illuminating: simpler approaches were often better than more advanced ones.

Figure 3: Evaluation on the RearrangeGeom domain (reconstructing YCB objects was difficult for existing approaches, so RearrangeYCB was too challenging). encoded angle of the main robot joint, location & partly orientation (major axis) of the largest objects. encoded (rough) locations quickly, but did not improve with longer training (bounding boxes not tight).

For our newly proposed domains with multiple objects: the first surprising result was that all of the approaches we tested failed to achieve clear reconstructions of objects from the YCB dataset. This was despite attempts to use larger architectures, up to 8 layers with skip connections, similar to park2019deepsdf. Figure 3 shows results for vs . succeeded to reconstruct RearrangeGeom, while other approaches failed. This indicates that our multi-object benchmark is a highly needed addition to the current continuous control benchmark pool. While single-object benchmarks might still be challenging for control, they could be inherently simpler for latent state learning and reconstruction.

Overall, our analysis shows that structuring the latent space can be beneficial, but has to be done such that it does not impair the learning process and resulting representations. This is not trivial, since seemingly beneficial objectives that worked well in the past could be detrimental on new domains. However, forgoing structure completely can fail on more advanced scenes. Hence, in the following sections we show an alternative direction: a principled way to learn a set of general rules from source domains, then apply them to structure latent space of unsupervised learners on target domains.

3 Analytic Manifold Learning

We now motivate the need to unify learning latent relations, then provide a rigorous and general mathematical formulation for this problem. Let denote a high-dimensional (observable) state at time and denote the corresponding low-dimensional or latent state. could be an RGB image of a scene with a robot & objects, while could contain robot joint angles, object poses, and velocities. Consider an example of a latent relation: the continuity (slowness) principle wiskott2002slow; kompella2011incremental. It postulates continuity in the latent states, implying that sudden changes are unlikely. It imposes a loss , with and encoder . A related heuristic from anand2019unsupervised maximizes mutual information between parts of consecutive latent states. Such approaches may be viewed as postulating concrete latent relations: , where is the squared distance between and for , and a more complicated relation for  anand2019unsupervised. Ultimately, all these are heuristics coming from intuition or prior knowledge. However, only a subset of them might hold for a given class of domains. Moreover, it would be tedious and error-prone to manually compose and incorporate a comprehensive set of such heuristics into the overall optimization process.

We take a broader perspective. Let define a relation that holds on a set of sequences . could contain state sequences from a set of source domains. We start by learning a relation ; then learn that differs from ; then learn different from and so on. Overall, we aim to learn a set of relations that are (approximately) independent, and we define independence rigorously. To understand why rigor is important here, recall the significance of the definition of independence in linear algebra: it is central to the theory and algorithms in that field. Extending the notion of independence to our more general nonlinear setting is not trivial, since naive definitions can yield unusable results. Our contribution is developing rigorous definitions of independence, and ensuring the result can be analyzed theoretically & used for practical algorithms.

3.1 Mathematical Formulation

Let be the ambient space of all possible latent state sequences (of some fixed length). Let be the submanifold of actual state sequences that a dynamical system from one of our domains could generate (under any control policy). A common view of discovering is to learn a mapping that produces only plausible sequences as output (the ‘mapping’ view). Alternatively, a submanifold can be specified by describing all equations (i.e. relations) that have to hold for points in the submanifold.

We are interested in finding relations that are in some sense independent. In linear algebra, a dependency is a linear combination of vectors with constant coefficients. In our nonlinear setting the analogous notion is that of

syzygy. A a collection of functions is called a syzygy if is zero. Observe that this sum is a linear combination of relations with coefficients in the ring of functions. If there is no syzygy s.t. , then are independent. However, this notion of independence is too general for our case, since it deems any dependent: holds for any . Hence, we define restricted syzygies.

Definition 3.1 (Restricted Syzygy).

Restricted syzygy for relations is a syzygy with the last entry equal to , i.e. with

Definition 3.2 (Restricted Independence).

is independent from in a restricted sense if the equality implies , i.e. if there exists no restricted syzygy for .

For we denote by . Using the above definitions, we construct a practical algorithm (Section 3.2) for learning independent relations. The overall idea is: while learning s, we are also looking for restricted syzygies . Finding them would mean s are dependent, so we augment the loss for learning to push it away from being dependent. We proceed sequentially: first learning , then while ensuring no restricted syzygies appear for , then learning and so on. Section 5 explains motivations for learning sequentially. For training s we use on-manifold data: sequences from our dynamical system. Restricted syzygies are trained using off-manifold data: , because we aim for independence of s on , not restricted to (on s should be zero). do not lie on our data submanifold and can come from thickening of on-manifold data or can be random (when

is large, the probability a random sequence satisfies equations of motion is insignificant). Independence in the sense of Definition 

3.2 is the same as saying that does not lie in the ideal generated by , with ideal defined as in abstract algebra (see Appendix B.1). Hence, the ideal generated by is strictly larger than that generated by alone, because we have added at least one new element (the ). We prove that in our setting the process of adding new independent s will terminate (proof in Appendix B.1):

Theorem 3.1.

When using Definition 3.2 for independence and real-analytic functions to approximate s, the process of starting with a relation and iteratively adding new independent s will terminate.

If is real-analytic (i.e. is cut out by a finite set of equations of type for some finite set of real-analytic s), then after the process terminates, the set where all relations hold will be precisely . Otherwise, the process will still terminate, having learned all possible analytic relations that hold on . By a theorem of Akbulut and King akbulut1992approximating any smooth submanifold of can be approximated arbitrarily well by an analytic set, so in practice the differences would be negligible.

To ensure that each new relation decreases the data manifold dimension, we could additionally prohibit from having any syzygy in which itself is not expressible in terms of . With such definition (below) we could guarantee that a sequence of independent relations restricts the data to a submanifold of codimension at least (Theorem 3.2, which we prove in Appendix B.1).

Definition 3.3 (Strong Independence).

is strongly independent from if the equality implies that is expressible as .

Theorem 3.2.

Suppose is a sequence of analytic functions on , each strongly independent of the previous ones. Denote by the part of the learned data manifold lying in the interior of . Then dimension of is at most .

In addition, we construct an alternative approach with similar dimensionality reduction guarantees, which ensures that the learned relations differ to first order. For this we use a notion of independence based on transversality, with the following definition and lemmas (with proofs in Appendix B.1):

Lemma 3.1.

Dependence as in Definition 3.2 implies and are dependent.

Definition 3.4 (Transversality).

If for all points the gradients of at , i.e. , are linearly independent, we say that is transverse to the previous relations: .

Using transversality, we deem to be independent from if the gradients of do not lie in the span of gradients of anywhere on . With this, that only differs from previous relations in higher-order terms would be deemed as ‘not new’. This formulation is natural from the perspective of differential geometry. Let be the hypersurface defined by : the set of points where . Each contains . If gradients of are linearly independent from gradients of , then the corresponding hypersurfaces intersect transversely along .

Lemma 3.2.

For once differentiable s.t. s are transverse along their common intersection , this intersection is a submanifold of of dimension .

The notion of independence defined via transversality is infinitesimal and symmetric w.r.t. permuting s. This is useful in settings where many relations could be discovered, because it is then better to find relations whose first order behavior differs. In cases where guaranteed decrease in dimension is not needed, using restricted syzygies could allow a flexible search for more expressive relations.

3.2 Learning Latent Relations

1 rollouts from RL actors train with loss (Eq.1) for  do
2       if aiming_for_transversality then
3             train with loss from Eq.2
4      else // using syzygies
5             train with loss from Eq.1 for  do
6                   generate train with if  on  then  break // indep. while  do
7                         freeze ; train with (Eq.3)
8                  
9            
10      
Algorithm 1 Algorithm 1 : Analytic Manifold Learning (AML)
Figure 4: Left: algorithm for learning latent relations. Top right: using transversality. Bottom right: training with syzygy to uncover if is dependent, then using to modify ’s loss. Orange & blue denotes NNs whose weights are being trained. Gray denotes learned relations whose NNs are frozen.

Here we describe the algorithm with relations and restricted syzygies approximated by neural networks. Each is represented by a neural network (NN) that takes a sequence of latent/low-dimensional states as input. The output of is a scalar. We use to denote both the relation and the NN used to learn it. If outputs 0 for on-manifold data, this implies has learned a function , which captures a relation between states of the underlying dynamical system. is trained on minibatches of size of on-manifold data points using loss gradients: , where means gradient w.r.t NN weights of . We need to make for on-manifold data, while avoiding trivial relations (e.g. all NN weights ). Hence, in the loss we minimize , where is the gradient of with respect to input points : . The gradient norm is the maximal ‘slope’ of the linearization of at , so is the distance from to the nearest point where this linearization vanishes ( height/slope  distance). Hence, is a proxy for the distance from to the vanishing locus of . This measure of vanishing avoids scaling problems (see Appendix B.2). We also maximize to further regularize . Equation 1 summarizes our loss for :

(1)

We proceed sequentially: first learn , then , and so on. Suppose that so far we learned (approximately) independent relations . We then keep their NN weights fixed and learn an initial version of the next relation . To obtain that is transverse to (Definition 3.4), we augment the loss as follows. We compute gradients of each w.r.t input . For example, for we denote this as . Making transverse to means ensuring that is linearly independent of . We optimize a computationally efficient numerical measure of this: maximize the angles between and all the previous . Such measure encourages transversality of subsets of relations and strongly discourages small angles. Our overall measure of transversality is the product of sines of pairwise angles, with log for stability (Appendix B.3.1 gives further discussion):

(2)

For independence based on Definition 3.2, we instead learn a restricted syzygy . Training data for is comprised of: 1) (defined in Section 3.1) and 2) , i.e. outputs from with fed as inputs. are passed directly to the next-to-last layer, which we denote as . The last layer of computes a dot product of and . We use a simple L1 loss for training . If outputs 0 at convergence: is not independent. In this case, we freeze the weights of and continue to train with augmented loss. We use gradients passed through to push away from a solution that made it possible to learn :

(3)

encourages adjusting such that it makes the outputs of (frozen) non-zero. Once is minimized, we can attempt to learn another syzygy , and so on, until we cannot uncover any new dependencies. Then can be declared (approximately) independent of and we can proceed to learn . All s, s, s are in latent space, so networks are small & quick to train.

An additional benefit of our formulation is that prior knowledge can be incorporated without restricting the hypothesis space. s can be pre-trained in a supervised way: to output values that a prior heuristic produces on- and off-manifold. Then, s can be further trained using on-manifold data, and if prior knowledge is wrong, then would move away from the wrong heuristic during further training.

4 Evaluating Analytic Manifold Learning (AML)

Figure 5: Learning relations on a noisy version of the analytic domain.

We evaluate our AML approach with 3 sets of experiments: 1) learning on an analytic domain and visualizing relations in 3D; 2) handling dynamics with friction and drag on a block-on-incline domain; 3) employing learned relations to get improved representations on the YCB-on-incline target domain.

For our analytic domain on-manifold data comes from an intersection of a hyperboloid and a plane. The top row of Figure 5 shows results using restricted syzygies. We visualize : the intersection of the learned relations (i.e. the intersection of the zero-level sets of these relations). The zero-level sets of individual relations are shown next. On the second row, we show training with transversality: has two simple relations – a plane and a hollow cylinder; includes smoothed cones. Transversality allows capturing information with a small number of general relations. In contrast, relations found using syzygies have more complicated shapes and can be similar in some regions, as expected. This could be useful when we need to avoid large changes, e.g. for fine-tuning or for flexible partial transfer using subsets of relations.

Figure 6: Phase space plots for on-manifold data and relations learned with AML for block-on-incline.

Next, we evaluate AML on a physics domain: a block sliding down an incline. The block is given a random initial velocity; gravity, friction and drag forces then determine its further motion. On-manifold data consists of noisy position & velocity of the block at the start and end of trajectories. Figure 6 shows AML with transversality (Appendix B.3.3 gives results with syzygies). We visualize phase space plots: arrows show change in position & velocity after of sliding (scaled to fit). The left plots show the case of a incline and demonstrate generalization. AML is only given training data with start position & velocity , but is able to generalize to . The middle plots show high friction on a incline. The right plots show high drag on a incline. Overall, these results show that AML can generalize beyond training data ranges and capture non-linear dynamics.

Lastly, we show transfer to YCB-on-incline domain (rightmost in Figure 1) and compare AML to the leading approaches from our earlier experiments: and . We note that while did ok on RearrangeGeom, it had significant problems reconstructing existing benchmarks. Decoding RearrangeYCB was problematic for all approaches (see Appendix A). Even supervised decoder training failed (with true states as training input). Decoder design is outside the scope of this work. Hence, we evaluate AML transfer using YCB-on-incline, which has challenging dynamics & images, but is still tractable for decoding. First, AML learns relations from Geom-on-incline. Incline angle, friction and object pose are initialized randomly. Actions are random forces that push objects along the incline. AML is given incline, position & velocity at two subsequent steps, and the applied action.

Figure 7: YCB-on-incline: mean of 6 training runs, shaded areas show one STD.

Then, we train an unsupervised learner () on the target YCB-on-incline domain. PPO RL drives the distribution of RGB frames. RL gets high rewards for pushing objects to stay in the middle of the incline. We impose AML relations by extending the latent part of an ELBO-based loss (with as encoder outputs): The resulting AML (AML when using syzygies) gets a better latent state alignment for object position compared to and without AML relations imposed (see the top plot in Figure 7).

Another important quality measure of a latent space mapping is how much it distorts the true data manifold. We quantify this as follows (on 10K test points): take pairs of low-dimensional representations , and the corresponding pixel-based representations , then compute distortion coefficient , with

as Euclidean distance. An encoder that yields low variance of these coefficients better preserves the geometry of the low-dimensional manifold (up to overall scale). This measure is related to approaches surveyed in 

distort18; bartal2019dimensionality (see Appendix B.3.2). The bottom plot in Figure 7 confirms that AML helps achieving lower distortion variability.

Results presented in Figure 7 show that imposing AML relations helps improve the latent space mapping of when training on RGB frames. The distribution of the frames is non-stationary, since they are sampled using the current (changing) policy of an RL learner. Overall, this above setup aims to demonstrate the potential for sim-to-real transfer. In this case, Geom-on-incline plays a role of a simulator, while frames from YCB-on-incline act as surrogates for ‘real’ observations. Note that YCB objects have realistic visual appearances and their dynamics is dictated by meshes obtained from the 3D scans of real objects. Hence, there is a non-trivial mismatch between the dynamics of the simple shapes of Geom-on-incline domain vs realistic shapes of the YCB-on-incline domain.

5 Related Work

Scalable simulation suites for continuous control roboschool; tassa2018deepmind; pybulletgym bolstered progress in deep RL. However, advanced benchmarks for unsupervised learning from non-stationary data are lacking, since the community mainly focused on dataset-oriented evaluation. anand2019unsupervised provides such a framework for ATARI games, but it is not aimed at continuous control. raffin18srl

includes a limited set of robotics domains and 3 metrics for measuring representation quality: KNN-based, correlation, RL reward. We incorporate more standard benchmarks, introduce a variety of objects with realistic appearances (fully integrated into simulation) and measure alignment to latent state in a complimentary way (highly non-linear, but not RL-based). In future work, it would be best to create a combined suite to support both games- and robotics-oriented domains, and offer a comprehensive set of RL-based and RL-free evaluation.

Our formulation of learning latent relations is in the general setting of representation learning. This is a broad field, so in this work we focus on formalization of learning independent/modular relations that capture the true data manifold. We also provide a way to transfer relations learned on source domains to target domains. Unlike meta-learning, we do not assume access to a task distribution and do not view target task reward as the main focus. Our sequential approach to learning has conceptual parallels with a functional Frank-Wolfe algorithm jaggi2013revisiting, but without convex optimization. Learning sequentially helps avoid instabilities, e.g. from training flexible NN mixtures with EM IODINE19. There is prior work for learning algebraic (meaning polynomial) relations, but its criterion for relation simplicity is based on polynomial degree. Such approaches are based on computational algebra and spectral methods from linear algebra. This line of work was initiated by livni2013vanishing; sauer2007approximate; heldt2009approximate, with extensions fassino2010almost; fassino2013simple; kera2016noise; kera2019spurious; kera2019gradient, applications iraji2017principal; yan2018deep and learning theory analysis hazan2016non; globerson2017effective. Our formulation is more general, since we learn analytic relations and approximate them with neural networks. We summarize the main differences & point out potential connections in Appendix B.2.

Conclusion and Future Work

We proposed a suite for evaluation of latent representations and showed that additional latent space structure can be beneficial, but could stifle learning in existing approaches. We then presented AML: a unified approach to learn latent relations and transfer them to target domains. We offered a rigorous mathematical formalization, algorithmic variants & empirical validation for AML.

We showed applications of AML to physics & robotics domains. However, in general AML does not assume that source or target domains are from a certain field, such as robotics, or have particular properties, such as continuity in the adjacent latent states or existence of an easy-to-learn transition model. As as long some relation exists between the subsequences of latent states – AML would attempt to learn it, and would succeed if a chosen function approximator is capable of representing it. Moreover, AML relations can be learned on the latent space of any unsupervised learner trained on the source domain. In this case, AML would capture abstract relations that encode the regularities embedded in the latent representation learned on the source domain. Imposing these relations during transfer could help to preserve (i.e. carry over) these regularities. This alternative could be better than starting from scratch and better than fine-tuning. Starting from scratch is not data-efficient. Fine-tuning is prone to getting stuck in local optima, causing permanent degradation of performance, especially in case of a non-trivial mismatch between the source and target domains.

AML can build a modular representation of relations encoded in the latent/low-dimensional space. Hence, AML can enable a dynamic partial transfer and thus help recover from negative transfer in cases of large source-target mismatch. In our follow-up work, we intend to dynamically adjust the strength of imposing each latent relation on the target domain. For this, we would combine the learned relations using prioritization weights . These weights would be optimized by propagating the gradients of the RL loss w.r.t. the latent state representation (that these weights would influence). Further extensions could include, for example, lifelong learning: we could gradually expand the set of learned relations and discard relations whose weights decay to zero as the lifelong learning proceeds. Another promising option would be to learn policy representations (rather than state representations). If AML could be used to learn policies that are in some sense independent, then we could provide a way to learn a portfolio of policies that are complementary. Then, we could construct algorithms for learning diversified portfolios, such that a system capable of executing any policy in a portfolio could provide robustness to uncertainty and changes in the environment.

This research was supported in part by the Knut and Alice Wallenberg Foundation. This work was also supported by an “Azure for Research” computing grant. We would like to thank Yingzhen Li, Kamil Ciosek, Cheng Zhang and Sebastian Tschiatschek for helpful discussions regarding unsupervised & reinforcement learning and variational inference.

Appendix A Evaluation Suite for Unsupervised Learning for Continuous Control

a.1 Benchmarking Alignment : Algorithm Descriptions and Further Evaluation Details

In this section, we include more detailed descriptions of the existing approaches we evaluated, describe parameters used for evaluation experiments, and give examples of reconstructions. Code and environments for the evaluation suite can be obtained at: https://github.com/contactrika/bulb

 kingma2013auto: a VAE with a 4-layer convolutional encoder and corresponding de-convolutional decoder (same conv-deconv stack is also used for all the other VAE-based methods below).
: a VAE with a replay buffer that retains 50% of initial frames from the beginning of training and replays them throughout training. This is our modification of the basic VAE to ensure consistent performance on frames coming from a wider range of RL policies. We included this replay strategy into the rest of the algorithms below, since it helped improve performance in all cases.
 higgins2017beta: a VAE with an additional parameter in the variational objective that encourages disentanglement of the latent state. To give its best chance we tried a range of values for .
: a sequential VAE that is trained to reconstruct a sequence of frames and passes the output of the convolutional stack through LSTM layer before decoding. Reconstructions for this and other sequential versions were also conditioned on actions .
: a VAE that, given a sequence of frames , constructs a predictive sequence . First, the convolutional stack is applied to each as before; then, the output parts are aggregated and passed through fully connected layers. Their output constitutes the predictive latent state. To decode: this state is chunked into parts, each fed into deconv stack for reconstruction.
 yingzhen2018disentangled: a sequential autoencoder that uses structured variational inference to encourage separation of static vs dynamic aspects of the latent state. It uses LSTMs in static and dynamic encoders. To give its best chance we tried uni- and bidirectional LSTMs, as well as replacing LSTMs with GRUs, RNNs, convolutions and fully connected layers.
 SPAIR19: a spatially invariant and faster version of AIR AIR16 that imposes a particular structure on the latent state. overlays a grid over the image (e.g. 4x4=16, 6x6=36 cells) and learns ‘location’ variables that encode bounding boxes of objects detected in each cell. ‘Presence’ variables indicate object presence in a particular cell. A convolutional backbone first extracts features from the overall image (e.g. 64x64 pixels). These are passed on to further processing to learn ‘location’,‘presence’ and ‘appearance’ of the objects. The ‘appearance’ is learned by object encoder-decoder, which only sees a smaller region of the image (e.g. 28x28 pixels) with a single (presumed) object. The object decoder also outputs transparency alphas, which allow rendering occlusions.

Neural network architectures and training parameters:

In our experiments, unsupervised approaches learn from 64x64 pixel images, which are rendered by the simulator. All approaches (except ) first apply a convolutional stack with 4 hidden layers, (with [64,64,128,256] conv filters). The decoder has analogous de-convolutions. Fully-connected and recurrent layers have size 512. Using batch/weight normalization and larger/smaller network depth & layer sizes did not yield qualitatively different results. The latent space size is set to be twice the dimensionality of the true low-dimensional state. For VAE we also tried setting it to be the same, but this did not impact results. use sequence length 24 for pendulums & 16 for locomotion (increasing to 32 yields similar results). parameters and network sizes are set to match those in SPAIR19. We experimented with several alternatives, but only the cell size had a noticeable effect on the final outcome. We report results for 4x4 and 6x6 cell grids, which did best.

To decouple the number of gradient updates for unsupervised learners from the simulator speed: frames for training are re-sampled from replay buffers. These keep 5K frames and replace a random subset with new observations collected from 64 parallel simulation environments, using the current policy of an RL learner. Training hyperparameters are the same for all settings (e.g. using Adam optimizer 

kingma2014adam with learning rate set to ). Since different approaches need different time to perform gradient updates, we equalize resources consumed by each approach by reducing the batch size for the more advanced/expensive learners. get 1024 frames per batch; for sequential approaches () we divide that by the sequence length; for we use 64 frames per batch (since ’s decoding process is significantly more expensive).

Reconstructions for benchmarks and the new multi-object domains:

Figure 1: Streaming/unseen frames (top) and reconstructions (bottom) after 500 training epochs.

Reconstruction for benchmark domains (e.g. CartPole, InvertedPendulum, HalfCheetah, Ant) was tractable for . Decoded images were sharp when these algorithms were trained on a static dataset of frames. However, then trained on streaming data with a changing RL policy, decoding was more challenging. It took longer for colors to emerge, especially for and . Sometimes robot links were missing, especially for poses that were seen less frequently.

We attempted to run on these benchmark domains as well. However, it had difficulties with reconstruction. The thin pole in CartPole domain was completely lost, and mistook the cart base as a part of background. For HalfCheetah and Ant: a bounding box was detected around the robot, signifying that did separate it from the background. However, cheetah robot was reconstructed only as a faint thin line, and legs of the Ant were frequently missing. Right set of plots in Figure 1 shows examples of reconstructions, red bounding boxes show detected foreground regions; blue boxes indicate inactive boxes. is not specifically designed for domains like this, since its strengths are best seen in identifying/tracking separate objects. Thin object parts and dynamic backgrounds in the benchmark domains are not the best match for ’s strongest sides.

As we noted in the main paper, all existing approaches we tried had difficulties decoding RearrangeYCB domain. did manage to produce reasonable reconstructions, albeit missing/splitting of objects was still common. Figure 2 shows example reconstructions after training for 10K epochs ( hours) and after 100K epochs. Bounding boxes reported by

were not tight even after 100K epochs (up to 11 days of training overall on one NVIDIA GeForce GTX1080 GPU). We used PyTorch implementation from 

SPAIRpytorch, which was tested in yonk2020msthesis to reproduce the original

results (and we added the capability to learn non-trivial backgrounds). An optimized Tensorflow implementation could potentially offer a speedup, but PyTorch has an advantage of being more accessible and convenient for research code.

did not achieve good reconstructions even on RearrangeGeom domain. Figure 3 shows example reconstructions. Hence, in the main paper, for analyzing alignment on RearrangeGeom domain we chose and . We focused on these, since offered speed and simplicity, while gave better reconstructions.

Figure 2: Left side: SPAIR RearrangeYCB results after 10K epochs. Right side: SPAIR after 100K epochs. True images are in the top row, reconstructions in the bottom. Thin red bounding boxes overlaid over true images (in the top row) show that bounding boxes did not shrink with further training. SPAIR 6x6 tended to split large objects into pieces (visible in the case with blue background). SPAIR 4x4 did not split objects and had better results for low-dimensional alignment.
Figure 3: True images (top) and reconstructed images (bottom) after 10K training epochs.

Appendix B Analytic Manifold Learning

b.1 Proofs and Technical Background for Mathematical Formulation

Here we present an extended version of Section 3.1 from the main paper. This version contains proofs for all lemmas and theorems, provides relevant technical background from abstract algebra and geometry. We draw analogies with simpler settings from linear algebra to highlight connections with settings that are common in ML literature.

Let be the ambient space of all possible latent state sequences (of some fixed length). Let be the submanifold of actual state sequences that a dynamical system from one of our domains could generate (under any control policy)111

In this work, we use the term ‘manifold’ in the sense most commonly used in the machine learning literature, i.e. without assuming strict smoothness conditions.

. A common view of discovering is to learn a mapping that would produce only plausible sequences as output (the ‘mapping’ view). Alternatively, a submanifold can be specified by describing all the equations (i.e. relations) that have to hold for the points in the submanifold. Recall an example from linear algebra, where a submanifold is linear, a.k.a. a vector subspace. This submanifold can be represented as an image of some linear map (the ‘mapping’ view), or as null space of some collection of linear functions, a.k.a. a system of linear equations. The latter is the ‘relations’ view: specifying which relations have to hold for a point to belong to the submanifold.

b.1.1 Definitions of Independence for Learning Independent Relations

We are interested in finding relations that are in some sense independent. One notion of independence is the functional independence. Relations are said to be functionally independent if there is no (non-trivial) function s.t. . However, with such definition, and could be deemed independent222 can transform s in any way, but does not have direct access to , so can not be a ‘coefficient’., even when does not provide an additional interesting relation, e.g. vs . Hence, we need a stricter version of independence. To describe such a version we use the formalism of modules.

A module is the generalization of the concept of vector space, where the coefficients lie in a ring instead of a field. In our case, both elements of the module and elements of the ring are functions on . We observe that the set of functions that vanish on is closed under the module operations ‘’ and multiplication by ring elements, hence it is a (sub-)module. Recalling the definition of independence for vectors of a vector space, we note that the default notion of independence for elements of a module is analogous. In this setting, a syzygy is a linear combination of relations with coefficients in the ring of functions. If there is no syzygy s.t. vanishes, then are independent.

However, for our case the above notion of independence is now too strict, because it would deem any relations dependent: holds for any . We propose several strategies to avoid this problem. One option is to define restricted syzygies, presented below.

Definition 3.1 (Restricted Syzygy).

Restricted syzygy for relations is a syzygy with the last entry equal to , i.e. with

Definition 3.2 (Restricted Independence).

is independent from in a restricted sense if the equality implies , i.e. if there exists no restricted syzygy for .

For we denote by .

Using definitions above, we construct a practical algorithm for learning an (approximately) independent set of relations. The overall idea is: while learning s, we are also looking for restricted syzygies . Finding them would mean s are dependent (in the sense of Definition 3.2), so we augment the loss for learning s to push them away from being dependent. We proceed sequentially: first learning , then learning while ensuring no restricted syzygies appear for , then learning and so on.

For training s we use on-manifold data: sequences come from our dynamical system (i.e. satisfying physical equations of motion, etc). Restricted syzygies are trained using off-manifold data: sequences that do not lie on our data submanifold. We denote such subsequences as . Off-manifold data is needed for since we aim for independence of s on , not restricted to their output on data that lies on (when restricted to the s are zero, and so are trivially dependent). do not lie on our data submanifold and can come from thickening of on-manifold data or can be random (when is large, the probability a random sequence satisfies equations of motion is insignificant).

Observe that independence in the sense of Definition 3.2 is the same as saying that does not lie in the ideal generated by , with ideal defined as in abstract algebra333 In the language of abstract algebra: we consider functions on as module over itself. When a ring is viewed as a module over itself, a submodule of a ring is called an ideal. Thus the set of relations that hold on is an ideal, called ‘the ideal of ’, written . When considering only subsets of relations that hold on , we will also talk about the ‘ideal generated by ’, which is, by definition, the smallest ideal containing . One can show that this ideal consists of all linear combinations of with functions as coefficients. . Hence the ideal generated by is strictly larger than that generated by alone, because we have added at least one new element (the ). Below we prove that in our setting the process of adding new independent s will terminate.

Theorem 3.1.

When using Definition 3.2 for independence and real-analytic functions to approximate s, the process of starting with a relation and iteratively adding new independent s will terminate.

Proof.

First, we assume that the values of each dimension of lie between some minimum constants and maximum . This is to model actual data observations that are limited by real-world boundaries. This implies that instead of working with unrestricted ambient space, we will work with a compact box , and the corresponding subset of the data manifold . The precise values of s, s and even the rectangular shape of the box are immaterial; what is needed is that is compact and is cut out by a collection of analytic inequalities. In technical terms: we require that is compact and real semi-analytic. To avoid boxes with pathological shapes we require in addition that is the closure of its interior . Possible s include a closed ball, or an arbitrary convex polytope.

We consider the case of using neural networks for approximating relations

. For networks with real-analytic activation functions (e.g. sigmoid, tanh), the

s and relations between them would be real-analytic (recall that a function is analytic if it is locally given by a convergent power series). The being independent in the sense of Definition 3.2 implies is not in the ideal generated by inside the ring of real-analytic functions. This means that is a strictly increasing sequence of ideals inside the ring of real-analytic functions on the ambient space . A theorem of J. Frisch (frisch1967points, Théorème (I, 9)) says that the ring of analytic functions on a compact real semi-analytic space is Noetherian, meaning that any growing chain of ideals in it will stabilize. This means that after a finite number of iterations we would be unable to learn a new independent , meaning we would have found all analytic relations that hold on , thus terminating the process. ∎

If itself is cut out by a finite set of equations of type for some finite set of real-analytic s), then after the process terminates, the subset of where all relations hold will be precisely . This is the same as saying that all the s defining will be in the ideal . If is not cut out by global real-analytic relations, the process will still terminate, having learned all possible global analytic relations that hold on .

We remark that by a theorem of Akbulut and King akbulut1992approximating any smooth submanifold of can be approximated arbitrarily well by defined by a finite set of analytic equations . The same is true even when are restricted to be polynomial. This means that if one ignores the issues of complexity of the defining equations , the differences between various categories of manifolds (smooth, analytic, or algebraic) could be ignored. The above may seem to suggest that methods based on polynomials may suffice. In practice, the polynomial relations needed may be of very high degree. Hence, using neural networks to learn (approximate) relations would be more suitable.

We further note that in practice we of course don’t have access to or even , but only to a finite sample of data points in . The fact that finding independent ’s vanishing at these points will terminate is a (simpler) special case of the Theorem 3.1, which guarantees that even the more complicated idealized set of relations defining can be learned in finite time.

Observe that if is dependent on then the set of points where is zero contains the set of points where all the other are zero. The converse is not true: while is different from the previous relations in a non-trivial way, it might happen that adding as a relation does not restrict the learned manifold to a smaller set. This arises because of the non-linearity in our setting444 This is in contrast to linear algebra, where adding an independent linear equation necessarily decreases the dimension of the subspace of solutions..

To ensure that each new relation decreases the data manifold dimension, we could additionally prohibit from having any syzygy in which itself is not expressible in terms of . This is encoded in the definition below.

Definition 3.3 (Strong Independence).

is strongly independent from if the equality implies that is expressible as .

In Theorem 3.2 we will show that imposing relations , such that each new relation is strongly independent from the previous ones, restricts data to a submanifold of codimension at least . Since we don’t assume that has to be smooth, the notion of dimension needs to be defined precisely. Thus, before embarking on a formal statement and a proof of Theorem 3.2, we give such a definition and discuss related notions needed in the proof.

b.1.2 Definitions of Dimension in Geometry and Algebra

For smooth manifolds, which are locally homeomorphic to some , the dimension is simply defined to be , and the invariance of dimension theorem of Brouwer (see (hatcher2002algebraic, Theorem 2.26)) ensures that this is unambiguous (which, in light of Cantor’s proof that all s have the same number of points and Peano’s construction of space-filling curves is not as obvious as it may seem a priori).

For arbitrary subsets of one can then analogously define if and only if contains an open set homeomorphic to an open ball in , but not an open set homeomorphic to an open ball in , for . We will call this geometric dimension. This is the definition we will use when referring to dimension of .

Now suppose is a semi-analytic subset of , meaning a subset locally defined by a system of analytic equations and inequalities555The manifolds we are learning are actually much nicer: they are globally defined by analytic equations. This means, by definition, that they are -analytic sets (an abbreviation of Cartan real analytic sets; see (acquistapace2017some, Definition 1.5) and cartan1957varietes, particularly the Paragraphe 11).. While is in general not smooth, it admits a decomposition into smooth parts. Then, the definition of geometric dimension given above coincides with just taking the largest dimension of any part (see, for example, (bierstone1988semianalytic, Proposition 2.10 and Remark 2.12)). This definition is local, meaning that if we define dimension of at a point , denoted by , to be dimension of for all sufficiently small open neighborhoods of , then . Of course one also has that implies . See (lojasiewicz1991introduction, II.1.1) for all this and more.

In order to relate this dimension to properties of the relations that define , we need to connect to dimensions of algebraic objects arising from s. These will be rings of various kinds. Thus, we need theory of dimensions of rings.

In commutative algebra the standard way to define a dimension of a ring is due to Krull. It says that dimension of a ring , denoted , is the length of the longest chain of prime ideals in . Note that this has some resemblance to the fact that dimension of a vector space is equal to the length of the longest chain of subspaces . For an ideal the Krull dimension is defined as , where is the quotient ring. See (eisenbud1995commutative, Chapter 8 and onwards).

b.1.3 Statement and Proof of Theorem 3.2
Theorem 3.2.

Suppose is a sequence of analytic functions on , each strongly independent of the previous ones. Denote by the part of the learned data manifold lying in the interior of . Then dimension of is at most .

Proof outline: Strong independence (Definition 3.3) is directly related to the definition of regular sequences. The proof ultimately aims to use Proposition 18.2 in eisenbud1995commutative, which ensures that ideals defined by regular sequences have low dimension. To deduce that has low dimension, we need to relate the Krull dimension of the ideal to the geometric dimension of . To do this we pass through a number of intermediate stages. First we localize, and complexify. This allows us to equate the dimension of the local complexified ideal to that of the local complexified ideal of , which we do by using local analytic Nullstellensatz. We also equate the common dimension of these two ideals to the (local complex) geometric dimension of . Then, we relate this to local real dimension of . Finally, we get a bound on the (global) dimension of itself.

Proof.

The Definition 3.3 is equivalent to saying that is not a zero divisor in the ring of functions modulo the ideal generated by . To see this, we argue as follows. By definition, in any ring, an element is not a zero divisor if implies that . Equality in the quotient ring means that, in the ring of functions, we have: . Thus if is not a zero divisor in the quotient ring, then implies is zero in the quotient ring, that is to say , for some functions .

Thus, a sequence where each is strongly independent from the previous ones is a regular sequence, see (eisenbud1995commutative, Sections 10.3, beginning of Section 17).

Let be a point in . We will denote by the ring of germs666A germ of a function at a point is an equivalence class of functions defined near , where are considered equivalent if there exists an open neighbourhood of s.t. restrictions of and to that neighborhood coincide. of real-analytic functions defined near , which is isomorphic to the ring of convergent power series centered at . We will denote the complex version of this ring by .

The localization ring of the ring of analytic functions on at a point is defined as the set of equivalence classes of pairs of analytic functions s.t. , with the equivalence relation . This is a formal way of introducing fractions . One also has the localization map from the original ring to the localization ring. It sends to the equivalence class represented by the pair , where is the constant function. In our setting, if one identifies the set of equivalence classes with germs, this map performs a ‘type conversion’ from an analytic function to its germ at . In fact, the localized ring is a subring of the ring of germs . Indeed, a fraction defines an analytic function on some open neighborhood of and the corresponding germ depends only on the equivalence class, thus giving a map . Clearly the germ is zero only when is zero, so this map is an injection, and is a subring of .

However, is not all of , since not every function analytic at is a ratio of two functions analytic on all of . To remedy this, we consider completions of both and , denoted and with respect to the maximal ideal of germs vanishing at . A completion is perhaps most familiar as a procedure that gives real numbers from rational ones, by means of equivalence classes of Cauchy sequences. In the present situation, a sequence of germs is deemed Cauchy if the difference of any two elements with sufficiently high indexes vanishes to arbitrarily high order (this is known as Krull topology). The completion (of either or ) is then isomorphic to the ring of formal power series centered at . Indeed, just taking Cauchy sequences of germs of polynomial functions we get that the completion contains all formal power series centered at ; and any Cauchy sequence (in either or ) is equivalent to one made up of polynomials, and converges to a formal power series.

We now argue as follows. Since the localization procedure commutes with taking quotients, and the localization map takes non-zero divisors to non-zero divisors ((dummit2004abstract, Section 15.4)), we conclude that for each the sequence of germs of is a regular sequence in . On the other hand, by (stacks-project, Lemma 10.67.5 and Lemma 10.96.2) (as cited in proof of (stacks-project, Lemma 23.8.1.)) a sequence is regular in a local ring if and only if it is regular in the completion, so is regular in , and so also in .

We claim that the corresponding complexified germs form a regular sequence in as well. Indeed, if on neighborhood of