## 1 Introduction

Consider a robot given a coverage task on a building floor. For example, it could be tasked with performing a safety inspection or collecting data. With the increasing availability and use of high-resolution sensors such as cameras and LiDAR, such tasks require the robot to process high-dimensional observations for real-time decisions. Current navigation approaches typically involve constructing and utilizing a high-fidelity map of the robot’s environment [Cadena16, Sun18, Doherty19, Vasilopoulos20]. However, is a map necessary for the task? Does the map-based representation satisfy the robot’s onboard memory constraints? Are there representations that are more memory efficient? These fundamental and practical questions motivate the need to have principled methods for finding memory representations that are not only sufficient for the task at hand but also reduce the robot’s memory requirements.

In order to illustrate the potential benefits of memory-efficient policies, consider a robot tasked with covering an maze (a simplified version of the building floor coverage task). Blum and Kozen [Blum78] show that there is a control policy — a clever, handcrafted, wall-following and zig-zagging routine — that only utilizes bits of memory. This policy thus requires *significantly less* memory than one that relies on building and using a map of the environment (a map-building strategy requires at least memory). Beyond memory efficiency, such a policy also affords additional advantages including (i) computational efficiency, and (ii) improved generalization/robustness. For example, Blum and Kozen’s policy does not need to perform real-time computations with the entire map as an input. Additionally, a policy that requires memory is inherently *task-centric*; irrelevant geometric details of the environment (e.g., the exact positions or colors of obstacles in the environment) do not affect the robot’s behavior. The policy can thus be highly robust to uncertainty or noise in these task-irrelevant features.

An important feature of memory-efficient policies is that they can be *qualitatively different* from ones that utilize map-based representations. As a simple example, consider the navigation problem demonstrated in Figure 1. A policy that chooses to follow the wall can be significantly more memory-efficient than one that navigates through the environment diagonally (since the wall-following strategy does not need to maintain information pertaining to obstacle locations). This motivates the need to *jointly synthesize* the memory representation and the control policy; such a joint synthesis can lead to policies that *actively* reduce memory requirements.

Statement of Contributions. The goal of this paper is to automatically and jointly synthesize low-dimensional, task-centric memory representations and control policies. Our primary contribution is a reinforcement learning framework for finding policies that achieve *active memory reduction (AMR)*. In particular, we leverage a group LASSO regularization scheme [Yuan06, Scardapane17] to enforce low-dimensional memory representations while simultaneously finding policies via a policy gradient (PG)-style algorithm that we refer to as AMR-PG. To our knowledge, this is the first work to find AMR policies in continuous state and action spaces. Lastly, we demonstrate the efficacy of our approach on three simulated examples: (i) an illustrative discrete navigation problem, (ii) a continuous navigation problem with synthetic environments, and (iii) vision-based navigation in an apartment using a photo-realistic simulator. These examples demonstrate our method’s ability to find AMR policies that reduce the dimension of the required memory representation and improve generalization as compared to standard PG methods.

## 2 Related Work

Memory-Efficient Representations. There are several approaches that consider memory-efficient representations for robot navigation tasks including gap navigation trees [Murphy08, Tovar05], compact maps [Srivastava16], and graph-like topometric maps [Ort20]. While each of these share this work’s goal of memory efficiency, these memory representations are hand-crafted for certain applications or domains. In contrast, we aim to provide a general approach for finding memory-efficient representations and policies. Recent work, by O’Kane and Shell [OKane17], takes a step in this direction by automatically designing minimal memory representations and controllers via combinatorial filters. However, their formulation defines memory with respect to the number of nodes in a policy graph and is restricted to discretized state and action spaces. Instead, our work defines memory complexity as the dimension of a continuous representation and is applicable to continuous state and action spaces.

Map-Free Representations in RL. End-to-end reinforcement learning (RL) of policies provides one avenue towards generating task-centric representations that avoid explicit geometric representations such as maps (see, e.g., [Levine16, Levine18, Zhu17]

). Recurrent neural network architectures allow one to incorporate memory into policies learned via RL

[Heess15]. For example, in the context of navigation, Chen et al. [Chen17]use a long short-term memory (LSTM)

[Hochreiter97] architecture to navigate mazes with cul-de-sacs. While these approaches are able to find policies that maintain task-relevant representations in memory, they do not try to explicitly minimize the memory. In practice, such approaches often choose the dimension of the memory representation with little to no knowledge of the appropriate size for the task.Memory-Efficient Representations in RL. Recent work in RL utilizes self-attention [Vaswani17] before recurrent memory layers. For example, Baker et al. [Baker19] use this method in their policy architecture to train agents to play hide-and-seek games over a long time horizon. In [Tang20], Tang et al. highlight the value of self-attention for memory-efficient representations. Specifically, they show how self-attention can be used as a bottleneck to promote the memory representation to only use task-centric features. They also demonstrate that such a bottleneck allows them to only use a small number of memory dimensions, e.g. an LSTM with only 16 memory state dimensions in a third-person perspective navigation task. While this type of approach is capable of finding task-centric representations that are low-dimensional, the memory dimension still needs to be specified a priori. In contrast, we present a regularization scheme that explicitly seeks to minimize the memory dimension.

A different line of work learns task-centric memory representations via information bottlenecks [Achille18, Pacelli20]. These approaches seek policies with “low complexity” as defined in terms of the information contained in the memory representation. For example, in [Pacelli20], the objective is to minimize the information content about the state in the memory representation. Our work, instead, defines memory complexity in terms of the dimension; such a measure of complexity is more physically meaningful and tied to the robotic system’s onboard memory constraints.

Dimensionality Reduction Techniques.

One approach for reducing the dimension of the memory representation is to use unsupervised techniques such as principal component analysis, manifold learning, or autoencoders

[vanderMaaten09]. However, these approaches do not take into account the control task. Sufficient dimension reduction (SDR) [Adragni09] addresses this by finding a mapping of the input data such that no task-relevant information is lost. However, most existing SDR approaches are restricted to linear mappings [Cook91, Li91]; the few nonlinear extensions [Kim08] rely on domain knowledge for a good kernel choice. This paper presents a method that is task-centric, handles nonlinearities, and does not require knowledge of the reduced dimension a priori.## 3 Problem Formulation

Our goal is to find a policy that utilizes a low-dimensional, task-centric memory representation. To formalize this, we focus on robotic tasks that can be defined with cost functions of the form where , and represent the robot’s state, control action, and sensor observation at time respectively. The state space , action space , and observation space may be continuous or discrete. Additionally, the robot’s dynamics and sensor model are described by unknown conditional distributions and respectively.

We seek policies of the form , where is the memory state at time-step . Here, is a function of the current observation and previous memory state, i.e., . Ideally, the memory state should (i) contain enough information about the sequence of past observations in order choose good actions, and (ii) have minimal dimension , where . To precisely formalize the above desiderata, we first introduce the matrix zero norm^{1}^{1}1

Note that, like the zero norm for vectors in Euclidean spaces, the matrix zero norm is not a proper norm because it is not homogeneous.

.###### Definition 1 (Matrix Zero Norm).

Let represent the transposed -th row of matrix . Additionally, let

indicate if there exists a non-zero element in . Then, the matrix zero norm is defined as the number of non-zero rows, i.e.

Thus , where , corresponds to the number of effective dimensions needed by the memory states across the trajectory. If , where , then the memory representation is effectively reduced from dimension to .

To find a memory representation that is both low-dimensional and task-centric, we minimize the memory representation dimension subject to an upper bound on the expected cost of the trajectory:

(1) |

where is the maximum allowable expected cost. Since actions are conditioned on the memory state, requiring the matrix zero norm of the memory states to be small means that the policy may need to take actions leading to lower dimensional memory states. In other words, the policy will actively reduce memory requirements. Hence, we call the resulting policy an active memory reduction (AMR) policy.

## 4 Learning AMR Policies

In this section, we present our approach for the AMR policy synthesis problem (1). We pose the problem as a reinforcement learning problem where and

are parameterized using neural networks. We use

to refer to the combined set of weights corresponding to and . The primary challenges with (1) then come from (i) the non-differentiability of the matrix zero norm, and (ii) the hard constraint on the expected cost. To tackle these, we relax the matrix zero norm with a regularizer used in group LASSO problems and soften the hard constraint. We discuss our regularization scheme in Section 4.1 and describe our overall policy gradient (PG)-style algorithm in Section 4.2.### 4.1 Dimensionality Reduction Based on the -norm

A well-known and widely-used convex relaxation of the vector zero norm is the -norm. However, since (1) aims for sparsity of entire matrix rows, this relaxation cannot be used directly. Instead, we use the -norm seen in group LASSO [Yuan06] to capture this desired behavior. The -norm for matrix is written as:

(2) |

Notice that for , (2) is the -norm. Hence, we can expect that minimizing the -norm will promote sparsity of entire matrix rows similar to how minimizing the -norm promotes sparsity of elements in a vector.

The -norm is also effective as a regularizer on groups of weights in neural networks [Scardapane17] and for promoting sparsity of hidden states in LSTMs [Wen18]. Our insight is to now apply it in an RL context to learn memory-efficient policies. Specifically, to relax the matrix zero norm in (1), we leverage the

-norm to target the incoming weights of a neuron at the memory layer as shown in Figure

2. Here, “memory layer” refers to the last layer of a standard, time-invariant recurrent neural network that gives output . More formally, let and be the number of neurons used at the memory layer and preceding hidden layer respectively and define the incoming memory layer weight matrix to be . We then calculate the -norm of this matrix, i.e. . Intuitively, minimizing this will promote entire rows of to be sparse which in turn, effectively drops out neurons (i.e., dimensions of the memory representation ).After we relax the matrix zero norm, we soften the hard constraint on the expected cost. Our new reinforcement learning objective then becomes:

(3) |

where is a tradeoff parameter between cost and memory efficiency.

The regularizer, which we refer to as the AMR regularizer, can additionally be used in time-varying recurrent neural network structures. We achieve this by stacking the memory layer weight matrices to define and penalizing . This ensures that the same number of memory dimensions are reduced at each time step.

### 4.2 AMR Policy Gradient Algorithm

Now we describe the algorithm we use to tackle (3). We first parameterize our policy using a recurrent neural network, , connected to a feedforward network that outputs . This output is treated as a distribution, and the control action, , is sampled from this. An illustration of this architecture is shown in Figure 2.

Next, we write the gradient of (3) with respect to the network parameters, , as

(4) |

In this form, we are able to extend the canonical policy gradient algorithm, REINFORCE [Williams92], to include the AMR regularizer (one may also modify other on-policy algorithms such as proximal policy optimization (PPO) [Schulman17]). We refer to our method as AMR-PG and outline it in Algorithm 1. For network parameter updates, we use the ADAM optimizer [Kingma14].

Once we train a policy using AMR-PG, it remains to determine the reduced memory representation dimension. In our networks used in Section 5, we apply a nonlinearity to the outputs preceding the memory layer. This allows us to upper bound the value of the memory state at dimension with the sum of the magnitudes of the incoming weights at dimension , i.e.,

— we refer to the value of the upper bound as the “memory saliency”. Thus as a general rule, we cut off any dimensions whose memory saliency is at least two orders of magnitude smaller than the highest memory saliency. After determining the dimension reduction, the network can be trained for several epochs with the cut dimensions and regularizer removed. This will provide a hard dimensionality reduction if desired, i.e., explicitly force all incoming weights at a dimension to be zero. In our results discussed in Section

5, we test with the raw trained network to give qualitative insight for how well the memory representation reduced its dependency on task-irrelevant features.Implementation Details. There are several practical considerations in implementing Algorithm 1. First, any layers preceding the memory layer should be at least the size of the observation plus the maximum memory dimension. This is to avoid losing task-relevant information before the memory layer [Scwartz17]. Theoretically, the maximum memory dimension should be the size of the observation multiplied by the time horizon. Intuitively, choosing this size could allow the network to implement the complete history of observations as the initial memory representation, and then we could minimize the dimension from there. However, this network size would be infeasible for many problems. We found that choosing an initial memory dimension of several times (e.g., 2-4x) greater than the observation dimension is a good starting point for our examples described in Section 5. It allows for the potential to store several complete observations in memory if needed for the task.

## 5 Examples

Here we illustrate the efficacy of our AMR-PG algorithm described in Section 4 with three examples: (i) an illustrative discrete navigation problem, (ii) a continuous navigation problem with synthetic environments, and (iii) vision-based navigation in an apartment using iGibson, a photo-realistic simulator [Xia20]. In these examples, we show that AMR-PG is able to reduce the dimension of the memory representation and potentially find qualitatively different policies (i.e., policies that actively reduce memory) as compared to standard policy gradient methods with the same parameterizations. Details regarding the neural networks and training procedures are discussed for each example in Appendix A.

### 5.1 Discrete Navigation

In this first example, we specialize our method to discrete spaces in order to illustrate policies that achieve AMR. Specifically, we consider an illustrative example from [OKane17], where a robot must navigate to a goal location in a grid as shown in Figure 2(c). The robot is equipped with a goal indicator, e.g., means that the robot is at the goal. The robot’s state, , is described by its cell position. Additionally, the robot takes discrete actions corresponding to up, right, down, left, and stop respectively. The state evolves with dynamics . The cost for this scenario is for where the robot is initialized at and must navigate to goal .

The goal here is to synthesize a policy that takes the form of a deterministic, Moore-style finite state machine as described by [OKane17] and shown in Figure 2(b). In this context, a memory-optimal policy is defined as one that requires the fewest number of memory states. For this task, an example of a memory-optimal policy is one that simply alternates between actions up and right until the goal is observed [OKane17]. This policy only requires two memory states (not including the starting and terminal states): one for action up and one for action right; see Figure 2(b). In contrast, a policy that chooses to repeat up, up, right, right is more complex as it requires keeping track of how many times an action has been applied; such a policy needs at least four memory states. We demonstrate that our approach recovers the optimal two-state policy identified in [OKane17]. However, in contrast to [OKane17], our method also handles continuous state, action, and observation spaces (considered in subsequent examples).

Training and Results.

We model the memory representation with a one-hot encoding vector that indicates which memory state the robot is using (as opposed to the continuous memory states described in Section

4). For the memory representation mapping, , we pass the observations to the memory layer of size 10, where 10 is the maximum number of memory states this task could have (a start state, a state for each time step, and a terminal state). We use the-Softmax activation function on the memory layer with

to encourage concentration around one explicit state for . Then we pass to a fully connected layer with 5 neurons activated by a Softmax function. The output is treated as a categorical distribution that we sample the actions from.We summarize our training results for 20 seeds in Figure 3. To count the memory states used and recover the deterministic Moore machine, we took the argmax of the memory and action layer outputs. For each seed, PG found a cost-optimal policy to the goal (see Figure 2(c)) but required between two and five memory states. In contrast, AMR-PG always found the memory-optimal policy.

### 5.2 Maze Navigation with RGB-Depth Array

Our next example focuses on a differential-drive robot navigating through a maze (Figure 1). The maze is 10m 10m with one red and one blue obstacle sampled within the shaded regions indicated in Figure 0(a). The robot is given a fixed linear velocity of 2m/s and has time steps (s each) to reach the green goal in the upper right corner of the maze. To meet this objective, we model the cost as the Euclidean distance between the robot’s position and the goal location normalized by the initial distance to the goal for all time steps. Additionally, the robot has control of its angular velocity and is equipped with a fov RGB-depth sensor that outputs colors and depths along 17 rays. The simulations are performed using Pybullet [Coumans18].

Qualitatively, there are two policies that are sufficient for successfully navigating to the goal: (i) diagonally navigating through the obstacles to the goal, and (ii) following the maze wall to the goal. Figure 0(b) and video linked in Appendix B illustrate these policies.

Results.

We compare AMR-PG with a standard PG method that uses the same neural network parameterization (hyperparameters are provided in the Appendix). The average cost and final normalized distance to the goal (across five seeds) for training and testing scenarios are summarized in Table

1. For four out of five seeds, PG found the cost-optimal solution of diagonally navigating through the obstacles. (The other seed found the wall-following policy as a result of minimal exploration outside of the far left portion of the maze). In contrast, AMR-PG consistently found the wall-following policy and significantly reduced the required memory dimension from 300 to at most 4 as shown in Figure 4. Thus, the policy found by AMR-PG only utilizes at most 1.33% of the memory used by the policy found using PG. We further evaluate the benefits in terms of generalization afforded by our approach. In particular, we test the policies on environments with obstacle colors that differ from ones seen during training. The performance of the PG policies degraded significantly. In contrast, the performance of the policies found using AMR-PG remained almost entirely unaffected. This result combined with the compact memory representation suggests that AMR-PG finds policies for this problem that actively reduce memory and only maintain task-centric representations that utilize the distance values to the wall.Scenario | Policy Gradient | AMR-PG | ||
---|---|---|---|---|

Cost | Dist. | Cost | Dist. | |

Training | 35.09 3.22 | 0.110.05 | 41.952.86 | 0.170.11 |

Testing | 34.99 3.25 | 0.110.04 | 42.713.36 | 0.190.13 |

Testing (Swapped Colors) | 52.445.77 | 0.530.20 | 43.933.35 | 0.21 0.14 |

Testing (New Colors) | 51.034.24 | 0.530.12 | 43.353.38 | 0.210.14 |

### 5.3 Vision-Based Navigation

The goal of our last example is to demonstrate AMR-PG’s ability to scale to a more realistic scenario: vision-based navigation in a photo-realistic simulation environment. In this example, a TurtleBot is randomly initialized in the hallway of the Placida apartment in iGibson [Xia20] and needs to navigate to the kitchen as shown in Figure 4(a). Specifically, the TurtleBot’s initial position is sampled uniformly between the set while the position and yaw are fixed such that the TurtleBot is centered in the hallway and facing the tables. The cost is described by a weighted sum of a sparse goal reward of 100, a term that rewards progress towards the goal (as measured by geodesic distance), a collision cost, and an angle (yaw) cost. The control actions specify linear and angular velocities. Additionally, the TurtleBot is equipped with a 90 fov RGB-D camera with a resolution of 128

128. We preprocess these observations with a convolutional neural network before passing them to our AMR policy network. For more details, see Appendix

A.3.Results. For this example, AMR-PG found a significant memory reduction from 100 dimensions down to 34 consistently across five seeds — a memory savings of 66% (see Figure 4(b)). Importantly, these savings did not impact the performance of the policy. On average, AMR-PG obtained a reward of on 20 initial states sampled from the same range seen in training, while PG obtained a similar reward of on the same initial states. We also initialized the robot from 20 states drawn from an enlarged set of initial conditions: . In this case, the TurtleBot only collided once using the AMR-PG policies from the five seeds. The policies found using PG resulted in five collisions (using the same set of initial states). Thus, the policies found by AMR-PG achieve effective dimensionality reduction and show potential for improved generalization across different initial conditions. We refer the reader to Appendix B for a video of these results.

## 6 Conclusion

We presented a reinforcement learning approach for jointly synthesizing a low-dimensional memory representation and a policy for a given task. This joint synthesis allows one to find policies that actively seek to reduce memory requirements. The key insight of our approach is to leverage the group LASSO regularization to encourage drop-out of neurons at the memory layer while simultaneously finding policies via a policy gradient approach. We refer to this new algorithm as AMR-PG. Additionally, we demonstrate our approach on discrete and continuous navigation problems, including vision-based navigation in a photorealistic simulator. Comparing AMR-PG and standard PG, we demonstrate that our approach can find low-dimensional representations (e.g., from 300 dimensions down to 4) and find qualitatively different policies (e.g., wall-following instead of obstacle avoiding).

Future Work. There are several interesting future directions for this work. One immediate extension is to find AMR policies with actor-critic architectures (e.g., using PPO [Schulman17]), and more complex memory network architectures (e.g., LSTMs [Hochreiter97]). On the practical front, we are excited to work towards the employment of AMR policies on resource-constrained robotic platforms such as micro aerial vehicles. An important step for this is demonstrating that the AMR policies scale well to long-horizon tasks. Another potential step is to explore the benefits of pairing AMR policies with recent advances in integrated circuits that address memory accessing bottlenecks for neural networks (e.g., [Zhang17]). Lastly, a particularly exciting direction is to explore whether our approach leads to policies that are more interpretable (since they only maintain low-dimensional memory representations) by visualizing features that impact the memory representation (e.g., using saliency maps [Simonyan13]).

This work is partially supported by the National Science Foundation [IIS-1755038], and the School of Engineering and Applied Science at Princeton University through the generosity of William Addy ’82.

## References

## Appendix

## Appendix A Training Summaries

### a.1 Training Parameters

The hyperparameters used in the examples are detailed in Table 2.

Example | Learning Rate | AMR Rate () | Max Memory Dimension | Max Epochs | # Rollouts |
---|---|---|---|---|---|

Discrete Nav. | 10 | 300 | 100 | ||

Maze Nav. | 300 | 3000 | 250 | ||

Vision-Based Nav. | 100 | 6000 | 3 |

### a.2 Maze Navigation with RGB-Depth Array Example

Here, we describe our network implementation for the maze navigation example. For the input to the neural network, we feed in a flattened RGB-depth sensor observation (along 17 rays) so that . We additionally augment with the previous action so that . The recurrent network, , has two hidden layers with 369 neurons each with exponential linear unit and nonlinearities respectively. The memory output layer is 300 and also has a

nonlinearity. This is then fully connected to the output layer with 2 neurons. For turning rate control, we treat the 2-dimensional network output as a Gaussian distribution.

### a.3 Vision-Based Navigation Example

Before

, we preprocess the image and depth sensor observations with a convolutional neural network containing three layers using filter sizes 32, 64, and 64, kernel sizes 8, 4, and 3, and strides 4, 2, and 2 respectively. The output is then fully connected to a layer of size 256 activated by a

nonlinearity. This is then treated as and passed to the recurrent network, , with a memory of size 100. Similar to the maze navigation example, we pass the memory state,, to a fully connected network whose output is treated as a multivariate Gaussian distribution for applying the linear and angular velocities. For implementing AMR-PG, we modify the REINFORCE agent in the TF-Agent Tensorflow library

[TFAgents].In this example, we use a weighted sum of various reward types including a sparse goal reward, a geodesic potential, a collision cost, and a yaw cost. The weights are given in Table 3. We model the geodesic potential as the previous geodesic distance minus the current geodesic distance. The collision cost is a penalty applied to any time step with a collision. The yaw cost is also applied at each time step and describes the angular difference between the TurtleBot’s yaw angle and the angle needed to get to the goal. We use this cost to promote the TurtleBot to move forward to the goal.

Sparse Goal | Geodesic Potential | Collision | Yaw | |
---|---|---|---|---|

Weight | 100 | 30 | -0.5 | -0.2 |

## Appendix B Supplementary Material

A video of our results is available at: https://youtu.be/x5yYhLoG6jY

Our code is available at: https://github.com/irom-lab/AMR-Policies

Comments

There are no comments yet.