# Locally Persistent Exploration in Continuous Control Tasks with Sparse Rewards

A major challenge in reinforcement learning is the design of exploration strategies, especially for environments with sparse reward structures and continuous state and action spaces. Intuitively, if the reinforcement signal is very scarce, the agent should rely on some form of short-term memory in order to cover its environment efficiently. We propose a new exploration method, based on two intuitions: (1) the choice of the next exploratory action should depend not only on the (Markovian) state of the environment, but also on the agent's trajectory so far, and (2) the agent should utilize a measure of spread in the state space to avoid getting stuck in a small region. Our method leverages concepts often used in statistical physics to provide explanations for the behavior of simplified (polymer) chains, in order to generate persistent (locally self-avoiding) trajectories in state space. We discuss the theoretical properties of locally self-avoiding walks, and their ability to provide a kind of short-term memory, through a decaying temporal correlation within the trajectory. We provide empirical evaluations of our approach in a simulated 2D navigation task, as well as higher-dimensional MuJoCo continuous control locomotion tasks with sparse rewards.

## Authors

• 4 publications
• 4 publications
• 12 publications
• 6 publications
• 118 publications
09/21/2021

### Long-Term Exploration in Persistent MDPs

Exploration is an essential part of reinforcement learning, which restri...
03/02/2022

### Follow your Nose: Using General Value Functions for Directed Exploration in Reinforcement Learning

Exploration versus exploitation dilemma is a significant problem in rein...
10/21/2019

### Exploration via Sample-Efficient Subgoal Design

The problem of exploration in unknown environments continues to pose a c...
01/01/2020

### Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning

Reinforcement learning with sparse rewards is still an open challenge. C...
06/06/2019

### Clustered Reinforcement Learning

Exploration strategy design is one of the challenging problems in reinfo...
02/24/2022

### Collaborative Training of Heterogeneous Reinforcement Learning Agents in Environments with Sparse Rewards: What and When to Share?

In the early stages of human life, babies develop their skills by explor...
05/22/2019

### The Journey is the Reward: Unsupervised Learning of Influential Trajectories

Unsupervised exploration and representation learning become increasingly...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As reinforcement learning agents typically learn tasks through interacting with the environment and receiving reinforcement signals, a fundamental problem arises when these signals are rarely available. The sparsely distributed rewards call for a clever exploration strategy that exposes the agent to the unseen regions of the space via keeping track of the visited state-action pairs [fu2017ex2, nair2018overcoming]. However, that cannot be the case for high-dimensional continuous space-and-action spaces, as defining a notion of density for such tasks is intractable and heavily task-dependent [NIPS2017_7090, taiga2019benchmarking].

Here, we introduce an exploration algorithm that works independently of the extrinsic rewards received from the environment and is inherently compatible with continuous state-and-action tasks. Our proposed approach takes into account the agent’s short-term memory regarding the action trajectory, as well as the trajectory of the observed states in order to sample the next exploratory action. The main intuition is that in a pure exploration mode with minimal extrinsic reinforcement, the agent should plan trajectories that expand in the available space and avoid getting stuck in small regions. In other words, the agent may need to be “persistent" in its choice of actions; for example, in a locomotion task, an agent may want to pick a certain direction and maintain it for some number of steps, in order to ensure that it can move away from its current location, where it might be stuck at. The second intuition is that satisfying the first condition requires a notion of spread measure in the state space to warrant the agent’s exposure to unvisited regions. Moreover, in sparse reward settings, while the agent’s primary intention must be to avoid being trapped in local regions by maintaining a form of short-term memory, it must still employ a form of memory evaporation mechanism to maintain the possibility of revisiting the informative states. Note that in continuous state-and-action settings, modern exploration methods [ostrovski2017count, houthooft2016vime, ciosek2019better] fail to address the fore-mentioned details simultaneously.

Our polymer-based exploration technique (PolyRL) is inspired by the theory of freely-rotating chains (FRCs) in polymer physics to implement the aforementioned intuitions. FRCs describe the chains (collections of transitions or moves) whose successive segments are correlated in their orientation. This feature introduces a finite (short-term) stiffness (persistence) in the chain, which induces what we call locally self-avoiding walks (LSAWs). The strategy that emerges from PolyRL provides consistent movement, without the need for exact action repeats (e.g. methods suggested by [dabney2020temporally, lakshminarayanan2017dynamic]), and can adapt the rigidity of the chain as required. Moreover, unlike action-repeat strategies, PolyRL is inherently applicable in continuous action-state spaces without the need to use any discrete representation of action or state space. The local self-avoidance property in a PolyRL trajectory cultivates an orientationally persistent move in the space while maintaining the possibility of revisiting the places visited before. In particular, in constructing LSAWs, PolyRL selects persistent actions in the action space and utilizes a measure of spread in the state space, called the radius of gyration, to maintain the (orientational) persistence in the chain of visited states. The PolyRL agent breaks the chain and performs greedy action selection once the correlation between the visited states breaks. The next few exploratory actions that follow afterward, in fact, act as a perturbation to the last greedy action, which consequently preserves the orientation of the greedy action. This feature becomes explicitly influential after the agent is exposed to some reinforcement, and the policy is updated, as the greedy action guides the agent’s movement through the succeeding exploratory chain of actions.

## 2 Notation and Problem Formulation

We consider the usual MDP setting, in which an agent interacts with a Markov Decision Process

, where and are continuous state and action spaces, respectively;

represents the transition probability kernel, and

is the reward function. Moreover, we make a smoothness assumption on ,

###### Assumption 1.

The transition probability kernel is Lipschitz w.r.t. its action variable, in the sense that there exists such that for all and measurable set ,

 |P(B|s,a)−P(B|s,a′)|≤C∥a−a′∥. (1)

Assumption 1 has been used in the literature for learning in domains with continuous state-action spaces [antos2008fitted], as the assumption on the smoothness of MDP becomes crucial in such environments [antos2008fitted, bartlett2007sample, ng2000pegasus]. Note that we only use the Lipschitz smoothness assumption on for the theoretical convenience, and we later provide experimental results in environments that do not satisfy this assumption. Furthermore, we assume that the state and action spaces are inner product spaces.

We present the trajectory of selected actions in the action space with and the trajectory of the visited states in the state space with . Moreover, we define

 Ω(τS,τA):={s∈S|Pr[ST=s|sT−1,aT−1]>0} (2)

as the set of probable states observed at time given from and the selected action from . Note that is the number of time steps since the start of the respective piece-wise exploratory trajectory, and is reset at each exploitation step. For simplicity, in the rest of this manuscript, we denote by . In addition, the concatenation of the trajectory and the state visited at time step is denoted as the trajectory . Moreover, for the theoretical analysis purposes, and in order to show the expansion of the visited-states trajectory in the state space, we choose to transform

into a sequence of vectors connecting every two consecutive states (bond vectors),

 ωτS= {ωi}T−1i=1,   where   ωi=si−si−1. (3)

Finally, we define Self-Avoiding Random Walks (LSA-RWs), inspired by the theory of freely-rotating chains (FRCs), where the correlation between consecutive bond vectors is a decaying exponential with the correlation number (persistence number). represents the number of time steps, after which the bond vectors forget their initial orientation.

###### Definition 1 (Locally Self-Avoiding Random Walks (LSA-RWs)).

A sequence of random bond vectors is Locally Self-Avoiding with persistence number , if for all there exists such that, 1) and 2) , where denotes the ensemble average over all configurations of the chain induced by the dynamic randomness of the environment (equivalent to the thermal fluctuations in the environment in statistical physics).

The first condition states that the expected magnitude of each bond vector is . The second condition shows that the correlation between consecutive bond vectors is a decaying exponential with the correlation length (persistence number), which is the number of time steps after which the bond vectors forget their original orientation. Note that despite the redundancy of the first condition, we choose to include it separately in order to emphasize the finite expected magnitude of the bond vectors.

## 3 Methods

We introduce the method of polymer-based exploration in reinforcement learning (PolyRL), which borrows concepts from Statistical Physics [de1979scaling, doi1988theory] to induce persistent trajectories in continuous state-action spaces (Refer to the Appendix (Section A) for more information regarding polymer models). As discussed below, our proposed technique balances exploration and exploitation using high-probability confidence bounds on a measure of spread in the state space. Algorithm 1 presents the PolyRL pseudo code. The method of action sampling is provided in the Appendix (Section C).

The PolyRL agent chooses the sequence of actions in such that every two consecutive action vectors are restricted in their orientation with the mean [correlation] angle . In order to induce persistent trajectories in the state space, the agent uses a measure of spread in the visited states in order to ensure the desired expansion of the trajectory in . We define radius of gyration squared,

 U2g(τS) :=1T−1∑s∈τSd2(s,¯τS), (4)

as a measure of the spread of the visited states in the state space , where is a metric defined on the state space , and serves as a measure of distance between a visited state and the empirical mean of all visited states . Also known as the center of mass, is calculated as, .

At each time step, the agent calculates the radius of gyration squared (equation 4) and compares it with the obtained value from the previous time step. If the induced trajectory in the state space is LSA-RW, it maintains an expected stiffness described by a bounded change in . Theorems 3 and 4 show high-probability confidence bounds on upper local sensitivity and lower local sensitivity , respectively. The lower bound ensures that the chain remains LSA-RW and expands in the space, while the upper bound prevents the agent from moving abruptly in the environment (The advantage of using LSA-RWs to explore the environment can be explained in terms of their high expansion rate, which is presented in Proposition 7 in the Appendix). If the computed is in the range , the agent continues to perform PolyRL action sampling method (Algorithm 2 in the Appendix). Otherwise, it samples the next action using the target policy. The factor that arises in equations 7 and 4

controls the tightness of the confidence interval. In order to balance the trade-off between exploration and exploitation, we choose to increase

with time, as increasing leads to tighter bounds and thus higher probability of exploitation. In addition, the exploration factor determines the probability of switching from exploitation back to starting a new exploratory trajectory. Due to the persistence of the PolyRL chain of exploratory actions, the orientation of the last greedy action is preserved for (persistence) number of steps. As for the trajectories in the state space, upon observing correlation angles above , the exploratory trajectory is broken and the next action is chosen with respect to the target policy .

## 4 Related Work

A wide range of exploration techniques with theoretical guarantees (e.g. PAC bounds) have been developed for MDPs with finite or infinitely countable state or action spaces [kearns2002near, brafman2002r, strehl2004exploration, lopes2012exploration, azar2017minimax, white2010interval, Wang2020Q-learning], however extending these algorithms to real-world settings with continuous state and action spaces without any assumption on the structure of state-action spaces or the reward function is impractical [haarnoja2017reinforcement, VIME].

Perturbation-based exploration strategies are, by nature, agnostic to the structure of the underlying state-action spaces and are thus suitable for continuous domains. Classic perturbation-based exploration strategies typically involve a perturbation mechanism at either the action-space or the policy parameter-space level. These methods subsequently employ stochasticity at the policy level as the main driving force for exploration [deisenroth2013survey]. Methods that apply perturbation at the parameter level often preserve the noise distribution throughout the trajectory [ng2000pegasus, sehnke2010parameter, theodorou2010generalized, fortunato2017noisy, colas2018gep, plappert2018parameter], and do not utilize the trajectory information in this regard. Furthermore, a majority of action-space perturbation approaches employ per-step independent or correlated perturbation [wawrzynski2009real, lillicrap2015continuous, haarnoja2017reinforcement, xu2018learning]. For instance, [lillicrap2015continuous] uses the Ornstein–Uhlenbeck (OU) process to produce auto-correlated additive noise at the action level and thus benefit from the correlated noise between consecutive actions.

While maintaining correlation between consecutive actions is advantageous in many locomotion tasks [morimoto2001acquisition, kober2013reinforcement, gupta2018meta], it brings technical challenges due to the non-Markovian nature of the induced decision process [perez2017non], which leads to substantial dependence on the history of the visited states. This challenge is partially resolved in the methods that benefit from some form of parametric memory (e.g. OU processes) [plappert2018parameter, lillicrap2015continuous]. However, they all suffer from the lack of informed orientational persistence, i.e. the agent’s persistence in selecting actions that preserve the orientation of the state trajectory induced by the target policy. Particularly in sparse-reward structured environments, where the agent rarely receives informative signals, the agent will eventually get stuck in a small region since the analytical change on the gradient of greedy policy is minimal [hare2019dealing, kang2018policy]. Hence, the agent’s persistent behaviour in action sampling with respect to the local history (short-term memory) of the selected state-action pairs plays a prominent role in exploring the environments with sparse or delayed reward structures.

## 5 Theory

In this section, we derive the upper and lower confidence bounds on the local sensitivity for the radius of gyration squared between and (All proofs are provided in the Appendix (Section B)). Given the trajectory and the corresponding radius of gyration squared and persistence number , we seek a description for the permissible range of such that the stiffness of the trajectory is preserved.

High-probability upper bound - We define the upper local sensitivity on upon observing new state as,

 ULSUg2(τS):=supsT∈ΩU2g(τ′S)−U2g(τS). (5)

Given the observed state trajectory with persistence number , the upper local sensitivity provides the maximum change observed upon visiting the next accessible state . With the goal of constructing the new trajectory such that it preserves the stiffness induced by , we calculate the high-probability upper confidence bound on . To do so, we write the term in equation 4 as a function of bond vectors , which is presented in lemma 2 given that -norm. We further substitute the resulting in equation 5 with the obtained expression from equation 4.

###### Lemma 2.

Let be the trajectory of visited states, be a newly visited state and be the bond vector that connects two consecutive visited states and . Then we have,

 ∥sT−¯τS∥2=∥ωT+1T[T−1∑i=1iωi]∥2. (6)

The result of lemma 2 will be used in the proof of theorem 3, as shown in the Appendix (Section B.2). In theorem 3 below, we provide a high-probability upper bound for .

###### Theorem 3 (Upper-Bound Theorem).

Let and be a LSA-RW in induced by PolyRL with the persistence number within episode , be the sequence of corresponding bond vectors, where denotes the number of bond vectors within , and be the average bond length. The upper confidence bound for with probability of at least is,

 UB= Λ(T,τS)+1δ[Γ(T,bo,τS)+2b2oT2T−1∑i=1ie−(T−i)LpτS], (7)

where,

 Λ(T,τS)= −1T−1U2g(τS) (8) Γ(T,bo,τS)= b2oT+∥∑T−1i=1iωi∥2T3 (9)

Equation 7 provides an upper bound on the pace of the trajectory expansion in the state space, and thus prevents the agent from moving abruptly in the environment, which would lead to a break in its temporal correlation with the preceding states. Similarly, we introduce the lower local sensitivity , which provides the minimum change observed upon visiting the next accessible state .

High-probability lower bound - In this section, We define the lower local sensitivity on upon observing new state as,

 LLSUg2(τS):=infsT∈ΩU2g(τ′S)−U2g(τS). (10)

We further compute the high-probability lower confidence bound on in order to guarantee the expansion of the trajectory upon visiting the next state.

###### Theorem 4 (Lower-Bound Theorem).

Let and be a LSA-RW in induced by PolyRL with the persistence number within episode , be the sequence of corresponding bond vectors, where denotes the number of bond vectors within , and be the average bond length. The lower confidence bound for at least with probability is,

 LB= Λ(T,τS)+(1−√2−2δ)× [Γ(T,bo,τS)+(T−1)(T−2)T2b20e−|T−1|LpτS], (11)

where,

 Λ(T,τS)= −1T−1U2g(τS) (12) Γ(T,bo,τS)= b2oT+∥∑T−1i=1iωi∥2T3 (13)

Equation 4 provides a lower bound on the change in the expansion of the trajectory. Note that for the experimental purposes, we let , where the exploration factor is a hyper parameter and is the number of elapsed episodes.

The following corollary is an immediate consequence of theorems 3 and 4 together with Assumption 1.

###### Corollary 5.

Given that assumption 1 in the manuscript is satisfied, with high probability the exploratory trajectories formed in induced by PolyRL Algorithm are LSA-RWs.

The proof is provided and discussed thoroughly in the Appendix (Section B.4).

## 6 Experiments

As an exploration method, PolyRL should be assessed in comparison with other exploration techniques, and thus must accompany an off-policy learning method to generate data for. In this section, we integrate PolyRL with three learning algorithms: (1) the Q-learning method ([watkins1992q]) with linear function approximation in a 2D sparse continuous state-and-action navigation task, where the performance of PolyRL is compared with that of -greedy; (2) the deep deterministic policy gradients (DDPG) [lillicrap2015continuous] algorithm, where PolyRL (DDPG-PolyRL) is assessed in comparison with additive uncorrelated Gaussian action space noise (DDPG-UC), correlated Ornstein-Uhlenbeck action space noise (DDPG-OU) [uhlenbeck1930theory, lillicrap2015continuous], as well as adaptive parameter space noise (DDPG-PARAM) [plappert2017parameter]; and (3) the Soft Actor-Critic (SAC) algorithm [haarnoja2018soft], where PolyRL is combined with SAC (SAC-PolyRL) and replaces the random exploration phase during the first steps in the SAC algorithm. The sets of experiments with the learning methods DDPG and SAC are performed in MuJoCo high-dimensional continuous control tasks “SparseHopper-V2” (, ), “SparseHalfCheetah-V2” (, ), and “SparseAnt-V2” (, ) (Refer to the Appendix (Section D) for the bench-marking results of the same algorithms in the standard (dense-reward) MuJoCo tasks.).

Algorithm and Environment Settings - The environment in our 2D sparse-reward navigation tasks either consists of only one chamber (goal reward ), or a room encapsulated by a chamber. Initially positioned inside the small room, the agent’s goal in the latter case is to find its way towards the bigger chamber, where the goal is located (goal reward ) (Figure 1). Moreover, in order to make the former task more challenging, in a few experiments, we introduce a puddle in the environment, where upon visiting, the agent receives the reward . In order to assess the agent’s performance, we integrate the PolyRL exploration algorithm with the Q-learning method with linear function approximation (learning rate ) and compare the obtained results with those of the -greedy exploration with Q-learning. We subsequently plot the quantitative results for the former task and visualize the resulting trajectories for the latter.

In the sparse MuJoCo tasks, the agent receives a reward of only when it crosses a target distance , termed the sparsity threshold. Different values can change the level of difficulty of the tasks significantly. Note that due to the higher performance of SAC compared with that of DDPG, we have elevated for SAC-based experiments, making the tasks more challenging. Moreover, we perform an exhaustive grid search over the corresponding hyper parameters for each task. The sparsity thresholds, the obtained hyper-parameter values, as well as the network architecture of the learning algorithms DDPG and SAC are provided in the Appendix (Section E).

Results and Discussion - We present the qualitative results for the 2D sparse navigation task in figure 1. An example of PolyRL trajectory in one episode (figure 1 (a)) demonstrates the expansion of the PolyRL agent trajectory in the environment. After episodes, the PolyRL agent exhibits a full coverage of the environment (figure 1 (b)) and is consequently able to learn the task (figure 1 (c)), while the -greedy agent is not able to reach the goal state even once, and thus fails to learn the task (figure 1 (d)). This visual observation highlights the importance of space coverage in sparse-reward tasks, where the agent rarely receives informative reinforcement from the environment. An effective trajectory expansion in the environment exposes the agent to the unvisited regions of the space. It thus increases the frequency of receiving informative reinforcement and accelerates the learning process. In Fig. 2, the quantitative results for learning the task in a similar environment (the chamber) are shown both in the absence (figure 2 (a)) and presence (figure 2 (b)) of a puddle. In both cases, the PolyRL exploration method outperforms -greedy significantly. In figure 2 (b), we observe that trajectories with lower persistence (larger ) present a better performance compared with stiffer trajectories. Note that due to the existence of walls in these 2D tasks, the transition probability kernel is specifically non-smooth at the walls. Thus, the smoothness assumption on the transition probability kernel made earlier for the theoretical convenience does not apply in these specific environments. Yet, we empirically show that PolyRL still achieves a high performance in learning these tasks.

We illustrate the results in sparse MuJoCo tasks for DDPG-based and SAC-based methods in figures 3 and 4, respectively. The obtained results in SparseHopper-V2, SparseHalfCheetah-V2, and SparseAnt-V2 show that integrating the two learning methods (SAC and DDPG) with the exploration algorithm PolyRL leads to improvement in the learning performance. The remarkable results achieved by the PolyRL exploration method confirms that the agent has been able to efficiently and sufficiently cover the space, receive informative reinforcement, and learn the tasks. The high performance of SAC-PolyRL (figure 4) is particularly significant, in the sense that PolyRL assists SAC in the data generation process only for the first steps, after which SAC fills in and continues with the learning process. Yet, this short presence leads to a notable contribution in enhancing the performance of SAC.

A notable feature of PolyRL is its relatively low sensitivity to the increase in sparsity threshold compared to that of other DDPG-based and SAC algorithms. The performance of PolyRL combined with DDPG (Figure 5) and SAC (Figure 6) is illustrated in tasks for three different sparsity thresholds. The level of complexity of the tasks increases from left to right with the sparsity threshold . As the sparsity level changes from sparse to very sparse, the graphs demonstrate a sharp decrease in the performance of PolyRL counterparts, while the PolyRL performance stays relatively stable (Note that due to the reward structure in these sparse tasks and considering that the maximum number of steps in each episode is by default set to in the gym environments, the maximum reward that an agent could get during each evaluation round cannot exceed ). This PolyRL agent’s behaviour can be explained by its relatively fast expansion in the space (Refer to Proposition 7 in the Appendix), which leads to faster access to the sparsely distributed rewards compared with other DDPG-based and SAC methods. On the other hand, PolyRL does not perform as well in standard tasks, where the reinforcement is accessible to the agents at each time-step. Since PolyRL does not use the received rewards in its action selection process, it does not behave targeted. Thus, in the environments with dense reward structures, PolyRL might skip the informative signals nearby and move on to the farther regions in the space, leading to acquiring less amount of information and lower performance. In other words, the strength of PolyRL is most observable in the tasks where accessing information is limited or delayed.

## 7 Conclusion

We propose a new exploration method in reinforcement learning (PolyRL), which leverages the notion of locally self-avoiding random walks and is designed for environments with continuous state-action spaces and sparse-reward structures. The most interesting aspect of our proposal is the explicit construction of each exploratory move based on the entire existing trajectory, rather than just the current observed state. While the agent chooses its next move based on its current state, the inherent locally self-avoiding property of the walk acts as an implicit memory, which governs the agent’s exploratory behaviour. Yet this locally controlled behavior leads to an interesting global property for the trajectory, which is an improvement in the coverage of the environment. This feature, as well as not relying on extrinsic rewards in the decision-making process, makes PolyRL perfect for the sparse-reward tasks. We assess the performance of PolyRL in 2D continuous sparse navigation tasks, as well as three sparse high-dimensional simulation tasks, and show that PolyRL performs significantly better than the other exploration methods in combination with the baseline algorithms DDPG and SAC. Finally, a more adaptive version of PolyRL, which can map the changes in the action trajectory stiffness to that of the state trajectory, could be helpful in more efficient learning of the simulation tasks.

## 8 Acknowledgements

The authors would like to thank Prof. Walter Reisner for providing valuable feedback on the initial direction of this work, and Riashat Islam for helping with the experiments in the early stages of the project. Computing resources were provided by Compute Canada and Calcul Québec throughout the project, and by Deeplite Company for preliminary data acquisition, which the authors appreciate.

## Appendix A Polymer Models

In the field of Polymer Physics, the conformations and interactions of polymers that are subject to thermal fluctuations are modeled using principles from statistical physics. In its simplest form, a polymer is modeled as an ideal chain, where interactions between chain segments are ignored. The no-interaction assumption allows the chain segments to cross each other in space and thus these chains are often called phantom chains [doi1988theory]. In this section, we give a brief introduction to two types of ideal chains.

Two main ideal chain models are: 1) freely-jointed chains (FJCs) and 2) freely-rotating chains (FRCs) [doi1988theory]. In these models, chains of size are demonstrated as a sequence of random vectors , which are as well called bond vectors (See Fig. 7). FJC is the simplest proposed model and is composed of mutually independent random vectors of the same size (Fig. 7(a)). In other words, an FJC chain is formed via uniform random sampling of vectors in space, and thus is a random walk (RW). In the FRC model, on the other hand, the notion of correlation angle is introduced, which is the angle between every two consecutive bond vectors. The FRC model, fixes the correlation angle (Fig. 7(c)), thus the vectors in the chain are temporally correlated. The vector sampling strategy in the FRCs induces persistent chains, in the sense that the orientation of the consecutive vectors in the space are preserved for certain number of time steps (a.k.a. persistence number), after which the correlation is broken and the bond vectors forget their original orientation. This feature introduces a finite stiffness in the chain, which induces what we call local self avoidance, leading to faster expansion of the chain in the space (Compare figures 7 (b) and (d) together). Below, we discuss two important properties of the FJCs and the FRCs, and subsequently formally introduce the locally self-avoiding random walks (LSA-RWs) in Definition 1.

FJCs (Property) - In the Freely-Jointed Chains (FJCs) or the flexible chains model, the orientations of the bond vectors in the space are mutually independent. To measure the expected end-to-end length of a chain with bond vectors of constant length given the end-to-end vector (Figure 7 (a)) and considering the mutual independence between bond vectors of an FJC, we can write [doi1988theory],

 ⟨∥U∥2⟩=T∑i,j=1⟨ωi.ωj⟩=T∑i=1⟨ω2i⟩+2∑i>j⟨ωi.ωj⟩=Tb2o, (14)

where denotes the ensemble average over all possible conformations of the chain as a result of thermal fluctuations. Equation 14 shows that the expected end-to-end length , which reveals random-walk behaviour as expected.

FRCs (Property) - In the Freely-Rotating Chains (FRCs) model, we assume that the angle (correlation angle) between every two consecutive bond vectors is invariant (Figure 7 (c)). Therefore, bond vectors are not mutually independent. Unlike the FJC model, in the FRC model the bond vectors are correlated such that [doi1988theory],

 ⟨ωi.ωj⟩=b2o(cosθ)|i−j|=b2oe−|i−j|Lp, (15)

where is the correlation length (persistence number). Equation 15 shows that the correlation between bond vectors in an FRC is a decaying exponential with correlation length .

###### Lemma 6.

[doi1988theory] Given an FRC characterized by end-to-end vector , bond-size and number of bond vectors , we have , where and is called the effective bond length.

Lemma 6 shows that FRCs obey random walk statistics with step-size (bond length) . The ratio is a measure of the stiffness of the chain in an FRC.

FRCs have high expansion rates compared to those of FJCs, as presented in Proposition 7 below.

###### Proposition 7 (Expanding property of LSA-RW).

[doi1988theory] Let be a LSA-RW with the persistence number and the end-to-end vector , and let be a random walk (RW) and the end-to-end vector . Then for the same number of time steps and same average bond length for and , the following relation holds,

 ⟨∥U(τ)∥⟩⟨∥U(τ′)∥⟩=1+e−1/Lpτ1−e−1/Lpτ>1, (16)

where the persistence number , with being the average correlation angle between every two consecutive bond vectors.

###### Proof.

This proposition is the direct result of combining equations and in [doi1988theory]. Equation provides the expected T time-step length of the end-to-end vector with average step-size associated with FJCs and equation provides a similar result for FRCs. Note that in the FRC model, since the bond vectors far separated in time on the chain are not correlated, they can cross each other. ∎

Radius of Gyration (Formal Definition) [rubenstein2003polymer] The square radius of gyration of a chain of size is defined as the mean square distance between position vectors and the chain center of mass (), and is written as,

 U2g(τ):=1TT∑i=1||ti−¯τ||2, (17)

where . When it comes to selecting a measure of coverage in the space where the chain resides, radius of gyration is a more proper choice compared with the end-to-end distance , as it signifies the size of the chain with respect to its center of mass, and is proportional to the radius of the sphere (or the hyper sphere) that the chain occupies. Moreover, in the case of chains that are circular or branched, and thus cannot be assigned an end-to-end length, radius of gyration proves to be a suitable measure for the size of the corresponding chains [rubenstein2003polymer]. For the case of fluctuating chains, the square radius of gyration is usually ensemble averaged over all possible chain conformations, and is written as [rubenstein2003polymer],

 ⟨U2g(τ)⟩:=1TT∑i=1⟨||ti−¯τ||2⟩. (18)
###### Remark 1.

The square radius of gyration is proportional to the square end-to-end distance in ideal chains (e.g. FJCs and FRCs) with a constant factor [rubenstein2003polymer]. Thus, Proposition 7 and equation 16, which compare the the end-to-end distance of LSA-RW and RW with each other, similarly hold for the radius of gyration of the respective models, implying faster expansion of the volume occupied by LSA-RW compared with that of RW.

## Appendix B The Proofs

In this section, the proofs for the theorems and the lemma in the manuscript are provided.

### b.1 The proof of Lemma 2 in the manuscript

Lemma 2 statement: Let be the trajectory of visited states, be a newly visited state and be the bond vector that connects two consecutive visited states and . Then we have,

 ∥sT−¯τS∥2=∥ωT+1T[T−1∑i=1iωi]∥2. (19)
###### Proof.

Using the relation as well as the definition of bond vectors (equation (3) in the manuscript), we can write on the left-hand side of equation (6) in the manuscript as,

 sT−¯τS= sT−1T∑s∈τSs = sT−sT−1+sT−1−1T∑s∈τSs = ωT+1T(sT−1−s0)+(sT−1−s1)+(sT−1−s2)+… +(sT−1−sT−2)] = ωT+1T[(sT−1−sT−2+sT−2−sT−3+… +s2−s1+s1−s0)+(sT−1−sT−2+sT−2−sT−3+… +s3−s2+s2−s1)+⋯+(sT−1−sT−2)] = ωT+1T[(ωT−1+⋯+ω1)+(ωT−1+⋯+ω2)+⋯+ωT−1] = ωT+1T[T−1∑i=1iωi] (20)
 ⇒∥sT−¯τS∥2=∥ωT+1T[T−1∑i=1iωi]∥2 (21)

### b.2 The proof of Theorem 3 in the manuscript

Theorem 3 statement (Upper-Bound Theorem)Let and be a LSA-RW in induced by PolyRL with the persistence number within episode , be the sequence of corresponding bond vectors, where denotes the number of bond vectors within , and be the average bond length. The upper confidence bound for with probability of at least is,

 UB= Λ(T,τS)+1δ[Γ(T,bo,τS)+2b2oT2T−1∑i=1ie−(T−i)LpτS], (22)

where,

 Λ(T,τS)=−1T−1U2g(τS) (23) Γ(T,bo,τS)=b2oT+∥∑T−1i=1iωi∥2T3 (24)
###### Proof.

If we replace the term in equation (5) in the manuscript with its incremental representation as a function of , we get

 ULSUg2(τS) =supsT∈Ω(T−2T−1Ug2(τS)+1T∥sT−¯τS∥2−Ug2(τS)) =−1T−1Ug2(τS)+supsT∈Ω1T∥sT−¯τS∥2. (25)

Therefore, the problem reduces to the calculation of

 1TsupsT∈Ω∥sT−¯τS∥2. (26)

Using Lemma 2 in the manuscript, we can write equation (26) in terms of bond vectors as,

 1TsupsT∈Ω∥sT−¯τS∥2=1TsupsT∈Ω∥ωT+1T[T−1∑i=1iωi]∥2. (27)

From now on, with a slight abuse of notation, we will treat

as a random variable due to the fact that

is a random variable in our system. Note that for is fixed, and thus is not considered a random variable. We use high-probability concentration bound techniques to calculate equation (26). For any , there exists , such that

 Pr[∥ωT+1T[T−1∑i=1iωi]∥2<α|ST∈Ω]>1−δ. (28)

We can rearrange equation 28 as,

 Pr[∥ωT+1T[T−1∑i=1iωi]∥2≥α|ST∈Ω]≤δ. (29)

Multiplying both sides by and expanding the squared term in equation 29 gives,

 Pr[T2∥ωT∥2+2T(ωT.T−1∑i=1iωi)+∥T−1∑i=1iωi∥2≥T2α|ST∈Ω]≤δ. (30)

By Markov’s inequality we have,

 Pr [T2∥ωT∥2+2T(ωT.T−1∑i=1iωi)+∥T−1∑i=1iωi∥2≥T2α] ≤E[T2∥ωT∥2+2T(ωT.∑T−1i=1iωi)+∥∑T−1i=1iωi∥2]T2α=δ ⟹ α=1δT2[T2E[∥ωT∥2]+2TE[ωT.T−1∑i=1iωi]+∥T−1∑i=1iωi∥2] ⟹by Def. 1 α=1δT2[T2b2o+2TE[ωT.T−1∑i=1iωi]+∥T−1∑i=1iωi∥2]

Note that all expectations in the equations above are over the transition kernel of the MDP. Using the results from Lemma 3 below, we conclude the proof. ∎

###### Lemma 3.

Let denote the sequence of states observed by PolyRL and be the new state visited by PolyRL. Assuming that (equation 2 in the manuscript) follows the LSAW formalism with the persistence number , we have

 E[ωT.T−1∑i=1iωi]=b20T−1∑i=1ie−|T−i|LpτS (31)
###### Proof.
 E[ωT.T−1∑i=1iωi]=E[T−1∑i=1iωT.ωi]=T−1∑i=1iE[ωT.ωi]. (32)

Here, the goal is to calculate the expectation in equation 32 under the assumption that is LSAW with persistence number . Note that if is LSAW and , the chain of states visited by PolyRL prior to visiting is also LSAW with . Now we focus on the expectation in equation 32. We compute using the LSAW formalism (Definition 1 in the manuscript) as following,

 E[ωT.ωi]=b20e−|T−i|LpτS

Therefore, we have,

 T−1∑i=1iE[ωT.ωi] =T−1∑i=1ib20e−(T−i)LpτS=b20T−1∑i=1ie−(T−i)LpτS

### b.3 The proof of Theorem 4 in the manuscript

Theorem 4 statement (Lower-Bound Theorem)Let and be a LSA-RW in induced by PolyRL with the persistence number within episode , be the sequence of corresponding bond vectors, where denotes the number of bond vectors within , and be the average bond length. The lower confidence bound for at least with probability is,

 LB =Λ(T,τS)+(1−√2−2δ)[Γ(T,bo,τS)+(T−1)(T−2)T2b20e−|T−1|LpτS], (33)

where,

 Λ(T,τS)=−1T−1U2g(τS) (34) Γ(T,bo,τS)=b2oT+∥∑T−1i=1iωi∥2T3 (35)
###### Proof.

Using the definition of radius of gyration and letting -norm in equation (4) in the manuscript, we have

 LLSUg2(τS) =infsT∈ΩT−2T−1U2g(τS)+1T∥sT−¯τS∥2−U2g(τS) =−1T−1U2g(τS)+infsT∈Ω1T∥sT−¯τS∥2 (36)

To calculate the high-probability lower bound, first we use the result from Lemma 2 in the manuscript. Thus, we have

 infsT∈Ω1T∥sT−¯τS∥2=1TinfsT∈Ω∥ωT+1T[T−1∑i=1iωi]∥2. (37)

We subsequently use the second moment method and Paley–Zygmund inequality to calculate the high-probability lower bound. Let

, for the finite positive constants and we have,

 Pr[Y>c2β]≥(1−β)2c1 (38)

where,

 E[Y2]≤c1E[Y]2 (39) E[Y]≥c2.

The goal is to find two constants and such that equation (39) is satisfied and then we find in equation (38) using . We start by finding ,

 E[Y] =E[(ωT+1T[T−1∑i=1iωi]s).(ωT+1T[T−1∑i=1iωi])] =E[∥ωT∥2+2T(ωT.T−1∑i=1iωi)+1T2∥T−1∑i=1iωi∥2] =E[∥ωT∥2]+2TT−1∑i=1iE[ωT.ωi]+1T2∥T−1∑i=1iωi∥2 =b2o+2Tb20T−1∑i=1ie−|T−i|LpτS+1T2∥T−1∑i=1iωi∥2, (40)

therefore,

 E[Y] =b2o+2T