1 Introduction
Reinforcement learning (RL) is a framework where an agent interacts with an unknown environment, receives a feedback from it, and optimizes its performance accordingly Sutton and Barto (2018); Bertsekas (2005). There have been attempts of learning a control policy directly from real world samples Levine et al. (2018); Yahya et al. (2017); Pinto and Gupta (2016); Kalashnikov et al. (2018). However, in many cases, learning from the actual environment may be slow, costly, or dangerous, while learning from a simulated system can be fast, cheap, and safe. The advantages of learning from simulation are counterbalanced by the realitygap Jakobi et al. (1995): the loss of fidelity due to modeling limitations, parameter errors, and lack of variety in physical properties. The quality of the simulation may vary: when the simulation mimics the reality well, we can train the agent on the simulation and then transfer the policy to the real environment, in a one shot manner (e.g., Andrychowicz et al. (2020)). However in many cases, simulation demonstrates low fidelity which leads to the following question: Can we mitigate the differences between real environments ("real") and simulations ("sim") thereof, so as to train an agent that learns from both, and performs well in the real one?
In this work, we propose to learn simultaneously on real and sim, while controlling the rate in which we collect samples from each environment and controlling the rate in which we use these samples in the policy optimization. This synergy offers a speedfidelity tradeoff and harnesses the advantage of each domain. Moreover, the simulation speed encourages exploration that helps to accelerate the learning process. The real system in turn can improve exploitation in the sense that it mitigates the challenges of simtoreal policy transfer, and encourages the learner to converge to relevant solutions. A general scheme describing our proposed setup is depicted in Figure 1. In a nutshell, there is a single agent interacting with environments (on the left). Each sample provided by an environment is pushed into a corresponding replay buffer (RB). On the right, the agent pulls samples from the RBs and is trained on them. In the simtoreal scheme, .
In the specific scheme for mixing real and sim samples in the learning process, separate probability measures for collecting samples and for optimizing parameters policies are used. The offpolicy nature of our scheme enables separation between real and sim samples which in turn helps controlling the rate of real samples used in the optimization process. In this work we discuss two RL algorithms that can be used with this scheme: (1) offpolicy linear actor critic with mixing sim and real samples and (2) Deep Deterministic Policy Gradient (DDPG; Lillicrap et al. (2015)
) mixing scheme variant based on neural networks. We analyze the asymptotic convergence of the linear algorithm and demonstrate the mixing samples variant of DDPG in a simtoreal environment.
The naive approach in which one pushes the stateactionrewardnextstate tuples into a single shared replay buffer is prone to failures due to the imbalance between simulation and real rollouts. To overcome this, we maintain separate replay buffers for each of the environments (e.g., in the case of a single robot and a simulator we would have two replay buffers). This allows us to extract the maximum valuable information from reality by distinguishing its tuples from those generated by other environments, while continuously improving the agent using data from all input streams. Importantly, although the rate of samples is skewed in favor of the simulation, the learning may be carried out using a different rate. In a sense, the mechanism we suggest is a version of the
importance sampling technique Bucklew (2013).Our main contributions in this work are as follows:

We present a method for incorporating real system samples and simulation samples in a policy optimization process while distinguishing between the rate of collecting samples and the rate of using them.

We analyze the asymptotic convergence of our proposed mixing real and sim scheme.

To the best of our knowledge, we provide for the first time theoretical analysis of the dynamics and properties of replay buffer such as its Markovity and the explicit probability measure induces by the replay buffer.

We demonstrate our findings in a simulation of simtoreal, with two simulations where one is a distorted version of the other and analyze it empirically.
2 Related Work
SimtoReal: SimtoReal is a long investigated topic in robotics where one aims to reduce the reality gap between the real system and its digital twin implementation. A general framework where we transfer results from one domain to another is domain adaptation. In vision, this approach have helped to gain stateoftheart results Ganin et al. (2017); Shu et al. (2018); Long et al. (2015); Bousmalis et al. (2016); Kim et al. (2017); Shrivastava et al. (2017). In our work, we focus on the physical aspects of the simtoreal gap. Related to domain adaptation, is the approach of domain randomization, where the randomization is done in simulation in order to robustify and enhance the detection and object recognition capability Tobin et al. (2017); Sadeghi and Levine (2016); James et al. (2017); Vuong et al. (2019). Recently, James et al. (2019) proposed a method where both simulation and reality are adapted to a common domain. Andrychowicz et al. Andrychowicz et al. (2020) extensively randomize the task of reaching a cube pose where oneshot transfer is achieved but with large sample complexity. Randomization may also be applied to dynamics, e.g., Peng et al. (2018), where robustness to inaccuracy in real world parameters is achieved.
Another approach in SimtoReal is how to change the simulation in the light of real samples. In Chebotar et al. (2019) the agent learns mainly from simulation but its parameters are updated to match the behavior in reality by reducing the difference between simulation and reality rollouts. Our method is a direct approach that incorporates phenomena that is difficult to simulate accurately. In Bayesian context, Ramos et al. (2019) provide a principled framework to reason about the uncertainty in simulation parameters. Kang et al. Kang et al. (2019) investigated how real system and simulation data can be combined in training deep RL algorithms. They separate between the data types by using real data to learn about the dynamics of the system, and simulated data to learn a generalizing perception system. Our method mix real and simulation data by controlling the rate of streaming each data type into the learning agent.
Replay Buffer analysis: Large portion of RL algorithms use replay buffers Lin (1993); Mnih et al. (2013) but here we review only works that provide some analysis. Several works study the effect of replay buffer size on the agent performance Zhang and Sutton (2017); Liu and Zou (2018). Our focus is the effect of controlling the rate of collecting samples and the rate of using them in the optimization process. Fedus et al. Fedus et al. (2020) investigated the effect of the ratio between these rates on the learning process through simulated experiments, while our focus is on the theoretical aspects. Other works studied the criteria for prioritizing transitions to enhance learning Schaul et al. (2015); Pan et al. (2018); Zha et al. (2019). In case of multiple agents that share their policy, Horgan et al. Horgan et al. (2018) argue in favor of a shared replay buffer for all agents and a prioritizing mechanism. We, on the other hand, emphasize the advantage of separating replay buffers when collecting samples from different environments to enable a mixing management in the learning process.
Stochastic Approximation: Our proposed algorithm is based on the Stochastic Approximation method Kushner and Clark (2012). Konda and Tsitsiklis Konda and Tsitsiklis (2000) proposed the actorcritic algorithm, and established the asymptotic convergence for the two timescale actorcritic, with TD() learningbased critic. Bhatnagar et al. Bhatnagar et al. (2008) proved the convergence result for the original actorcritic and natural actorcritic methods. Di Castro and Meir Di Castro and Meir (2010) proposed a single timescale actorcritic algorithm and proved its convergence. Recently, several finite sample analyses were applied by Wu et al. (2020); Zou et al. (2019); Dalal et al. (2018) and more but these works have not analyzed the RB asymptotic behavior while we do.
3 Setup
We model the problem using a Markov Decision Process (MDP;
Puterman (1994)), where and are the state space and action space, respectively. We let denote the probability of transitioning from state to state when applying action . The MDP measure and the policy measureinduce together a Markov Chain (MC) measure
( is matrix form). We consider a probabilistic policy , parameterized by which expresses the probability of the agent to choose an action given that it is in state . We let denote the stationary distribution induced by the policy . The reward function is denoted by . Throughout the paper we assume the following.Assumption 1.
1. The set is compact. 2. The reward for all .
Assumption 2.
For any policy , the induced Markov chain of the MDP process is irreducible and aperiodic.
The goal of the agent is to find a policy that maximizes the average reward that the agent receives during its interaction with the environment Puterman (1994). Under an ergodicity assumption, the average reward over time eventually converges to the expected reward under the stationary distribution Bertsekas (2005)
(1) 
The statevalue function evaluates the overall expected accumulated rewards given a starting state and a policy
(2) 
where the actions follow the policy and the next state follows the transition probability . Denote
to be the vector value function defined in (
2). Therefore, the vectorial Bellman Equation (BE) for a fixed policy is , where is a vector of rewards for each state Puterman (1994). We recall that the solution to the BE is unique up to an additive constant. In order to have a unique solution, we choose a state to be of value , i.e., (due to Assumption 2, can be any of ).In our specific setup, we consider a model where there are MDPs, denoted by , all share the same state space , action space , and reward function . The environment dynamics, though, are different, and are denoted by a transition function . Together with a shared policy , each is induced by a state transition measure and a stationary distribution . Let and define the average reward over environments,
(3) 
The following assumption resembles Assumption 2 for environments.
Assumption 3.
For any policy , the induced Markov chain of MDP is irreducible and aperiodic for all .
We define to be the throughput of and it is defined as the number of samples MDP provides for a unit time. In simtoreal context, this setup can practically handle several robots and several simulation instances. We assume for the simtoreal scenario that .
Since the samples from real arrive at a lower throughput than the sim, if we push the samples into two separate Replay Buffers (RB; Lin (1993); Mnih et al. (2013)) based on their sources, we can leverage the relatively scarce, but valuable samples that originated in the real system. This observation is the main motivation for our "Mixing Sim and Real" scheme, presented in the next section.
4 Mixing Sim and Real Algorithm
In order to reconcile the dynamics disparity, we propose our Mixing Sim and Real Algorithm with Linear Actor Critic, presented in Algorithm 1 and described in Figure 1. We consider environments, modeled as MDPs, , where the agent maintain a replay buffer for each MDP, respectively. For the sake of analysis simplicity, we replace
with the following random variable. The agent chooses an environment to communicate with according to
where , , and . The agent collects transitions from the chosen environment and stores them in the corresponding . In order to approximate the rates correctly, we choose for the agent to interact according to the rates.We train the agent in an offpolicy manner. The agent selects for sampling the next batch for training according to where , , and . This distribution remains static, and hence the selections in time are i.i.d^{1}^{1}1We note that one could remove this restriction and think of other schemes in which the replay buffer selection distribution changes over time based on some prescribed optimization goal, cost, etc.. In addition, the distribution that selects which samples to train over should be different than the distribution that controls the throughput each environments pushes samples to the RB. In that way, scarce samples from the real environment can get higher influence on the training.
Once a RB is selected, the sampled batch is used for optimizing the actor and the critic parameters. In this work, we propose a two time scale linear actor critic optimization scheme Konda and Tsitsiklis (2000), which is an RBbased version of Bhatnagar et al. (2008) Algorithm. We analyze its convergence properties in Section 5. We note, however, that other optimization schemes can be provided, such as DDPG Lillicrap et al. (2015), which we use in our experiments.
We define a tuple of indices where corresponds to and corresponds to the th sample in this . In addition, it corresponds to time where this is the time when the agent interacted with the th MDP and the sample was added to . Let be a transition sampled at time from . Whenever it is clear from the context, we simply use .
The temporal difference (TD) error is a random quantity based on a single sampled transition from ,
(4) 
where is a linear approximation for , is a feature vector for state and is a parameter vector. In Algorithm 1, average reward, critic and actor parameters are updated based on the TD error (see lines 7  9). Note that for the actor updates, we use a projection that projects any to a compact set whenever .
In order to gain understanding of our proposed setup, in the next section we characterize the behaviour of the iterations in Algorithm 1.
5 Convergence Analysis for Mixing Sim and Real with Linear Approximation
The standard tool in the literature for analyzing iterations of processes such as two time scale ActorCritic in the context of RL is SA; Stochastic Approximation Kushner and Yin (2003); Borkar (2009); Bertsekas and Tsitsiklis (1996). This analysis technique includes two parts: proving the existence of a fixed point, and bounding the rate of convergence to this fixed point. By far, the most popular methods for proving convergence is the Ordinary Differential Equation (ODE) method. Usually, the iteration should demonstrate either some monotonicity property, or a contraction feature in order for the iteration to converge.
Although in practice such algorithms (after some tuning) usually converge to an objective value, it is not always guaranteed. To achieve that in a stochastic approximation setup, the main known result shows that the iteration can be decomposed into a deterministic function, which depends only on the problem parameters, and a martingale difference noise, which is bounded in some way.
In this section we show that the iterations of Algorithm 1 converge to a stable point of a corresponding ODE. We begin with showing that the process of sampling transitions from RBs is a Markov process. Afterward, we show that if the original Markov chain is irreducible and aperiodic, then also the RBs Markov process is irreducible and aperiodic. This property is required for proving the convergence of the iterations in Algorithm 1 using SA tools. We conclude this section with showing that if in some sense sim is close to real, then the properties of the mixed process is close to the properties of both sim and real.
5.1 Asymptotic Convergence of Algorithm 1
Let be a replay buffer storing the last transitions from MDP . Let be the state of at time , i.e., where is a transition tuple pushed at some time . We denote the collection of all as . We define and be i.i.d random processes based on and , respectively. We define to be the process induced by Algorithm 1, i.e.,
(5) 
The next lemma states the is Markovian. The proof is deferred to the Supplementary material A.1.
Lemma 1 ( induced by Algorithm 1 is Markovian).
1. The random process is a Markovian. 2. Under Assumption 3, there exists some such that is irreducible and aperiodic for .
Next, we present several assumptions that are necessary for proving the convergence of Algorithm 1. The first assumption is a standard requirement for policy gradient methods.
Assumption 4.
For any state–action pair , is continuously differentiable in the parameter .
Proving convergence for a general function approximation is hard. In our case we demonstrate the convergence for a linear function approximation (LFA; Bertsekas and Tsitsiklis (1996)). In matrix form, it can be expressed as where . The following assumption is needed for the uniqueness of the convergence point of the critic.
Assumption 5.
1. The matrix has full rank. 2. The functions are Liphschitz in and bounded. 3. For every , where is a vector of ones.
In order to get a with probability 1 using the SA convergence, the following standard assumption is needed. Note that in the actorcritic setup we need two timescales convergence, thus, in this assumption the critic is a ‘faster’ recursion than the actor.
Assumption 6.
The stepsizes , , , satisfy , and .
We define the induced MC for the time with a corresponding parameter . For this parameter, we denote with the transition matrix at that time and the corresponding state distribution vector (both induced by the policy ). Finally, we define the following diagonal matrix and the reward vector with elements . Based on these definitions we define
(6) 
where
is the identity matrix and
is a vector of ones. The intuition behind and is the following. For an online TD(0)learning under a stationary policy we have a fixed point at the solution to the equation (Bertsekas and Tsitsiklis (1996); Lemma 6.5). In our case, since we have RBs where each one with samples entered at different times, we have a superposition of all these samples. When , for all index . We let and define(7) 
For proving the convergence of the critic, we assume the policy is fixed. Thus, for each RB the induced MC is one for all the samples in this RB, so the sum over disappear for and . Now we are ready to prove the following theorems, regarding Algorithm 1. We note that Theorems 2 and 3 state the critic and actor convergence.
Theorem 2.
The proof for Theorem 2 follows the proof for Lemma 5 in Bhatnagar et al. (2009), see more details in the supplementary material A.2. For establishing the convergence of the actor updates, we define additional terms. Let denote the set of asymptotically stable equilibria of the ODE and let be the neighborhood of . Let , and define
Theorem 3.
5.2 Sim2Real Asymptotic Convergence Properties
In this section we analyze the convergence properties of the Mixing Sim and Real algorithm we use. The main idea is that if sim and real are close in their dynamics through the MDP transition matrix many properties of their MDPs under the same policy are close as well. Moreover, we show that under the assumption of sim close to real, any process derived from both processes is close to both sim and real.
Assumption 7.
(Closeness of sim and real). For all , , , we have .
The following theorem states that if Assumption 7 holds then the convergence points of sim, real, and the mixed process (as defined in Algorithm 1) convergences to close points.
Theorem 4.
Consider a policy and Assumptions 1, 2, and 7. Then, for each , , and we have:
1. The induced MC of sim and real, and , satisfy .
2. Let where its elements are identical to the first elements of . The corresponding stationary distributions satisfy , where
is the largest eigenvalue of the matrix
.3. The convergence points for the average reward and value functions under the policy for sim and real satisfy and .
6 Experimental Evaluation
In this section we evaluate the performance of our proposed algorithm on two Fetch Push environments Plappert et al. (2018), one acts as the real environment and the other is the simulation environment ^{2}^{2}2Code for the experiments can is available at: https://github.com/sdicastro/SimAndRealBetterTogether.. Although our theoretical results are on the proposed mixing scheme with linear function approximation, in this section we focus on nonlinear methodologies, i.e., using neural networks. We set meaning there is only one real and one simulation environments. We denote by the probability of collecting samples from the real environment and by the probability of choosing samples from the real environment for the optimization process. We are interested in demonstrating the effect of different and values on the learning process. We investigate different mixing strategies for combining real and sim samples.

"Mixed": real and sim episodes are collected according to Algorithm 1.

"Real only": The agent collects and optimize only real samples (i.e., and ).

"Sim only": The agent collects and optimize only sim samples (i.e., and ).

"Sim first": At the beginning the agent collects and optimize only sim samples. When the success rate in the sim reaches 0.7, we switch to sampling and optimizing only using real.

"Simdependent": At the beginning the agent collects and optimize only sim samples. When the success rate in the sim environment reaches 0.7, we switch to the "Mixed" strategy.
In the Fetch Push task, a robot arm needs to push an object on a table to a certain goal point. The state is represented by the gripper, object and target position and pose, as well as their velocities and angular velocities^{3}^{3}3The final dimension is after removing noninformative dimensions.. The action specifies the desired gripper position at the next timestep. The agent gets a reward of 1, if the desired goal was not yet achieved and 0 if it was achieved within some tolerance. To solve the task we used our mixing sim and real algorithm and replaced the linear actorcritic optimization scheme (lines 69 in Algorithm 1) with DDPG Lillicrap et al. (2015) together with Hindsight Experience Replay (HER; Andrychowicz et al. (2017)) optimization scheme. We created the real and sim environments using the Mujoco simulator Todorov et al. (2012). The difference between the environments is the friction between the object and the table. We preceded the following experiments with an experiment to depict a region of friction parameters where training the task using only sim samples and using the trained policy in the real environment does not solve the task (see supplementary material Section C.3).
We emphasize that we evaluate the performance in each experiment according to the success rate in the real environment, as this is the environment of final interest. In addition, we seek for mixing strategies that achieve the lowest number of real samples since usually they are costly and harder to get than sim samples.
Different values: We fix optimization parameter and test different collection parameter . Results are presented in Figure 2. We notice that when the agent is trained using "Sim only" strategy (), it fails to solve the task in real (Figure 2a). Next, when the agent is trained using "Real only" strategy (), the task is solved. However, for achieving 0.9 success rate, "Real only" requires approximately 20K real environment episodes and to increase it to success rate of 1, it requires approximately 40K real episodes (Figures 2b and 2c). Observing the values inbetween, we see that achieves the best performance – it uses fewer (K) real episodes to achieve high success rates compared to the "Real only" strategy. Notice that as increases the performance deteriorates. This phenomenon can be explained due to the mixed samples distribution. When is low, most of the data distribution is based on sim, and real samples do not change it much, but only "fine tune" the learning. When increases, the data distribution is composed of two different environments which may confuse the agent.
Different values: In this experiment, we fix and test for . Results are presented in Figure 3. When is low and equals , the agent fails to solve the task (Figure 3a). But when is higher than , the performance improves where no significant differences are observed for . For , the algorithm achieves the best performance: high success rate of 0.9 while using fewer real episodes and fewer sim episodes compared to other values (Figures 3b and 3c). Interestingly, when is too high (with respect to , i.e., ) the performance deteriorates, suggesting that choosing is preferable.
Different Mixing Strategies: We tested different mixing strategies. "Mixed", "Sim first" and "Simdependent" as described above. Results are presented in Figure 4. Using the "Simdependent" strategy reduced the required real and sim episodes to achieve 0.9 success rate comparing to the "Mixed" strategy with the same and values (Figure 4c). When using "Sim first" strategy, we observe that although in the beginning of the learning it uses only sim samples, once it switches to use only real samples, the agent requires many more real episodes to achieve success rate of (compared to the "Mixed" and "Simdependent" strategies; Figures 4b and 4c). Although the most common approach is to train a policy in simulation and then use it as an initial starting point for the real system, we see that applying the mixing strategy after transferring the policy to real can reduce further the required real episodes while maintaining high success rate.
7 Conclusions and Future Work
In this work we analyzed a mixing strategy between simulation and real system samples. By separating the rate of collecting samples from each environment and the rate of choosing samples for the optimization process, we were able to achieve a significant reduction in the amount of real environment samples, comparing to the common strategy of using the same rate for both collection and optimization phases. This reduction is of special interest since usually the real samples are costly and harder to achieve. We believe this work can lead to a new line of research. First, finite sample analysis for our proposed algorithm can reveal its exact sample complexity. Comparing it to the sample complexity of learning only on real environment can emphasis the advantage of using the mixing strategy. Second, other replay buffer prioritization schemes can now be theoretically analyzed using the dynamics and properties of replay buffers we have developed. Third, our approach is limited to the online case, where new samples are collected during training. Adapting our approach to the offline case can discover new venues in the offline RL research. Fourth, learning the real samples collection rate and adapting it during training can further improve our approach.
References
 Hindsight experience replay. In Advances in neural information processing systems, pp. 5048–5058. Cited by: §C.1, §C.2, Appendix C, §6.
 Learning dexterous inhand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §1, §2.
 Dynamic programming and optimal control. Athena scientific Belmont, MA. Cited by: §1, §3.
 Neurodynamic programming. Athena Scientific. Cited by: §A.2.1, §5.1, §5.1, §5.
 Incremental natural actorcritic algorithms. In Advances in neural information processing systems, pp. 105–112. Cited by: §A.2, §A.2, §A.2, §A.3, §A.3, §2, §4.
 A simultaneous perturbation stochastic approximationbased actorcritic algorithm for markov decision processes. IEEE Transactions on Automatic Control 49 (4), pp. 592–598. Cited by: §A.3.
 Natural actor–critic algorithms. Automatica 45 (11), pp. 2471–2482. Cited by: §5.1, §5.1.
 The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38 (2), pp. 447–469. Cited by: §A.2, §A.2, §A.2.
 Stochastic approximation: a dynamical systems viewpoint. Vol. 48, Springer. Cited by: §5.
 Domain separation networks. In Advances in neural information processing systems, pp. 343–351. Cited by: §2.
 Introduction to rare event simulation. Springer Science & Business Media. Cited by: §1.
 Closing the simtoreal loop: adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8973–8979. Cited by: §2.

Finite sample analyses for td (0) with function approximation.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Cited by: §2. 
A convergent online single time scale actor critic algorithm.
The Journal of Machine Learning Research
11, pp. 367–410. Cited by: §2.  Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pp. 3061–3071. Cited by: §2.

Domainadversarial training of neural networks.
In
Domain Adaptation in Computer Vision Applications
, pp. 189–209. Cited by: §2.  Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933. Cited by: §2.
 Matrix analysis. Cambridge university press. Cited by: §B.1.
 Noise and the reality gap: the use of simulation in evolutionary robotics. In European Conference on Artificial Life, pp. 704–720. Cited by: §1.
 Transferring endtoend visuomotor control from simulation to real world for a multistage task. arXiv preprint arXiv:1707.02267. Cited by: §2.

Simtoreal via simtosim: dataefficient robotic grasping via randomizedtocanonical adaptation networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 12627–12637. Cited by: §2.  Qtopt: scalable deep reinforcement learning for visionbased robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §1.
 Generalization through simulation: integrating simulated and real data into deep reinforcement learning for visionbased autonomous flight. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6008–6014. Cited by: §2.

Learning to discover crossdomain relations with generative adversarial networks
. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1857–1865. Cited by: §2.  Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §C.1.
 Actorcritic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §2, §4.
 Stochastic approximation methods for constrained and unconstrained systems. Vol. 26, Springer Science & Business Media. Cited by: §A.3, §2.
 Stochastic approximation and recursive algorithms and applications. Vol. 35, Springer Science & Business Media. Cited by: §5.

Learning handeye coordination for robotic grasping with deep learning and largescale data collection
. The International Journal of Robotics Research 37 (45), pp. 421–436. Cited by: §1.  Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: Appendix C, §1, §4, §6.
 Reinforcement learning for robots using neural networks. Technical report CarnegieMellon Univ Pittsburgh PA School of Computer Science. Cited by: §2, §3.
 The effects of memory replay in reinforcement learning. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 478–485. Cited by: §2.
 Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37, pp. 97–105. Cited by: §2.
 Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §2, §3.
 Organizing experience: a deeper look at replay mechanisms for samplebased planning in continuous state domains. arXiv preprint arXiv:1806.04624. Cited by: §2.
 Simtoreal transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8. Cited by: §2.

Supersizing selfsupervision: learning to grasp from 50k tries and 700 robot hours
. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 3406–3413. Cited by: §1.  Multigoal reinforcement learning: challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464. Cited by: §6.
 Markov decision processes. Wiley and Sons. Cited by: §3, §3.
 Bayessim: adaptive domain randomization via probabilistic inference for robotics simulators. arXiv preprint arXiv:1906.01728. Cited by: §2.
 Cad2rl: real singleimage flight without a single real image. arXiv preprint arXiv:1611.04201. Cited by: §2.
 Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §2.
 Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2107–2116. Cited by: §2.
 A dirtt approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735. Cited by: §2.
 Reinforcement learning: an introduction. MIT press. Cited by: §1.
 Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §2.
 Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §6.
 How to pick the domain randomization parameters for simtoreal transfer of reinforcement learning policies?. arXiv preprint arXiv:1903.11774. Cited by: §2.
 A finite time analysis of two timescale actor critic methods. arXiv preprint arXiv:2005.01350. Cited by: §2.
 Collective robot reinforcement learning with distributed asynchronous guided policy search. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 79–86. Cited by: §1.
 Experience replay optimization. arXiv preprint arXiv:1906.08387. Cited by: §2.
 A deeper look at experience replay. arXiv preprint arXiv:1712.01275. Cited by: §2.
 Finitesample analysis for sarsa with linear function approximation. arXiv preprint arXiv:1902.02234. Cited by: §2.
Appendix A Proof of Main Lemmas and Theorems of Section 5.1
a.1 Proof of Lemma 1
Proof.
1. Proving Markovity requires that
(8) 
Let us denote , and . Recall that and that the time index of entering a transition into RB(k) is for all and for all . Index relates the position in RB(k) in which the transition is placed at time . In addition, recall that where . Let be RB(k) of MDP at time , denoted as .We denote the collection of all as .
Remark 1.
Note that each time step that a transition enters some is unique. That is, for a fixed , for all and for . Moreover, for all and all . In addition, note that when a new transition is pushed into the RB, the oldest transition in the RB is thrown away, and all the transitions in the RB, move one index forward, that is for and .
Computing the l.h.s. of (8) yields
where in equality (1) we use the definition, in equality (2) we wrote the RB samples explicitly, in equality (3) the terms were rearranged, in equality (4) we expressed the probability as a conditional product, and in equality (5) we use the fact that and are independent random variables and the rule of pushing transition into RB():
Similarly, computing the r.h.s of (8) yields
Both sides of (8) are equal and therefore is Markovian.
2. According to Assumption 3, we assume that for every environment and for every policy the Markov Process induced by the MDP together with the policy is irreducible and aperiodic. In addition, we assume , where is the time where we have full RBs, each one with transitions. This means that when a new transition arrives to RB(k), it requires throwing away the oldest transition in the buffer. We saw in part 1 that
(9) 
Let be an index set. We now write explicitly the following term
(10) 
where we expressed the probability as a conditional product, separating RB() at time from all other RB’s. Note that in : for all since these RB’s do not change in this timestep.
We continue with expression (a).