I Introduction
With the advent of big data in robotics [1, 2, 3, 4]
, there has been an increasing interest in selfsupervised learning for planning and control. The core idea behind these approaches is to collect largescale datasets where each datapoint has the current state (e.g. image of the environment), action/motorcommand applied, and the outcome (success/failure/reward) of the action. This largescale dataset is then used to learn policies, typically parameterized by highcapacity functions such as Convolutional Neural Networks (CNNs) that predict the actions of the agent from input images/observations. But what is the right way to collect this dataset for selfsupervised learning?
Most selfsupervised learning approaches use random exploration. That is, first a set of random objects is placed on the tabletop followed by a random selection of actions. However, is random sampling the right manner for training a selfsupervised system? Random exploration with few thousand data points will only work when the output action space is lowdimensional. In fact, the recent successes in selfsupervised learning which shown experiments on real robots (not just simulation) use a search space of only 36 dimensions ^{1}^{1}1[1, 2, 3] use 3,4,5dim search space respectively for output action space. Random exploration is also suboptimal since it leads to a very sparse sampling of the actionspace.
In this paper, we focus on the problem of sampling and selfsupervised learning for highlevel, highdimensional control. One possible approach is to collect and sample training data using stagedtraining [1] or onpolicy search [5]. In both these approaches, random sampling is first used to train an initial policy. This policy is then used to sample the next set of training points for learning. However, such approaches are hugely biased due to initial learning from random samples and often sample points from a small search space. Therefore, recent papers have investigated other exploration strategies, such as curiositydriven exploration [6]. However, data sparsity in highdimensional action space still remains a concern.
Let’s take a step back and think how do humans deal with highdimensional control. We note that the action space of human control grows continually with experience: the search does not start in highdimensions but rather in a small slice of the highdimensional space. For example, in the early stages of human development, when handeye coordination is learned, a single mode of grasping (palmargrasp) is used, and we gradually acquire more complex, multifingered grasping modalities [7]. Inspired by this observation, we propose a similar strategy: order the learning in control parameter space by fixing few dimensions of control parameters and sampling in the remaining dimensions. We call this strategy curriculum learning in control space, where the curriculum decides which control dimensions to learn first ^{2}^{2}2Note our curriculum is defined in control space as opposed to standard usage where easy examples are used first followed by hard examples for training. In our case, the objects being explored, though diverse and numerous, remain fixed.. We use a sensitivity analysis based approach to define the curriculum over control dimensions. We note that our framework is designed to infer highlevel control commands and use planners/lowlevel controllers to achieve desired commands. In future work, the curriculum learning of lowlevel control primitives, such as actuator torques, could be explored.
We demonstrate the effectiveness of our approach for the task of adaptive multifingered grasping (See Fig 1). Our search space is 8dimensional and we sample the training points for learning control in 6dimensions ( is done via regionproposal sampling, as explained later). We show how a robust model for grasping can be learnt with very few examples. Specifically, we illustrate that defining a curriculum over the control space improves overall grasping rate compared to that of random sampling and stagedtraining strategy by a significant margin. To the best of our knowledge, this is the first work applications of curriculum learning on a physical robotic task.
Ii Related Work
Curriculum Learning: For biological agents, concepts are easier to learn when provided in a structured manner instead of an arbitrary order [8]
. This idea has been formalized for machine learning algorithms by Elman et al.
[9] and Bengio et al. [10]. Under the name of Curriculum Learning (CL) [10], the core idea is to learn easier aspects of the problem earlier while gradually increasing the difficulty. Most curriculum learning frameworks focus on the ordering of training data: first train the model on easy examples and then on more complex data points. Curriculum over data has been shown to improve generalization and speed up convergence [11, 12]. In our work, we propose curriculum learning over the control space for robotic tasks. The key idea in our method is that in higher dimensional control spaces, some modalities are easier to learn and are uncorrelated with other modalities. Our variancebased sensitivity analysis exposes these easy to learn modalities which are learnt earlier while focusing on harder modalities later.Intrinsic Motivation:
Given the challenges for reinforcement learning in tasks with sparse extrinsic reward, there have been several works that have utilized intrinsic motivation for exploration and learning. Recently, Pathak et. al. learned a policy for a challenging visualnavigation task by optimizing with intrinsic rewards extracted from selfsupervized future image/state prediction error
[13]. Sukhbaatar et al. proposed a asymmetric selfplay scheme between two agents to improve data efficiency and incremental exploration of the environment [14]. In our work, the curriculum is defined over the control space to incrementally explore parts of the highdimensional action space.Ranking Functions: An essential challenge in CL is to construct a ranking function, which assigns the priority for each training datapoint. In situations with human experts, a stationary ranking function can be hand defined. In Bengio et al. [10], the ranking function is specified by the variability in object shape. Some other methods like SelfPaced Learning [15] and SelfPaced Curriculum Learning [16] dynamically update the curriculum based on how well the agent is performing. In our method, we use a stationary ranking that is learned from performing sensitivity analysis [17] on some data collected by sampling the control values from a quasirandom sequence. This stationary ranking gives priority ordering on control parameters. Most formulations of curriculum training use a linear curriculum ordering. A recent work by Svetlik et al. generated a directed acyclic graph of curriculum ordering and showed improved data efficiency for training an agent to play Atari games with reinforcement learning [18].
Grasping: We demonstrate dataefficiency of CASSL on the grasping problem. Refer to [19, 20] for surveys of prior work. Classical foundational approaches focus on physicsbased analysis of stability [21]. However, these methods usually require explicit 3D models of the objects and do not generalize well to unseen objects. To perform grasping in complex unstructured environments, several datadriven methods have been proposed [22, 1, 2]. For largescale data collection both simulation [22] and realworld robots [1, 2] have been used. However, these large scale methods operate on lower dimensional control spaces (planar grasps are often 3 dimensional in output space) since highdimensional grasping requires significantly more amount of data. In our work, we hypothesize and show that CASSL requires lesser data and can also learn on higher dimensional grasping configurations.
Robot Learning: The proposed method of Curriculum Accelerated SelfSupervised Learning (CASSL) is not specific to the task of grasping and can be applied to a wide variety of robot learning, manipulation and selfsupervised learning tasks. The ideas of selfsupervised learning have been used to push and poke objects [3, 23]. Nevertheless, a common criticism of selfsupervised approaches is their dependency on large scale data. While reducing the amount of data for training is an active area for research [24], CASSL may help in reducing this data dependency. Deep reinforcement learning [25, 26, 27] methods have empirically shown the ability of neural networks to learn complex policies and general agents. Unfortunately, these modelfree methods often need data in the order of millions to learn their perceptionbased control policies.
Iii Curriculum Accelerated SelfSupervised Learning (CASSL)
We now describe our curriculum learning approach for highlevel control. First, we discuss how to obtain priority ordering of control parameters followed by how to use the curriculum for learning.
Iiia CASSL Framework
Our goal is to learn a policy and scoring function , which given the current state represented by image and action predicts the likelihood of success for the task. Note that in the case of highdimensional control where is the dimensionality of the action space. For the task of grasping an object,
can be the grasp success probability given the image of object (
) and control parameters of the grasp configuration (). The highlevel control dimensions for grasping are the grasping configuration, gripper pose, force, grasping mode, etc. as explained later.The core idea is that instead of randomly sampling the training points in the original Kdim space and learning a policy, we want to focus learning on specific dimensions first. So, we will sample more uniformly (high exploration) in the dimensions we are trying to learn; and for the other dimensions we use the current model predictions (low exploration). Consequently, the problem is reduced to the challenge of finding the right ordering of the control dimensions. One way of determining this ranking is with expert human labeling. However, for the tasks we care about, the output function is often too complex for a human to infer rankings due to the complex space of grasping solutions. Instead, we use global sensitivity analysis on a dataset of physical robotic grasping interactions to determine this ranking. The key intuition is to sequentially select the dimension that is the most independent and interacts the least with all other dimensions, hence is easier to learn.
IiiB Sensitivity Analysis
For defining a curriculum over control dimensions, we use variancebased global sensitivity analysis. Mathematically, for a model of the form , global sensitivity analysis aims to numerically quantify how uncertainty in the scalar output (e.g. grasp success probability in this paper) can be expressed in terms of uncertainty in the input variables (i.e. the control dimensions) [28]. The first order index, denoted by , is the most preliminary metric of sensitivity and represents the uncertainty in that comes from alone. Another metric of interest is the total sensitivity index , which is the sum of all sensitivity indices (first and higher order terms) involving the variable . As a result, it captures the interactions (pairwise, tertiary, etc.) of
with other variables. Detailed description on monte carlo estimators for the indices and proofs can be found in
[28]. Obtaining the sensitivity metrics requires the model or an approximate version of it. Instead, we use Sobol sensitivity analysis [29] implementation in SAlib and propose a datadriven method for estimating the sensitivity metrics. In Sobol sensitivity analysis, the control input is sampled from a quasirandom sequence, as it provides a better coverage/exploration of the control space compared to a uniform random distribution.IiiC Determining the Curriculum Ranking
Given a large control space, an intuitive curriculum would be to learn control dimensions in the descending order of their sensitivity. However, when designing a curriculum, we also care about the interactions between a control dimension and others. Hence, we need to optimize on getting dimensions that have high sensitivity and low correlation with other dimensions. One way to do this is to minimize higher order (1) terms (i.e. ) and the pairwise interactions between variables . Given sensitivity values for each control dimension, we choose the subset of dimensions
which minimize the heuristic Eqn
1 below:(1) 
Here is the set of all control dimensions (i.e. = ), and is a subset of dimensions. We evaluate all possible subsets and choose the subset with the minimum value as the first set of control dimensions in the curriculum. We then recompute the term for subsets of remaining control dimensions and iteratively choose the next subset (as seen in Algorithm 1). The intuition behind Eqn 1 is that we want to choose the subset of control dimensions on which the output depends the most and is least correlated with the remaining dimensions.
IiiD Modeling the Policy
The policy function takes the image as input and outputs the desired action . Inspired by the approach in [1], we use a CNN to model the policy function. However, since CNNs have been shown to work better on classification than regression, we employ classification instead of regressing control outputs. To this end, each control space is discretized into bins as given in Table I.
Our network design is based on AlexNet [30]
, where the convolutional layers are initialized with ImageNet
[31] pretrained weights as done before in [1, 32]. We used ImageNet pretrained features as they been proven to be effective for transfer learning in a number of visual recognition tasks
[33, 34]. The network architecture is shown in Fig 3. The fullyconnected layer’s weights are initialized from a normal distribution. While we could have had separate networks for each control parameter, this would enormously increase the size of our model and make the prediction completely independent. Instead, we employ a shared architecture commonly used in multitask learning
[32, 23], such that the nonlinear relationship between the different parameters could be learned. Each parameter has a separate fc7 layer and this ensures that the network learns a shared representation of the task until the fc6 layer. The fc8 ouputs are finally sent through and normalized by a sigmoidal function. Predicting the correct discretized value for each control parameter is formulated as a multiway classification problem. More specifically,
is akin to a Q value function that returns the probability of success when the action corresponding to the discrete bin for control dimension is taken.IiiE Curriculum Training
Algorithm 1 describes the complete training structure of our method. First, initial data is collected to perform sensitivity analysis and given this priority ordering, we begin the training procedure for our policy models. Apart from diversity in the objects seen, we still need to enforce exploration in the action space through all stages of the curriculum training.
As described in Algorithm 1, the greedy action corresponds to executing whatever control values the network predicts. The hyperparameters, = 0.15 and = 0.7, determine the probability of choosing a random action visàvis the greedy one given by the policy. Therefore, for the control dimensions already learned, we are more likely to select the policy via the network. In our framework, for parameters that have already been learned in the curriculum (i.e ), they will have little exploration. In contrast, for control parameters with , they have a great deal of exploration so that the data collected captures the higher order effects between control parameters. When
, the control is chosen with importance sampling explained as follows. The grasping policy is parameterized as a multiclass classifier on a discretized action space. As a result, the output value
from the final sigmoid layer for the discrete bin for controlcan be treated as a bernoulli random variable with probability
. Here, the control value that is selected is the one which the model is most uncertain about and hence has the highest variance i.e = (1 )). Taking the analytic derivative, the uncertainity is maximized when =0.5. This approach is similar to previous works such as [13], where actions were taken based on what the agent is most “curious”/uncertain about and the curiosity reward is defined as the prediction error of the next state given the current state and action. Similarly, in [6], the actions that maximize information gain about the agent’s belief of the environment dynamics were taken.At each stage of the curriculum learning, we also aggregated the training dataset similar to DAgger [35] and prior work [1]. On stage k of the curriculum, the network was finetuned on {, }, where is the dataset collected in the current stage of the curriculum. We sample 2.5 times more than to give more importance to new datapoints.
Iv CASSL for Grasping
We now describe the implementation of CASSL for the task of grasping objects. The grasping experiments and data are collected on a Fetch mobile manipulator [36]. Visual data is collected using a PrimeSense Carmine 1.09 shortrange RGBD sensor and we use a 3finger adaptive gripper from Robotiq. The Expanding Space Tree (ESTk) planner from MoveIt is used to generate collisionfree trajectories and state estimation is handdesigned similar to prior work [1]  using background subtraction to detect newly placed objects on the table. We further use depth images to obtain an approximate value for the height of objects.
Iva Adaptive Grasping
The robotiq gripper has three fingers that can be independently controlled and has two primary grasp modalities  encompassing and fingertip grips. As shown in Fig 4, there are three operational modes for the gripper  Pinch, Normal and Wide. Pinch mode is meant for precision grasping of small objects and is limited to fingertip grasps. Normal grasping mode is the most versatile and can grasp a wide range of objects with encompassing and fingertip grasps. Similarly, Wide mode is adept at grasping circular or large objects. While the fingers can be individually controlled, we only command the entire gripper to open/close, and the proprietary planner handles the lowerlevel control for the fingers. The fingers are operated at a speed of 110mm/sec.
The adaptive mechanisms of the gripper also allow it to better handle the uncertainty in the object’s geometry and pose. As a result of the adaptive closing mechanism, some of the grasps end up being similar to pushgrasps [37]. The gripper fingers sweep the region containing the object, such that the object ends up being pushed inside the fingers regardless of its starting position. Sometimes, such grasps may not have force closure and the object could slip out of the gripper.
IvB Grasping Problem Definition
We formulate our problem in the context of tabletop grasping, where we infer highlevel grasp control parameters based on the image of the object. There are three parameters that determine the location of the grasp ( and ), three parameters that determine the approach direction and orientation of the gripper ( and ) and two others that involve the configuration (Mode and Force ). The geometric description of the three angles with respect to the object pose is shown in Fig 4 and details of each parameter are provided in Table I. is very sensitive to asymmetrical, elongated objects while  the angle from the vertical axis  allows the gripper to tilt its approach direction to grasp the objects from the side. The camera’s point cloud data gives a noisy estimate of the object height, denoted by . Let be the height of the table with respect to the robot base. Then,
is a scaling parameter (between 0 and 1) that interpolates between these two values, where the final height of the object is
. The height of a grasp is crucial in ensuring that the gripper moves low enough to make contact with the object in the first place. However, note that the error in the height depends on both and the noisy depth measurement from the camera. As shown in Fig 4, there were only three discrete modes for the gripper provided by the manufacturer.Although the total space of grasp control is 8 dimensional, two of the translational controls ( and ) are subsumed in the sampling. Given an input image of the entire scene , 150 patches are sampled which correspond to the different values of and . Though this increases the inference time (since we have to input multiple samples), it also massively decreases the search space as a lot of the scene corresponding to the background) is empty. Hence, only 6 dimensions of control are learned for our task of grasping.
Parameter  Min  Max  # of Discrete Bins 
20  
10  
10  
(Height)  0  1  5 
(Mode)  0  2  3 
(Force)  15N  60N  20 
IvC Sensitivity Analysis on Adaptive Grasping
As described in Section IIIB, we collect a dataset of 1960 grasp interactions using the sobol quasirandom sampling scheme with an accuracy of 21% during data collection. The results for the and indices for all control parameters are shown in Table II. While the sensitivity analysis was limited to 10 objects, they were diverse in their properties  shape, deformable vs. rigid, large vs. small. Given sensitivity indices for each control parameter, the objective function in Eqn 1 is optimized to determine the optimal ordering of the control parameters to learn. The ordering that minimizes Eqn 1 is: in decreasing order of priority.
0.014  0.109  0.040  0.087  0.164  0.124  
0.799  0.985  0.892  1.130  0.850  0.788  
  0.0125  0.195  0.216  0.153  0.0956  
    0.0859  0.163  0.190  0.0385  
      0.0904  0.194  0.236  
        0.280  0.0519  
          0.260  
           
IvD Training and Model Inference
Eqn 2
is the joint loss function that is optimized.
corresponds to the success/failure label, gives the number of discretized bins for control parameter (see Table I), K (=6) is the number of control parameters, B is the batch size and is the sigmoid activation. is an indicator function and is equal to 1 when the control parameter corresponding to bin is applied.is the corresponding feature vector that is passed into the final sigmoid activation.
(2) 
Note that for each image datapoint, the gradients for all six control parameters are backpropagated throughout training. For each stage of the curriculum, the network is trained for 1520 epochs with a learning rate of 0.0001 using the ADAM optimizer
[38]. For inference, once we have the bounding box of the object of interest, 150 image patches are sampled randomly within this window and are resized to 224224 dimensions for the forward pass through the CNN. For each control parameter, the discrete bin with the highest activation is selected and interpolated to obtain the actual continuous value. The networks and optimization are implemented in TensorFlow
[39]. As a good practice when training deep models, we used dropout(0.5) to reduce model overfitting.V Experimental Evaluation
Experimental Settings: To quantitatively evaluate the performance of our framework, we physically tested the learned models on a set of diverse objects and measured their grasp accuracy averaged over a large number of trials. We have three test sets (shown in Fig 5): 1) Set A containing 10 objects seen by the robot during training 2) Set B containing 10 novel objects and 3) Set C with 20 novel objects. For Sets A and B, 5 grasps were attempted for each object placed in various random initial configurations and the results are detailed in Table III. CL0 in Table III refers to the model that was trained on the 1960 grasps collected for sensitivity analysis. Fig 6 shows some of the successful grasps executed with the robot using the final model trained with CASSL (i.e. CL6). Given the long physical testing time on the largest test set C, we took the best performing model and baselines on test sets A and B and tested them on Set C. As summarized in Table III, the values reported for each model were averaged for a total of 160 physical grasping trials (8 per object). When testing, the object was placed in 8 canonical orientations (NSWE,NE,SE,SW and NW) with respect to the same reference orientation.
Curriculum Progress: The grasp accuracy increases with each stage of curriculum learning on Set A and B, as shown in Fig 7. Starting with CL0 at 41.67%, the accuracy topped 70.0% on Set A (Seen objects) and 62% on Set B (Novel objects) at the end of the curriculum for the CL6 model. Note that at each stage of the curriculum, the model trained on the previous stage was used to collect around 460480 grasps as explained in Algorithm 1. As expected, the performance of the models on Set A with seen objects was better than that of the novel objects in Set B. Yet, the strong grasping performance on unseen objects suggests that the CNN was able to learn a generalized visual representation to scale its inference to novel objects. There was a dip in accuracy for CL2, possibility owing to overfitting on one of the control dimensions, but the performance recovered in subsequent stages since the models are trained with all the aggregated data.
Training  Testing  
Set A  Set B  Set C  
CL0  20.9  42.0  42.0   
CASSL(Ours)  CL6()  51.1  70.0  62.0  66.9 
CASSL(Random1)  42.7  56.0  54.0  55.6 
CASSL(Random2)  37.1  54.0  50.0   
Staged Learning [1, 2]  26.85  66.0  54.0  56.9 
Random Exploration  25.8  48.0  48.0   
Baseline Comparison: We evaluated against four baselines, all of which are provided equal or more data than that given to CASSL. 1) Random Exploration  Training the network from scratch with 4756 random grasps. 2) Staged Learning [1, 2]  We first trained the network with data from sensitivity analysis (i.e. CL0) and used this learned policy to sample the next 2796 grasp data points, as done in prior work. The policy was then finetuned with the aggregated data (4756 examples). In the third and final stage, 350 new grasp data points were sampled. This staged baseline was the training methodology used in prior work [1, 2]. 3) CASSL (Random 1 & 2)  Instead of using sensitivity analysis to define the curriculum, two sets of randomly ranked control parameters were trained with CASSL and the performance of the final trained models is reported in Table III. The ordering for Random 1 and 2 is (in decreasing order of priority) and respectively. In addition to the baselines above, the CL0 model achieves a grasping rate of around 20.86% and this could be roughly considered as the performance of random grasping trained with 1960 datapoints.
All the curriculum models (except CL0, CL2) outperformed the random exploration baseline’s accuracy of 48%. On the Set B (novel objects), CL6 showed a marked increase of 14%, 8% and 12% visàvis the random exploration, staged learning and CASSL (Random 2) baselines respectively. For the results on the larger Set C, CL6 still outperformed staged learning by about 10% and CASSL (Random 1) by 11.3%. The curriculum optimized with sensitivity analysis outperformed the random curriculum, illustrating the importance of choosing the right curriculum ranking, the lack of which can hamper learning performance.
Vi Conclusion and Future Work
We introduce Curriculum Accelerated SelfSupervised Learning (CASSL) for highlevel, highdimensional control in this work. In general, using random sampling or staged learning is not optimal. Instead, we utilize sensitivity analysis to compute the curriculum ranking in a datadriven fashion and assign the priority for learning each control parameter. We demonstrate effectiveness of CASSL on adaptive, 3fingered grasping. On novel test objects, CASSL outperformed baseline random sampling by 14%, onpolicy sampling by 8% and a random curriculum baseline by 12%. In future work, we hope to explore the following: 1) Modify the existing framework to include dynamically changing curriculum instead of a precomputed stationary ordering 2) Investigate applications in hierarchical reinforcement learning, where highlevel policy trained with CASSL is used alongside a lowlevel controller 3) Scale CASSL for learning in high dimensional manipulation tasks such as inhand manipulation.
Acknowledgements
This work was supported by ONR MURI N000141612007, NSF IIS1320083 and Google Focused Award. Abhinav Gupta was supported in part by a Sloan Research Fellowship and Adithya was partly supported by a Uber Fellowship. The authors would also like to thank Alex Spitzer, Wen Sun, Nadine Chang, and Tanmay Shankar for discussions; Chuck Whittaker, Eric Relson and Christine Downey for administrative and hardware support.
References
 [1] L. Pinto and A. Gupta, “Supersizing selfsupervision: Learning to grasp from 50k tries and 700 robot hours,” ICRA, 2016.

[2]
S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning handeye coordination for robotic grasping with deep learning and largescale data collection,”
ISER, 2016.  [3] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning to poke by poking: Experiential learning of intuitive physics.” NIPS, 2016.
 [4] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep visuomotor policies,” JMLR, 2016.
 [5] R. S. Sutton, “Generalization in reinforcement learning: Successful examples using sparse coarse coding,” Advances in neural information processing systems, pp. 1038–1044, 1996.
 [6] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel, “Variational information maximizing exploration,” arXiv preprintarXiv:1605.09674, 2016.
 [7] Y. Futagi, Y. Toribe, and Y. Suzuki, “The grasp reflex and moro reflex in infants: Hierarchy of primitive reflex responses,” International Journal of Pediatrics, 2012.
 [8] B. Skinner, “Reinforcement today,” American Psychologist 13, 94–99, 1958.
 [9] J. Elman, “Learning and development in neural networks: The importance of starting small,” Cognition, 48, 781–799, 1993.
 [10] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in ICML, 2009.
 [11] L.J. Li and L. FeiFei, “Optimol: automatic online picture collection via incremental model learning.” IJCV, 2010.
 [12] X. Chen and A. Gupta, “Webly supervised learning of convolutional networks.” ICCV, 2015.
 [13] D. Pathak, P. Agrawal, A. Efros, and T. Darrell, “Curiositydriven exploration by selfsupervised prediction,” ICML, 2017.
 [14] S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, and A. Szlam, “Intrinsic motivation and automatic curricula via asymmetric selfplay,” arXiv, 2017.
 [15] M. Kumar, B. Packer, and D. Koller, “Selfpaced learning for latent variable models.” NIPS, 2010.

[16]
L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann, “Selfpaced
curriculum learning,”
AAAI’15 Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence Pages 26942700
, 2015.  [17] A. Saltelli, P. Annoni, I. Azzini, F. Campolongo, M. Ratto, and S. Tarantola, “Variance based sensitivity analysis of model output. design and estimator for the total sensitivity index,” Computer Physics Communications, vol. 181, no. 2, pp. 259–270, 2010.
 [18] M. Svetlik, L. Matteo, S. Jivko, R. Shah, N. Walker, and P. Stone, “Automatic curriculum graph generation for reinforcement learning agents,” AAAI’17 Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, 2017.
 [19] A. Bicchi and V. Kumar, “Robotic grasping and contact: a review,” in ICRA, 2000.
 [20] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Datadriven grasp synthesis—a survey,” IEEE Transactions on Robotics, 2014.
 [21] V.D. Nguyen, “Constructing forceclosure grasps,” The International Journal of Robotics Research, vol. 7, no. 3, pp. 3–16, 1988.
 [22] J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kröger, J. Kuffner, and K. Goldberg, “Dexnet 1.0: A cloudbased network of 3d objects for robust grasp planning using a multiarmed bandit model with correlated rewards,” in ICRA, 2016.
 [23] L. Pinto and A. Gupta, “Learning to push by grasping: Using multiple tasks for effective learning,” arXiv preprint arXiv:1609.09025, 2016.
 [24] L. Pinto, J. Davidson, and A. Gupta, “Supervision via competition: Robot adversaries for learning tasks,” arXiv preprint arXiv:1610.01685, 2016.
 [25] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, 2015.
 [26] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trust region policy optimization.” in ICML, 2015, pp. 1889–1897.
 [27] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
 [28] A. Saltelli, P. Annoni, I. Azzini, F. Campolongo, M. Ratto, and S. Tarantola, “Variance based sensitivity analysis of model output. design and estimator for the total sensitivity index.” Computer Physics Communications 181 259–270, 2010.
 [29] J. Herman and W. Usher, “An opensource python library for sensitivity analysis.” Journal of Open Source Software, 2017.
 [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
 [31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C Berg, and L. FeiFei, “Imagenet large scale visual recognition challenge,” in arXiv preprint arXiv:1409.0575, 2014.
 [32] L. Pinto, D. Gandhi, Y. Han, Y.L. Park, and A. Gupta, “The curious robot: Learning visual representations via physical interactions,” ECCV, 2016.
 [33] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation tech report,” CVPR, 2014.
 [34] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features offtheshelf: an astounding baseline for recognition,” CVPR, 2014.
 [35] S. Ross, G. Gordan, and A. Bagnell, “A reduction of imitation learning and structured prediction to noregret online learning,” in arXiv preprint arXiv:1011.0686, 2010.
 [36] M. Wise, M. Ferguson, D. King, E. Diehr, and D. Dymesich, “Fetch & freight: Standard platforms for service robot applications,” Workshop on Autonomous Mobile Service Robots, IJCAI, 2016.
 [37] M. Dogar and S. Srinivasa, “Pushgrasping with dexterous hands: Mechanics and a method,” IROS, 2010.
 [38] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
Comments
There are no comments yet.