CAZSL: Zero-Shot Regression for Pushing Models by Generalizing Through Context

03/26/2020 ∙ by Wenyu Zhang, et al. ∙ MERL cornell university 0

Learning accurate models of the physical world is required for a lot of robotic manipulation tasks. However, during manipulation, robots are expected to interact with unknown workpieces so that building predictive models which can generalize over a number of these objects is highly desirable. In this paper, we study the problem of designing learning agents which can generalize their models of the physical world by building context-aware learning models. The purpose of these agents is to quickly adapt and/or generalize their notion of physics of interaction in the real world based on certain features about the interacting objects that provide different contexts to the predictive models. With this motivation, we present context-aware zero shot learning (CAZSL, pronounced as 'casual') models, an approach utilizing a Siamese network architecture, embedding space masking and regularization based on context variables which allows us to learn a model that can generalize to different parameters or features of the interacting objects. We test our proposed learning algorithm on the recently released Omnipush datatset that allows testing of meta-learning capabilities using low-dimensional data.



There are no comments yet.


page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Fig. 1: The proposed idea of learning context-aware zero-shot regression models in the paper. The context variables are the additional features which effect the interaction dynamics being considered. The goal is for the learning agent can generalize to different context variables using the proposed approach.

Designing learning agents that can reliably perform robotic manipulation tasks is challenging [32]. One of the reasons among many others is that robotic manipulation deals with a lot of challenging phenomena such as unilateral contacts, frictional contacts, impact, and deformation. These phenomena are challenging to understand or model even when considered individually, and manipulation requires considering several of these simultaneously. Consequently, it is difficult to either derive or learn precise models of interaction that can model different robotic manipulation tasks. Furthermore, robots are expected to interact with unknown workpieces so that building predictive models which can generalize over objects is highly desirable and of practical value [32]. For example, Figure 2 shows objects from the Omnipush dataset [4] where the pushing dynamics depend on the shape and mass distribution of the objects being pushed. While humans generalize effortlessly to variation in different physical properties of objects during interaction, it is difficult for robots to understand this generalization during interaction [3, 38, 34, 21].

Fig. 2: Three outlined top-down views of Omnipush objects with different shapes and weights. Red circles inside the object indicate the positions of weights. As explained in [4], the pushing dynamics depends on the mass and shape of the object. It is desirable that a learning agent can quickly adapt its notion of pushing interaction based on these attributes of the objects. These attributes of a novel object can be obtained from an auxiliary system (e.g., a vision system). Images are reproduced from with permission from authors [4].

Learning accurate models of the physical world is pre-requisite for many model-based robotic manipulation tasks. The motivation of our work is to train general purpose AI agents that can adapt their model of physical systems (e.g., interaction) using some extra features which can be easily obtained using auxiliary systems. For example, the interaction dynamics between two objects can depend on their mass, shape, size, etc. These features for a new object can, however, be easily estimated using state-of-the-art vision systems or can be encoded into state-representation features. The learning agents can then adapt their notion of the interaction physics based on these additional inputs. This is very similar to how humans adapt their model of objects based on some features that they can sense. Throughout the paper, we call these additional features as

context. These are analogous to parameters in classical modeling approaches. We propose zero-shot regression models outlined pictorially in Figure 1

which are trained using neural networks while using these additional context variables. While the concepts of meta-learning and zero-shot learning are very popular in machine learning literature

[12], they have not been widely studied in robotics [4].

Supervised deep learning models are increasingly popular to model complex relationships in physical systems 

[37, 2]. The advantage of deep learning models lies in their superior ability to learn complex, non-linear spatial and temporal behaviors through the choice of large network architectures, which can then be optimized using large amounts of data.

However in real applications, we are often unable to collect comprehensive datasets that cover all possible contexts, states and actions. For instance, we may be able to conduct physical experiments with a range of initial conditions for data collection, but not able to observe for all possible initial conditions. Inductive biases typically allow deep learning models to generalize well to further samples collected under the same contexts. This renders such models suitable for applications with a finite and fixed set of contexts. However, they may fare poorly with out-of-distribution samples from unseen contexts due to the lack of ability to generalize across contexts [16], and hence need additional procedures such as context identifiers to correct for this [37].

Inspired by these problems, we present a context-aware zero-shot learning method, CAZSL, for learning predictive models that can generalize across object-dependent context variables. We present a novel combination of context-based mask and regularizer that augments model parameters based on contexts and constrains the zero-shot model to predict similar behavior based on similarity in contexts. This allows us to make accurate prediction on novel objects by adapting the model based on the newly available context. The results of the proposed CAZSL method is reported using the Omnipush dataset which provides a diverse dataset with different objects for pushing dynamics. The proposed idea is presented pictorially in Figure 1, which presents the idea of CAZSL for the Omnipush dataset to generalize over the shape of objects for pushing. We demonstrate empirically that CAZSL improves performance or performs comparably to meta-learning and baselines methods in numerous scenarios.

Contributions: The proposed paper has the following contributions.

  1. We present a context-aware zero-shot learning (CAZSL) modeling approach with the motivation of building agents that can quickly adapt their notion of physics based on object-dependent context (analogous to parametric representation). We propose a novel combination of context mask and regularizer to constrain the model using similarities between contexts.

  2. We compare the proposed algorithm against several others methods on the recently released Omnipush dataset [4] providing new benchmark results for generalization.

Note that this paper only shows results for modeling using the proposed zero-shot learning approach. Use of the proposed models for model-based control is deferred to a future publication.

Ii Related Work

The work presented in this paper is mainly motivated by the goal of creating generalizable models for learning complex interaction dynamics. These kind of physical interactions are common in a lot of physical systems. Interaction between objects especially play a big role in robotic manipulation where a robot interacts with its environment using selective contacts [32]. Learning accurate predictive models of physical systems and interactions is a very active area of research in robotics and machine learning communities [19].

Model learning has been studied extensively in both machine learning as well as robotics community. The goal of these techniques in robotic manipulation is to learn high-precision models of interaction of the robot with the physical world which can be then used for synthesizing controllers [36, 24, 13, 1, 11]. Among the possible ways to manipulate an object, pushing stands out as one of the most fundamental. As such, it has gained a lot of attention and thus, has been extensively studied [31, 26, 27, 35]. However, creating reliable models for pushing requires good models of friction, contacts, etc. which still remains poorly modeled in most of the state-of-the-art physics engines. As a result, a lot of data-driven approaches have been proposed based on either learning these interaction models entirely from data or augmented with prior physical knowledge [2, 18, 5]. However, the dynamics of pushing interaction is affected by the physical attributes of the object being pushed (e.g., their shape, mass, size, etc.). As a result, the models learned for a particular object may perform poorly on novel objects [4]. Motivated by this problem and to allow study of generalization to different objects, the Omnipush dataset was released recently [4]. We also draw motivation from this problem and present a technique that can generalize to different objects during a manipulation process and thus, can be used to predict the interaction with different kind of objects. With this goal, we propose a zero-shot regression technique that can generalize using contexts available from different objects. This paper focuses on evaluations through the Omnipush dataset, but we believe that the proposed method is general and can be used to study generalization over other different interactions.

Zero-shot learning algorithms in machine learning are primarily focused on classification problems where either the target classes are rare or expensive to obtain, or the number of target classes is large [41]. These methods assume there is a finite number of classes and may not be easily transferable for use in regression settings. The common technique for zero-shot learning is to make use of auxiliary information or semantic representations, such as object attributes [22, 23, 42] and images [45], to assist learning a model that can generalize to unseen classes. The auxiliary information is usually embedded into a latent space, and regularization has been used to make the embedded representations for each class more separable [25]. We apply a novel regularization where the embedding function is learned according to the distance between contexts, thus maintaining an ordering where more similar contexts are embedded closer together. This allows for a continuous spectrum of contexts instead of a finite number of class prototypes.

Iii Background

In this section, we provide some background on the relevant learning approaches that will be referred to in the rest of the paper and allows clarity for readers not familiar with these learning approaches.

Iii-a Siamese Networks

A Siamese neural network consists of two copies of a network which both take a unique input and compute a distance metric between the two feature representations generated [7]. The parameters of the two copied networks are shared, ensuring that inputs which are similar, according to application-specific definition, result in a lower distance.

Siamese networks have been used for object tracking, one-shot image classification and image matching [6, 20, 33] and in robotics applications such as robotic surgery and indoor navigation [43, 44].

Iii-B Neural Network Masking

Deep neural networks are often highly over-parametrized [10] meaning that a large number of weights or layers are redundant and can be pruned [46]. In many pruning strategies, pruning of weights is performed using a binary mask [14, 46], and in other works it has been shown that binary weights are sufficient for state-of-the-art accuracy [8, 9, 15, 17, 39]. By considering a fixed backbone network

trained on one dataset, and training an additional set of mask weights on a second dataset, the resulting new network

where as illustrated in Figure 3 has been found to achieve state-of-the-art performance on a second task [28, 29, 30]. Recent work has also shown that training a mask on an un-trained randomly initialized network achieves accuracies near state-of-the-art on image classification [46]. In all cases, a deep neural network, trained to specialize in a given task, can be augmented to perform additional tasks by applying a mask on its parameters.

Fig. 3: Overview of masking a deep neural network from [28]. The original weights of the network are updated by a learned mask through elementwise product to obtain a new set of weights for the network , allowing the network to specialize for a new set of inputs different from the original.

In this work, we utilize both the architecture of Siamese neural networks and masks to better incorporate object contexts into neural network models, such that existing models can better generalize over object properties.

Iv Proposed CAZSL

Fig. 4: Proposed CAZSL model; (a) the CAZSL network is a Siamese network which ensures that the same object-push input pairs attain the same predicted state output, (b) regularization on the input context and context embedding mask enforces similar objects to have similar intermediate representations, (c) intermediate representations are altered by the context mask so that the network can guide its output based on knowledge of object properties, (d) the final predicted state is estimated from the masked intermediate representation which incorporates the context.

In this section, we describe the CAZSL method to effectively incorporate context information into the learning paradigm of neural network models. This allows the learning agents to adapt their model of a real-world physical system based on properties of the interacting objects such as mass and shape, and hence be able to generalize their predictions even towards new unseen objects. The proposed method uses a Siamese network and masking as shown pictorially in Figure 4. Regularization on the context inputs as well as the context mask embedding aims to enforce similar intermediate representations based on similarity in contexts. This idea is explained in more detail in the following text.

A general predictive model takes the form of

for a deep neural network model parameterized by . The inputs at time are observations . The outputs are denoted , which are the prediction targets at time . However, the learned tends to be biased towards training samples available and the resulting model does not generalize well to out-of-distribution samples.

We learn the model in an end-to-end fashion to incorporate the ability to generalize to new objects through a novel combination of context mask with a regularization term. We propose learning

where is the original -layered deep neural network with an additional non-linear context mask which depends on the context . The context mask is jointly learned by a neural network, and is applied as an elementwise product on the activations from the first layer. The mask augments the embedded input based on context.

Further, we encourage learning such that if


for contexts , and where denotes the Frobenius norm, and is a suitably chosen distance function defining the magnitude of difference between two contexts. The key idea is that physical dynamics are more similar under more similar contexts. For instance, we would expect the objects in Figure 1(a) and 1(b) to behave more similarly to each other when pushed than Figure 1(a) and 1(c)

, since the first pair of objects shares three common sides while the second pair shares two. The constraint would allow the model to generalize to new out-of-distribution contexts not in the training set, by interpolating or extrapolating based on object attributes observed in the training set. We impose this constraint through the regularization component which we refer to as

context regularization:

to be added to the prediction loss in the objective function.

The twin network architecture of Siamese networks allows pairwise comparison of inputs. The network is trained through a Siamese network structure to optimize the objective function over pairs of inputs and , dropping time indices and

in the expression for simplicity. The complete loss function for a pair of inputs is:


where is the prediction loss function.

Throughout our experiments, we model


is the negative log likelihood of the prediction. Additionally, we consider two distance functions over the vectorized context inputs for our context regularizer:

  1. regularization: Euclidean distance function

  2. neural regularization: kernel distance function

In the kernel distance function,

is a two layer fully-connected network, or

which involves learning the spatial features of

through a convolutional neural network when

is an image. Using

regularization is reasonable when the context variables are continuous, and neural regularization may be more advantageous when the relationship between the context variables are highly non-linear, as in many dynamical systems. Another benefit of the neural regularization is that hyperparameter

can be absorbed and learned, and we fix for all experiments with neural regularization which is equivalent to not setting the second hyperparameter.

V Experiments and Results

We present results to clarify, motivate and justify the use of the proposed CAZSL method for zero-shot learning. To do so, we perform a series of numerical experiments to answer the following questions.

  1. Is the inclusion of context helpful towards learning?

  2. Does CAZSL improve regression performance on out-of-distribution samples?

  3. How should the distance function in CAZSL be selected?

We evaluate our method on a simple regression task as well as six experiments using two contexts from the Omnipush dataset [4]. In the following subsections, we abbreviate competing methods evaluated as ANP (attentive neural process), FCN (fully-connected network), and FCN + CC (FCN with context concatenated to input). ANPis a meta-learning method that uses an attention mechanism on relevant context points for regression [16], and FCN is a 4-layer fully-connected neural network. These two methods are used in the Omnipush data-release paper [4] and they do not make use of context information. We apply our proposed context mask and regularization directly on FCN for easy comparison of their effects. We abbreviate variations of our proposed CAZSL method for ablation studies as FCN + CM (FCN with context mask), FCN + CM + L2Reg (FCN + CM with context regularization), and FCN + CM + NeuralReg (FCN + CM with neural context regularization).

We point out that the FCN predicts a Gaussian density for each sample as defined by a mean

and standard deviation

. The mean parameter is evaluated by root-mean-square error (RMSE). We also report the standard deviation (STD) to give a sense of the prediction uncertainty. All values reported correspond to test performance with parameters from the last training epoch.

Hyperparameters: For the simple regression task in Section V-A, we train all models for epochs with the Adam optimizer using a learning rate of , and a batch size of . This configuration is sufficient for convergence due to the small size of the simulation dataset.

For experiments on the Omnipush dataset in Section V-B, we use the same configurations as in [4], that is, we train all models for a maximum of epochs with the Adam optimizer using a learning rate , and a batch size of . The ANP model in [4] is trained for epochs with warm-up step of whereas we train for epochs with warm-up step of to match the number of epochs of all other methods. Our replicated results of methods in [4] are comparable with the original results reported.

V-a Regression

We use a simple regression task to illustrate the effects of the proposed context mask and regularization on FCN. We simulate 1D Gaussian processes with the RBF kernel:

where controls the scale and is the bandwidth controlling how far the data can be extrapolated. The parameters are drawn uniformly at random as and . The training set consists a total of 4000 samples extracted from Gaussian processes generated with 200 parameter sets. 20 samples are extracted per parameter set, and denoting as the observation at time , each sample has predictor which is a subsequence of 3 historical observations and response as the predictive target. The context variable is the kernel parameters . The test set consists 400 out-of-distribution samples corresponding to 20 new parameter sets.

The simulation is repeated 10 times. We set hyperparameters and for the applicable models. We use a small degree of regularization since this regression task is relatively simple. From Table I, all variations of our proposed method outperform the baselines FCN and FCN + CC. We note that context concatenation has decreased accuracy while context masking has improved accuracy, which reflects the effectiveness of masking in the embedding space. The use of regularization further improves performance, and FCN + CM + L2Reg has the largest reduction of RMSE over FCN.

FCN 0.108 0.131
0.146 0.133
0.105 0.121
FCN + CM + L2Reg
0.098 0.100
FCN + CM + NeuralReg
0.103 0.113
TABLE I: Average one-step prediction performance on out-of-distribution Gaussian processes samples across 10 simulations.

V-B Omnipush Dataset

DataSet Description: The Omnipush dataset [4] collected 250 pushes per object for 250 objects on ABS surface (hard plastic). The data collection setup for pushing is shown in Figure 5. The objects are constructed to explore key factors that affect pushing – the shape of the object and its mass distribution – which have not been broadly explored in previous datasets and allow for study of generalization in model learning. Each side of the object has four possible shapes (concave, triangular, circular, rectangular) with three types of extra weights (0g, 60g, 150g). The triangular shape allows two positions (interior, exterior) for extra weights to be attached. A maximum of two weights are attached per object. We denote the shape and mass distribution of the objects as context, and experiment with two types of context variables:

  1. Indicator context: length–36 binary vector indicating the shape, extra weight and its position for each side

  2. Visual context: numerical array representing top-down view of object displayed in Figure 2.

This allows us to test the generalization capability of our proposed CAZSL technique. The visual context is , resized from an original image111We resize original images in [4] using the resize function with default parameters in scikit-image.. The dataset further has 250 pushes per object for 10 objects on plywood surface. More details of the dataset can be found on the website and the corresponding paper [4].

Fig. 5: Omnipush dataset collection setup. The data is collected using an ABB industrial arm. The pusher is a steel rod attached to the end-effector of the arm. This steel arm interacts with the objects which are pushed during the experiments. The picture is reproduced with permission from [4].

The prediction task is to estimate the ending location and orientation of the object after being pushed. In data collection, the pusher is set to move at constant speed. Treating the location and angle of the object as the origin , the model input is containing the location and angle of the pusher with respect to the object. The model output is the 3-dimensional vector . To give a more intuitive representation of model accuracy, we convert RMSE to millimeters by multiplying it by 21.92mm, as done by the authors in [4].

Experiment Setups: We use three setups to evaluate generalization performance of models across objects,

  1. Different objects: training and test objects have different characteristics, that is, the combination of shapes, weights and weight positions for four sides;

  2. Different surfaces: training objects are pushed on ABS surface and test objects are pushed on plywood;

  3. Different weights: training and test objects have a different number of extra weights attached.

The Different surfaces setup allows evaluation for generalization performance beyond the provided context since surface information is not provided during training. We note that some objects pushed on plywood do not have images provided, and hence we only use indicator context for this setup.

The Different weights setup is further split into three sub-setups. There are three possible number of weights per object, and we use objects in each of the three options in turn as test objects, and the remaining objects as training objects.

For all experiments with indicator context, we set CAZSL hyperparameters and where applicable. For visual context, we set and where applicable. The smaller is to balance the higher dimensions of the visual context variables. When neural regularization is used, we treat which is equivalent to not having the second hyperparameter.

Results: From Table II and III, CAZSL models consistently have improved performance over the baseline FCN and FCN + CC, reflecting that the inclusion of contextual information helps learning but the context should be applied to the embedding space instead of directly concatenated in the observation space. The RMSE of ANP is consistently between 0.22 and 0.28. Since ANP is a meta-learning method which is aimed at object-generalization, we would expect its performance to be fairly consistent across setups. In comparison with ANP, our CAZSL models achieved better performance except in two sub-cases in Table II(a) for learning Different weights with indicator context. However, we note that our regularized approach outperforms the ANP in the weight test set of the Different weights experiment, indicating the ability to better generalize to unknown mass distributions. Moreover, with the use of visual contexts which contain more detailed contextual information, CAZSL models consistently outperform ANP.

Comparing between the and neural regularizations in CAZSL models, we see that the former has lower RMSE when indicator context is used. Since the indicator context is a sparse binary vector, the neural network used for its embedding in neural regularization is possibly over-parameterized, hence resulting in overfitting. When visual context is used, the performance difference between the two choices of regularization is marginal. The convolutional neural network used for context embedding is able to extract spatial features possibly relating to the object geometry and mass distribution, and hence the kernel distance function learned is able to better discern differences between visual contexts. These results suggest that neural regularization is more suitable for complex or high-dimensional context variables.

In summary, in all experiments, we find that our CAZSL models outperform baseline counterparts which do not implement context masking and regularization, and perform comparably or better than the ANP meta-learning baseline. We also observe that using indicator contexts improves performance over using no context most of the time, and using visual contexts improves performance over using indicator contexts or no context in all experiments. This suggests that increasing details in contextual information can be utilized to help learning.

Different objects
Indicator context Visual context
RMSE STD Dist. (mm) RMSE STD Dist. (mm)
ANP 0.222 0.079 4.87 0.222 0.079 4.87
FCN 0.330 0.149 7.23 0.330 0.149 7.23
FCN + CC 0.224 0.040 4.91 0.205 0.063 4.49
FCN + CM 0.210 0.043 4.60 0.193 0.039 4.23
FCN + CM + L2Reg 0.205 0.029 4.49 0.193 0.060 4.23
FCN + CM + NeuralReg 0.220 0.037 4.82 0.193 0.055 4.23
(a) Different objects: Training and test samples are from objects with different characteristics i.e. shape and mass distribution.
Different surfaces: Indicator context
RMSE STD Dist. (mm)
ANP 0.271 0.064 5.94
FCN 0.328 0.154 7.19
FCN + CC 0.264 0.046 5.79
FCN + CM 0.257 0.036 5.63
FCN + CM + L2Reg 0.260 0.035 5.70
FCN + CM + NeuralReg 0.263 0.045 5.76
(b) Different surfaces: Training samples are from objects pushed on ABS surface (hard plastic), and test samples are from objects pushed on plywood surface. Some test objects have different characteristics from training objects.
TABLE II: Method performance under multiple setups where test samples have out-of-distribution properties. Setups same as in [4].
Different weights: Indicator context
0 Weight 1 Weight 2 Weights
RMSE STD Dist. (mm) RMSE STD Dist. (mm) RMSE STD Dist. (mm)
ANP 0.252 0.079 5.52 0.250 0.070 5.48 0.242 0.079 5.30
FCN 0.327 0.144 7.17 0.356 0.127 7.80 0.329 0.163 7.21
FCN + CC 0.258 0.073 5.66 0.359 0.049 7.87 0.331 0.043 7.26
FCN + CM 0.235 0.038 5.15 0.267 0.033 5.85 0.266 0.030 5.83
FCN + CM + L2Reg 0.227 0.040 4.98 0.257 0.034 5.63 0.272 0.033 5.96
FCN + CM + NeuralReg 0.254 0.039 5.57 0.294 0.034 6.44 0.294 0.036 6.44
(a) Different weights: Indicator context.
Different weights: Visual context
0 Weight 1 Weight 2 Weights
RMSE STD Dist. (mm) RMSE STD Dist. (mm) RMSE STD Dist. (mm)
ANP 0.252 0.079 5.52 0.250 0.070 5.48 0.242 0.079 5.30
FCN 0.327 0.144 7.17 0.356 0.127 7.80 0.329 0.163 7.21
FCN + CC 0.239 0.044 5.15 0.282 0.044 6.18 0.268 0.061 5.87
FCN + CM 0.222 0.034 4.89 0.230 0.032 5.04 0.219 0.039 4.80
FCN + CM + L2Reg 0.209 0.079 4.60 0.220 0.051 4.82 0.218 0.079 4.78
FCN + CM + NeuralReg 0.209 0.064 4.56 0.222 0.050 4.87 0.218 0.054 4.78
(b) Different weights: Visual context.
TABLE III: Method performance when training and test samples are from objects with different number of weights attached, out of options from 0 to 2. For example, the heading ‘0 Weight’ means that train samples are from objects with 1 or 2 weights attached, and test samples are from objects have no weight attached. Setup not in [4].

Vi Conclusion and Future Work

Robotic manipulation is hard to model as the interaction dynamics is affected by complex phenomena like dry friction, contacts, impacts, etc. which are difficult to model. Furthermore, the robots are often expected to work with unknown workpieces. As such it is challenging to create models that can predict these interactions accurately over a diverse range of objects with different physical attributes. We present a zero-shot learning method CAZSL which allows us to explicitly consider the physical attributes of different objects so that the predictive model can then be easily adapted to a novel object. We introduced a novel combination of context mask and regularization that augments model parameters based on contexts and constrains the model to predict similar behavior for objects with similar physical attributes. We tested our CAZSL models on the recently released Omnipush dataset. We demonstrate empirically that CAZSL improves performance or performs comparably to meta-learning and object-independent baselines in numerous scenarios.

In the future, we would like to further develop the algorithm and test it on much bigger and diverse interaction datasets. We would like to further investigate the proposed method for multi-step predictive error so that it could be evaluated for control of modeled interactions. Similarly, it would be interesting to test the proposed method for prediction in other physical domains [40, 47].


  • [1] P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine (2016) Learning to poke by poking: experiential learning of intuitive physics. In Advances in neural information processing systems, pp. 5074–5082. Cited by: §II.
  • [2] A. Ajay, J. Wu, N. Fazeli, M. Bauza, L. P. Kaelbling, J. B. Tenenbaum, and A. Rodriguez (2018) Augmenting physical simulators with stochastic neural networks: case study of planar pushing and bouncing. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3066–3073. Cited by: §I, §II.
  • [3] P. W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum (2013)

    Simulation as an engine of physical scene understanding

    Proceedings of the National Academy of Sciences 110 (45), pp. 18327–18332 (en). External Links: ISSN 0027-8424, 1091-6490, Document Cited by: §I.
  • [4] M. Bauza, F. Alet, Y. Lin, T. Lozano-Perez, L. Kaelbling, P. Isola, and A. Rodriguez (2019-10) Omnipush: accurate, diverse, real-world dataset of pushing dynamics with rgb-d video. arXiv. Note: Cited by: Fig. 2, item 2, §I, §I, §II, Fig. 5, §V-B, §V-B, §V-B, TABLE II, TABLE III, §V, §V, footnote 1.
  • [5] M. Bauza and A. Rodriguez (2017) A probabilistic data-driven model for planar pushing. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3008–3015. Cited by: §II.
  • [6] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In

    European conference on computer vision

    pp. 850–865. Cited by: §III-A.
  • [7] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pp. 737–744. Cited by: §III-A.
  • [8] Z. Cheng, D. Soudry, Z. Mao, and Z. Lan (2015)

    Training binary multilayer neural networks for image classification using expectation backpropagation

    arXiv preprint arXiv:1503.03562. Cited by: §III-B.
  • [9] M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §III-B.
  • [10] Y. N. Dauphin and Y. Bengio (2013) Big neural networks waste capacity. arXiv preprint arXiv:1301.3583. Cited by: §III-B.
  • [11] N. Fazeli, M. Oller, J. Wu, Z. Wu, J. B. Tenenbaum, and A. Rodriguez (2019) See, feel, act: hierarchical learning for complex manipulation skills with multisensory fusion. Science Robotics 4 (26). External Links: Document, Link, Cited by: §II.
  • [12] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §I.
  • [13] C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. Cited by: §II.
  • [14] J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §III-B.
  • [15] K. Hwang and W. Sung (2014) Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6. Cited by: §III-B.
  • [16] H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh (2019) Attentive neural processes. arXiv preprint arXiv:1901.05761. Cited by: §I, §V.
  • [17] J. Kim, K. Hwang, and W. Sung (2014) X1000 real-time phoneme recognition vlsi using feed-forward deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7510–7514. Cited by: §III-B.
  • [18] A. Kloss, S. Schaal, and J. Bohg (2017) Combining learned and analytical models for predicting action effects. arXiv preprint arXiv:1710.04102. Cited by: §II.
  • [19] J. Kober, J. A. Bagnell, and J. Peters (2013) Reinforcement learning in robotics: a survey. The International Journal of Robotics Research 32 (11), pp. 1238–1274. Cited by: §II.
  • [20] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §III-A.
  • [21] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §I.
  • [22] C. H. Lampert, H. Nickisch, and S. Harmeling (2009-06) Learning to detect unseen object classes by between-class attribute transfer. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    Vol. , pp. 951–958. Cited by: §II.
  • [23] C. H. Lampert, H. Nickisch, and S. Harmeling (2013) Attribute-based classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence 36 (3), pp. 453–465. Cited by: §II.
  • [24] A. D. Libera, D. Romeres, D. K. Jha, B. Yerazunis, and D. Nikovski (2020) Model-based reinforcement learning for physical systems without velocity and acceleration measurements. arXiv. Note: External Links: 2002.10621 Cited by: §II.
  • [25] C. Luo, Z. Li, K. Huang, J. Feng, and M. Wang (2018-02) Zero-shot learning via attribute regression and class prototype rectification. IEEE Transactions on Image Processing 27 (2), pp. 637–648. Cited by: §II.
  • [26] K. M. Lynch, H. Maekawa, and K. Tanie (1992) Manipulation and active sensing by pushing using tactile feedback.. In IROS, Vol. 1. Cited by: §II.
  • [27] K. M. Lynch and M. T. Mason (1996) Stable pushing: mechanics, controllability, and planning. The international journal of robotics research 15 (6), pp. 533–556. Cited by: §II.
  • [28] A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In ECCV, Cited by: Fig. 3, §III-B.
  • [29] M. Mancini, E. Ricci, B. Caputo, and S. Rota Bulò (2018) Adding new tasks to a single network with weight transformations using binary masks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §III-B.
  • [30] M. Masana, T. Tuytelaars, and J. van de Weijer (2020) Ternary feature masks: continual learning without any forgetting. arXiv preprint arXiv:2001.08714. Cited by: §III-B.
  • [31] M. T. Mason (1986) Mechanics and planning of manipulator pushing operations. The International Journal of Robotics Research 5 (3), pp. 53–71. Cited by: §II.
  • [32] M. T. Mason (2018) Toward robotic manipulation. Annual Review of Control, Robotics, and Autonomous Systems 1, pp. 1–28. Cited by: §I, §II.
  • [33] I. Melekhov, J. Kannala, and E. Rahtu (2016) Siamese network features for image matching. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 378–383. Cited by: §III-A.
  • [34] F. Osiurak and D. Heinke (2018) Looking for intoolligence: A unified framework for the cognitive study of human tool use and technology. American Psychologist 73 (2), pp. 169–185 (English). External Links: ISSN 0003-066X, Document Cited by: §I.
  • [35] M. A. Peshkin and A. C. Sanderson (1988-12) The motion of a pushed, sliding workpiece. IEEE Journal on Robotics and Automation 4 (6), pp. 569–598. External Links: Document, ISSN 2374-8710 Cited by: §II.
  • [36] D. Romeres, D. K. Jha, A. DallaLibera, B. Yerazunis, and D. Nikovski (2019-05) Semiparametrical gaussian processes learning of forward dynamical models for navigating in a circular maze. In 2019 International Conference on Robotics and Automation (ICRA), Vol. , pp. 3195–3202. External Links: Document, ISSN 1050-4729 Cited by: §II.
  • [37] A. Sanchez-Gonzalez, N. M. O. Heess, J. T. Springenberg, J. Merel, M. A. Riedmiller, R. Hadsell, and P. W. Battaglia (2018) Graph networks as learnable physics engines for inference and control. In ICML, Cited by: §I, §I.
  • [38] K. A. Smith, P. W. Battaglia, and E. Vul (2018) Different Physical Intuitions Exist Between Tasks, Not Domains. Computational Brain & Behavior 1 (2), pp. 101–118 (en). External Links: ISSN 2522-0861, 2522-087X, Document Cited by: §I.
  • [39] D. Soudry, I. Hubara, and R. Meir (2014) Expectation backpropagation: parameter-free training of multilayer neural networks with continuous or discrete weights. In Advances in Neural Information Processing Systems, pp. 963–971. Cited by: §III-B.
  • [40] M. H. Tahersima, K. Kojima, T. Koike-Akino, D. Jha, B. Wang, C. Lin, and K. Parsons (2019) Deep neural network inverse design of integrated photonic power splitters. Scientific reports 9 (1), pp. 1–9. Cited by: §VI.
  • [41] W. Wang, V. W. Zheng, H. Yu, and C. Miao (2019) A survey of zero-shot learning: settings, methods, and applications. ACM TIST 10, pp. 13:1–13:37. Cited by: §II.
  • [42] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-ucsd birds 200. California Institute of Technology. Cited by: §II.
  • [43] M. Ye, E. Johns, A. Handa, L. Zhang, P. Pratt, and G. Yang (2017) Self-supervised siamese learning on stereo image pairs for depth estimation in robotic surgery. arXiv preprint arXiv:1705.08260. Cited by: §III-A.
  • [44] Y. Yeboah, C. Yanguang, W. Wu, and S. He (2018) Autonomous indoor robot navigation via siamese deep convolutional neural network. In

    Proceedings of the 2018 International Conference on Artificial Intelligence and Pattern Recognition

    pp. 113–119. Cited by: §III-A.
  • [45] E. Zablocki, P. Bordes, B. Piwowarski, L. Soulier, and P. Gallinari (2019-06) Context-Aware Zero-Shot Learning for Object Recognition. In Thirty-sixth International Conference on Machine Learning (ICML), Long Beach, CA, United States. Cited by: §II.
  • [46] H. Zhou, J. Lan, R. Liu, and J. Yosinski (2019) Deconstructing lottery tickets: zeros, signs, and the supermask. In Advances in Neural Information Processing Systems, pp. 3592–3602. Cited by: §III-B.
  • [47] Y. Zhu and N. Zabaras (2018) Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. Journal of Computational Physics 366, pp. 415–447. Cited by: §VI.