The State of Service Robots: Current Bottlenecks in Object Perception and Manipulation

by   S. Hamidreza Kasaei, et al.
University of Groningen

Service robots are appearing more and more in our daily life. The development of service robots combines multiple fields of research, from object perception to object manipulation. The state-of-the-art continues to improve to make a proper coupling between object perception and manipulation. This coupling is necessary for service robots not only to perform various tasks in a reasonable amount of time but also to adapt to new environments through time and interact with non-expert human users safely. Nowadays, robots are able to recognize various objects, and quickly plan a collision-free trajectory to grasp a target object. While there are many successes, the robot should be painstakingly coded in advance to perform a set of predefined tasks. Besides, in most of the cases, there is a reliance on large amounts of training data. Therefore, the knowledge of such robots is fixed after the training phase, and any changes in the environment require complicated, time-consuming, and expensive robot re-programming by human experts. Therefore, these approaches are still too rigid for real-life applications in unstructured environments, where a significant portion of the environment is unknown and cannot be directly sensed or controlled. In this paper, we review advances in service robots from object perception to complex object manipulation and shed a light on the current challenges and bottlenecks.



There are no comments yet.


page 1

page 3

page 5

page 8

page 9

page 11

page 12


Interactive Open-Ended Object, Affordance and Grasp Learning for Robotic Manipulation

Service robots are expected to autonomously and efficiently work in huma...

Manipulation-Oriented Object Perception in Clutter through Affordance Coordinate Frames

In order to enable robust operation in unstructured environments, robots...

Towards Robot-Centric Conceptual Knowledge Acquisition

Robots require knowledge about objects in order to efficiently perform v...

Imagination-enabled Robot Perception

Many of today's robot perception systems aim at accomplishing perception...

Object Permanence Through Audio-Visual Representations

As robots perform manipulation tasks and interact with objects, it is pr...

Mass Estimation in Manipulation Tasks of Domestic Service Robots using Fault Reconstruction Techniques

Manipulation is a key capability in domestic service robots, as can be s...

What went wrong?: Identification of Everyday Object Manipulation Anomalies

Extending the abilities of service robots is important for expanding wha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The development of service robots, used for any domestic or service task, is ongoing and interest is growing. On the one hand, there is an increase in supply, with ever more efficient and widely applicable robots. On the other hand, there is an increase in demand, due to interest from service industries to automate as well as an increasing elderly population [1] that is a significant challenge for many countries. According to a recent study [1], the number of Europeans aged 80-plus is set to rise from 4.9% in 2016 to 13% in 2070. The old-age dependency ratio of the European population (i.e., people aged 65 or above relative to those aged 15-64) was 29.6% in 2016 and is projected to reach 51.2% by 2070, meaning for every person in retirement age there are less than two people of working age. This significant demographic change poses several challenges since the population of caregivers is shrinking, while the number of people needing care is growing. This imbalance between demand and supply, leads to a big gap in the workforce and, therefore, calls for developing service robots to help us overcome this ever-increasing problem.

There is a wide diversity of tasks that a service robot can be used for, such as setting a table for a meal, clearing a table after eating a meal, serving a drink, helping people carry groceries [2], etc. Most of these household tasks can be decomposed into detecting an object, driving the robot’s arm to a desired pose, and manipulating an object. In other words, a robot needs to know which kinds of objects exist in a scene, where they are, and how to grasp and manipulate objects in different situations to operate in human-centric domains. These tasks are of a high complexity and consist of several sub-tasks that need to be performed sequentially or simultaneously to accomplish a certain goal. As shown in Fig. 1, we have categorized these sub-tasks into three core components, including:

Fig. 1: Categorization of sub-tasks of a service robots into three core components, including: (i) object perception and perceptual learning, (ii) object grasping and manipulation, (iii) memory management.
  • Object perception and perceptual learning: A service robot may sense the world through different modalities. The perception system provides important information that the robot has to use for interacting with users and environments. For instance, to interact with users and environment, a robot needs to know which kinds of objects exist in a scene and where they are. Besides, learning mechanisms allow incremental and open-ended learning. A service robot must also update its models over time with limited computational resources.

  • Object grasping and manipulation: A service robot must be able to grasp and manipulate objects in different situations to interact with the environment as well as human users. It is worth mentioning, object manipulation mainly happens in a completely reactive manner, with the agent selecting one or more primitive actions on each decision cycle, executing them, and repeating the process on the next cycle. This approach is associated with closed-loop strategies for execution, since the agent can also sense the environment on each time step.

  • Memory management: A service robot, working in an open-ended domain, should involve experience management mechanisms such as salience and forgetting to prevent the accumulation of examples in the memory. Otherwise, the memory consumption and the required time to both update the models and recognize new objects would increase exponentially.

Fig. 2: Abstract architectures for hybrid reactive-deliberative robots with a single memory system [3].

These core components are tightly coupled together using a software architecture. This coupling is necessary for service robots, not only to perform object perception and manipulation tasks in a reasonable amount of time, but also to robustly adapt to new environments by handling new objects. A service robot should process very different types of information in varying time scales. Two different modes of processing, generally labelled as System 1 and System 2, are commonly accepted theories in cognitive psychology [4], and are mainly used as a software architecture of a service robot. The operations of System 1 (i.e. perception and action) are typically fast, automatic, reactive and intuitive. The operations of System 2 (i.e. semantic) are slow, deliberative and analytic. In this review paper, we mainly focus on object perception and manipulation tasks in service robots with the distinctive characteristics of System 1. The abstract system architecture is depicted in Fig. 2. In this architecture, a Perception component processes all momentary information coming from sensors, including sensors that capture the actions and utterances of the user. A Reasoning component updates the world model and determines plans to achieve goals. An Action component reactively dispatches and monitors the execution of actions, taking into account the current plans and goals. Finally, a Learning component, which typically runs in the background, analyzes the trace of foreground activities recorded in a Memory component and extracts and conceptualizes possibly interesting experiences. The resulting conceptualizations are stored back in memory. It is worthwhile to mention that each component in such an abstract architecture is usually decomposed into a set of software modules.

This review of current state-of-art service robots will take a closer look at each piece in the pipeline of a service robot performing higher order tasks. Each core component will be reviewed based on recent works, describing the current state-of-the-art and identifying possible areas where improvements can still be made. The remainder of this paper is organized as follows. In Section II, we review a set of state-of-the-art assistive and service robots. Afterward, in sections III, IV, V and VI the state-of-the-art research on each core component is described with both positive developments and unsolved issues. Finally, in section VII, conclusions are presented and future works are discussed.

Ii Assistive and Service Robots

While a lot of developments have been made recently in each of the mentioned core components and in the field of service robotics as a whole, robotic servants do not yet live among us, helping us in our daily tasks. We believe that the underlying reason is that robots are usually painstakingly coded and trained extensively in advance to perform object perception and manipulation tasks in the right way. Therefore, the knowledge of such robots is fixed after the training phase, and any changes in the environment require complicated, time-consuming, and expensive robot re-programming by expert users. Although an exhaustive survey of assistive robotics is beyond the scope of this paper, representative works will be reviewed in this section.

Fig. 3: An illustrative example of a “make_a_salad” task with a PR2 robot: In this scenario, the robot faces a new object while making a salad. A user then teaches the “Plate” category to the robot and demonstrates a feasible grasp for the object. This way, the robot conceptualizes new concepts and adapts its perceptual motor skills over time to different tasks.

Over the past decade, several projects have been conducted to develop robots to assist people in daily tasks. Most state-of-the-art service robots use classical object category learning and recognition approaches (i.e. offline training and online testing are two separate phases), where open-ended object category learning is generally ignored. Therefore, they work well for specific tasks, where there are limited and predictable sets of objects, and fail at any other assignment. In other words, the perceptual knowledge of these robots is static and they are unable to adapt to dynamic environments. Examples of such service robots that have demonstrated perception and action coupling include ARMEN [5], El-E [6], Busboy [7], TUM Rosie [8], TORO [9], Walk-Man [10], ARMAR [11], and DLR’s Rollin’Justin [12].

In the ARMEN project, Leroux et al. [5] proposed a mobile assistive robotics approach providing advanced functions to help care for elderly or disabled people at home. This project mainly involves object manipulation, knowledge representation and object recognition. The authors also developed an interface to facilitate the communication between the user and the robot. Jain et al. [6] presented an assistive mobile manipulator, named EL-E, that can autonomously pick objects from a flat surface and deliver them to the user. The user should provide the location of the object to be grasped by the robot by pointing at the object with a laser pointer.

In another work, a busboy assistive robot has been developed by Srinivasa et al. [7]. In particular, they propose a multi-robot assistive system, consisting of a Segway mobile robot with a tray and a stationary Barrett WAM robotic arm. The Segway robot navigates through the environment and collects empty mugs from people. Then, it delivers the mugs to a predefined position near the Barrett arm. Afterwards, the arm detects and manipulates the mugs from the tray and loads them into a dishwasher rack. The vision system of busboy is designed for detecting a single object type (mugs). Furthermore, because there is a single object type (i. e. mug), they computed the set of grasp points off-line.

The work on two robots cooperating to solve a pancake making task [8] shows the feasibility of having one or multiple robots perform a task they weren’t specifically programmed for. In this experiment, it is shown that by giving a robot information on objects and low-level tasks, used in conjunction with a task planning module, these robots can make their own order of operations from a set of instructions downloaded from the internet. The instructions are vague and directed towards humans with some prior knowledge of the objects and operations involved. When an instruction mentions preheating the pancake maker, for example, this could mean turning it on, or it could also require plugging in the power cord. A robot making its own order of operations will need to take all these differences into account. This opens up routes for further dynamic task resolution, giving a robot the agency to solve a wide variety of problems, in contrast with most robotic projects, which work to solve a small range of tasks which are specifically defined for that purpose.

Another interesting robot which has been in development for a while is TORO [9], a continuation of the DLR’s Rollin’ Justin [12]. This robot was developed to explore the possibilities of torque-controlled joints in humanoid robots. While several forms of torque control are specified, the electrical drive units with torque measurements are argued to be most effective for service robots in a household setting. The torque feedback from limbs while moving gives the robot an additional layer of safety when interacting with humans as any movement can be adjusted not just by input from cameras, but also from the limbs themselves.

Combining both actuators and torque to balance passive and active adaptation to the environment WALK-MAN [10] was developed. This robot was designed in response to a call for disaster relief robots and, to perform well in these hectic environments, focuses on being able to move effectively in very rough terrain. In this work, it is argued that, while quadrupedal or wheeled robots are often more stable, they are incapable of properly traversing significantly imbalanced terrain. To further diminish the effect of uneven terrain, the combination of active and passive adaptation was developed. This resulted in a very robust bipedal robot.

While multiple of these robots have been developed for work in a human environment, ARMAR [11]

is specifically developed to collaboration with humans. The focus of this robot is on applicability in a human work place and working side by side with humans. To achieve this, the reach and carrying capacity have been developed significantly. ARMAR has a reach of 1.3m and can carry 10kg in one arm, even at maximum reach. Most importantly, ARMAR has been developed with the ability to detect human behaviour in an attempt to recognize when a human is in need of help. This is done through pose-tracking of the human, which can be combined with speech commands, giving the robot information on when and how to help. It can also estimate the task a human is doing and determine whether it is a task normally done by multiple people. If so, the robot can step in as the second person.

In order for robots to be useful day to day, they need to understand and make sense of the spaces where we live and work, and adapt to new environments by gaining more and more experiences. In such dynamic domains, robots are expected to increasingly interact and collaborate closely with humans. This requires new forms of machine intelligence. Towards this goal, open-ended learning must be properly understood and addressed to better integrate robots into our society. In human cognition, learning is closely related to memory. Wood et al. [13] presented a thorough review and discussion on memory systems in animals as well as artificial agents, having in mind further developments in artificial intelligence and cognitive science. Cognitive science also revealed that humans learn to recognize object categories and grasp affordances ceaselessly over time [14, 15]. This ability allows adapting to new environments by enhancing knowledge from the accumulation and conceptualization of new experiences. Inspired by this theory, service robots should approach 3D object recognition and manipulation from a long-term perspective and with emphasis on domain open-endedness. Moreover, we have to incorporate intermittent robot-teacher interaction, which is an outstanding feature in human learning [16]. This way, non-expert users will be able to correct unexpected actions of the robot and quickly guide the robot toward target behaviors. For example, consider the robotic “make_a_salad” task as depicted in Fig. 3. In this example, the robot faces a new object while making a salad. The robot then asks a user to teach the category of the object and demonstrate how to grasp it. Such situations provide opportunities to collect training instances from online experiences, and the robot can incrementally update its knowledge rather than retraining from scratch when a new task is introduced or a new category is added.

To achieve this, several cognitive robotics groups have started to explore how to learn incrementally from past experiences and human interactions to achieve adaptability. Besides, since service robots receive a continuous stream of data, several methods have been introduced that use open-ended learning for object perception. In [17], a system with similar goals is described. Faulhammer et al., [18], presented a perception system that allows a mobile robot to autonomously detect, model, and re-recognize objects in everyday environments. They only considered isolated object scenarios. Of course, actual human living environments can be very cluttered, that’s why Srinivasa et al. [19] introduced HERB. HERB is an autonomous mobile manipulator that can navigate through a dynamic and highly cluttered environment. Furthermore, it is able to search for and recognize objects in this clutter and is able to manipulate a large range of objects, including constrained objects like doors. The main drawback of HERB is that its error recovery is completely hand-coded, which does not interact sufficiently with other modules (e.g. the planning modules). As a result HERB can become stuck in cycles in which after recovery of an error, it keeps performing the behaviour that lead to the same error over and over again.

In the RACE project (Robustness by Autonomous Competence Enhancement) [20, 21], a PR2 robot demonstrated effective capabilities in a restaurant scenario including the ability to serve a coffee, set a table for a meal and clear a table. The aim of RACE was to develop a cognitive system, embodied by a service robot, which enabled the robot to build a high-level understanding of the world by storing and exploiting appropriate memories of its experiences. The X company also launched a similar project recently, called the everyday robot project111 The main goal of this project is to develop a general-purpose learning robot that can safely operate in human environments, where things change every day, people show up unexpectedly, and obstacles appear out of nowhere.

Iii Object perception and perceptual learning

Object perception is one of the most researched subjects in the field of robotics and computer vision. It comprises a wide range of different tasks which, over the last decades, have been more and more focused on (deep) neural network approaches. This has largely been facilitated by the vast increase in available computational power that has occurred concurrently. The main goal of object perception is to detect any objects the robot needs to interact with in order to help human users. Furthermore, the output of object perception is required for motion planning not only to specify the pose of a target object but also to localize all objects that can be considered as obstacles for the current task. An obstacle can be (

i) “fixed objects” (e.g., wall, table, etc.), (ii) objects that are in a fixed position most of the time (e.g., a vase, a table-sign, etc.), or (iii) a dynamic object, which usually corresponds to human’s or the robot’s body. Additionally, it is important that a robot knows about its work-space and accessible regions within the work-space. In the following subsections, we discuss the recent advances in object detection, open-ended object category learning and recognition, and the alternate forms of perceptions for service robots.

Iii-a Object detection

Object detection and pose estimation are crucial for robotics applications and recently attracted attention of the research community 

[22]. Many researchers participated in public challenges such as the Amazon picking challenge222 to solve multiple objects detection and pose estimation in a realistic scenario. There are two different mechanisms that are often employed for real-time object detection purposes.

Fig. 4: Examples of object segmentation in isolated objects scenarios: detected object candidates are shown by different bounding boxes and colors.
Fig. 5: Four examples of object segmentation in pile scenarios: detected objects are shown by different colors.

In the first group of approaches, an object detection approach is paired with an approach for object classification. These approaches initially detect objects and place a bounding box around them. Each detected object is then classified 

[23, 24]. Four examples of isolated table-top objects are shown in Fig. 4. In general, separating object detection and object recognition is a suitable strategy for service robots since it allows robots to detect never-seen-before objects [24, 25, 26, 27, 28, 29]. A drawback, however, is that only detected objects will be classified. In a household environment, a robot may frequently encounter a pile of objects such as a clutter of toys in the living room, tidying up a messy dinning table, or multiple unused objects stacked in a box in the garage. This complicates the object detection process, because some objects can be occluded by and overlapping with other objects. There are two sets of approaches that can handle pile segmentation. The first set of approaches mainly use clustering algorithms to segment objects using the curvature of the surface normals and/or colour information [30, 31, 32] (see Fig. 5). The other set is based on active segmentation. In other words, such approaches initially segment the scene to specify a set of object candidates and then, manipulate/push object candidates one by one to be isolated from the pile of objects [33]. For example, Van Hoof et al. [34] presented a part-based probabilistic approach for interactive object segmentation. They tried to minimize human intervention in the sense that the robot learns from the effects of its actions, rather than human-given labels (see Fig. 6). In another work, Gupta and Sukhatme [35] explored manipulation-aided perception and grasping in the context of sorting small objects on a tabletop. They presented a pipeline that combines perception and manipulation to accurately sort the bricks by color and size.

The second group of approaches for real-time object detection are based on semantic segmentation, which classifies each pixel/point of a scene to a particular class [36] (i.e., end-to-end object detection). This group does not have the mentioned problem of the first group [37, 38, 39, 36, 40, 41]. Each segmented region corresponds to a salient part of an object, an entire object or a group of objects belonging to the same category. However, this group is not entirely satisfactory for service robots. That is why several attempts have been made to not only segment by category, but also by different instances in a category [42]. The benefit of such segmentation is that all points/pixels are classified, which could provide valuable information for the motion/task planning purposes. An example of a semantic segmentation approach on point clouds is given by SpiderCNN [43]. As the name suggests, it uses CNNs, however, with one important adaption. Normal CNNs cannot straightforwardly be applied to point clouds data, since it is not stored in a regular grid, which is required for such CNNs. SpiderCNN instead uses a set of irregular parametrized convolutions, which can be applied to point clouds directly. A similar approach is used by PointCNN [44], which instead uses x-transforms to be able to extend convolutions to point clouds. Alternative approaches that do not use CNNs have also been explored. Take for instance PointNet [45]

, it initially processes all points independently and identically with a set of input and feature transforms. Eventually it combines point features by applying one layer of max pooling. While performance is on par with other state-of-the-art approaches, its drawback is that due to its layers of independent processing, it does not capture local features of the spatial distribution of points well. To this end PointNet++

[46] was introduced. It borrows ideas from CNNs and applies the architecture of PointNet recursively on incrementally growing local regions of the input space. Unlike the original pointNet, this now allows it to capture local features on a variety of scales.

Fig. 6: Examples of active object segmentation in pile of objects scenarios: (left) The robot observes the scene, obtaining a point cloud from one perspective. (right) The robot decides to push the object form the bottom to segment them. The robot repeats these step to singulate all objects (adapted from [34]).
Fig. 7: Examples of object (scene) segmentation: (left) a 3D input pointcloud; (left) a network prediction; The colors of the points represent the object labels (adapted from [37]).

While end-to-end deep learning based object detection can achieve high accuracy, it generally requires a very large number of training examples and training with limited data usually leads to poor performance

[24, 25]. Catastrophic forgetting is another important limitation of deep learning approaches [47, 48]. Furthermore, these approaches are unable to learn new categories in an online fashion. It is worth mentioning that for path planning purposes, the mentioned limitations are not that important, since the robot mainly needs to avoid objects, furniture or humans, which can be divided in relatively few categories. However, to be of help, a service robot must be able to manipulate all sort of small household items accurately, which come in a seemingly endless number of categories.

Another problem with many current 3D object detection approaches is the inability to detect transparent and highly reflective objects. Since glass and shiny metal objects are commonplace in every household, this is an issue that surely needs to be addressed. S. Sajjan et al., [49] took a step towards addressing such limitations for object grasping and manipulation. In particular, they proposed a deep learning approach, named ClearGrasp [49], for estimating accurate 3D geometry of transparent objects from a single RGB-D image for robotic manipulation.

Iii-B Object category learning and recognition

A typical household environment contains objects belonging to a large number of categories. In such domains, it is not feasible to assume one can anticipate and preprogram everything for robots. Therefore, autonomous robots must have the ability to execute learning and recognition concurrently. Several methods have been introduced that allow for open-ended learning of new categories. In such approaches, the introduction of new categories can significantly interfere with the existing category descriptions. To cope with this issue, memory management mechanisms, including salience and forgetting, can be considered [47]. The main approaches can be divided into two different categories, according to what type of object representation they use. On the one hand, there are approaches that do still use deep learning based techniques [50, 51, 28]. They use networks that have been pretrained on a large dataset, and, therefore, are already able to extract a lot of useful features from images. Additional training and recognition on new objects and categories is approached in a number of different ways, some use few-shot learning [50, 52, 53]

, others use one-class support vector machines 


and random forests 

[55], and simple instance based learning is also combined with nearest neighbor classification [50, 51, 28]

. Other learning approaches have also been considered, such as the use of autoencoder-based representation learning 

[56, 57], Bag of Words [58, 26, 59], and topic modeling [60, 61, 62].

On the other hand, there are approaches that instead use hand-crafted features for object recognition [27, 63, 64, 65, 66]. They use either global or local features of the objects that can be extracted in a variety of ways. The representation of an object is usually given by a histogram of features. Concerning category formation, an instance-based learning (IBL) approach is usually adopted, in which a category is represented by a set of views of instances of the category. When the robot is presented with a new object, it compares the representation of the new object to the representation of all instances of all categories by, for example, a simple nearest neighbor classifier. Therefore, each instance-based object learning and recognition approach can be seen as a combination of a particular object representation, similarity measure [67] and classification rule [68]

. One advantage that instance-based learning has over other methods of machine learning is its ability to adapt the model to previously unseen data. The disadvantage of this approach is that the computational complexity can grow with the number of training instances. The computational complexity of classifying a single new instance is

, where is number of instances stored in memory. Therefore, these systems must resort to experience management methodologies to discard some instances to prevent the accumulation of an impractically large set of experiences. Salience and forgetting mechanisms can be used to bound the memory usage. These mechanisms are also useful for reducing the risk of overfitting to noise in the training set. Another advantage of the instance-based approach is that it facilitates incremental learning in an open-ended fashion.

There are some works that adapt model-based learning (MBL) that are often contrasted with IBL approaches. In particular, IBL considers category learning as a process of learning about the instances of the category while MBL is a process of learning a parametric/non-parametric (Bayesian) model from a set of instances of a category. In other words, each category is represented by a single model. The IBL approaches propose that a new object is compared to all previously seen instances, while the MBL approaches propose that a target object is compared to the model of categories. Therefore, in the case of recognition response, MBL approaches are faster than IBL approaches. In contrast, IBL approaches can recognize objects using small number of experiences, while MBL approaches need more experiences to achieve a good classification result. Therefore, training is very fast in IBL approaches, but they require more time in the recognition phase. Another disadvantage of IBL approaches is that they need a large amount of memory to store the instances. In MBL, new experiences are used to update category models and then the experiences can be forgotten immediately. The category model encodes the information collected so far. Therefore, this approach consumes a much smaller amount of memory when compared to any IBL approach.

Iii-C Fine-grained object recognition

In human-centric environments, fine-grained (very similar) object categorization is as important as basic-level categorization. A problem of the above approaches is that categories that are very similar might be hard to distinguish. Such categories could for example be different items of cutlery, different dog, cat or other pet breeds, a variety of box shaped objects (like food container boxes, tissues, a stack of paper, etc) or different writing utensils. Attempts have been made to tackle this issue by introducing fine grained object recognition [69, 70, 71]. Fine-grained object recognition takes into consideration small visual details of the categories that are important to distinguish them from similar categories. In the case of food boxes, think for example about the print on the boxes. Fine-grained and basic-level recognition are important in different domains and, when combined, a trade-off is made between the computational benefits of basic-level recognition and the accuracy of fine-grained recognition. Most existing approaches for fine-grained categorization heavily rely on accurate object parts/features annotations during the training phase. Such requirements prevent the wide usage of these methods. However, some works only use class labels and do not need exhaustive annotations. Geo et al. [70] proposed a Bag of Words (BoW) approach for fine-grained image categorization by encoding objects using generic and specific representations. This approach is impractical for an open-ended domain (i.e., large number of categories) since the size of object representation is linearly dependent on the number of known categories. Zhang et al. [71] proposed a novel fine-grained image categorization system. They only used class labels during the training phase. This work completely discarded co-occurrence (structural) information of objects, which may lead to non-discriminative object representations and, as a consequence, to poor object recognition performance. Several types of research have been performed to assess the added-value of structural information. Kasaei et al. [60] proposed an open-ended object category learning approach just by learning specific topics per category. In another work [69], an approach is proposed to learn a set of general topics for basic-level categorization, and a category-specific dictionary for fine-grained categorization.

Fig. 8: A set of very similar (fine-grained) cutlery objects: (left) an object view of a spoon, (center) an instance of a fork, and (right) an object view of a knife object.

Object detection and recognition performance can be improved by considering the context in which objects appear [72, 73, 74, 75]. In a house, certain categories of related items are often placed together. This information can be used to improve the processing of related items. For example, chairs and tables often appear together, so the presence of one could be used as a cue to expect the presence of the other as well. Additionally, information can be used to distinguish objects from similar categories that appear in different environments. A pen and a screwdriver, for example, have a fairly similar shape, however, a pen is likely to appear on or near a desk in a home office or a table in the living room, while a screwdriver is expected to be around other tools such as hammers and wrenches that are more likely to be found in a garage or near a fuse box.

In active perception scenarios, whenever the robot fails to recognize an object from the current view point, the robot will estimate the next view position and capture a new scene from that position to improve the knowledge of the environment. This will reduce the object detection and pose estimation uncertainty. Towards this end, Mauro et al. [76] proposed a unified framework for content-aware next best view selection based on several quality features such as density, uncertainty, and 2D and 3D saliency. Using these features, they computed a view importance factor for a given scene. Kasaei et al. [77] proposed a novel Next-Best-View prediction algorithm to improve object detection and manipulation performance. First a given scene is segmented into object hypotheses and then the next best view is predicted based on the properties of those object hypotheses. In another work, Biasotti et al. [78] approached the problem of defining the representative views for a single 3D object based on visual complexity. They proposed a new method for measuring the viewpoint complexity based on entropy. Their approach revealed that it is possible to retrieve and to cluster similar viewpoints. Doumanoglou et al. [79] used class entropy of samples stored in the leaf nodes of a Hough forest to estimate the Next-Best-View. Some researchers have recently adopted deep learning algorithms for next best view prediction in active object perception. For instance, Wu et al. [80]

proposed a deep network namely 3D ShapeNets to represent a geometric shape as a probabilistic distribution of binary variables on a 3D voxel grid.

Iii-D Alternate forms of perceptions

Cognitive scientists showed that humans’ vision is not an independent process and it is closely coupled to other forms of perception [81, 82, 83]. It could therefore be desirable to also explore different forms of perception in robotics. Furthermore, it could be interesting to investigate a way to couple these forms of perception, similar to what we see in humans. This could be especially useful in the case where fragmented information in the different modes of perception could be combined to form a clearer picture.

Take, for example, the case of robotic perception via touch. A robot can not only use vision, but also touch to identify objects [84]. This could help with identifying transparent or reflective objects, which was previously identified as a problem area and is something many forms of visual object detection currently cannot. It could also be very useful for a robot to be able to detect typical household sounds, which has received surprisingly little attention so far [85]. Additionally, auditory and visual perception could be improved by interchange of information. Research in this respect has largely focused on sound localization [86, 87, 88, 89, 90, 91]. This entails that, given a video input with a corresponding audio, a system tries to identify which sounds in the audio channel correspond to which objects in the video scene. Take for example the MONO2BINAURAL [86] DNN. Given a single-channel audio signal together with an accompanying source video, it uses spatial information in the video to convert the audio signal into two channels, each of which represents the signal received in one ear. This improved double audio channel then also provides spatial information of the sounds, which is clearly related to sound localization. It is therefore not surprising that the representation learned by this network was successfully extended to the task of sound localization. The CO-SEPERATION [87] architecture is very similar, However, it separates the audio into different sources instead of two spatial channels. By doing this it also learns to match the sound of the same object type appearing in different videos.

While sound localization can already be useful in itself for robots to, for example, look at the person who is talking to it, to the best knowledge of the authors no attempts have been made to leverage this separated and localised audio to aid in the recognition of objects or events. For instance, imagine the case of a whistling kettle, or something more serious like the thump of a human falling and hitting the ground, possibly injuring themselves. It is important for the robot to be able to identify where the sound comes from. Furthermore, it is just as important that the robot should be able to recognise the sounds, and the objects that produce them, and take action if necessary. Ideally these alternate forms of perceptions, as well as the different forms of vision we discussed, should be coupled in a unified framework, so that one can benefit from information of the others.

Iv Affordance Detection and Object Grasping

Object grasping and manipulation stand at the core of a successful service robot, without these abilities the robot cannot provide many of the necessary services. There are a number of problems inherent to this task. For example, the robot needs to accurately detect the pose of objects, recognise the class of objects, and find the best location to grasp them in real-time. Besides, any planned movements need to avoid self collision and collision with the environment. Furthermore, the more complex a robotic arm and gripper are, the more complicated these problems become, taking more computation time to find adequate solutions. Both accuracy and computation time are the two major factors to balance when writing algorithms for these purposes. A great deal of work has been done to improve the quality of object grasping and reduce the execution of motions in object manipulation. There are, however, still areas and specific subjects that have not yet received a lot of attention. In the following subsections, we will point out some of those gaps in the state-of-the-art of object grasping and manipulation.

Iv-a Grasp point detection

Detecting stable grasp points for previously unseen objects in real-time is one of the main challenges for object grasping. Towards finding proper grasping points of objects, many different techniques for both teaching and learning processes can be used. Trial and error is not often utilized due to the large risk of damaging the robot, but it is able to be employed to learn multiple good grasps [92, 93]. Some approaches use object segmentation, but they differ significantly from the approaches mentioned in the previous section, which often have the disadvantage of being relatively inflexible and not generalize well to objects outside of the set it has been trained on. Matching known grasps in a manner of instance-based learning is possible as well [94]

. Many recent techniques share the same pipeline to find a set of stable grasp points for the given object. First the grasping candidates are sampled from the point cloud (or image) of the object, then the candidates are ranked through for example a Convolutional Neural Network (CNN)

[95]. The best candidate then gets grasped in either an open-loop or a closed-loop fashion. Grasp point detection approaches can be broadly classified into two categories, analytical approaches and empirical approaches. While analytical approaches rely on kinematic and dynamic formulations to choose a proper end-effector configurations (i.e., position and orientation of hand and fingers), empirical approaches use data-driven learning algorithms to transfer grasps from 3D model databases to a target object. Empirical methods can be further classified based on whether the grasp configuration is being computed for known, familiar or unknown objects. The underlying reason for this classification is that prior knowledge about objects determines how grasp candidates are generated and ranked. For more details on data-driven grasp synthesis, we refer the reader to the surveys of Bohg et al. [96] and Sahbani [97].

Fig. 9: Results of grasp point detection on three different objects: (left) Without considering object affordance, the robot detects a graspable point on the blade region of the knife; (center and right) By considering objects’ affordances, the robot detects a suitable grasp point for both objects.

Deep learning methods have recently achieved the largest advancements in grasping for unknown objects. Although an exhaustive survey is beyond the scope of deep learning methods for grasp point detection, we review some representative works here. Most of these approaches however use an adaption of the CNN architectures designed for object recognition [98]. Additionally, they often sample and rank potential grasps individually [99]. Together, this can cause exceedingly long computation times which makes them unsuitable for real-time closed-loop grasps. Mahler et al. [100] proposed the first Dexterity Network (Dex-Net). Since then they have proposed three version of Dex-Net architectures. Together with the DexNet architectures they also proposed the growing Dex-Net dataset. The Dex-Net 1.0 [100] uses a CNN with a Alex-Net architecture [101]

for predicting the labels of the given object first and then, the probability of forced closure under uncertainty in object pose, gripper pose and friction coefficient to obtain the quality of a grasp. The Dex-Net 2.0 

[95] uses a Grasp Quality Convolutional Neural Network (GQ-CNN) to evaluate grasp candidates for a certain object. In this approach, first, a discrete set of antipodal candidate grasps is sampled from the image space. Then, these grasp candidates are forwarded to the GQ-CNN to find out the best grasp point. The architecture of Dex-Net 3.0 [102] does not differ from the architecture from the Dex-Net 2.0, and the main difference lies in the use of a suction cup in the Dex-Net 3.0, whereas the Dex-Net 2.0 uses parallel-jaw gripper. The Dex-Net 4.0 [103] was created for a robot with a suction cup and a parallel-jaw gripper. The advantages of suction cups are that they can reach into narrow spaces and pick up an object with a single point of contact. In difference to earlier versions, the Dex-Net 4.0 not only trains on grasping objects in separation, but it is also trained to grasp objects in pile scenarios.

Generative Grasping Convolutional Neural Network (GG-CNN) [104, 105] seeks to improve on the mentioned drawbacks. GG-CNN is an object-independent grasp synthesis method which can be used for closed-loop grasping. GG-CNN predicts the quality and pose of grasps at every pixel through a one-to-one mapping from the depth image. This is in contrast with most deep-learning grasping methods which use discrete sampling of grasping candidates and have long computation times. Furthermore, such online analysis of objects allows for more precise and potentially faster grasps. These methods vastly reduce the amount of possible grasps, by highlighting the locations where the grasps are of the highest quality. In another work [106], an online grasp generation has been addressed. Both [106] and GG-CNN approaches use neural networks to generate maps of grasp quality. For the former, grasp points for the two end-effectors at various angles are generated. For the latter, maps for grasp quality, angles and width are generated.

A robot could benefit from affordance detection of parts of a single object to reduce the complexity in finding the locations of high quality grasps [107, 108, 107]. Certain objects have a clear place to be grabbed, for instance the handle of a knife, the ear of a mug or the handle of a pan. These can be identified in this manner. On the contrary, there are also objects that should not be grasped in a certain area. Considering again the case of a knife, its blade could damage the robot’s end-effector when grasped. Softer skin-like end-effectors [109], which are important for delicate manipulation of objects, are especially at risk (see Fig. 9). Care should also be taken with plates, cups, bowls and similar objects. They may contain food or liquids. Manipulating filled containers is already a difficult task [110], a bad grasp can make the manipulation more difficult than it has to be. Naturally, touching food or liquids with the end-effector should be avoided as well.

An affordance for a certain grasp is generally synonymous with a high grasping quality, be it for a specific end-effector, angle or other purpose. In [111]

, objects’ point clouds are semantically segmented by a rule based system. Based on what the robot is tasked with, an specific segment of the object may be more suitable to be grasped than others. As a bonus, the sizes and shapes of segments are features, allowing the system to classify objects. This combination of grasp affordance and knowledge of semantics, creates a system that is robust in handling objects for a variety of tasks. Besides, affordance prediction can provide some additional information about other objects

[112, 113]. For example, it can tell the robot on which furniture other objects can be placed, or determine if an object can be picked up or not, etc. It is worthwhile to mention that affordances do not make any claims about the category of objects in a scene, but rather try to predict the functionality of objects [113]. Instance-based learning with affordances [106] and applying affordances to segments of objects [111] are other possible options.

Iv-B Object grasping

Object grasping remains one of the challenging tasks and requires knowledge from different fields. This topic has received a lot of attention lately. Through the use of inverse kinematics (IK), there are many different grasps possible on objects. Not all of them are feasible as not all joint configurations are free of collisions and singularities. A good grasp has a high quality, a measure of how stable a grasp is, and has to be somewhere on the visible part of the object. A large body of recent efforts has focused on solving 4-DoF (x, y, z, yaw) grasping, where the gripper is forced to approach objects from above [106, 105, 95, 114].

Fig. 10: An example of multi-functional gripper that can be used for grasping objects in different situations: (left) In this situation, the object is graspable vertically using the two-finger parallel-jaw gripper since there is enough space between walls and the object. This type of grasping is mainly used to pick up objects with small, irregular surfaces such as baskets, table-top objects and tools; (center) Some objects can be robustly grasped using a suction cup gripper as shown in this figure. This type of grasping is robust for objects with large and flat surfaces, e.g., books and boxes. (right) In some situations, it is necessary to grasp an object from an specific direction mainly due to environmental constraints or application requirement. For instances, this type of grasp is used to grasp objects resting against walls, which may not have suction-able areas from the top [115].

A major drawback of these approaches is the inevitably restricted ways to interact with objects. For instance, they are not able to grasp a horizontally placed plate. Worse still, the knowledge of robots is fixed after the training stage, in the sense that the representation of the known grasp templates does not change anymore. A household situation is prone to change, cabinets may be stocked differently and implements may be lost and gained. To avoid impossible tasks, the degrees of freedom of an end-effector cannot be so restricted. These drawbacks call for more advanced and seamless approaches for object grasping.

Recently some research groups have taken token a step towards addressing these issues. Qin et al., [116] studied this problem in a challenging setting, assuming that a set of household objects from unknown categories are casually scattered on a table. They proposed a learning framework to directly regress 6-DoF grasps from the entire scene point cloud in one pass. In particular, they compute a per-point scoring and pose regression method for 6-DoF grasp. In another paper, a single view was used for grasp generation by Kopicki et al., [117]. This works similarly as the direct regression of point clouds. The main obstacle with such a method is that the back side of objects is generally unknown and represent missing information. Challenging situations, those where the back side needs to be grasped, are thus harder to correctly resolve. By combining generative models and using new ways to evaluate contact points, a higher rate of success is achieved in these challenging situations. In another work, Murali et al., [118] proposed an approach to plan 6-DoF grasps for any desired object in a cluttered scene from partial point cloud observations. They mainly used an instance segmentation method to detect the target object. To generate a set of grasp points for the object, they follow a cascaded approach by first reasoning about grasps at an object level and then checking the cluttered environment for collisions.

These methods are partial solutions to the general problem of moving from 4 to 6 degrees of freedom. A larger search space for grasps means that there are more grasps that are in conflict with the environment. A larger search space also means more computation time is required to find good grasps and eliminate bad ones. By focusing on regression of point clouds and single views, Qin et al. and Kopicki et al. avoid a large computational pitfall. Likewise, the method of Murali et al. is able to avoid scene collisions by clever segmentation and analysis.

For the methods discussed, the end-effectors are still rather simple. While fast grasping with a simple gripper is relatively easy, doing so with more complicated and also more dextrous grippers is again harder and more computationally intensive. Another option is a vacuum end-effector which functions like a suction cup. Like grippers, these have similar requirements for a successful grasp. Both require a high quality grasp, but while grippers require a stable pinching area, suction cups require a stable suction area. This difference in suitable grasp location for vacuum end-effectors allows them to manipulate objects in ways that grippers cannot.

As we saw before, Mahler et al., [102] created Dex-Net 3.0, a large dataset for 3D object grasping specifically for vacuum-based end-effectors. This dataset is mainly used to train a grasp quality classifier. Therefore, simple objects and typical everyday objects are grasped with a high rate of success while the most complicated of objects are able to be successfully grasped over half of the time. While a vacuum end-effector is highly suitable for work in a factory, a delicate gripper is still preferred for the manipulation of the softest of objects, as well as those objects that lack any proper suction points. The opposite is also true, objects that are more easily picked up with a suction cup or that are not able to be picked up with a gripper may warrant the inclusion of a vacuum end-effector. To ensure the proper completion of any task, a service robot may have to be equipped with both. Instead of using two robotic arms with two different types of gripper, Zeng et al., [115] developed an interesting multi-functional gripper, consisting of a suction cup and a parallel-jaw gripper, to allow robots to robustly grasp objects from a cluttered scene without relying on their object identities or poses. Such mechanisms enable a robot to quickly switch between suction cup and parallel-jaw gripper to grasp different type of objects in various positions (see Fig. 10).

There are also varies anthropomorphic robotic hands. The shadow dexterous hand [119] is an almost fully actuated robotic hand that is modeled very closely after human hands. There even exists an upgraded version which boasts touch and vibration sensors on the fingertips of the device. For a more integrated approach, iCub [120, 121] is a fully anthropomorphic robot that was designed for human-like interactions with its environment. Compared to the shadow dexterous hand, the iCub robot is softer and less actuated. It is covered in a soft skin-like material to aid in the delicate handling of objects. On the extreme end of the spectrum, there is the RBO Hand-II [122]. While still anthropomorphic, it’s individual fingers are more alike to tentacles than human fingers. It is an extremely soft and relatively simple hand, lacking any joints. Nevertheless it is still able to grasp a variety of objects.

The simplicity of its gripper does not necessarily inhibit it from being successful at various task. This is also shown in a review of various robotic hands [123], where there has been a rise in simpler but effective grippers. While complex end-effectors can be used to do complex tasks, a simpler gripper may be able to perform it just as effectively. This leads back to the vacuum end-effector which is far more effective in certain circumstances than any anthropomorphic robotic hand.

While soft or under-actuated robotic hands are not required for delicate operation, it generally depends on the task at hand. Industrial robots are often fully actuated and rigid because, while the actions themselves are often not complicated and in a highly structured environment, high precision is still required. For service robotics, the tasks are more varied while requiring similar precision. For such an environment, the end-effector may specify suitable tasks instead of the other way around. Having multiple end-effectors at its disposal then increases the amount of tasks that the robot is suitable for. To this end, a combination of different grippers such as in Zeng et al., [115] would be a proper approach to create a capable service robot.

Iv-C Open-ended grasp learning

While progress has been made towards open-ended learning, it remains a big problem that needs to be addressed further. Current approaches are only able to successfully learn a limited percentage of all objects that can be found in a household. For service robots which need to be able to help those in need, significant chance of failure to recognise and grasp objects, for important objects like medicine containers, could be very problematic. Towards addressing this issue, some researchers used kinesthetic teaching to teach a new grasp configuration, including the position and orientation of the arm relative to the object and the finger positions [124, 125, 126]. As shown in Fig. 11, an instructor teaches an appropriate end-effector position and orientation using the robot’s compliant model. After performing the kinesthetic teaching, visual features of the taught grasp region (e.g., heightmaps [125]) are extracted and stored as a grasp template in the grasp memory.

Fig. 11: An example of kinesthetic teaching: (left) a user demonstrates a feasible grasp to the PR2 robot; (right) Extracted template heightmap and gripper-pose are used to train the proper grasp position for the given object (adapted from  [125]).

In another work, Kasaei et al. [124]

, formulated grasp pose detection as a supervised learning problem to grasp familiar objects. Their primary assumption was that new objects that are geometrically similar to known ones could be grasped in a similar way. They used kinesthetic teaching to teach feasible grasp configurations for a set of objects. The target grasp point is described by (

i) a spin-image [127], which is a local shape descriptor, and (ii) a global feature, which represents the distance between the grasp point and the centroid of the object. To detect a grasp configuration for a new object, they initially estimate the similarity of the target object with all known objects, and then, try to find the best grasp configuration for the target object based on the grasp templates of the most similar object.

Most of these examples have dealt with only grasping. To be truly considered dexterous, robotic systems should have the capacity to do complicated tasks. A small step in the right direction is the ability to prevent objects from slipping and falling [128]. The adjustment of already held objects, such as by using the individual fingers [129], is a required stepping stone to advanced object handling. Moving from one robotic arm to two, synchronized or not [130], truly opens up the way to manipulate equipment made for human use. A fair part of the equipment that service robots have to work with, is made for humans. This means that advanced object manipulation, as well as the other advances, are required for the development of capable service robots.

V Object Manipulation

It is easier to explain the details of object manipulation using an example; consider serve_a_coke task as shown in Fig. 12. To accomplish this task, the robot needs to detect and recognize all objects first. Afterwards, it has to identify a proper grasp pose for the coke object and plan a trajectory to reach the target pose. The object is then grasped by the robot. The object manipulation module computes an obstacle-free trajectory to navigate the robot’s end-effector to a desired pose, which is on top of the cup object. The object manipulation module should check whether any part of the manipulator is at risk of colliding with itself or with any obstacles. Finally, the manipulation module sends out the action to the execution manager module. We will mainly discuss collision avoidance and dexterity in the following subsections.

Fig. 12: An example of object manipulation in serve_a_coke scenario; (top-right) Initially, the robot detects the table as shown by the green polygon. Then, all table-top objects are detected and recognized. The pose and category label of each object is highlighted by a bounding box and a label in red on top of each object. (top-left) The robot then finds out the CokeCan object and goes to its pre-grasp area and picks it up first from the table. (bottom-left) The robot retrieves the position of Cup first, and then calculates an obstacle-free trajectory and moves the CokeCan on top of the Cup to serves the drink. (bottom-right) Finally, the robot computes an obstacle-free trajectory to navigate the robot’s end-effector to the initial pose.

V-a Self-collision avoidance

The standard way of dealing with self-collision avoidance (SCA) is either by planning feasible paths under constraints, or to react when the robot almost collides with itself. A planned route is susceptible to interruptions and thus only works in static environments [131, 132, 133]. The environments that service robots have to perform in, may not always be static. Whenever the environment changes, the plan will have to be recalculated, including the SCA. Reactive systems, such as [134, 130], are more adaptable and can deal with changes to the environment. As a trade-off, they tend to get stuck more easily, may have trouble with tight spaces and tend to oscillate around local maxima in their joint space.

Salehian et al., [130] proposed a new way of handling SCA. The collision boundary and its gradient in the combined joint space of the two arms is encoded into a model before the operation of the machine even starts. This stored gradient leads the robotic arms to avoid situations where they might collide, by being repulsed by the boundary denoting imminent collision. This distance to collision is encoded in a kernel SVM and can be used as a constraint for inverse kinematics (IK) solving instead of the usual joint-to-joint distances. This model is effectively data-driven, requiring examples of valid and invalid joint configurations for the SVM to learn. The trade-off to this is that solving the IK with SCA is very fast (<10ms). The collision boundary and gradient have to be determined only once for a given robot blueprint and can be copied onto each production model.

While [130] has a good method to avoid collisions between its two robot arms, it does not take into account the singularities that can occur during motion. The model from [134] manages to balance SCA as well as avoiding singularities, moving smoothly and reaching the desired goal position and orientation. The SCA is done in a similar manner as before, but a neural network is trained instead of a SVM. The imminence of collisions, together with other objective functions, determines the constraints that the IK solver is under.

These methods are able to quickly avoid self-collision, but rely on data to show the collision boundaries. The safety of joint configurations can be easily simulated and so it is possible to generate this data beforehand. It is somewhat analogous to how humans learn the limits of their bodies, but without the need for some form of stress or touch sensors embedded into the robot arm structure. Nevertheless, looking into more human-like robotic arms, complete with additional sensors, can be useful for more advanced service robotics. However, for simpler service robots, the current implementations of collision proximity is more than enough.

V-B Avoiding collisions with the environment

In contrast to self-collision avoidance, checking for collisions with the environment is a computationally intensive process (see Fig. 13). Reactive systems have to continuously check if they are at risk of colliding while planners have to check every configuration or position that they may attempt to use. In an online environment, speed is key, an algorithm that takes more than a second may already be too slow. One way to remedy this in a planning-based approach is to sample more aggressively towards the goal position and orientation [132]. This reduces the amount of nodes that have to be visited during planning, but also the amount of collision checks that have to be made.

Looking at the consequences of the actions is important, especially when it comes to hard hitting robots and delicate surroundings. One interesting way of preventing damage is by predicting the consequences of disturbing scenes [131]. When an object is picked up by the robot, it may cause other objects to move, tumble or fall. Objects may get damaged in this way, which is of course not desired. To remedy this, robots usually act as little as possible to get the job done. This new method instead focuses on learning the order in which to move objects to cause the least amount of damage. In this case, the paths that disturbed objects take, by rolling or falling, act as a cost or penalty. Based on available knowledge of the scene, the model is able to generalize and choose the best order of operations to cause the least damage.

Fig. 13: An example of defining a set of environmental constraints to prevent the robot from collisions with the environment in the serve_a_coke scenario.

Avoiding damage and reducing planning load are both important steps to get closer to fast and safe robotics. Nevertheless, not much work seems to be put into reducing the complexity of collision checking itself. For SCA, neural networks were trained to learn the boundaries of self collision. This resulted in a fast method of avoiding bad joint configurations. Such system could be set up to do a similar thing for visible objects with, for example, a stream point clouds as input. Similar to affordances for grasping, humans seem to be able to be able to quickly ascertain affordances for arm and manipulator movement and orientation based on what they see and know. For service robots, this means they should be able to operate on a similar level, or at least fast enough to perform the various tasks we require of them. Towards this goal, Qureshi et al., proposed MPnet algorithm which is a learning-based neural planner for solving motion planning problems [135, 136]. In particular, they presented a recursive algorithm that receives environment information as point-clouds, as well as a robot’s initial and desired goal configurations as input arguments and generates connectable path as an output. Some other researches also demonstrated segmenting motion tasks could reduce the complexity that the lower modules have to deal with [137]. Whether it is by distinguishing between individual tasks based on relations of touch [138] or embedding scores or attractors directly into a robotic joint feature space [139]. Simplifying the problem of motion with several degrees of freedom, allows for a greater focus on the other parts.

V-C Coordination and dexterity

For humans, doing something with “one hand behind their back” is seen as a challenge. Similarly for robots, doing a task with only one gripper is often possible, but may be more difficult. Furthermore, some tasks may even be impossible to perform. Coordinated and uncoordinated motion of two robotic arms is discussed in [130], but in a factory setting instead of a household one. As household situations usually deal with objects of a smaller size, the self collision distance thresholds are set a bit higher than would be required. This highlights one of the potential problems with multi-arm coordination in household settings. Robotic hands may have to work very closely together, to the point of touching. As of yet, very little research seems to include this as a point of interest. While it is possible to manipulate objects one at a time in a cluttered environment [106] and many actions are possible with one arm, there are cases where extra dexterity is required.

In the same vein, there is also room for improvement when it comes to manipulating objects that are already being grasped. Preventing an object from slipping from a gripper [128] and minor manipulation with fingers while grasping [129] have already been somewhat explored. These all lay the foundation for advanced manipulation, which remains mostly out of focus in favor of the act of grasping itself. Nevertheless, advanced manipulation is a requirement for the most delicate of tasks. As an example, correctly cracking open an egg without any tools is difficult even for humans, but a service robot may at one point be asked to do this or similar difficult tasks. For this reason, advanced manipulation still requires more research.

Vi Manipulation in Human-Robot Shared Environments

A lot of research has gone into enabling robots to deal with a wide variety of environments, both static and dynamic. While static environments have been mastered quite well, the navigation of dynamic environments, especially those that deal with having other human agents in the same space as the robot, are currently a hot topic in the field. Path and trajectory planning are essential components of object manipulation, especially if the robot is required to affect the environment past its immediate reach. This means that it must be able to relocate itself in, at worst, a highly dynamic environment with one or more other agents that may act unpredictably. We can further break down this problem into local and global navigation. Global navigation deals with planning towards a goal or objective that is not currently in the range of the robot’s perception while local navigation concerns navigation through the immediately perceivable space around the robot [140]. Global navigation and path planning has been largely tackled to a satisfactory degree and should generally be able to converge to an optimal global solution [141].

One of the focuses in current developments is on local path planning and obstacle avoidance in dynamic environments. Within this task, we can further specify into varying levels of application. Lower level concepts, such as combining color data with depth sensing to get additional information about the environment [142, 143], can be used for collision avoidance. A slightly higher level task is creating a collision risk map in order to more effectively navigate the environment. These risk maps provide some insight into possible future states from the current one [140, 144]. Instead of explicitly modelling the collision risk using an artificial neural network, one can instead be trained to abstract the risk to higher or lower confidence path solutions [145].

When considering a robot’s movement in human-robot shared environments, many of the lower level functions and considerations have been puzzled out already, so the current state-of-the-art instead focuses on higher level control philosophies. For a truly robust trajectory planning in such a dynamic environment, the socially aware control model should be able to account for unexpected events, such as groups or individuals moving away from or avoiding unmodeled obstacles. Expanding on [146] and the concept of prioritizing human agents in an environment, we can see a shift towards increasingly social based models, incorporating social force models (SFM) [147] to develop a system of socially reactive control (SRC) [148]. Such a system proposes to take into account not only single humans, but groups of them and their collective motion for mobile manipulators, which can be used in stationary manipulation scenarios as well. This means group dynamics, such as group motion, centres and size of groups. Additionally it is proposed to gain some basic understanding of what the human agent is doing at a given time, what they are interacting with, and using that information to further understand what is likely to happen in the environment around the robot.

An example given in the paper [148] shows a human interacting with an object of interest and a second human facing the first as if to approach them. This would cause either the robot to cross the path of the human or the human to cross the robot’s path if both continue on a straight line. Additionally, in the path to the goal are two humans who clearly form another distinct group. The socially reactive model of control will attempt to avoid the group as a whole instead of attempting to pass through them, even if there is enough space to perform such an action. While group dynamics can be useful to model in order to avoid collision or interactions with groups of humans, the model of navigation, considering humans as single independent agents rather than trying to infer groups, is still very relevant. In this case, the state-of-the-art proposes a change from a flat pre-calculated confidence for the trajectory of a human to a Bayesian one [149] instead, which is constantly updated. This would allow for unobtrusive navigation around agents that act entirely unpredictable. This means either intrinsically unpredictable or unpredictable in a sense that an agent is reacting to features of the environment that are not modelled by the robot, which would make a subsequent action to avoid an obstacle or object unpredictable. Such a motion plan is successful at avoiding human agents because it is far more conservative in the planning stage, considering a much wider area around a human to be inaccessible. In essence, it is similar to the earlier introduced concept of a virtual “force-field” around a human agent [146], however it is not modelled explicitly as such. This provides a slightly more adaptable framework where any other agent in motion, human or not, is able to be avoided, by not making too strong of an assumption as to its intended direction of motion or goal.

Another interesting thing to consider in human-robot shared environments is that all paths to a certain goal may be blocked, even by very light or easily movable objects. Should a robot, in its navigation, consider if an object can be pushed aside without damage to itself or the environment? For example, a sheet of cardboard slips from a shelf and blocks the robot’s path, can the robot simply push through it as it would do no damage to either the robot or the rest of the environment? Navigation has been considered as a tool to aid in a robot’s vision and perception, especially in crowded scenarios, and generally serves the purpose of moving the robot to a location where it can manipulate the environment using its gripper or other manipulation tool. However should the navigation itself be considered a valid method for manipulating the environment, using the frame of the robot itself?

Another concept to consider, is something more akin to mapping. Whereas so far, we have discussed path and trajectory planning (i.e., navigation) as a means to bring a manipulator to a desired location in a dynamic environment, we can instead consider another intermediary task. The camera of a robot rarely has more than just the three rotational degrees of freedom, so any planar motion that is desired needs to be provided by the robot itself through navigation. Yervilla-Herrera et al., [150] proposed a use-case for navigation which combines some basic static obstacle avoidance with the principles of object reconstruction using methods like shape from motion. In this case, the goal of the navigation is, in fact, to provide better or more complete sensory information to the robot, rather than navigating to a specific point in the space. This may be useful in the case that an object in a pile is more easily detected or manipulated from a different angle of approach [77, 22], or simply to gain a better understanding of the overall shape of an object. As mentioned in the section III, the better the robot’s ability to perceive an object, the easier it is to manipulate.

Vii Conclusion and Future Work

In recent years, many great developments have been made in the field of service robotics. It can be seen that a lot of recent developments are due to the use of more complex machine learning techniques, such as (deep) neural networks, and are based on large amounts of data. While this leads to continuous improvements on the tasks themselves, large issues remain with real-world applicability due to time complexity issues. Service robots also struggle severely in unknown environments, lacking open-ended learning about object categories and scenes, a map for global navigation, reliable object recognition for local navigation, and running into collision issues while manipulating objects. After these robotic tasks are solved in experimental setups, the focus will need to shift from solutions, to applicability, by reducing complexity and implementing more open-ended learning techniques.

As reviewed in this paper, several major issues and hurdles are solved almost entirely. It is shown that path planning is close to completion in a reliable, known environment, as is grasp planning. Similarly, object recognition, which feeds into both these types of planning, is also approaching near perfect scores when applied to objects in a predetermined setting.

Current issues in object manipulation can be boiled down to avoiding collision. This means mostly dealing with dynamic and shared human-robot environments by developing better methods for local planning. Some solutions that have been developed recently concern themselves with humans, or even groups of humans. Using group dynamics, a group of humans or other agents can temporarily be seen as a single unit for the purposes of trajectory prediction. Whether tracking a single human or a group, trajectory prediction has also undergone development Bayesian trajectory calculations. This allows the robot a range of possibilities where an entity might move next and use this probability to plan its own path. Besides, self-collision avoidance is also addressed in recent works. New approaches have been suggested which are more adaptable and can adjust a grasp movement during execution. These adaptive approaches have a higher tendency to get stuck, however. Through the use of support vector machine (SVM) [130], resulting in adaptable grasps that sometimes ran into singularities, and later using a neural net for motion planning [136], these issues were resolved.

Vii-a Trade-offs

While some areas see considerable linear improvements, other areas suffer from one of two situations: either an improvement in one criteria, such as accuracy, proves to be a setback in another criteria, such as speed, or two approaches are developed side-by-side and their development continuously surpasses each other, without a clear best approach to the task.

The first category is mainly concerning planning modules. When a planning module becomes more sophisticated, usually, the complexity of the constraints increases. This causes the calculations to be much more difficult, almost inevitably affecting the time it takes to find a proper solution. As such, in both navigation and grasp planning, continuous issues arise when trying to apply new methods in real environments. While some of these computational burdens eventually even out due to increases in hardware capacity, other times specific research needs to be done to reduce time complexity.

The second category is seen across many different areas of service robotics. In object perception, it applies to object representation, differentiating between object descriptors that are either hand-crafted or trained by a neural network. It can also be seen in how to view the environment, where approaches based on bounding boxes compete with approaches using image segmentation. Finally, in more recent work on object perception, the problem of cluttered areas is addressed, where multiple objects are in a pile. Here the distinction is made between using a grasping module to relocate the items before identifying them, or to use segmentation to label the partially occluded objects [151].

In grasping, certain trade-offs from this second category can also be observed. With the problem of damage reduction due to collision, some approaches try to limit the movements made by a manipulator such that the chance of an accident is limited. Other, more dynamic solutions will try to predict possibly damaging results and use this damage as a parameter or constraint in grasp planning. Finally, when constructing the grasp movements, there is no consensus on whether grasps should be based on point clouds, segmented by a rule-based system, or by using neural nets to generate grasps. From this list of trade-offs, it can be seen that the field of robotics delivers far from a unified solution to most issues. New approaches are continuously developed, old approaches are reinvigorated and improved and even opposing, but similarly effective approaches are found.

Vii-B Future Work

While new developments are made frequently, some issues are either not solved or largely lacking research. For object manipulation, a suggested direction is to focus on different control philosophies for local planning. While some, such as Bayesian trajectory prediction, are already being developed, this is the area where navigation has most to gain. Some suggest that navigation can also be used to have a robot capable of learning online, basically using navigation to explore. This would mean a robot needs to determine the most probable locations for a given object in a household environment or part of a map, and use navigation planning to go to that locations to observe and manipulate the object.

A different issue remaining for the grasping task is that of increasing complexity and dependency on data. Many models are currently trained on large sets of data from previous grasps, or on data concerning specific objects. This means that open-ended learning, as with the object perception task, is still lacking. To the best of the authors’ knowledge, object detection for manipulation has largely been used for small tabletop items. However, the robot should also be able to autonomously manipulate larger objects like (wheel) chairs, as well as partly fixed objects like cabinet doors or windows. To this end, it is important to appropriately extend 3D object recognition and affordance prediction to furniture, doors and windows.

For object perception, the list of unresolved issues is longer. It contains, among other things, dealing with objects with reflective surfaces and dealing with large objects. In order to solve the more difficult corner cases, several suggestions have been made. One such suggestion is to incorporate other forms of perception, such as tactile, into object recognition. In order to be less data dependent, the area of object perception will need to put a bigger emphasis on open-ended learning, allowing a robot to learn new objects while performing tasks. The final addition to object perception overlaps with grasping as it has to do with affordance predictions. Affordances have been used in both perception and grasping for the identification of objects. As an additional use, grasping affordances can be extracted from images to quickly find suitable grasping locations and orientations. It may be possible to extract other types as affordances such as ease of movement for terrain, available space to move in 3D or joint spaces. As affordances are already used for grasping with single grippers, they may be applied to the case of double-armed robots as well. As little work has been done on the use of multiple grippers simultaneously in the household, this seems like a worthwhile direction.


  • [1] S. Spasova, R. Baeten, S. Coster, D. Ghailani, R. Peña-Casas, and B. Vanhercke, “Challenges in long-term care in europe,” A study of national policies, European Social Policy Network (ESPN), Brussels: European Commission, 2018.
  • [2] R. Memmesheimer, V. Seib, and D. Paulus, “homer@ unikoblenz: winning team of the robocup@ home open platform league 2017,” in Robot World Cup.    Springer, 2017, pp. 509–520.
  • [3] M. Oliveira, G. H. Lim, L. Seabra Lopes, S. H. Kasaei, A. M. Tomé, and A. Chauhan, “A perceptual memory system for grounding semantic representations in intelligent service robots,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.    IEEE, 2014, pp. 2216–2223.
  • [4] J. S. B. Evans, “Dual-processing accounts of reasoning, judgment, and social cognition,” Annu. Rev. Psychol., vol. 59, pp. 255–278, 2008.
  • [5] C. Leroux, O. Lebec, M. B. Ghezala, Y. Mezouar, L. Devillers, C. Chastagnol, J.-C. Martin, V. Leynaert, and C. Fattal, “Armen: Assistive robotics to maintain elderly people in natural environment,” IRBM, vol. 34, no. 2, pp. 101–107, 2013.
  • [6] A. Jain and C. C. Kemp, “EL-E: an assistive mobile manipulator that autonomously fetches objects from flat surfaces,” Autonomous Robots, vol. 28, no. 1, p. 45, 2010.
  • [7] S. Srinivasa, D. Ferguson, J. M. Vandeweghe, R. Diankov, D. Berenson, C. Helfrich, and K. Strasdat, “The robotic busboy: Steps towards developing a mobile robotic home assistant,” in Proceedings of International Conference on Intelligent Autonomous Systems, July 2008.
  • [8] M. Beetz, U. Klank, I. Kresse, A. Maldonado, L. Mösenlechner, D. Pangercic, T. Rühr, and M. Tenorth, “Robotic roommates making pancakes,” in 2011 11th IEEE-RAS International Conference on Humanoid Robots.    IEEE, 2011, pp. 529–536.
  • [9] J. Englsberger, A. Werner, C. Ott, B. Henze, M. A. Roa, G. Garofalo, R. Burger, A. Beyer, O. Eiberger, K. Schmid et al., “Overview of the torque-controlled humanoid robot TORO,” in 2014 IEEE-RAS International Conference on Humanoid Robots.    IEEE, 2014, pp. 916–923.
  • [10] N. G. Tsagarakis, D. G. Caldwell, F. Negrello, W. Choi, L. Baccelliere, V.-G. Loc, J. Noorden, L. Muratore, A. Margan, A. Cardellino et al., “Walk-Man: A high-performance humanoid platform for realistic environments,” Journal of Field Robotics, vol. 34, no. 7, pp. 1225–1259, 2017.
  • [11] T. Asfour, M. Wächter, L. Kaul, S. Rader, P. Weiner, S. Ottenhaus, R. Grimm, Y. Zhou, M. Grotz, and F. Paus, “Armar-6: A high-performance humanoid for human-robot collaboration in real world scenarios,” IEEE Robotics & Automation Magazine, vol. 26, no. 4, pp. 108–121, 2019.
  • [12] M. Fuchs, C. Borst, P. R. Giordano, A. Baumann, E. Kraemer, J. Langwald, R. Gruber, N. Seitz, G. Plank, K. Kunze et al., “Rollin’justin-design considerations and realization of a mobile platform for a humanoid upper body,” in 2009 IEEE International Conference on Robotics and Automation.    IEEE, 2009, pp. 4131–4137.
  • [13] R. Wood, P. Baxter, and T. Belpaeme, “A review of long-term memory in natural and synthetic systems,” Adaptive Behavior, vol. 20, no. 2, pp. 81–103, 2012.
  • [14] S. Harnad, “To cognize is to categorize: Cognition is categorization,” in Handbook of categorization in cognitive science.    Elsevier, 2017, pp. 21–54.
  • [15] M. A. Lebedev and S. P. Wise, “Insights into seeing and grasping: distinguishing the neural correlates of perception and action,” Behavioral and cognitive neuroscience reviews, vol. 1, no. 2, pp. 108–129, 2002.
  • [16] K. Illeris, “A comprehensive understanding of human learning,” in Contemporary theories of learning.    Routledge, 2018, pp. 1–14.
  • [17] D. Skočaj, A. Vrečko, M. Mahnič, M. Janíček, G.-J. M. Kruijff, M. Hanheide, N. Hawes, J. L. Wyatt, T. Keller, K. Zhou et al., “An integrated system for interactive continuous learning of categorical knowledge,” Journal of Experimental & Theoretical Artificial Intelligence, vol. 28, no. 5, pp. 823–848, 2016.
  • [18] T. Fäulhammer, R. Ambruş, C. Burbridge, M. Zillich, J. Folkesson, N. Hawes, P. Jensfelt, and M. Vincze, “Autonomous learning of object models on a mobile robot,” IEEE Robotics and Automation Letters, vol. 2, no. 1, pp. 26–33, 2016.
  • [19] S. S. Srinivasa, D. Ferguson, C. J. Helfrich, D. Berenson, A. Collet, R. Diankov, G. Gallagher, G. Hollinger, J. Kuffner, and M. V. Weghe, “HERB: a home exploring robotic butler,” Autonomous Robots, vol. 28, no. 1, p. 5, 2010.
  • [20] J. Hertzberg, J. Zhang, L. Zhang, S. Rockel, B. Neumann, J. Lehmann, K. S. Dubba, A. G. Cohn, A. Saffiotti, F. Pecora et al., “The race project,” KI-Künstliche Intelligenz, vol. 28, no. 4, pp. 297–304, 2014.
  • [21] M. Oliveira, L. Seabra Lopes, G. H. Lim, S. H. Kasaei, A. M. Tomé, and A. Chauhan, “3D object perception and perceptual learning in the race project,” Robotics and Autonomous Systems, vol. 75, pp. 614–626, 2016.
  • [22] J. Sock, S. Hamidreza Kasaei, L. Seabra Lopes, and T.-K. Kim, “Multi-view 6D object pose estimation and camera motion planning using rgbd images,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 2228–2235.
  • [23] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in Advances in neural information processing systems, 2013, pp. 2553–2561.
  • [24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 779–788.
  • [25] Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deep learning: A review,” IEEE transactions on neural networks and learning systems, 2019.
  • [26] S. H. Kasaei, M. Oliveira, G. H. Lim, L. Seabra Lopes, and A. M. Tomé, “Towards lifelong assistive robotics: A tight coupling between object perception and manipulation,” Neurocomputing, vol. 291, pp. 151–166, 2018.
  • [27] S. H. Kasaei, A. M. Tomé, L. Seabra Lopes, and M. Oliveira, “GOOD: A global orthographic object descriptor for 3D object recognition and manipulation,” Pattern Recognition Letters, vol. 83, pp. 312–320, 2016.
  • [28] H. Kasaei, “OrthographicNet: A deep learning approach for 3D object recognition in open-ended domains,” arXiv preprint arXiv:1902.03057, 2019.
  • [29] S. H. Kasaei, M. Oliveira, G. H. Lim, L. Seabra Lopes, and A. M. Tomé, “Interactive open-ended learning for 3D object recognition: An approach and experiments,” Journal of Intelligent & Robotic Systems, vol. 80, no. 3-4, pp. 537–553, 2015.
  • [30] Q. Zhan, Y. Liang, and Y. Xiao, “Color-based segmentation of point clouds,” Laser scanning, vol. 38, no. 3, pp. 155–161, 2009.
  • [31]

    V. Jumb, M. Sohani, and A. Shrivas, “Color image segmentation using k-means clustering and otsu’s adaptive thresholding,”

    International Journal of Innovative Technology and Exploring Engineering (IJITEE), vol. 3, no. 9, pp. 72–76, 2014.
  • [32] O. P. Verma, M. Hanmandlu, S. Susan, M. Kulkarni, and P. K. Jain, “A simple single seeded region growing algorithm for color image segmentation using adaptive thresholding,” in 2011 International Conference on Communication Systems and Network Technologies.    IEEE, 2011, pp. 500–503.
  • [33] L. Chang, J. R. Smith, and D. Fox, “Interactive singulation of objects from a pile,” in 2012 IEEE International Conference on Robotics and Automation.    IEEE, 2012, pp. 3875–3882.
  • [34] H. Van Hoof, O. Kroemer, and J. Peters, “Probabilistic segmentation and targeted exploration of objects in cluttered environments,” IEEE Transactions on Robotics, vol. 30, no. 5, pp. 1198–1209, 2014.
  • [35] M. Gupta and G. S. Sukhatme, “Using manipulation primitives for brick sorting in clutter,” in 2012 IEEE International Conference on Robotics and Automation.    IEEE, 2012, pp. 3883–3889.
  • [36] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [37] C. Choy, J. Gwak, and S. Savarese, “4D spatio-temporal convnets: Minkowski convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3075–3084.
  • [38] B. W. Kim, Y. Park, and I. H. Suh, “Integration of top-down and bottom-up visual processing using a recurrent convolutional–deconvolutional neural network for semantic segmentation,” Intelligent Service Robotics, pp. 1–11, 2019.
  • [39] G. L. Oliveira, C. Bollen, W. Burgard, and T. Brox, “Efficient and robust deep networks for semantic segmentation,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 472–491, 2018.
  • [40] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
  • [41] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su, “PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 909–918.
  • [42] X. Liang, L. Lin, Y. Wei, X. Shen, J. Yang, and S. Yan, “Proposal-free network for instance-level object segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 2978–2991, 2017.
  • [43] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao, “SpiderCNN: Deep learning on point sets with parameterized convolutional filters,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 87–102.
  • [44] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “PointCNN: Convolution on x-transformed points,” in Advances in neural information processing systems, 2018, pp. 820–830.
  • [45] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: deep learning on point sets for 3D classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  • [46] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical feature learning on point sets in a metric space,” in Advances in neural information processing systems, 2017, pp. 5099–5108.
  • [47] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
  • [48] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan, “Measuring catastrophic forgetting in neural networks,” in Thirty-second AAAI conference on artificial intelligence, 2018.
  • [49] S. S. Sajjan, M. Moore, M. Pan, G. Nagaraja, J. Lee, A. Zeng, and S. Song, “Cleargrasp:3D shape estimation of transparent objects for manipulation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020.
  • [50] B. Hariharan and R. Girshick, “Low-shot visual recognition by shrinking and hallucinating features,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3018–3027.
  • [51] M. Ullrich, H. Ali, M. Durner, Z.-C. Márton, and R. Triebel, “Selecting CNN features for online learning of 3D objects,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).    IEEE, 2017, pp. 5086–5091.
  • [52] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning without forgetting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4367–4375.
  • [53] B. Oreshkin, P. R. López, and A. Lacoste, “TADAM: Task dependent adaptive metric for improved few-shot learning,” in Advances in Neural Information Processing Systems, 2018, pp. 721–731.
  • [54] B. Krawczyk and M. Woźniak, “One-class classifiers with incremental learning and forgetting for data streams with concept drift,” Soft Computing, vol. 19, no. 12, pp. 3387–3400, 2015.
  • [55] M. Ristin, M. Guillaumin, J. Gall, and L. Van Gool, “Incremental learning of ncm forests for large-scale image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3654–3661.
  • [56] M. Tschannen, O. Bachem, and M. Lucic, “Recent advances in autoencoder-based representation learning,” arXiv preprint arXiv:1812.05069, 2018.
  • [57] Y. Zhao, T. Birdal, H. Deng, and F. Tombari, “3D point capsule networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1009–1018.
  • [58] M. Oliveira, L. Seabra Lopes, G. H. Lim, S. H. Kasaei, A. D. Sappa, and A. M. Tomé, “Concurrent learning of visual codebooks and object categories in open-ended domains,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).    IEEE, 2015, pp. 2488–2495.
  • [59] S. H. Kasaei, M. Oliveira, G. H. Lim, L. Seabra Lopes, and A. M. Tomé, “An adaptive object perception system based on environment exploration and bayesian learning,” in 2015 IEEE International Conference on Autonomous Robot Systems and Competitions.    IEEE, 2015, pp. 221–226.
  • [60] S. H. M. Kasaei, L. Seabra Lopes, and A. M. Tomé, “Local lda: Open-ended learning of latent topics for 3D object recognition,” IEEE transactions on pattern analysis and machine intelligence (PAMI), 2019.
  • [61] S. H. Kasaei, A. M. Tomé, and L. Seabra Lopes, “Hierarchical object representation for open-ended object category learning and recognition,” in Advances in Neural Information Processing Systems, 2016, pp. 1948–1956.
  • [62] S. H. Kasaei, L. Seabra Lopes, and A. M. Tomé, “Concurrent 3D object category learning and recognition based on topic modelling and human feedback,” in 2016 International Conference on Autonomous Robot Systems and Competitions (ICARSC).    IEEE, 2016, pp. 329–334.
  • [63] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu, “Fast 3D recognition and pose using the viewpoint feature histogram,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.    IEEE, 2010, pp. 2155–2162.
  • [64] R. B. Rusu, Z. C. Marton, N. Blodow, and M. Beetz, “Learning informative point classes for the acquisition of object model maps,” in 2008 10th International Conference on Control, Automation, Robotics and Vision.    IEEE, 2008, pp. 643–650.
  • [65] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (FPFH) for 3D registration,” in 2009 IEEE International Conference on Robotics and Automation.    IEEE, 2009, pp. 3212–3217.
  • [66] W. Wohlkinger and M. Vincze, “Ensemble of shape functions for 3D object classification,” in 2011 IEEE international conference on robotics and biomimetics.    IEEE, 2011, pp. 2987–2992.
  • [67]

    S.-H. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,”

    City, vol. 1, no. 2, p. 1, 2007.
  • [68] S. H. Kasaei, M. Ghorbani, J. Schilperoort, and W. van der Rest, “Investigating the importance of shape features color constancy color spaces and similarity measures in open-ended 3D object recognition,” arXiv preprint arXiv:2002.03779, 2020.
  • [69] S. H. Kasaei, “Look further to recognize better: Learning shared topics and category-specific dictionaries for open-ended 3D object recognition,” arXiv preprint arXiv:1907.12924, 2019.
  • [70] S. Gao, I. W.-H. Tsang, and Y. Ma, “Learning category-specific dictionary and shared dictionary for fine-grained image categorization,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 623–634, 2013.
  • [71] Y. Zhang, X.-S. Wei, J. Wu, J. Cai, J. Lu, V.-A. Nguyen, and M. N. Do, “Weakly supervised fine-grained categorization with part-based image representation,” IEEE Transactions on Image Processing, vol. 25, no. 4, pp. 1713–1725, 2016.
  • [72] T. Lüddecke, T. Kulvicius, and F. Wörgötter, “Context-based affordance segmentation from 2D images for robot actions,” Robotics and Autonomous Systems, 2019.
  • [73] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European conference on computer vision.    Springer, 2014, pp. 740–755.
  • [74] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. J. Belongie, “Objects in context.” in ICCV, vol. 1, no. 2.    Citeseer, 2007, p. 5.
  • [75] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 891–898.
  • [76] M. Mauro, H. Riemenschneider, A. Signoroni, R. Leonardi, and L. Van Gool, “A unified framework for content-aware view selection and planning through view importance,” Proceedings BMVC 2014, pp. 1–11, 2014.
  • [77] S. H. Kasaei, J. Sock, L. Seabra Lopes, A. M. Tomé, and T.-K. Kim, “Perceiving, learning, and recognizing 3D objects: An approach to cognitive service robots,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [78] B. Li, Y. Lu, and H. Johan, “Sketch-based 3D model retrieval by viewpoint entropy-based adaptive view clustering,” in Proceedings of the Sixth Eurographics Workshop on 3D Object Retrieval.    Eurographics Association, 2013, pp. 49–56.
  • [79] A. Doumanoglou, R. Kouskouridas, S. Malassiotis, and T.-K. Kim, “Recovering 6D object pose and predicting next-best-view in the crowd,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3583–3592.
  • [80] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1912–1920.
  • [81] B. E. Stein and M. A. Meredith, The merging of the senses.    The MIT Press, 1993.
  • [82] M. O. Ernst and H. H. Bülthoff, “Merging the senses into a robust percept,” Trends in cognitive sciences, vol. 8, no. 4, pp. 162–169, 2004.
  • [83] M. A. Eckert, N. V. Kamdar, C. E. Chang, C. F. Beckmann, M. D. Greicius, and V. Menon, “A cross-modal system linking primary auditory and visual cortices: Evidence from intrinsic fMRI connectivity analysis,” Human brain mapping, vol. 29, no. 7, pp. 848–857, 2008.
  • [84] S. Luo, J. Bimbo, R. Dahiya, and H. Liu, “Robotic tactile perception of object properties: A review,” Mechatronics, vol. 48, pp. 54–67, 2017.
  • [85] C. Kertész and M. Turunen, “Common sounds in bedrooms (csibe) corpora for sound event recognition of domestic robots,” Intelligent Service Robotics, vol. 11, no. 4, pp. 335–346, 2018.
  • [86] R. Gao and K. Grauman, “2.5D visual sound,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 324–333.
  • [87] ——, “Co-separating sounds of visual objects,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3879–3888.
  • [88] R. Gao, R. Feris, and K. Grauman, “Learning to separate object sounds by watching unlabeled video,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 35–53.
  • [89] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” arXiv preprint arXiv:1804.03619, 2018.
  • [90] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, “The sound of pixels,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 570–586.
  • [91] H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of motions,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1735–1744.
  • [92] F. Ficuciello, “Hand-arm autonomous grasping: Synergistic motions to enhance the learning process,” Intelligent Service Robotics, vol. 12, no. 1, pp. 17–25, 2019.
  • [93] A. T. Miller and P. K. Allen, “GraspIt! a versatile simulator for robotic grasping,” IEEE Robotics Automation Magazine, vol. 11, no. 4, pp. 110–122, 2004.
  • [94] M. Kopicki, R. Detry, M. Adjigble, R. Stolkin, A. Leonardis, and J. L. Wyatt, “One-shot learning and generation of dexterous grasps for novel objects,” The International Journal of Robotics Research, vol. 35, no. 8, pp. 959–976, 2016.
  • [95] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” arXiv preprint arXiv:1703.09312, 2017.
  • [96] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp synthesis—a survey,” IEEE Transactions on Robotics, vol. 30, no. 2, pp. 289–309, 2013.
  • [97] A. Sahbani, S. El-Khoury, and P. Bidaud, “An overview of 3D object grasp synthesis algorithms,” Robotics and Autonomous Systems, vol. 60, no. 3, pp. 326–336, 2012.
  • [98] E. Johns, S. Leutenegger, and A. J. Davison, “Deep learning a grasp function for grasping under gripper pose uncertainty,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).    IEEE, 2016, pp. 4461–4468.
  • [99] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015.
  • [100] J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kröger, J. Kuffner, and K. Goldberg, “Dex-Net 1.0: A cloud-based network of 3D objects for robust grasp planning using a multi-armed bandit model with correlated rewards,” in 2016 IEEE international conference on robotics and automation (ICRA).    IEEE, 2016, pp. 1957–1964.
  • [101]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [102] J. Mahler, M. Matl, X. Liu, A. Li, D. Gealy, and K. Goldberg, “Dex-Net 3.0: Computing robust robot vacuum suction grasp targets in point clouds using a new analytic model and deep learning,” arXiv preprint arXiv:1709.06670, 2017.
  • [103] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg, “Learning ambidextrous robot grasping policies,” Science Robotics, vol. 4, no. 26, p. eaau4984, 2019.
  • [104] D. Morrison, P. Corke, and J. Leitner, “Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach,” in Proc. of Robotics: Science and Systems (RSS), 2018.
  • [105] ——, “Learning robust, real-time, reactive robotic grasping,” The International Journal of Robotics Research, p. 0278364919859066, 2019.
  • [106] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo et al., “Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).    IEEE, 2018, pp. 1–8.
  • [107] A. Myers, C. L. Teo, C. Fermüller, and Y. Aloimonos, “Affordance detection of tool parts from geometric features,” in 2015 IEEE International Conference on Robotics and Automation (ICRA).    IEEE, 2015, pp. 1374–1381.
  • [108] T.-T. Do, A. Nguyen, and I. Reid, “AffordanceNet: An end-to-end deep learning approach for object affordance detection,” in 2018 IEEE international conference on robotics and automation (ICRA).    IEEE, 2018, pp. 1–5.
  • [109] N. Elango and A. Faudzi, “A review article: investigations on soft materials for soft robot manipulations,” The International Journal of Advanced Manufacturing Technology, vol. 80, no. 5-8, pp. 1027–1037, 2015.
  • [110] L. Moriello, L. Biagiotti, C. Melchiorri, and A. Paoli, “Manipulating liquids with robots: A sloshing-free solution,” Control Engineering Practice, vol. 78, pp. 129–141, 2018.
  • [111] L. Antanas, P. Moreno, M. Neumann, R. P. de Figueiredo, K. Kersting, J. Santos-Victor, and L. De Raedt, “Semantic and geometric reasoning for robotic grasping: a probabilistic logic approach,” Autonomous Robots, vol. 43, no. 6, pp. 1393–1418, 2019.
  • [112] J. Sun, J. L. Moore, A. Bobick, and J. M. Rehg, “Learning visual object categories for robot affordance prediction,” The International Journal of Robotics Research, vol. 29, no. 2-3, pp. 174–197, 2010.
  • [113] T. Hermans, J. M. Rehg, and A. Bobick, “Affordance prediction via learned object attributes,” in ICRA: Workshop on Semantic Perception, Mapping, and Exploration, vol. 1, no. 2.    Citeseer, 2011.
  • [114] D. Morrison, P. Corke, and J. Leitner, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” in International Conference on Robotics: Science and Systems (RSS), 2018.
  • [115] T. A. Funkhouser, “Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” International Journal of Robotics Research, 2019.
  • [116] Y. Qin, R. Chen, H. Zhu, M. Song, J. Xu, and H. Su, “S4G: Amodal single-view single-shot se (3) grasp detection in cluttered scenes,” arXiv preprint arXiv:1910.14218, 2019.
  • [117] M. S. Kopicki, D. Belter, and J. L. Wyatt, “Learning better generative models for dexterous, single-view grasping of novel objects,” The International Journal of Robotics Research, vol. 38, no. 10-11, pp. 1246–1267, 2019.
  • [118] A. Murali, A. Mousavian, C. Eppner, C. Paxton, and D. Fox, “6-DOF grasping for target-driven object manipulation in clutter,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).    IEEE, 2020, pp. 1–8.
  • [119] “Dexterous hand.” [Online]. Available:
  • [120] G. Metta, L. Natale, F. Nori, G. Sandini, D. Vernon, L. Fadiga, C. Von Hofsten, K. Rosander, M. Lopes, J. Santos-Victor et al., “The iCub humanoid robot: An open-systems platform for research in cognitive development,” Neural Networks, vol. 23, no. 8-9, pp. 1125–1134, 2010.
  • [121] “icub tech.” [Online]. Available:
  • [122] R. Deimel and O. Brock, “A novel type of compliant and underactuated robotic hand for dexterous grasping,” The International Journal of Robotics Research, vol. 35, no. 1-3, pp. 161–185, 2016.
  • [123] C. Piazza, G. Grioli, M. Catalano, and A. Bicchi, “A century of robotic hands,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, pp. 1–32, 2019.
  • [124] S. H. Kasaei, N. Shafii, L. Seabra Lopes, and A. M. Tome, “Interactive open-ended object, affordance and grasp learning for robotic manipulation,” in 2019 IEEE International Conference on Robotics and Automation (ICRA), 2019.
  • [125] A. Herzog, P. Pastor, M. Kalakrishnan, L. Righetti, J. Bohg, T. Asfour, and S. Schaal, “Learning of grasp selection based on shape-templates,” Autonomous Robots, vol. 36, no. 1-2, pp. 51–65, 2014.
  • [126] N. Shafii, S. H. Kasaei, and L. Seabra Lopes, “Learning to grasp familiar objects using object view recognition and template matching,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).    IEEE, 2016, pp. 2895–2900.
  • [127] A. E. Johnson and M. Hebert, “Using spin images for efficient object recognition in cluttered 3D scenes,” IEEE Transactions on pattern analysis and machine intelligence, vol. 21, no. 5, pp. 433–449, 1999.
  • [128] S.-J. Huang, W.-H. Chang, and J.-Y. Su, “Intelligent robotic gripper with adaptive grasping force,” International Journal of Control, Automation and Systems, vol. 15, no. 5, pp. 2272–2282, 2017.
  • [129] B. Sundaralingam and T. Hermans, “Relaxed-rigidity constraints: kinematic trajectory optimization and collision avoidance for in-grasp manipulation,” Autonomous Robots, vol. 43, no. 2, pp. 469–483, 2019.
  • [130] S. S. Mirrazavi Salehian, N. Figueroa, and A. Billard, “A unified framework for coordinated multi-arm motion planning,” The International Journal of Robotics Research, vol. 37, no. 10, pp. 1205–1232, 2018.
  • [131] T. Fromm, “Self-supervised damage-avoiding manipulation strategy optimization via mental simulation,” arXiv preprint arXiv:1712.07452, 2017.
  • [132] G. Kang, Y. B. Kim, Y. H. Lee, H. S. Oh, W. S. You, and H. R. Choi, “Sampling-based motion planning of manipulator with goal-oriented sampling,” Intelligent Service Robotics, pp. 1–9, 2019.
  • [133] J. Pan, L. Zhang, and D. Manocha, “Collision-free and smooth trajectory computation in cluttered environments,” The International Journal of Robotics Research, vol. 31, no. 10, pp. 1155–1175, 2012.
  • [134] D. Rakita, B. Mutlu, and M. Gleicher, “Relaxedik: Real-time synthesis of accurate and feasible robot arm motion.” in Robotics: Science and Systems, 2018.
  • [135] A. H. Qureshi, A. Simeonov, M. J. Bency, and M. C. Yip, “Motion planning networks,” in 2019 International Conference on Robotics and Automation (ICRA).    IEEE, 2019, pp. 2118–2124.
  • [136] A. H. Qureshi, Y. Miao, A. Simeonov, and M. C. Yip, “Motion planning networks: Bridging the gap between learning-based and classical motion planners,” arXiv preprint arXiv:1907.06013, 2019.
  • [137] L. S. Sha Luo, Hamidreza Kasaei, “Accelerating reinforcement learning for reaching using continuous curriculum learning,” arXiv preprint arXiv:2002.02697, 2020.
  • [138] M. J. Aein, E. E. Aksoy, and F. Wörgötter, “Library of actions: Implementing a generic robot execution framework by using manipulation action semantics,” The International Journal of Robotics Research, p. 0278364919850295, 2018.
  • [139] N. Jetchev and M. Toussaint, “Discovering relevant task spaces using inverse feedback control,” Autonomous Robots, vol. 37, no. 2, pp. 169–189, Aug 2014. [Online]. Available:
  • [140] D. Chaves, J. Ruiz-Sarmiento, N. Petkov, and J. Gonzalez-Jimenez, “Integration of CNN into a robotic architecture to build semantic maps of indoor environments,” in International Work-Conference on Artificial Neural Networks.    Springer, 2019, pp. 313–324.
  • [141] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal motion planning,” The International Journal of Robotics Research, vol. 30, no. 7, pp. 846–894, 2011. [Online]. Available:
  • [142] A. Cherubini and F. Chaumette, “Visual navigation of a mobile robot with laser-based collision avoidance,” The International Journal of Robotics Research, vol. 32, no. 2, pp. 189–205, 2013.
  • [143] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: realtime multi-person 2D pose estimation using part affinity fields,” arXiv preprint arXiv:1812.08008, 2018.
  • [144] Y.-h. Liang and C. Cai, “Intelligent collision avoidance based on two-dimensional risk model,” Journal of Algorithms & Computational Technology, vol. 10, no. 3, pp. 131–141, 2016.
  • [145] N. H. Singh and K. Thongam, “Neural network-based approaches for mobile robot navigation in static and moving obstacles environments,” Intelligent Service Robotics, vol. 12, no. 1, pp. 55–67, 2019.
  • [146] L. Zeng and G. M. Bone, “Mobile robot collision avoidance in human environments,” International Journal of Advanced Robotic Systems, vol. 10, no. 1, p. 41, 2013.
  • [147] D. Helbing and P. Molnár, “Social force model for pedestrian dynamics,” Phys. Rev. E, vol. 51, pp. 4282–4286, May 1995. [Online]. Available:
  • [148] X.-T. Truong, V. N. Yoong, and T.-D. Ngo, “Socially aware robot navigation system in human interactive environments,” Intelligent Service Robotics, vol. 10, no. 4, pp. 287–295, 2017.
  • [149] D. Fridovich-Keil, A. Bajcsy, J. F. Fisac, S. L. Herbert, S. Wang, A. D. Dragan, and C. J. Tomlin, “Confidence-aware motion prediction for real-time collision avoidance,” The International Journal of Robotics Research, p. 0278364919859436, 2019.
  • [150] H. Yervilla-Herrera, J. I. Vasquez-Gomez, R. Murrieta-Cid, I. Becerra, and L. E. Sucar, “Optimal motion planning and stopping test for 3-d object reconstruction,” Intelligent Service Robotics, vol. 12, no. 1, pp. 103–123, 2019.
  • [151] A. Eitel, N. Hauff, and W. Burgard, “Learning to singulate objects using a push proposal network,” in Robotics Research.    Springer, 2020, pp. 405–419.