We aim to build robots which can not only work among human beings safely but robots that can perform tasks as well as humans. In the development of household assistant robots and other similar technologies, the fluidity of motions is important for the users and developers alike. Learning human-like manipulations has been the objective of reinforcement learning and robot learning for motion generation. To represent motion from a robotics point of view, in this paper, we introduce a manipulation taxonomy that considers the robotics (i.e. mechanics) of human manipulations, particularly in the attributes of contact type and trajectory type in cooking activities. Those attributes are directly associated with trajectory generation and control. It is a representation that a robot could “understand” and execute.
Grasp taxonomies have been extremely inspirational and useful in robotic grasp planning and analysis. A number of works have defined different grasp taxonomies or grasp types [4, 27, 2, 9, 21, 5, 1, 20] from either videos or grasping data. Those studies have focused on uncovering more than the dichotomy between power and precision grasps, and they go deep into the way fingers secure objects contained within the hand. Grasping taxonomies have greatly aided in grasp planning for robot manipulation [18, 19, 3]. To some degree, this relates to the theory of affordance  where we can infer the functionality of an object based on properties of the object itself. Using grasps, we can identify the type of activity happening in a scene, even if the tool is occluded from view, because the type of the grasp can suggest the type of tool being held or manipulated .
However, there is a lack of a manipulation motion taxonomy that focuses on the mechanics of motions – trajectory and contact in the manipulations. Different from the grasp taxonomy that focuses on the finger kinematics, we prioritize contact and motion trajectory. A mechanics-based manipulation motion taxonomy could help roboticists to consolidate motion aliases, words or expressions of the same or similar motions in terms of mechanics and to eventually use this knowledge for motion generation, analysis, and recognition. A good manipulation motion taxonomy also plays an important role in transferring or generalizing skills learned for one manipulation to others using common attributes.
Using the attributes defining the manipulation taxonomy, we can code manipulation motions using binary-encoded strings, which can represent manipulation in a way that robots can “understand” and use to plan and execute. With such strings, we can also consolidate aliases or terms for different motions (even in other languages) since they will be represented in a format that describes the motions on a functional level. A properly represented motion in a machine language is crucial for manipulation knowledge representation  such as functional object-oriented network (FOON) . Using our proposed motion taxonomy, motions with different names such as “insert” and “pierce” are represented with the same manipulation code, as they share the same motion and tactile features in the taxonomy.
In designing the manipulation motion taxonomy, we first identify motion and contact features that can be used for distinguishing motions. These features are selected based on common characteristics used in robot motion generation and control such that the motions with the same taxonomy features would have the same motion generation and control strategy. In identifying motion codes in the taxonomy, we have taken two different approaches: 1) clustering motions based on intuition according to the motion features, and 2) clustering motions based on real data and experimental observations.
Ii Manipulation Data
In this paper, we focus on manipulation motions as typically observed in cooking activities. The primary sources of video data used in this paper are two sets of instructional videos and their labels. The first set of videos are from the openly available functional object-oriented network [22, 23]. This knowledge representation, inspired by our previous work [25, 19], combines object and motion annotations from 100 instructional videos of cooking activities. We use motion labels from FOON as well as those from our Daily Interactive Manipulation (DIM) data set for our taxonomy.
The object and motion annotations are represented in a directed acyclic graph called a functional object-oriented network (FOON). The graph contains a combination of object nodes and motion nodes in structures referred to as functional units, which describes a series of procedures needed to create different meals. We have used FOON for task planning as well as video understanding . Presently, our FOON combines a total of 100 video demonstrations: 75 videos from YouTube (10 additional videos in addition to the 65 videos from ), 18 from Activity-Net , and the remaining 7 from EPIC Kitchens . This network (shown as Figure 1) contains a total of 5332 nodes, which is made up of 3448 object nodes and 1884 motion node instances.
An indicator of important motions for human activities can be obtained from counting the frequency of motions that appear in the network. These important motions indicate manipulations which would especially need to be mastered by a robotic system for performing cooking tasks since they are used quite often in cooking. By identifying such motions, we can give special attention to learning said motions and fine-tuning their performance. To measure the frequency, we simply count the number of motion node instances (where there is one node per functional unit) in the entire universal FOON since there can be (and there are) multiple instances of each motion node type. We show the frequency (as percentages) of the top 20 motion types (making up 85% of all motion nodes) in the universal FOON as Figure 2.
One major challenge we have encountered during the process of annotating FOONs is inconsistency among labels used by annotators and among different data sets. In the case of our universal FOON, for example, with multiple annotators, there is always a concern of labels provided by volunteers for describing activities in demonstration videos. To fix this, we would need to review the labels and correct them to match the appropriate motion labels. We also encountered this problem of inconsistency when merging information from other data sets such as the MPII Cooking Activities Dataset , which use different expressions to describe their activities to ours. This difficulty partially motivated us to develop a motion representation that is meaningful to robots.
In many manipulations, contact is a very important component. However, videos can only provide the visual information of the manipulations. It is impossible to analyze the contact characteristics between the objects in manipulations solely using visual features. Therefore, we performed common physical interactive manipulation motions in our lab and collected the interactive force and torque readings along with the motions during the manipulations. The data set of 32 manipulation types of 3,000 manipulation trials is openly available through .
|Manipulation Attributions||Description of Attributes|
(between tool and objects)
Iii Motion Taxonomy
To capture the mechanics of the manipulation motion, we look at the motion from the following main aspects: contact type, engagement type, and trajectory type. We then add two additional aspects that could be useful for planning: contact duration and manual operation (whether unimanual or bimanual) for finer manipulation details. We combine them into a manipulation code to represent a motion. In Table I, we describe these attributes in detail.
Iii-a Motion Attributes
We mainly distinguish manipulations as contact or non-contact motions. Contact motions are those in which there is an interaction between objects, tools or utensils in the demonstration, while non-contact are those in which there is little to no contact. Contact motions are those manipulations that involve forces being applied on an object (or a set of objects) where the force is exerted by a tool, utensil or another object. We refer to the tool or utensil as the active participant in the motion, while objects being acted upon are referred to as passive participants. For instance, a hammer exerts force as repeated single, powerful impacts on a nail for the hammering motion, while a softer force can be observed with motions like mixing liquids in a container or brushing a surface with a brush. In some cases, the robot’s hand acts as the active tool in manipulations such as picking-and-placing, squeezing or folding. We can also have a non-contact motion type, which will involve the manipulation of tools that make little to no contact on participating passive objects. For instance, when we pour a liquid into a bowl from a cup, the cup does not touch the bowl in a typical pouring action. It is important to note that in pouring, we do not consider the hand gripping the object as a tool.
A manipulation motion can also be identified by how an active object engages with other passive objects. We identify motion engagement types as either being soft or rigid. Soft engagement motions are those where either the active tool or the passive object undergoes a change in its shape from contact with each other. Rigid or neutral engagement motions have neither the tool nor the objects change in their shape, state or form as a result of direct contact. However, these motions can either cause some sort of movement in the manipulation or the object being acted upon does not move from the manipulation. For instance, with a spatula, one can pick up items without changing the physical state of the manipulator tool and the manipulated object, but the passive item would be moved from one location to another.
Soft engagement contact can be broken down into three subcategories: 1) admitting or penetrative, where the tool can penetrate the object without deformation of the tool and the passive object allows the tool to enter it, or 2) deforming, where either the active or passive object deforms in some way. The latter can be further broken down into either deforming of the manipulator, where the active tool itself changes in its shape or deforms for manipulation upon an object, or deforming of the manipulatee, where the passive object changes in its state or shape and the active tool remains rigid and does not deform. As an example of an admitting engagement, when scooping flour from a bowl, the spoon or cup penetrates the ingredients. A manipulator-deforming engagement type can be observed when using a brush, for instance, since the bristles will bend and deform in shape from the default appearance of a brush. As for a manipulatee-deforming engagement such as cutting, the active knife deforms the passive object by changing its shape from its natural state to pieces for the purpose of cooking.
With contact made between the active tool and the passive object, engagement can either be continuous, where there is a constant interaction or force in the manipulation over the duration of the action, or discontinuous, where there is little to no constant or non-persistent contact between them. Discontinuous motions tend to be those which can be identified by sharp periods of force. For example, in the case of pick-and-place, the only contact between the object and the environment in the pick-and-place process are at the beginning and the end of the process – breaking and establishing contact between the picked object and the support environment. However, since the hand is considered to be the active tool, which continuously grips the object, this action is considered as continuous contact. With an action such as dipping, the object will only make temporary contact with contents usually held within a container.
As for manipulation motion types, the movement can be prismatic, where it undergoes linear translation across a line/plane (e.g. cutting is usually a vertical motion in 1D), or it can be revolute or rotational, where the object or tool undergoes a change in orientation and it moves about axes of rotation (e.g. pouring typically involves the rotation of a cup to allow liquid to flow into a receiving container). Manipulation motions are not confined to a single trajectory type since certain manipulations combine rotation and translation; hence, these two subcategories are not mutually exclusive. An example of this type of motion is folding.
Finally, these motion types can also be described by the number of hands (or end-effectors) regularly used in the action. We can classify them asunimanual (involving one hand) or bimanual (involving both hands) in terms of manipulation of the active tool or item. Sprinkling salt from a shaker can be considered as a unimanual action since we can hold the shaker and shake it with one hand, while rolling or flattening is usually a bimanual action since a rolling pin requires two hands to operate. This criterion is important for determining which motions we can execute since some robotic systems are not built consistently to human anatomy (i.e. with two arms, two hands, and similar joints).
Figure 3 illustrates the manipulation taxonomy described in Table I as five hierarchical trees. Each manipulation motion will be grouped according to the taxonomy trees and assigned a string of binary manipulation code. The binary string is a combination of manipulation attributes in the following order from left to right: contact type, engagement type, trajectory type, contact duration and manual operation.
Iii-B Manipulation Codes
Based on the taxonomy, each motion type can be represented with a manipulation code which can be used for representing each motion as detailed in our taxonomy. In Table II, we assigned manipulation codes to common cooking motions as seen in both FOON and DIM. Several motions end up naturally clustered because of common codes.
Mixing/stirring is assigned the same code as inserting/piercing since they are both admitting actions, have prismatic trajectories, and they are classified as continuous contact motions. Cutting/slicing/chopping along with motions such as mashing, rolling (unimanual), peeling, shaving, and spreading are clustered together mainly because of their manipulatee-deforming and prismatic properties. This group is separate to that containing pulling apart and grating because they are typically bimanual actions.
|Manipulation Code||Motion Types|
|10111010||pick-and-place, push (rigid)|
|11001010||insert, pierce, mix, stir|
|11101010||brush, wipe, push (deforming)|
|11110100||tap, crack (egg)|
|11110111||twist (open/close container)|
|11111010||cut, slice, chop, mash, roll (unimanual), peel, scrape, shave, spread, squeeze, press, flatten|
|11111011||roll (bimanual), pull apart, grate|
Iv Classifying Motions with Real Data
We have established a motion taxonomy for grouping motions which are similar to one another based on force and motion using attributes such as contact versus non-contact. In this section, we support our taxonomy by comparing force reading data for different motion types. As we described in Section II, several demonstrations were recorded using position/orientation and force sensors for a variety of human activities and are featured in the DIM data set. The objective here is to match each activity to a motion type and to determine whether the measurements show that certain motion types are alike to other motion types, thus determining whether the clusters from Table II aligns with real data.
Iv-a Finding Motion Similarity
The DIM dataset is the only data set at the moment that contains contact 6-axis force data of many manipulation motions[14, 15]. However, due to the limitation of the force sensor in its data collection setup, it does not have manipulations involving high force or torque, such as squeezing, mashing, or pressing. Additionally, we did not analyze non-contact motions (such as pouring or sprinkling/shaking) because there are no interactive forces to measure between active and passive objects. It is for that reason we do not have mappings to all motion clusters. Several motions were collected as multiple variations of demonstrations, and so we try to combine all recordings in this data set.
Using the force data, we created a representative model for each motion type using Gaussian Mixture Models (GMM). Each GMM represents a force distribution across space to derive a motion description of a motion type, and they are built by combining the data points generated in multiple trials of demonstrations. To measure the similarity of motions using their individual force distributions, we use the Kullback-Leibler (KL) divergence method. The typical method for measuring KL divergence between two distributions is to use random sampling between different points; however, this is a very intensive task for us to do with GMMs, and so we used the variational approximation of KL divergence (as proposed in ) as the distance measure between a pair of different motions. Originally, this metric is asymmetric and it is non-transitive (i.e. the KL divergence value from A to B will not be the same as that from B to A). However, we can obtain a symmetric result by taking the average of the divergence values obtained from the two sets of pairs (i.e. we take the value from A to B and B to A and computing the average). Since we have multiple recordings for certain motion types, we also computed the average of all KL divergence values computed for each of those instances. This makes it easier to interpret the pairwise values we obtain, which we present in a matrix form as Figure 4. The values obtained from KL divergence are unbounded and non-negative, where the closer the value is to 0 (based on color, the deeper the shade of the blue), the more two distributions are considered to be alike; conversely, the larger the value obtained from this calculation (based on colour, the lighter the shade of yellow), the more dissimilar two manipulation motion types are from one another based on force readings. Matrix values are symmetric, so we omitted the upper diagonal values.
The main question we will be addressing in this section is: how well do our motion clusters match real supporting data? We determine this by looking at how similar motions classified as certain clusters match up to others that are also considered to be in the same cluster based on force/torque readings. In Figure 4, we have certain activity pairs whose motion labels agree with our taxonomy such as: mashing to slicing, mashing to shaving, spreading to shaving, spreading to mashing, peeling to shaving, and twisting for both directions. There are several motions which are close to one another but differ to the clusters in Table II due to one or two attributes. Even though brushing and shaving are considered different in the taxonomy, this is only due to the nature of the tools; brushing is considered to be manipulator-deforming, while shaving is manipulatee-deforming. The movement type and force application are expected to be similar aside from the deformation type found in these tools, and therefore these motions can be considered to be similar. Similarly, flipping and scooping are similar to one another because they are both prismatic and revolute; however, flipping is considered as a rigid engagement motion, while scooping is an admitting, soft engagement motion. Inserting/piercing is considered to be somewhat distant to all other motions, with perhaps the closest to twisting, which does not match our expectations.
Other pairs which we expected to be similar but they did not have low KL divergence values include peeling and scraping; conversely, motion pairs that were deemed similar but do not match our taxonomy include flipping and mashing, flipping and shaving, stirring to slicing, stirring to shaving, and stirring to spreading. Twisting open is found to be similar to many other motions such as slicing and shaving which are not revolute but prismatic only motions. This illustrates that these features should not be neglected when comparing motion data. Since the KL divergence only considers force readings, we neglect other factors which may give away unlikely matching candidates, which are likely to be obtained from an analysis of motion trajectory data or video analysis. This is why some similarities do not match with the intra-clustering of motions.
In conclusion, our aim in this paper was to investigate the robotic attributes of manipulation tasks as seen in cooking and use them to create an effective representation of manipulation motions. By identifying a motion taxonomy, we were able to assign binary-encoded strings, which we called manipulation codes, that describe attributes of a particular motion based on trajectory and contact properties. Manipulation codes can be used to determine motion types that are similar to one another. The taxonomy and codes allow researchers to represent and group manipulations from the robotics point of view. In addition, by representing motions as manipulation codes, we can effectively consolidate aliases (or different labels or words) thus removing ambiguity among motion types. Moreover, comparing the codes between manipulations provides a path towards transferring learned manipulations to new unlearned manipulations.
To show that the motion code assignments given to different motion types hold up in measuring similarity (or dissimilarity) between motion types, we performed experiments using collected demonstration data. We showed that the force reading data for certain motion types naturally cluster with other motion types, supporting the taxonomical clusters described in the paper. For a better measure of similarity and support for the taxonomy, we would need to collect force data for other motions that we did not include in the analysis. Furthermore, we may also identify other obtainable attributes to be included within the taxonomy, which can be selected based on the proposed task and available resources.
This material is based upon work supported by the National Science Foundation under Grants No. 1421418 and 1560761.
-  (2016) Grasp taxonomy based on force distribution. In Robot and Human Interactive Communication (RO-MAN), 2016 25th IEEE International Symposium on, pp. 1098–1103. Cited by: §I.
-  (2013) A hand-centric classification of human and robot dexterous manipulation. IEEE transactions on Haptics 6 (2), pp. 129–144. Cited by: §I.
-  (2019) On the choice of grasp type and location when handing over an object. Science Robotics 4 (27), pp. eaau9757. Cited by: §I.
-  (1989) On grasp choice, grasp models, and the design of hands for manufacturing tasks. IEEE Transactions on robotics and automation 5 (3), pp. 269–279. Cited by: §I.
-  (2013) Functional analysis of grasping motion. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pp. 3507–3513. Cited by: §I.
-  Daily Interactive Manipulation (DIM) Dataset. Note: http://rpal.cse.usf.edu/datasets_manipulation.htmlAccessed: July 31, 2019 Cited by: §II.
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset.
European Conference on Computer Vision (ECCV), Cited by: §II.
ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970. Cited by: §II.
-  (2016) The GRASP taxonomy of human grasp types. IEEE Transactions on Human-Machine Systems 46 (1), pp. 66–77. Cited by: §I.
-  FOON Website: Graph Viewer and Videos. Note: http://www.foonets.comAccessed: July 31, 2019 Cited by: Fig. 1, §I.
-  (1977) The theory of affordances. In Perceiving, Acting and Knowing, R. Shaw and J. Bransford (Eds.), Cited by: §I.
-  (2010) Action observation can prime visual object recognition. Experimental Brain Research 200 (3-4), pp. 251–258. Cited by: §I.
-  (2007) Approximating the kullback leibler divergence between gaussian mixture models. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, Vol. 4, pp. IV–317. Cited by: §IV-A.
-  (2016) Recent data sets on object manipulation: a survey. Big data 4 (4), pp. 197–216. Cited by: §IV-A.
-  (2019) A dataset of daily interactive manipulation. The International Journal of Robotics Research 38 (8), pp. 879–886. Cited by: §IV-A.
-  (2018) Long Activity Video Understanding using Functional Object-Oriented Network. IEEE Transactions on Multimedia. Cited by: §II.
-  (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §IV-A.
-  (2014) Grasp planning based on strategy extracted from demonstration. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4458–4463. Cited by: §I.
-  (2015) Robot grasp planning based on demonstrated grasp strategies. The International Journal of Robotics Research 34 (1), pp. 26–42. Cited by: §I, §II.
-  (2016) Data-driven human grasp movement analysis. In ISR 2016: 47st International Symposium on Robotics; Proceedings of, pp. 1–8. Cited by: §I.
-  (2017) The complexities of grasping in the wild. In 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), pp. 233–240. Cited by: §I.
-  (2016) Functional Object-Oriented Network for Manipulation Learning. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2655–2662. Cited by: §II.
-  (2018-05) Functional Object-Oriented Network: Construction & Expansion. In ICRA 2018 - IEEE International Conference on Robotics and Automation, Brisbane, Australia. Cited by: §II, §II.
-  (2019) A survey of knowledge representation in service robotics. Robotics and Autonomous Systems 118, pp. 13–30. Cited by: §I.
-  (2013) Human-object-object-interaction affordance. In Workshop on Robot Vision, Cited by: §II.
-  (2012) A database for fine grained activity detection of cooking activities.. In CVPR, pp. 1194–1201. External Links: Cited by: §II.
-  (2013) A simple ontology of manipulation actions based on hand-object relations. IEEE Transactions on Autonomous Mental Development 5 (2), pp. 117–134. Cited by: §I.