Understanding of Object Manipulation Actions Using Human Multi-Modal Sensory Data

by   Bahareh Abbasi, et al.

Object manipulation actions represent an important share of the Activities of Daily Living (ADLs). In this work, we study how to enable service robots to use human multi-modal data to understand object manipulation actions, and how they can recognize such actions when humans perform them during human-robot collaboration tasks. The multi-modal data in this study consists of videos, hand motion data, applied forces as represented by the pressure patterns on the hand, and measurements of the bending of the fingers, collected as human subjects performed manipulation actions. We investigate two different approaches. In the first one, we show that multi-modal signal (motion, finger bending and hand pressure) generated by the action can be decomposed into a set of primitives that can be seen as its building blocks. These primitives are used to define 24 multi-modal primitive features. The primitive features can in turn be used as an abstract representation of the multi-modal signal and employed for action recognition. In the latter approach, the visual features are extracted from the data using a pre-trained image classification deep convolutional neural network. The visual features are subsequently used to train the classifier. We also investigate whether adding data from other modalities produces a statistically significant improvement in the classifier performance. We show that both approaches produce a comparable performance. This implies that image-based methods can successfully recognize human actions during human-robot collaboration. On the other hand, in order to provide training data for the robot so it can learn how to perform object manipulation actions, multi-modal data provides a better alternative.



There are no comments yet.


page 5


Human and Machine Action Prediction Independent of Object Information

Predicting other people's action is key to successful social interaction...

CZU-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors

Human action recognition has been widely used in many fields of life, an...

Exploiting Multi-Modal Features From Pre-trained Networks for Alzheimer's Dementia Recognition

Collecting and accessing a large amount of medical data is very time-con...

Imitation and Mirror Systems in Robots through Deep Modality Blending Networks

Learning to interact with the environment not only empowers the agent wi...

Semantic constraints to represent common sense required in household actions for multi-modal Learning-from-observation robot

The paradigm of learning-from-observation (LfO) enables a robot to learn...

Cognitive architecture aided by working-memory for self-supervised multi-modal humans recognition

The ability to recognize human partners is an important social skill to ...

ARTiS: Appearance-based Action Recognition in Task Space for Real-Time Human-Robot Collaboration

To have a robot actively supporting a human during a collaborative task,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I introduction

Understanding human actions is a critical component in an effective human-robot interaction. To ensure successful collaboration, the robot should be able to recognize human actions, and produce appropriate physical behaviors. Service robots thus require both an action recognition module to identify the human actions and a learning module to learn how to replicate them. Development of such modules requires understanding how humans perform an action. To do so, it is necessary to collect data from various sensor modalities and understand their role. Among different actions of interest, object manipulation actions play key role in the ADLs. They are particularly challenging to study because they are difficult to measure in naturalistic settings. In this paper, we focus on modeling the human manipulation actions with the goal of enhancing the understanding between the human and the robot

In robotics, human manipulation actions have been traditionally studied either by using the video recordings [1, 2], or by analyzing of human movement data [3, 4]. The extracted features can be exploited for modeling of human actions that involve interaction with objects, or for action recognition. The idea of breaking down these actions into action units analogous to phonemes in language, and to study the temporal sequence of these action units, has attracted considerable interest [4, 3, 5, 6].

The concept of action primitives can be used at a symbolic level to define the building blocks that can describe the actions. In turn, they can be used to train a low level model, like a support vector machine (SVM), to extract these primitives from the continuous motion trajectory, and a higher level temporal model such as Hidden Markov Model (HMM) to classify a sequence of them into actions 

[4]. Alternatively, these two steps can be combined into a single hierarchical probabilistic temporal model (HHMM) [3]. In this approach, the association between the movement signal and the primitives is probabilistic due to variability seen among the subjects and even within a subject [4], and human annotation of primitives/actions is required. Another way of tackling this problem is to combine the movements at the trajectory level, and define the primitives based on repeated observed patterns [5]. One of the drawback of this approach is that the obtained primitives are dependent on locations and other experimental setup properties. In [6], the problem is resolved by proposing a Parametric Hidden Markov Model (PHMM) for each primitive based on a quantization of object configuration space.

In this work, we model the human manipulation actions using two approaches, referred to as primitive-based and visual-data-flow approach. We introduce the manipulation primitives which are inferred from the human multi-modal signals during the execution of these actions. The distinct sequential nature in which primitives occur in each manipulation action is exploited to recognize the action. The recognition module is thus based on a sequential model which can process a sequence of the primitives as they occur in time. A novel feature of our work is that we use force (more precisely, data from pressure sensors on the palm of the hand) as an additional information channel, i.e. the data includes force in addition to the hand velocity and bending of the finger joints. We define a set of 24 primitive features that can be easily identified in the multi-modal data, no learning or annotation needs to be used. These features are parameter-free and can be thus used as a symbolic representation of the multi-modal signal. Once the signals are converted into such a sequence of primitive features, the sequence can be used as an input for a sequential classifier for action recognition. Because these primitive features are parameter-free, the recognition is independent from the location of the action and generalizes across subjects.

The second approach uses visual features extracted from the raw data. We extract the raw visual features from recorded video using pre-trained deep convolutional neural network (DCNN). Training DCNN models from scratch requires a massive amount of data. Instead, we rely on the existing DCNNs that have been trained using ImageNet 

[7] as they are well suited for extracting the visual features. Subsequently, the vision-based features are utilized for training the classifier. We also train the same classifiers with the combination of the visual features and the multi-modal data to check whether adding the extra information significantly improves the recognition. To the best of our knowledge, we are among the first to use multi-modal data, in particular visual features along with motion and force data, for recognition of object manipulation actions. Finally, we show that vision can be successfully used for recognition and is statistically equivalent to multi-modal classifiers.

The rest of the paper is organized as follows, in section II, we review the literature on modeling human actions. In section. III, we characterize the manipulative activities that are of interest to this work. Section IV is devoted to primitive-based approach. In section V, we describe our method for visual feature extraction and how to use them for recognition along with the non-vision features. The conclusion and future work is provided in section VI.

NatEXPIdentify manipulation primitives

MotionEXPExtract motion trajectories

ValEXPPerformance validation

Fig. 1: The three different human studies and their objectives in this work.

Ii related work

There is a huge body of research on understanding human activities. Many researchers have taken observational approach in modeling human activities and proposed taxonomies for human actions [8]. Pavlovic et al. [9] proposed a detailed taxonomy for hand and arm movements. They refer to all intentional motions as “gestures” and suggest that human gestures are either manipulative or communicative (e.g. mimetic or referential). Bullock et al. [10] proposed a taxonomy with 15 classes for the manipulation tasks. Their taxonomy is based on the type of the contact (prehensile or not) and the states of the hand motion (no-motion, within-hand motion and non-within-hand motion).

In contrast to the observational approach in modeling human actions, many researchers have focused on proposing algorithms to determine human actions from the measured physical signals, such as the motion trajectories and applied forces/torques. Almost all of these works, assume that a hierarchical relationship exists between the physical signals and the high-level actions. One of the most appreciated models for this hierarchical relation is the “movement-activity-action” model, proposed by Bobick [11]. He defines movements as the most atomic primitives. Activity refers to a sequence of such movements or states.

While Bobick’s model [11]

is presented in the context of computer vision and image processing, the idea of using primitive motions for determining human actions is employed by many researchers in different research areas. For instance, Hogan and Sternad

[12] suggested that human upper- and lower-limb behaviors can be described by three classes of “dynamic” primitives: sub-movements (trajectory attractor), oscillations (limit-cycle attractor) and mechanical impedances (generalized stiffness).

Action Object Activities Instances Duration (s)
Putting on the shoe shoe, subject’s foot Reaching to the object, grasping the object, picking up the object, reorienting the object (aligning with the foot), inserting the foot into the shoe      1     32
Tying shoelace shoelaces Reaching the object, grabbing the object, performing the tying maneuvers, releasing the object      1     13
Getting up another person Support other subject while getting up from a chair or a bed      1     6
Ambulating another person Support other subject while walking      1     26
Open/Close door, closet, cabinet, fridge, drawer, … Reaching to the handle, grasping the handle, pushing/pulling the handle in the opening/closing direction, releasing the handle and reaching back     23     57
Pick up plate, mug, spoon, … Reaching to the object, grasping the object and picking up the object     24     56
Place plate, mug, spoon, … Reaching to the goal, releasing the object and reaching back     19     43
Carry/Hold pot, plate, … Moving with/holding the grasped object     21     70
Handover plate, spoon, … Giving/receiving the object to the other person     18     23
Showing bowl, plate, … Re-configuring the object to provide better view point for the other person      6     12
Displacing plate, bowl, mug, silverware … Reaching the object, grasping/touching the object, Sliding the object to target point, releasing the object, reaching back      7     15
TABLE I: The list of physical actions observed during the experiment. The highlighted actions include non-habitual movements.

Non-observational approaches for definition and utilization of the primitives can be grouped into two main categories. The first category consists of methods that introduce explicit primitives, usually in the task space (or phase space). Therefore, these primitives usually have physical representations and meanings. On the other hand, the methods in the second category propose a set of data driven primitives, usually in the feature space. These methods usually perform a dimension reduction technique followed by applying a clustering algorithm. Therefore, the primitives in this category usually do not represent any physical entity.

In the first category, the primitive motions are either explicitly proposed by the researchers or assumed to be available as a library (e.g. pre-trained from human demonstrations). For instance, Based on the idea that human arm tends to move on a fixed plane, Fang and Ding [13] suggested explicit movements primitives for anthropomorphic arms: “moving on a working plane”, “switching different working planes”, “translating the hand”, and “rotating the hand about a fixed axis”. Meier et al. [14] presented a movement segmentation framework, assuming that a library of movement.

In the second category, the primitives are obtained in a machine learning framework, without any prior knowledge or assumption. For instance, to obtain primitive motions in a 3 to 5-second long sequences of arm movements, Fod et al.


used FastTrak motion tracking system. They segmented the velocity signals using zero-crossing and thresholding techniques, performed PCA on the segmented signals and applied K-means clustering on the first 30 principal components. Similarly, Lim et al.

[16] applied PCA on human motion capture data to obtain the joint trajectory basis functions (movement primitives) and suggested that arbitrary movements can be represented as a linear combination of these basis functions. Jenkins et al. [17] used spatio-temporal Isomap to identify clusters (primitive behaviors) in human motion data. Williams et al. [18] employed factorial hidden Markov model to infer primitives in handwriting data. Similarly, Inamura et al. [19]

employed expectation maximization technique to learn the parameters of a Hidden Markov Model for unknown primitive motions. Alvarez et al.

[20] used second order linear differential equations in conjunction with a latent force model to extract a sequence of dynamical motor primitives. Our approach in defining the primitive falls in the first category. That is, by observing human activities of daily living (ADLs) in an experimental setup, we collected a list of building blocks that comprises the large proportion of the human manipulation actions. While most of the mentioned previous works processed the raw measured motion data for action recognition, our developed primitive-based recognition model rely on symbolic parameter free multi-modal primitive features and is generalized.

Iii human manipulation actions types

Depending on how the human nervous system handles the motion, arm/hand movements can be categorized in three types: reflexive, habitual and deliberate movements. Reflexive movements are motions whose control loops are implemented in the spinal cord and therefore, their response to stimuli is very fast. An example of these movements is the reflex of the arm when hand touches a hot object. In contrast, the habitual movements are controlled at basal ganglia and are done as ordinary actions without any deliberation. On the other hand, certain manipulation tasks, like spinning a basketball on a finger, are considered dexterous manipulation which need precise planning in controller level. These manipulation tasks are managed at prefrontal cortex and are referred to as deliberate movements. Any new (unfamiliar) task is first treated as a deliberate movement, but when it is repeated sufficiently often, it becomes habitual.

Different people take different approaches in learning and performing deliberate movements. As a result, the variability of the motion trajectories between subjects for the same task is large. In contrast, the motion profiles of the habitual movements are mainly governed by the environment and human-body kinematics. Thus, the variability of the motion trajectories between subjects is much smaller for habitual movements. In this work, we focus on tasks that mainly consist of habitual movements and exclude reflexive and deliberate movements. ADLs are particularly good examples of habitual movements and we will specifically study them in this work.

Iv primitive-based approach

In this section we describe our primitive-based approach. We will show that the habitual manipulation actions can be decomposed into a sequence of primitives and their corresponding features. Our work relies on three separate user studies (Fig. 1). In the NatEXP study, data was collected in a fully functional apartment while an elder and a human helper performed ADLs. The MotionEXP study was used to analyze motion trajectories of a subset of manipulation primitives to identify their primitive features. Finally, ValEXP study was used to train the sequential models to validate the approach.

Iv-a NatEXP Human Study: Identify the Manipulation Primitives

In our primary human study, NatEXP (see Fig. 1), we investigated the interaction between human and human, and collected data of interactions between an elder and a human helper. The collected data was annotated for visual cues, spoken language, and manipulation actions. Here, we only focus on the manipulation actions and present that part of the corpus. Subsequently, we show that the substantial portion of these actions can be represented as the integration of proposed primitives in this section. The remaining parts of the corpus can be obtained from [21].

The human study was conducted in a fully functional studio apartment at Rush University in Chicago. It consists of 15 sets of interactions in which gerontological nursing students provided care to the elder person. All elderly subjects were highly functioning at a cognitive level and did not have any major physical impairment.

Multi-modal data was collected during all experiments. The room was equipped with seven cameras to ensure multiple points of view. Both participants were wearing a microphone to record their conversations, and a data glove on their dominant hand to collect haptics data. For more details about the experimental setup see [21]. The videos are annotated and aligned with other signals using Anvil [22]. In total, there are 436 minutes of recorded signals, of which 266 minutes were the helper-elder interactions during the ADLs. The rest was either instructing the elder subject or setting up the experiment equipment.

# % D %
Reaching movement 220 32.5% 162.32 62%
Fixed-axis rotation 189 28% 94.44 36%
Grasp/release 130 19% NA NA
Bend/extend 131 19.5 NA NA
other 7 1% 3.8 2%
TABLE II: The proposed manipulation primitive movements and instances and duration (s) of each.

Each experiment involved interaction between an elder person and a helper while performing four tasks: putting on shoes, getting up from a bed or a chair, ambulating and preparing dinner. The “putting on shoes” task consisted of the helper picking up the elder’s shoe, putting it on his/her foot and tying the laces. The “getting up” activity was performed by the helper supporting the elder’s weight while getting up from a chair or a bed. During “ambulating” action, the helper provided support for the elder and helped the elder balance while walking around the room. Finally, “preparing dinner” comprised many sub-tasks, such as finding pots, opening cans and containers, putting pots on a stove, setting up the dinner table, and cleaning the table afterwards.

Manipulation Primitives
Reaching movement : reach
Hand rotation: rotate
Grasp/Release: grasp/release
Bend/Extend: bend/extend
TABLE III: The proposed manipulation primitives for representing the human manipulation actions.

We analyzed in detail one of the experiments, focusing on manipulation actions. Of 19 minutes of recordings in that experiment, manipulation actions take 353 seconds. Table I shows a list of all the manipulation actions that appear in the experiment. While most of the actions comprised habitual movements, deliberate movements appeared in few instances. For example, “getting up” and “ambulating” actions require careful planning, because they involve ensuring the elder’s balance. Note that putting on shoes was also usually performed as a sequence of planned deliberate movements because the subjects’ feet imposed unknown constraints for the motion planning. Similarly, tying the shoelace on someone else’s shoe was not habitual movements. Table I describes different activities (3rd column) that occur during an action (1st column). The last two columns of Table I give the duration and frequency of the actions. In total, there were 122 actions. As illustrated, most of the actions comprised the habitual movements.

Next, we attempted to identify primitives that could describe the manipulation actions. We observed that 81.35% of the manipulation actions can be described by only four activities: reaching movement, hand rotation, bending/extending the fingers and grasping/releasing the object. Table II shows the frequency and duration (in sec) of these primitives in the experiment. Note that the first two primitives, reaching movement and hand rotation, can be constrained by the environment (object). The manipulation actions that can not be described with the primitives above are mostly instances of deliberate movements. Curved hand trajectory due to an obstacle and adjusting the pot’s position on the stove are examples of such movements. Based on these observations, we propose the human manipulation primitives that is listed in Table III.

While the proposed primitives capture the experimental data, on their own they are not sufficient for recognizing actions in Table I; the primitives represent a level of abstraction that is too high and a sequence of primitives is not unique to a single action. This motivated us to use the primitives to identify features in the data that recognize different actions. Not only that, through the primitives, these data features can be defined explicitly as specific shapes in the signals.

Iv-B MotionEXP Human Study: Velocity Profile of Primitives

In the previous section, we identified four manipulation primitives in human manipulation actions, using the video of the experiment. But in order to recognize different actions, we need to associate features to each primitive, and find a way to identify these features in the data. In this section, we describe how a characteristic motion profile can be obtained for reach and rotate primitives.

Human motions have been studied extensively and it has been shown that, for a ballistic motion, the speed profile is bell-shaped. Minimum-jerk model [23], minimum torque change model [24, 25]

, and minimum variance model

[26] are three popular models for this profile. Although the aforementioned models approximate the human reaching motion, we are interested in finding the motion trajectory not only for reach primitive but also for rotate primitive. We chose an empirical method to model these primitives to capture inter and intra subjects variation in performing these primitives with trying different conditions like empty hand, or holding different objects.

(a) reach primitive
(b) rotate primitive
Fig. 2: Time trajectory and 2D histogram for the normalized manipulation primitives

To obtain the motion profile for reach and rotate primitives, we conducted MotionEXP human study (see Fig. 1) with eight healthy adult subjects. Each subject was asked to perform each primitive motions repeatedly wearing the data glove with an IMU mounted on the back of the glove (see Sec. IV-C for more details). For the reach primitive, the subjects were instructed to move their hands in different directions and for various distances (in rang of ), at least ten times. The subjects performed the task either empty-handed or while carrying an object. Various objects were involved in the experiments including a mug, silverware, and a plate. For the rotate primitive, the subjects were instructed to rotate their hands while keeping their arms steady. The subjects were asked to rotate their hand around the three principal axes of the hand (that also corresponded to the axes of the IMU frame). Each subject performed at least ten rotations around each axis (in rang of ).

After discarding the corrupted signals, we obtained 87 signals for the reach primitive and 77 signals for the  rotate primitive. Fig. 2 shows the normalized velocity trajectories in time domain for each primitive. To better illustrate the trajectory profiles, the 2D histograms of these signals are also shown. As can be seen, the histograms roughly have a bell-shaped profile and they therefore in general agree with the predictions of the models in the literature [23, 24, 25, 26]. However, to formally define the primitive features for the model for the reach and rotate

primitives, we decided to directly use the experiment data. Formally, we define the bell-shaped primitive profile as the most probable time trajectory in the collected data based on the 2D histograms in Figs.

(a)a and (b)b. In calculating the velocity profiles, we applied a smoothing filter on the maximum probable trajectory. Fig. 3 shows the calculated primitive profiles.

The bell-shapes defined in this way can be instantiated with several parameters to match the physical trajectory of the hand. These parameters include which is the direction of the motion (or axis of rotation), the magnitude of the motion (distance: or angle: ), start time () and the duration of the motion () (see Tab. IV). Details on how the defined shapes are used for action recognition are provided in the next section.

(a) reach primitive
(b) rotate primitive
Fig. 3: Bell-shaped profiles for the reach and rotate primitives obtained from the human study
Parameters of reach and rotate primitives
: axis of rotation/reaching
: peak or angle of rotation/ distance of reaching
: span of the primitive in time
: start time
TABLE IV: The list of parameters of reach/rotate primitives.
(a) Top View
(b) Side View
Fig. 4: The experimental setup, and the location of the coordinate frame on the table.

Iv-C ValEXP Human Study: Validation of Primitives-based Method for Action Recognition

In this section, we present our method for identifying the manipulation primitives and their associated features directly without the need for annotation and classification.

Based on our analysis in section IV-A, we decided to use the following modalities for action recognition: hand linear velocity, hand rotational velocity, bending of the fingers, and pressure between the hand and the object. Next we describe how these signals are measured.

The experimental site consists of a table with the length of 185cm, width of 90cm, and height of 85cm. Two cameras were used, one placed directly above the middle of the table, and the other to the side of the table. The two cameras are used to track a color marker attached to the hand and are calibrated with resolution. Each camera provides a motion trajectory of the hand with a sampling rate of frames/sec. The resulting trajectories were fused to extract the trajectory of the hand in the reference frame that is set at the middle of the table, as shown in Fig. 4. This trajectory was used to extract the linear velocity of the hand. Fig. 4 shows the top and side views of the experimental site.

In order to measure other multi-modal signals, we developed our own data glove (see [27]). The glove has eighteen flexible pressure sensors (FSR400-402 from Interlink Electronics Inc. [28]) that cover all the active areas identified in [29] for different grasp types. Eight flexible bend sensors (Flexpoint [30]) were placed on the back side of the glove to measure bending of the fingers. An Inertial Measurement Unit (Adafruit 10-DOF IMU) is attached to the back side of the glove to capture the angular velocity data. The data collection was conducted through an Arduino board (atmega 2560 [31]) with sampling rate. Fig. 5 shows the data glove with the location of the FSR and Flex sensors.

(a) Developed data glove
(b) FSR sensor locations
(c) Flex sensor locations
Fig. 5: The developed data glove equipped with pressure sensors, bend sensors, and an IMU.

Iv-C1 Primitive Features

As discussed in Sec. IV-A, primitives are too abstract to be directly use for action recognition. We thus define primitive features that further characterize each primitive; these primitive features can in turn be used for action recognition.

The primitive features are defined based on the collected data. We note two distinctive features of our approach. First, the features are multi-modal. In particular, we use pressure sensor data and flex sensor data in addition to motion data. Second, the features are defined in such a way that they can be identified in the data directly using signal processing, no learning is employed.

The modalities used for action recognition include four different types of signals: three linear velocities, three angular velocities, bending of the fingers, and the pressure between the hand and the object. The detailed description of all the components is given in Table V.

Multi-Modal Signal Components
Hand linear velocity , ,
Hand angular velocity , ,
Active areas force ,…,
Fingers joints bend ,…,
TABLE V: All the collected signals during the experiments.

We next describe features of each modality in turn. We start with the grasp/release primitive features. First, the norm of the dimensional vector formed by all the pressure senor values is computed. We will refer to this norm as the force signal. The average of the maximum of force signals in the train set is calculated. The features are then associated with the transition through a specific force level, and the direction of the transition. We quantized the force into three high, medium and low levels which are 75%, 45%, and 15% of the calculated average. If the force signal crosses through one of the levels with positive slope, the primitive feature is recorded as grasp primitive, otherwise if the force signal passes one of the levels while decreasing, it is release primitive. The grasp primitive features are shown by , , and and release primitive features are , , which indicate the crossing level of low, medium and high respectively.

Primitive Primitive Features
reach , , , , ,
rotate , , , , ,
grasp , ,
release , ,
bend , ,
extend , ,
TABLE VI: The primitives features.

Similar approach is used to associate features with the finger bending signals. That is, first the norm of the vector of the sensor measurements is computed to obtain the composite bend signal. Next, the primitive features are defined by the composite bend signal and the direction of transition through the levels. As before, the high, medium and low levels are used. Transition through one of these levels in the positive (rising) direction corresponds to the bend primitive features (, , ), while transition through one of these levels in the negative (falling) direction corresponds to the extend primitive features (, , ) respectively.

To assign primitive features to the linear velocity signal, we use the bell-shaped motion profile for reach primitive shown in Fig. (a)a obtained from the MotionExp human study. To extract the bell-shapes from the signal, first all the signal peaks with corresponding magnitude and location within the original signal are detected. Then, we assign the proposed profile for reach to each peak at it’s location with it’s magnitude. In this way, the original signal is reconstructed by the resulted bell-shapes. By employing the optimization toolbox of MATLAB, the best approximation of the signals is obtained by finding the optimal duration of each bell-shape. The cost function of the optimization function is the mean square error(), defined for original signal and it’s approximated ones with the bell-shape profiles. In this way, the reach primitive features as well as their associated parameters are identified within the signal in each of the three directions (, and ). We follow the same approach for the angular velocity signal and the proposed profile for rotate primitive shown in Fig. (b)b. We discard the bell-shapes that represent the movement with distance/angle of less than /.

To remove the dependency of the features on the magnitude, we only consider the direction (positive or negative) of the velocity. That is, each identified bell-shape is abstracted with one of two different symbols. For the linear velocity signal in direction (), we present the two features as “:move up” and “: move down”. Similarly, we defined ‘: move forward” and “: move backward”. For side motion in axis, we have two possible primitive features and . For angular velocity signal, we consider two primitives features for each direction, one for negative and one for positive: , , , , , . The final list of manipulation primitives with their corresponding primitives features is provided at Table VI.

Further, we combine the bell-shapes in different directions which occur at the same time or very close to each other. To do so, given a bell-shape in a particular direction, if a bell shape in another direction occurs within the first half of its duration, we combine the features []. Since each combination represents a motion, the order of the primitive features does not matter in it. Therefore we reorder them according to their lexicographical order. This is done separately for the reach and rotate primitive features.

Manipulation Primitive Feature Raw Feature
Close Drawer (CD) 1.00 1.00 1.00 0.67 0.67 0.53
Open Drawer (OD) 0.93 0.89 1.00 0.74 0.62 0.29
Close Cabinet (CC) 0.84 0.90 1.00 0.89 0.50 0.50
Open Cabinet (OC) 0.89 0.82 0.95 0.90 0.64 0.56
Pick/Place (P-P) 0.78 0.71 0.70 0.62 0.67 0.32
Spray (Sy) 1.00 1.00 1.00 0.84 0.95 0.43
Stir (Sr) 0.93 0.86 0.77 0.88 0.55 0.22
Pour (Pr) 0.71 0.88 0.86 0.71 0.35 0.71
Overall 0.88 0.88 0.91 0.79 0.62 0.44
TABLE VII: Comparison of recognition result for three different sequential model with primitives features and raw data feature in terms of F1-score for each action.

Using the proposed library of manipulation primitives and their corresponding features, human manipulation action can be recognized by performing the following two steps: first one needs to mark the multi-modal human data including motion and force with the defined primitive features (a total of 24 symbols). Next, a stochastic temporal model needs to be trained to recognize the sequence of these abstract primitive features to recognize the manipulation actions. It should be added that these features remove the dependency of the recognition system to the absolute value of the continuous signal.

Iv-C2 Recognition Model

To validate the proposed methodology, we collected human multi-modal data for a subset of manipulation actions observed in ADLs. We considered a set of eight common object manipulation actions that had been observed in the NatExp human study: open/close a drawer, open/close a cabinet, pick and place a ladle, stir with a ladle, spray a sprayer bottle, pour a mug. The first column in Table VII lists the selected manipulation actions. To simulate the environment in an apartment, we placed a drawer and a cabinet on the experiment site along with a mug, a sprayer bottle, and a ladle.

Five participants were recruited for the experiment. They were asked to perform each of the manipulation task for four to seven times in an arbitrary order. They performed the actions with an arbitrary starting and ending position, but the initial orientation was always fixed in order to be able to interpret the angular velocity provided by the IMU. For example, in the “pick and place” action, the subject chose the initial position, reached for the object, moved it to an arbitrary new position and retrieved his/her hand. All the actions started with reaching for the object and ended with retrieving the hand. We applied the primitives features extraction method, and found the sequence of them inside each trial. After removing the corrupted trials, overall trials were recorded during the experiment. The collected data for three subjects are used as training set, and the two remaining subjects as the testing set. We train three different sequential models HMM, FC-LSTM, and Conv-LSTM. The abstracted signals via the primitive-features are used as the input of the sequential models. The details of the training of ach model is provide in the following.

Hidden Markov Model (HMM) are among the most common classifiers for small feature-space sequential data. In this approach, each class, i.e. each manipulation action is modeled via a single HMM. Two different topologies (left-right (Bakis) and fully connected (ergodic)) with different number of states are experimented. The one with maximum likelihood is selected for each action. The predicted labels for testing set determined by the maximum likelihood among the HMM models.

The FC-LSTM and Conv-LSTM are both simple neural network based model including the Long short-term memory (LSTM) layer. In the former one, we employ Embedding (dimension:

) and Mask layer to pre-process the data, and one fully-connected (FC) layer (

nodes) with ReLu activation followed by three LSTM layers (dimensions:

, , ) with tanh and hard-sigmoid as activation and recurrent activation respectively and the final FC layer with nodes (each represents a manipulation action) and sigmoid activation for classification. In the latter network, we employ the Embedding layer along with convolutional layer (100 filters, kernel size:

), a max pooling layer, a LSTM layer (dimension:

), and the same output layer. For both of the networks, Dropout layer with rate of

is used before LSTM layer and the loss function is categorical cross entropy and optimizer is Adam with learning rate of


The performance of the aforementioned trained models is presented in Tab. VII

. Additionally, to prove the effectiveness of using primitive-based model compared to using the raw measurements signals, we trained the same three models with raw measurements signals this time. Since the raw data signals include continuous values, we employed the HMMs with Gaussian Mixture Models (GMMs). Moreover, in LSTM-based models the Embedding and Mask layers (if existed) are replaced with a FC layer (

nodes) to process the continuous data. In Conv-LSTM network, in addition to convolutional layer (dimension: 300), three LSTM layer (dimensions: 325, 220, and 220) are used. We used Dropout layer with ratio of 0.5 before each LSTM layer. The hyper-parameters of the models are optimized accordingly.

Fig. 6:

The confusion matrix of FC-LSTM trained with primitive-based model

The results clearly reflect the success of the proposed primitive-based method compared to using raw data. It also implies that using raw data without any treatment like extracting hand-crafted features is not sufficient for training the recognition models.

The confusion matrix of FC-LSTM model via primitive-based method is provided in Fig. 6. The results from this model and others that trained using the same method imply that the primitive-based method abstract the raw continuous measurements data space into different meaningful and parameter-free symbols, and show better performance in recognition of human manipulation actions. Based on the confusion matrix, the “pick and place a ladle” actions are confused with “pouring a mug” and “string a ladle” actions because these actions are very close to each other in terms of generated signals.

V visual-data-flow approach

In this section, we propose the purely data-driven method to recognize the human manipulation actions. We combine the extracted visual features from the videos with other modalities signals including motion of the hand (angular velocity and linear velocity), bend (fingers joints angles), and force (resulted pressure from active areas of the hand) to train the classifiers to perform the action recognition.

Classifiers Resnet-v2-152 Inception-Resnet-v2
20 frames 83.36 81.78 81.66 89.24 82.51 82.54 89.69 84.53
70 frames 85.97 84.26 82.23 91.06 87.42 87.06 89.13 89.39
Combined 86.29 88.71 88.50 90.65 87.42 93.92 91.17 92.57
TABLE VIII: Recognition result for different non-sequential and sequential classifiers trained with extracted visual features for 20 frames, 70 frames and combined features (with 70 frames) in terms of overall f1-score.

V-a Visual Features from Deep Convolutional Neural Networks

During the recent years, Deep Learning is used as a powerful technique for image classification tasks in computer vision; and various Deep Convolutional Neural Networks (DCNN) architectures have been proposed for it. In this paper, we take advantage of two recent proposed state of the art architectures for ImageNet Large Scale Visual Recognition Competition (ILSVRC)

[7], and extract visual features from the pre-trained checkpoint. We employ Inception-ResNet-v2 which is the state of the art among the Inception

series networks. The difference between this network and previous Inception architectures is addition of residual connection to inception blocks which improves the learning 

[32]. We also exploit a recent network from ResNet networks, ResNet-v2-152 which has layers. The primary version of these deep residual networks are introduced by [33]

and proven to improve learning on image classification. In the second version series network, batch normalization has been added to weight layers 

[34]. All the networks are implemented by python under TensrorFlow platform.

To prepare the data for feature extraction from the pre-trained networks, we first transform all the videos to the frame images. We select the videos from side view camera. Afterward, the images for each trial are fed into each network separately. Since we are only interested in features extraction, the output of the last fully connected layer which does the classification is not useful, therefore, we use the fully connected layer before the last layer. The numbers of the nodes in this layer for Inception-ResNet-v2 is which results in the same amount of features per image. Due to the reason that the layer before last layer in ResNet-v2-152 is not fully connected layer, we extract the features from the last pooling layer which result in features per image.

After the feature extraction from each frame of the video of the trials, we utilize them to train the classifiers for human manipulation actions recognition. Two methods are employed for training the classifiers: non-sequential and sequential.

In order to train the non-sequential classifiers based on the extracted visual features, we need to select the particular number of frames from each recorded video in each trial, i.e. the number of features from all trials should be the same. To do that, we exploit an automatic clustering algorithm, K-means [35], to group the images into clusters, and the frames closest to the centroids are selected as the representatives of the video. Two different numbers ( and ) are tried as the number of the clusters (K). In other words, we select and

frames per trial. In non-sequential method, the extracted visual features for selected frames are concatenated for each video based on their sequence. Two different classifiers including Support Vector Machines (SVM) and a neural network model (Multi-layer Perceptron (MLP)) are trained for action recognition using the concatenated visual features.

In addition to non-sequential method, we train two sequential models, FC-LSTM2 and Conv-LSTM2 networks. The former network contains two LSTM layers with tanh and hard-sigmoid as activation and recurrent activation (dimensions: , ) and a FC layer with ReLu activation ( nodes) in between. In addition, two Dropout layers (ratio: ) are added after FC layer and the last LSTM layer. The same as the previous networks, the last FC layer which is responsible for classification has

nodes and sigmoid activation function with the same training hyper-parameters.

The latter network contains one convolutional layer with 300 filters and kernel size of 3 followed by max pooling layer and two LSTM layers (dimension: , ). The last layer is FC layer with nodes for classification. Two Dropout layers are added before each LSTM layer. The training parameters are the same as the training parameters mentioned in section IV-C2.

The result of the recognition for both feature sets extracted from the two networks with and frames is provided in the first two rows of Table VIII. As the results imply the vision-based features are sufficient for action recognition purpose to some extend. This feature extraction method does not require training the deep models, therefore it does not need a huge dataset for training and is particularly good for small datasets. The performance of the models are close to those of the primitive-based models. In the sequential models for both feature sets from two different networks, the performance of the models with frames and frames are close to each other (except the Conv-LSTM2 with Inception-Resnet-v2 features), however in the non-sequential models the performance of the models with 20 frames is less than those using 70 frames.

t-value p-value
FC-LSTM2 -1.1148 0.3156
Conv-LSTM2 -1.1284 0.3022
MLP -2.3475 0.0657
SVM -4.0304 0.0100
TABLE IX: Significant test result on different classifiers using two different feature sets: vision-based features and combined features in terms of f1-score repeated for 6 times.

V-B Multi-Modal data-flow Approach

In previous section, we used only visual features for training the classifiers, in this section, we train the same models with the vision-based features along with the non-vision features including the hand linear and angular velocities, the norm of the force and bend signals (combined data). Here, we use the visual features extracted from videos for frames since they show similar or better performance than frames. The same as the previous section, for non-sequential method, we require to have equal number of features from each trials. Therefore, we concatenate the features from

selected frames with the padded non-vision features to the maximum length of the dataset and train MLP and SVM classifiers with them. For sequential models, we used the same

selected frames from videos, and added their corresponding non-vision features associated with that frame to them. We train the same LSTM-based models that explained in the previous section.

The train and the test set is similar to previous sections. The performance evaluation is presented in the last row of Table VIII

. The primary result suggests that adding the non-vision features may be beneficial for training some of the classifiers, however to generalize this finding, we need to show it statistically. The characteristics of different classifiers are distinct and they may process the features in completely different manner. In order to conclude the fact that adding non-vision features to the visual data improves the recognition performance significantly, we conducted significance test (t-test).

We used the visual features from ResNet-v2-152 with frames and trained all the aforementioned classifiers with three subjects and tested it with two remaining subjects. For each classifier, we replicated the procedure by shuffling the subjects in testing and training (cross-validation among the subjects trials). We trained each classifier in two modes: using combined features and using only visual features. The f1-score is utilized as the metric to measure the performance of the classifier in each mode. The performance results are provided in Tab. IX.

The null hypothesis indicates that the performance of the recognition model using two different feature sets (visual features/combined features) does not significantly differ from each other. To implement the one sample t-test, we calculate the difference between the f1-scores resulted from each classifier in each mode. The hypothesized population mean is equal to zero. The resulted p-value and t-value are presented in Table

IX. Based on the reported numbers for p-values, considering 0.05 significance level, the recognition rate improvement by adding the non-visual features is not statistically significant. The t-test results imply that the vision-based features are sufficient for three classifiers out of four. In other words, in the action recognition step, using the video recorded by cameras results in satisfactory performance.

Vi Conclusion and Future Work

In this paper, we studied how multi-modal data recorded during human object manipulation action can inform robot recognition of such actions, and how it can be used by the robot to learn how to reproduce them. We proposed two different approaches for action recognition, a primitive-based approach that is based on explicit physical models, and a visual-data-flow approach that only relies on the raw data.

In the primitive-based method, twenty four multi-modal primitives were defined using both physical models and the insights gleaned from the collected data. The multi-modal data used for this part of our study consists of hand motion data, applied forces as represented by the pressure patterns on the hand, and measurements of the bending of the fingers. The distinguishing characteristics of the proposed primitives is that they can be directly mapped onto the collected data, no classification needs to take place. In particular, the annotation and classifier training are not needed. Once the data had been segmented into the primitives, the sequence of the primitives was used to recognize the manipulation actions. Since the proposed primitives are explicitly modeled, they eliminate the over-fitting of the model and provide an abstract representation that does not depend on the absolute value of the signals. The comparison between the models trained with and without the primitives suggests that adding the information on the physics of the task to the recognition model is beneficial.

In the visual-data-flow method, we used vision in addition to the multi-modal data used previously. We extracted the raw visual features from recorded video using pre-trained deep convolutional neural network (DCNN), eliminating the need for developing a separate feature extraction methodology. The visual features were used to train several different classifiers both with and without using the multi-modal data. The results indicate that as concerns the recognition only, adding multi-modal data does not result in a statistically significant change in the classifier performance.

We show that both approaches produce a comparable classifier performance. This implies that image-based methods can successfully recognize human actions during human-robot collaboration. On the other hand, in order to provide training data for the robot so it can learn how to perform object manipulation actions, multi-modal data provides a better alternative.

Our results suggest several possible avenues for future research. While the visual-data-flow method can successfully recognize manipulation actions, in order to better control the interaction it may be necessary to also recognize the action primitives. We thus plan to study how the primitive-based method can provide the annotation labels to the visual-data-flow method and in turn generate a classifier without any human intervention. Another direction would be to move from object manipulation actions to higher-level tasks. This involves both choosing a task representation formalism as well as finding an appropriate planning and execution framework.


  • [1] H. Kjellström, J. Romero, D. Martínez, and D. Kragić, “Simultaneous visual recognition of manipulation actions and manipulated objects,” Computer Vision–ECCV 2008, pp. 336–349, 2008.
  • [2] H. Kjellström, J. Romero, and D. Kragić, “Visual object-action recognition: Inferring object affordances from human demonstration,” Computer Vision and Image Understanding, vol. 115, no. 1, pp. 81–90, 2011.
  • [3] M. Patel, C. H. Ek, N. Kyriazis, A. Argyros, J. V. Miro, and D. Kragic, “Language for learning complex human-object interactions,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 4997–5002, IEEE, 2013.
  • [4] I. S. Vicente, V. Kyrki, D. Kragic, and M. Larsson, “Action recognition and understanding through motor primitives,” Advanced Robotics, vol. 21, no. 15, pp. 1687–1707, 2007.
  • [5] Sanmohan and V. Krüger, Primitive Based Action Representation and Recognition, pp. 31–40. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009.
  • [6] V. Kruger, D. L. Herzog, S. Baby, A. Ude, and D. Kragic, “Learning actions from observations,” IEEE Robotics & Automation Magazine, vol. 17, no. 2, pp. 30–43, 2010.
  • [7] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [8] E. A. Fleishman, M. K. Quaintance, and L. A. Broedling, Taxonomies of human performance: The description of human tasks. Academic Press, 1984.
  • [9] V. I. Pavlovic, R. Sharma, and T. S. Huang, “Visual interpretation of hand gestures for human-computer interaction: A review,” IEEE Transactions on pattern analysis and machine intelligence, vol. 19, no. 7, pp. 677–695, 1997.
  • [10] I. M. Bullock, R. R. Ma, and A. M. Dollar, “A hand-centric classification of human and robot dexterous manipulation,” IEEE transactions on Haptics, vol. 6, no. 2, pp. 129–144, 2013.
  • [11] A. F. Bobick, “Movement, activity and action: the role of knowledge in the perception of motion,” Philosophical Transactions of the Royal Society of London B: Biological Sciences, vol. 352, no. 1358, pp. 1257–1265, 1997.
  • [12] N. Hogan and D. Sternad, “Dynamic primitives in the control of locomotion,” Frontiers in computational neuroscience, vol. 7, 2013.
  • [13] C. Fang and X. Ding, “A set of basic movement primitives for anthropomorphic arms,” in Mechatronics and Automation (ICMA), 2013 IEEE International Conference on, pp. 639–644, IEEE, 2013.
  • [14] F. Meier, E. Theodorou, F. Stulp, and S. Schaal, “Movement segmentation using a primitive library,” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pp. 3407–3412, IEEE, 2011.
  • [15] A. Fod, M. J. Matarić, and O. C. Jenkins, “Automated derivation of primitives for movement classification,” Autonomous robots, vol. 12, no. 1, pp. 39–54, 2002.
  • [16]

    B. Lim, S. Ra, and F. C. Park, “Movement primitives, principal component analysis, and the efficient generation of natural motions,” in

    Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on, pp. 4630–4635, IEEE, 2005.
  • [17] O. C. Jenkins and M. J. Mataric, “Deriving action and behavior primitives from human motion data,” in Intelligent Robots and Systems, 2002. IEEE/RSJ International Conference on, vol. 3, pp. 2551–2556, IEEE, 2002.
  • [18] B. H. Williams, M. Toussaint, and A. J. Storkey, “Extracting motion primitives from natural handwriting data,” in International Conference on Artificial Neural Networks, pp. 634–643, Springer, 2006.
  • [19] T. Inamura, I. Toshima, H. Tanie, and Y. Nakamura, “Embodied symbol emergence based on mimesis theory,” The International Journal of Robotics Research, vol. 23, no. 4-5, pp. 363–377, 2004.
  • [20] M. Alvarez, J. R. Peters, N. D. Lawrence, and B. Schölkopf, “Switched latent force models for movement segmentation,” in Advances in neural information processing systems, pp. 55–63, 2010.
  • [21] L. Chen, M. Javaid, B. Di Eugenio, and M. Žefran, “The roles and recognition of haptic-ostensive actions in collaborative multimodal human–human dialogues,” Computer Speech & Language, vol. 34, no. 1, pp. 201–231, 2015.
  • [22] M. Kipp, “Anvil-a generic annotation tool for multimodal dialogue,” in Seventh European Conference on Speech Communication and Technology, 2001.
  • [23] T. Flash and N. Hogan, “The coordination of arm movements: an experimentally confirmed mathematical model,” Journal of neuroscience, vol. 5, no. 7, pp. 1688–1703, 1985.
  • [24] Y. Uno, M. Kawato, and R. Suzuki, “Formation and control of optimal trajectory in human multijoint arm movement,” Biological cybernetics, vol. 61, no. 2, pp. 89–101, 1989.
  • [25] E. Nakano, H. Imamizu, R. Osu, Y. Uno, H. Gomi, T. Yoshioka, and M. Kawato, “Quantitative examinations of internal representations for arm trajectory planning: minimum commanded torque change model,” Journal of Neurophysiology, vol. 81, no. 5, pp. 2140–2155, 1999.
  • [26] C. M. Harris and D. M. Wolpert, “Signal-dependent noise determines motor planning,” Nature, vol. 394, no. 6695, p. 780, 1998.
  • [27] B. Abbasi, M. Sharifzadeh, E. Noohi, S. Parastegari, and M. Žefran, “Grasp taxonomy for robot assistants inferred from finger pressure and flexion,” in 2019 International Symposium on Medical Robotics (ISMR), pp. 1–7, IEEE, 2019.
  • [28] “Fsr.” https://www.interlinkelectronics.com/fsr-400.
  • [29] B. Abbasi, E. Noohi, S. Parastegari, and M. Žefran, “Grasp taxonomy based on force distribution,” in Robot and Human Interactive Communication (RO-MAN), 2016 25th IEEE International Symposium on, pp. 1098–1103, IEEE, 2016.
  • [30] “Flexpoint.” http://www.flexpoint.com/technicalDataSheets/FlexpointBrochure1.pdf.
  • [31] “Arduino mega 2560.” https://www.arduino.cc/en/Guide/ArduinoMega2560.
  • [32] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.,” in AAAI, vol. 4, p. 12, 2017.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 770–778, 2016.
  • [34] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision, pp. 630–645, Springer, 2016.
  • [35] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.