Predicting Ergonomic Risks During Indoor Object Manipulation Using Spatiotemporal Convolutional Networks

Automated real-time prediction of the ergonomic risks of manipulating objects is a key unsolved challenge in developing effective human-robot collaboration systems for logistics and manufacturing applications. We present a foundational paradigm to address this challenge by formulating the problem as one of action segmentation from RGB-D camera videos. Spatial features are first learned using a deep convolutional model from the video frames, which are then fed sequentially to temporal convolutional networks to semantically segment the frames into a hierarchy of actions, which are either ergonomically safe, require monitoring, or need immediate attention. For performance evaluation, in addition to an open-source kitchen dataset, we collected a new dataset comprising twenty individuals picking up and placing objects of varying weights to and from cabinet and table locations at various heights. Results show very high (87-94) labels for videos lasting over two minutes and comprising a large number of actions.


page 1

page 4

page 5

page 6


Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection

The dominant paradigm in spatiotemporal action detection is to classify ...

Video Semantic Object Segmentation by Self-Adaptation of DCNN

This paper proposes a new framework for semantic segmentation of objects...

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

We introduce UCF101 which is currently the largest dataset of human acti...

Interpreting video features: a comparison of 3D convolutional networks and convolutional LSTM networks

A number of techniques for interpretability have been presented for deep...

Egocentric Prediction of Action Target in 3D

We are interested in anticipating as early as possible the target locati...

Occlusion Edge Detection in RGB-D Frames using Deep Convolutional Networks

Occlusion edges in images which correspond to range discontinuity in the...

I Introduction

One of the key considerations for viable human-robot collaboration (HRC) in industrial settings is safety. This consideration is particularly important when a robot operates in close proximity to humans and assists them with certain tasks in increasingly automated factories and warehouses. Therefore, it is not surprising that a lot of effort has gone into identifying and implementing suitable HRC safety measures [1]. Typically, the efforts focus on designing collaborative workspaces to minimize interferences between human and robot activities [2], installing redundant safety protocols and emergency robot activity termination mechanisms through multiple sensors [2], and developing both predictive and reactive collision avoidance strategies [3]. These efforts have resulted in the expanded acceptance and use of industrial robots, both mobile and stationary, leading to increased operational flexibility, productivity, and quality [4].

However, ergonomic safety of the collaborating humans is an extremely important related topic that has not received much attention until recently. Unlike other commonly considered safety measures, a lack of ergonomic safety does not lead to immediate injury concerns or fatality risks. It, however, causes or increases the likelihood of causing longterm injuries in the form of musculoskeletal disorders (MSDs) [5]. According to a recent report by the U.S. Bureau of Labor Statistics, there were 349,050 reported cases of MSDs in 2016 just in the U.S. [6], leading to tens of billions of dollars in healthcare costs.

Therefore, we need to develop collaborative workspaces that minimize the ergonomic risks due to repetitive and/or physically strenuous tasks involving awkward human postures such as bending, stretching, and twisting. Keeping this need in mind, researchers have started working on defining suitable ergonomic safety measures, formulating cost functions for the collaborating robots (co-bots) to optimize over the joint human-robot action spaces, using the right fidelity human motion model, and learning to deal with modeling errors and operational uncertainties (see the various topics presented at a recently organized workshop [7]).

Here, we focus on developing an automated real-time ergonomic risk assessment framework for indoor object manipulation. The goal of this framework is to enable the co-bots to take over a majority of the risky manipulation actions, allowing the humans to engage in more cognitively challenging tasks or supervisory control activities. However, the conventional ergonomic risk assessment methods are based on observations and self-reports, making them error-prone, time consuming, and labor-intensive [8]. Alternatively, RGB-D sensors are placed everywhere in the workspaces and motion sensors are attached to different body parts of the co-workers [9], leading to expensive and unwieldy solutions. Recently, computer vision techniques are being applied on both images and videos [10], and interactive 3D simulation followed by automated post-processing is being employed [11] to realize efficient and convenient solutions. We adopt the final approach in this work without relying on any kind of post-processing.

Unsurprisingly, quite a few researchers have used deep learning over the past couple of years to assess the risks of performing occupational tasks. In particular, this area of research has become quite popular in the construction industry. Recent works include that by Fang et al. [12]

, who presented a Faster Region-Convolutional Neural Network (Faster R-CNN) method for automated detection of non-hardhat-use from far-field surveillance videos. Ding et al.

[13] used a hybrid deep learning method to determine unsafe human behaviors at construction sites. Fang et al. [14] developed another Faster R-CNN method to identify workers who were not wearing harnesses from a large database of construction sites images. Outside of the construction sector, Abobakr et al. [15] employed deep residual CNNs to predict the joint angles of manufacturing workers from individual camera depth images. Mehrizi et al. [16]

proposed a multi-view based deep perceptron approach for markerless 3D pose estimation in the context of object lifting tasks. While all these works present useful advances and report promising performances, they do not provide a general-purpose framework to assess or predict the ergonomic risks for any representative set of object manipulation tasks commonly performed in the industry.

Here, we present a first of its kind end-to-end deep learning system and a new evaluation dataset to address this shortcoming for indoor manipulation. Our learning methods are adapted from video action detection and segmentation literature. In action detection, a sparse set of action segments are computed, where the segments are defined by start times, end times, and class labels. For example, Singh et al. [17]

proposed the use of a Bi-directional Long Short Term Memory (Bi-LSTM) layer after feature extraction from a multi-stream CNN to reduce detection errors. In action segmentation, an action is predicted for every video frame. Representative works include that by Fathi et al.

[18], who showed that state changes at the start and end frames of actions provided good segmentation performance. Kuehne et al. [19]

used reduced Fisher vectors for visual representation of every frame, which were then fitted to Gaussian mixture models. Huang et al.

[20] presented a temporal classification framework in the case of weakly supervised action labeling. While there are overlaps between these two forms of video processing, our problem closely resembles the latter. We, therefore, employ state-of-the-art segmentation methods that have been shown to work well on challenging datasets.

Ii Ergonomic Risk Assessment Model

We use a well-established ergonomic model, known as the rapid entire body assessment (REBA) model [21], which is popularly used in the industry. The REBA model assigns scores to the human poses, within a range of 1-15, on a frame-by-frame basis by accounting for the joints motions and angles, load conditions, and activity repetitions. An action with an overall score of less than 3 is labeled as ergonomically safe, a score between 3-7 is deemed to be medium risk that requires monitoring, and every other action is considered high risk that needs attention.

Skeletal information for the TUM Kitchen dataset is available in the bvh (Biovision Hierarchy) file format. We use the bvh parser from the MoCap Toolbox [22] in MATLAB to read this information as the XYZ coordinates of thirty three markers (joints and end sites) corresponding to every frame. For the UW IOM dataset, the positions of twenty five different joints are recorded directly in the global coordinate system for each frame using the Kinect sensor with the help of a toolbox [23] that links Kinect and MATLAB. For every frame, the vectors corresponding to different body segments such as fore-arm, upper-arm, leg, thigh, lower half spine, upper half spine, and so on, are computed. The extension, flexion, and abduction (as applicable) of the various body segments are computed by taking the projection of the two body segment vectors that constitute the angle on the plane of motion. These angles of extension, flexion, and abduction are used to assign the trunk, neck, leg, upper arm, lower arm, and wrist scores [21].

We define three different thresholds as a part of our implementation, namely, zero threshold, binary threshold, and abduction threshold. Zero threshold is used for trunk bending, such that any trunk bending angle less than this value is regarded as no bending. Binary threshold is defined to answer whether the trunk is twisted and/or side flexed. Trunk twisting and trunk side flexion less than this value are ignored. Abduction threshold, though similar to the binary threshold, is separately defined for shoulder abduction considering the considerably larger allowable range of abduction (about 150°) as against a smaller allowable range of trunk twisting. Due to the non-availability of rotation information of the neck, we assume that the neck is twisted when the trunk is twisted. The nature of the performed actions does not involve arm rotations, and they are ignored while computing the upper arm score.

The computed trunk, neck, leg, upper arm, lower arm, and wrist scores are used to assign the REBA score on a frame-by-frame basis using lookup tables [21]. The REBA scores assigned for each frame are then aggregated over actions and participants. This aggregated value is considered as the final REBA score for that particular action.

Iii Deep Learning Models

Iii-a Spatial Features Extraction

We adapt two variants of VGG16 convolutional neural network models [24]

for spatial feature extraction. The first model is based on the VGG16 model that is pre-trained on the ImageNet database


. The second model involves fine-tuning the last two convolutional layers of the VGG16 base. In both the models, the flattened output of the last convolutional layer is connected to a fully connected layer with a drop-out of 0.5 and then fed into a classifier. We always use ReLU and softmax as the activation functions, and Adam

[26] as the optimizer.

We also use a simplified form of the pose-based CNN (P-CNN) features [27] that only consider the full images and not the different image patches. Optical flow [28] is first computed for each consecutive pair of RGB datasets, and the flow map is stored as an image [29]. A motion network, introduced in [29], containing five convolutional and three fully-connected layers, is then used to extract frame descriptors for all the optical flow images. Subsequently, the VGG-f network [30], pre-trained on ImageNet, is used to extract another set of frame descriptors for all the RGB images. The VGG-f network also contains five convolutional and three fully connected layers. The two sets of frame descriptors are put together as arrays in the same sequence as that of the video frames to construct motion-based and appearance-based video descriptors, respectively. The appearance and motion-based video descriptors are then normalized and concatenated to form the final video descriptor (spatial features).

Iii-B Video Segmentation Methods

We use two kinds of temporal convolutional networks (TCNs), both of which use encoder-decoder architectures to capture long-range temporal patterns in videos [31]

. In the first network, referred as the encoder decoder-TCN, or ED-TCN, a hierarchy of temporal convolutions, pooling, and upsampling layers is used. The network does not have a large number of layers, but each layer includes a set of long convolutional filters. We use the ReLU activation function and a categorical cross-entropy loss function with RMSProp

[32] as the optimizer. In the second network, termed as dilated-TCN, or D-TCN, dilated upsampling and skip connections are added between the layers. We use the gated activation function as it is inspired by the WaveNet [33] and Adam optimizer. We also use two other segmentation methods for comparison purposes. The first method is Bi-LSTM[34]

, a recurrent neural network commonly used for analyzing sequential data streams. The second method is support vector machine, or SVM, which is extremely popular for any kind of classification problem.

Iii-C Video Segmentation Performance Metrics

In addition to frame-based accuracy, which is the percentage of frames labeled correctly for the related sequence as compared to the ground truth (manually annotated), we report edit-score and F1 overlap score to evaluate the performance of the various methods. The edit-score [35] measures the correctness of Levenshtein distance to the segmented predictions. The F1 overlap score [35]

, combines classification precision and recall to reduce the sensitivity of the predictions to minor temporal shifts between the predicted and ground truth values, as such shifts might be caused by subjective variabilities among the annotators.

Iv System Architecture and Datasets

Iv-a System Architecture

We develop an end-to-end automated ergonomic risk prediction system as shown in Fig. 1. The Figure shows that the prediction works in two stages. In the first stage (top half of the Figure), which only needs to be done once for a given dataset, ergonomic risk labels are computed for each object manipulation action class based on the skeletal models extracted from the RGB-D camera videos. Simultaneously, the videos are annotated carefully to assign an action label to each and every frame. These two types of labels are then used to learn a modified VGG16 model for the entire set of available videos. In the second stage (bottom half of the Figure), during training, the exact sequence of video frames is fed to the learned VGG16 model to extract useful spatial features. The array of extracted features is then fed in the same order to train a TCN on how to segment the videos by identifying the similarities and changes in the features corresponding to actions executions and transitions, respectively. For testing, a similar procedure is followed except that the trained TCN is now employed to segment unlabeled videos into semantically meaningful actions with known ergonomic risk categories. It is possible to replace the VGG16 model by another deep neural network model that also accounts for human motions (e.g., P-CNN) to achieve slightly better segmentation performance at the expense of longer training and prediction times.

Fig. 1: End-to-end ergonomic risk prediction system

Iv-B Datasets

Iv-B1 TUM Kitchen Dataset

The TUM Kitchen dataset [36], consists of nineteen videos, at twenty-five frames per second, taken by four different monocular cameras, numbered from 0 to 3. Each video captures regular actions performed by an individual in a kitchen involving walking, picking up, and placing utensils to and from cabinets, drawers, and tables. The average duration of the videos is about two minutes. The dataset also includes skeletal models of the individual through 3D reconstruction of the camera images. These models are constructed using a markerless full body motion tracking system through hierarchical sampling for Bayesian estimation and layered observation modeling to handle environmental occlusions [37]. We categorize the actions into twenty-one classes or labels, where each label follows a two-tier hierarchy with the first tier indicating a motion verb (close, open, pick-up, place, reach, stand, twist, and walk) and the second tier denoting the location (cabinet, drawer) or mode of object manipulation (do not hold, hold with one hand, and hold with both hands).

Iv-B2 University of Washington Indoor Object Manipulation Dataset

Considering the dearth of suitable videos capturing object manipulation actions involving awkward poses and repetitions, we collected a new University of Washington Indoor Object Manipulation (UW IOM) dataset using an IRB-approved study. The dataset comprises videos of twenty participants within the age group of 18-25 years, of which fifteen are males and the remaining five are females. The videos are recorded using a Kinect Sensor for Xbox One at an average rate of twelve frames per second. Each participant carries out the same set of tasks in terms of picking up six objects (three empty boxes and three identical rods) from three different vertical racks, placing them on a table, putting them back on the racks from where they are picked up, and then walking out of the scene carrying the box from the middle rack. The boxes are manipulated with both the hands while the rods are manipulated using only one hand. The above tasks are repeated in the same sequence three times such that the duration of every video is approximately three minutes. We categorize the actions into seventeen labels, where each label follows a four-tier hierarchy. The first tier indicates whether the box or the rod is manipulated, the second tier denotes human motion (walk, stand, and bend), the third tier captures the type of object manipulation if applicable (reach, pick-up, place, and hold), and the fourth tier represents the relative height of the surface where manipulation is taking place (low, medium, and high). Representative snapshots from one of the videos are shown in Fig. 2. The UW IOM dataset is available for free download and use at:

Fig. 2: Representative video frames depicting actions with different ergonomic risk levels in our own UW IOM dataset

V Experimental Details and Results

V-a Implementation Details

For each participant (video), we first compute the REBA score for all the frames. The zero threshold is set to 5o, the binary threshold to 10o and the abduction threshold is selected as 30o for these computations. For the UW IOM dataset, we then compute the median of the REBA scores assigned to all the frames belonging to a particular action. We then take the median over all the participants to determine the final REBA score for that action.

The framewise skeletal information available for the TUM Kitchen dataset has a variable lag with respect to the video frames, i.e., the skeleton does not lie exactly on the human pose in the RGB image. Therefore, aggregating over actions and participants according to the RBG image annotations does not result in meaningful REBA scores. We, therefore, reduce the length of both the video annotations of the RGB frames and the framewise REBA scores to 100 using a constant step size of number of frames/100 for every video. We then compute the average REBA score for every action in a particular video using the reduced video annotations and framewise scores. The maximum score assigned to a particular action among all the videos is considered as the final REBA score for that action.

The pre-trained VGG16 model for spatial features extraction is trained for 200 epochs with 300 steps per epoch on the TUM Kitchen dataset, and 300 epochs with 300 steps per epoch on the UW IOM dataset with a step-size of 1e-5. The fine-tuned model is trained with the same number of epochs for the TUM Kitchen dataset but with 500 steps per epoch on the UW IOM dataset with 300 steps per epoch and a step-size of 1e-7. The total number of training and validation samples for the TUM Kitchen dataset are 24,052 and 5,290, respectively. For the UW IOM dataset, we train over 27,539 samples and validate over 6,052 samples. The models are learned using the open-source TensorFlow machine learning software library


and Python-based Keras

[39] neural network library as the backend. To implement the simplified P-CNN model, we modify the MATLAB package provided with [27].

We evaluate the performance of the four segmentation methods by splitting our datasets into five splits, in each of which, the videos are assigned randomly to mutually exclusive training and test sets of fixed sizes. For both the TCN methods, training is terminated after 500 epochs in each of the splits as the validation accuracy stops improving afterward. We use a learning rate of 0.001 for both the methods. D-TCN includes five stacks, each with three layers, and a set of {32, 64, 96} filters in each of the three layers. Filter duration duration, defined as the mean segment duration for the shortest class from the training set, is chosen to be 10 seconds. Similarly, training for Bi-LSTM is terminated after 200 epochs for each split as the validation accuracy does not change any further. Bi-LSTM uses Adam optimizer with a learning rate of 0.001, softmax activation function, and categorical cross-entropy loss function. We choose a linear kernel to train the SVM and use squared hinge loss as the loss function. All the training and testing are done on a workstation running Windows 10 operating system, equipped with a 3.7GHz 8 Core Intel Xeon W-2145 CPU, GPU ZOTAC GeForce GTX 1080 Ti, and 64 GB RAM.

V-B Results

V-B1 Ergonomic Risk Assessment Labels

For the TUM Kitchen dataset, fifteen actions are labeled to be medium risk, while the remaining six are deemed as high risk. The high risk actions are associated with closing, opening, and reaching motions, although there is no perfect correspondence due to a lack of fidelity of the skeletal models on which the risk scores are based upon.

In case of the UW IOM dataset, three actions are labeled as low risks, eleven actions are considered medium risk, and the remaining three are identified as high risk. The high risk actions include picking up a box from the top rack and placing objects (box and rod) on the top rack. Walking without holding any object, walking while holding a box, and picking up a rod from the mid-level rack while standing are regarded as low risk, i.e., safe actions. Fig. 2 shows the corresponding ergonomic risk labels for these different actions depicted in the video snapshots

V-B2 Video Segmentation Outcomes

Table I provides a quantitative performance assessment of the two variants of our segmentation method on the TUM Kitchen dataset for camera # 2 videos. Both the variants perform satisfactorily with respect to all the three performance measures. In fact, the ED-TCN method achieves an F1 overlap score of almost 88%, which has not been previously reported for any action segmentation problem with more than twenty labels to the best of our knowledge. Our TCN methods also outperform Bi-LSTM and SVM substantially. Just for comparison purposes, it is interesting to note that the validation accuracy of image classification is 82.80% and 73.46% using the pretrained and fine-tuned VGG16 models, respectively.

Method Accuracy (%) Edit score (%) F1 overlap (%)
Pre-trained VGG16 D-TCN 73.744.57 78.76.50 83.884.52
ED-TCN 74.754.08 86.343.15 87.922.16
Bi-LSTM 62.55 6.56 44.498.67 55.119.42
SVM 59.55 4.98 35.393.00 47.753.82
Fine-tuned VGG16 D-TCN 74.144.97 80.335.41 84.444.05
ED-TCN 74.324.06 84.964.37 87.292.78
Bi-LSTM 62.89 6.17 47.158.67 57.759.02
SVM 59.81 5.10 35.83.54 47.674.35
TABLE I: Comparative performance measures of different action segmentation methods on the TUM Kitchen dataset for camera # 2.

Fig. 3 demonstrates that regardless of whether the spatial features are extracted using a pre-trained or fine-tuned VGG16 architecture, both the TCN methods are able to segment the frames into the correct (or more precisely, same as the manually annotated) actions substantially better than Bi-LSTM and SVM. In fact, the global frame-by-frame classification accuracy value is very high, between (86-91)%, using the TCN methods. Furthermore, both the TCN methods almost always predict the correct sequence of actions unlike the other two widely-used classification methods.

Fig. 3: Performance comparison of various methods in action segmentation of a representative TUM Kitchen dataset video using (A) pre-trained VGG16 model and (B) fine-tuned VGG16 model. For each method, the upper row shows the ground truth (manually annotated) action labels, whereas the lower row depicts the corresponding predicted label.

The difference in performance between the TCN and other two segmentation methods is even more pronounced in case of the UW IOM dataset, which includes a larger variety of object manipulation actions. As shown in Table II, SVM performs rather poorly particularly with respect to edit score and F1 overlap values owing to over-segmentation and sequence prediction errors. Bi-LSTM performs somewhat better with the best results obtained using the spatial features generated from a simplified form of P-CNN. Interestingly enough, ED-TCN performs substantially better than D-TCN regardless of the spatial feature extraction method being used. This finding is also consistent with the results for different grocery shopping, gaze tracking, and salad preparation datasets presented in [31]

. It happens most likely due to the ability of ED-TCN to identify fine-grained actions without causing over-segmentation by modeling long-term temporal dependencies through max pooling over large time windows. In fact, the edit scores for ED-TCN are close to 90% and the F1 overlap values are more than 93% when we use the fine-tuned VGG16 and P-CNN models. The performance measures are almost identical between the two models with P-CNN yielding marginally better results. The validation accuracy is 75.97% and 73.86% for the pre-trained VGG16 and fine-tuned VGG16 models, respectively, which are similar to the values for the TUM Kitchen dataset. Fig.

4 reinforces these observations on a representative UW IOM dataset video.

Method Accuracy (%) Edit score(%) F1 overlap (%)
Pre-trained VGG16 D-TCN 62.114.13 46.624.17 57.485.26
ED-TCN 78.763.65 82.963.33 87.772.51
Bi-LSTM 42.14 5.45 23.761.50 29.713.72
SVM 27.10 3.40 18.050.92 20.251.35
Fine-tuned VGG16 D-TCN 61.396.22 72.296.16 72.296.16
ED-TCN 86.460.50 88.521.17 93.240.58
Bi-LSTM 59.234.40 33.193.13 43.884.23
SVM 42.103.33 20.610.89 27.561.92
Simplified P-CNN D-TCN 81.722.82 74.015.13 82.234.80
ED-TCN 87.630.77 89.901.16 93.990.77
Bi-LSTM 71.384.97 75.337.41 80.457.55
SVM 59.622.74 20.090.95 31.331.75
TABLE II: Comparative performance measures of different action segmentation methods on the complete UW IOM dataset
Fig. 4: Performance comparison of various methods in semantic segmentation of a representative UW IOM dataset video using (A) pre-trained VGG16 model and (B) fine-tuned VGG16 model. For each method, the upper row shows the ground truth (manually annotated) action labels, whereas the lower row depicts the corresponding predicted label.

If we only use the spatial features, image classification validation accuracy is either comparable to (for the TUM Kitchen dataset), or lower than the video segmentation test accuracy (for the UW IOM dataset). Noting that validation accuracy is typically greater than test accuracy for any supervised learning problem, we would expect segmentation accuracy to be much lower than the reported values in the absence of the temporal neural networks. On the other hand, segmentation performance depends quite a bit on the choice of the spatial feature extraction model, particularly in the case of the more challenging UW IOM dataset. This reinforces the intuition that both spatial and temporal characteristics are important in analyzing long-duration human action videos.

It is not surprising to observe that the TCN methods perform better using edit score and F1 overlap score as the measure instead of global accuracy. As also reported in [31], accuracy is susceptible to erroneous and subjective manual annotation of the video frames, particularly during the transitions from one action to the next, where identifying the exact frame when one action ends and the next one begins is often open to individual interpretation. Both edit score and F1 score are more robust to such annotation issues as compared to accuracy, and, therefore, serve as better indicators of true system performance.

To further evaluate the general applicability of our action segmentation methods, we consider two additional test scenarios: TUM Kitchen videos taken from camera # 1, and a truncated UW IOM dataset comprising only one sequence of object manipulation actions per participant. Table III reports the action segmentation outcomes using just the fine-tuned VGG16 model since it yields better results than the pretrained VGG16 model on our regular test datasets. The trends are more or less the same as in our regular datasets. The actual measures are almost identical for the complete and truncated UW IOM dataset. Thus, our methods seem to be robust to sample size, provided all the actions are covered adequately with a sufficient number of instances in the training set, and the actions occur in the same sequence in all the videos. The actual measures for our TCN methods are only slightly lower for the different TUM Kitchen dataset. Thus, performances appear to be independent of the manner (camera orientation) in which the videos are recorded. The validation accuracy is equal to 76.81% and 75.28% for the different TUM Kitchen and the truncated UW IOM dataset, respectively, which are nearly identical to the corresponding values for the regular TUM Kitchen and complete UW IOM datasets.

Method Accuracy (%) Edit score (%) F1 overlap (%)
TUM Kitchen with camera # 1 D-TCN 64.008.74 69.266.08 72.247.49
ED-TCN 69.834.66 84.693.49 83.433.09
Bi-LSTM 54.10 9.37 40.648.06 48.339.44
SVM 48.43 8.87 30.346.25 38.467.88
UW IOM with non-repeated action sequence D-TCN 74.042.53 62.913.32 72.913.01
ED-TCN 83.991.10 88.162.24 92.661.72
Bi-LSTM 58.93 2.22 30.572.98 41.232.71
SVM 40.621.72 21.051.94 26.981.92
TABLE III: Comparative performance measures of different action segmentation methods on two additional video datasets.

V-B3 System Computation Times

In addition to characterizing the goodness of action segmentation, we are interested in knowing how long does it take to learn the spatial feature extraction models, to train the segmentation methods, and to compute the framewise action labels during testing.

The learning times for the pre-trained and fine-tuned VGG16 models are 20,844.11 seconds and 30,564.39 seconds, respectively, in case of the complete UW IOM dataset. As expected, the learning time for the fine-tuned VGG16 model is somewhat lower and equal to 25,414.24 seconds in case of the truncated UW IOM dataset. For the TUM Kitchen dataset, the corresponding value is 31,753.18 seconds.

Using the fine-trained VGG16 model, in case of the complete UW IOM dataset, the overall training times are 252.73 0.85, 237.76 0.72, 2,172.23 11.22, and 60.54

1.54 seconds across the five data splits for the D-TCN, ED-TCN, Bi-LSTM, and SVM methods, respectively. The corresponding testing times are 0.10, 0.10, 1.09, and 0.09 seconds (the standard errors are negligible), respectively, for an average number of 8,261 frames, which implies that real-time action prediction is highly feasible. These values are almost identical using the pre-trained VGG16 model.

In case of the truncated UW IOM dataset, using the fine-tuned VGG16 model, the overall training times are 113.3 1.52, 90.37 0.72, 754.79 14.16, and 13.76 0.59 seconds across the five data splits for the D-TCN, ED-TCN, Bi-LSTM, and SVM methods, respectively. The corresponding testing times are 0.04, 0.03, 0.40, and 0.04 seconds (the standard errors are again negligible), respectively, for an average number of 2,916 frames.

For the TUM Kitchen dataset, the overall training times are 91.19 1.04, 74.53 0.65, 619.21 1.82, and 15.68 0.67 seconds across the five data splits for the D-TCN, ED-TCN, Bi-LSTM, and SVM methods, respectively. The corresponding testing times are 0.03, 0.02, 0.33, and 0.03 seconds (negligible standard errors), respectively, for an average number of 6,311 frames.

We further note that the TCN methods also have acceptable training times of the order of a few minutes for reasonably large datasets. This characteristic enables our system to adapt quickly to changing object manipulation tasks. On the other hand, the training times are considerably larger for Bi-LSTM, similar to the results reported in [31].

Vi Discussion

In case of the more challenging UW IOM dataset, we observe that our TCN methods demonstrate better segmentation performance when spatial features are extracted using the fine-tuned VGG16 model instead of the pre-trained VGG16 model. Consequently, we decided to use P-CNN features to examine whether additional spatial features would further facilitate learning the temporal aspects of the videos for the action segmentation methods. As introduced in [27], P-CNN features are descriptors for video clips that are restricted to only one action per clip. All the frame features of a video clip are aggregated over time using different schemes that result in a single descriptor comprising information about the action in that clip. However, our goal is to process full-length videos with multiple actions. A single time-aggregated descriptor for an entire sequence of multiple actions is of less value to us, as time aggregation results in the loss of important information about the sequence of actions as well as the transitions between the different actions. Hence, we skip the time aggregation step to obtain a video descriptor of the same length as the number of frames in the full-length video.

Also, P-CNN features are originally generated by stacking normalized time-aggregated descriptors for ten different patches, i.e., five patches of the RGB image (namely, full body, upper body, left hand, right hand, and full image) and corresponding five patches of the optical flow image. These patches are cropped from the RGB and optical flow frames, respectively, using the relevant body joint positions. The missing parts in the patches are filled with gray pixels, before resizing them as necessary for the CNN input layer. This filling step is done using a scale factor available along with the joint positions for the dataset used in [27]. Such a scale factor is, however, not available for our TUM Kitchen and UW IOM datasets. On experimenting with various common values for this scale factor, we observe that it needs to be different for every video as each participant has a somewhat different body structure. Therefore, we only use the full image patches in our simplified form of P-CNN to avoid this issue.

Vii Conclusions

In this letter, we present an end-to-end deep learning system to accurately predict the ergonomic risks during indoor object manipulation using camera videos. Our system comprises effective spatial features extraction and sequential feeding of the extracted features to the temporal neural networks for real time segmentation into meaningful actions. The segmentation methods work well with just standard (RGB) camera videos, irrespective of how the spatial features are extracted, provided depth cameras are used to generate reliable ergonomic risk scores for all the possible actions corresponding to a known object manipulation environment. Consequently, it makes our system useful for widespread deployment in factories and warehouses by eliminating the need for expensive RGB-D cameras that have to be mounted on mobile co-bots or installed at a large number of fixed locations.

In the future, we intend to further enhance our system to segment the videos satisfactorily, when either the actions are not always performed in the same sequence, or, the same set of actions are not carried out by all the humans. We plan to use the spatiotemporal correlations among the manipulated objects and their affordances, within, potentially, a generative deep learning model, for this purpose. Subsequently, we aim to use our system as the initial step for automated action recognition, where the goal would be to infer the future actions of a human given a set of executed actions.


  • [1] A. M. Zanchettin, N. M. Ceriani, P. Rocco, H. Ding, and B. Matthias, “Safety in human-robot collaborative manufacturing environments: Metrics and control,” IEEE Trans. Autom. Sci. Eng., vol. 13, no. 2, pp. 26 754–26 772, Apr. 2016.
  • [2] G. Michalos, S. Makris, P. Tsarouchi, T. Guasch, D. Kontovrakis, and G. Chryssolouris, “Design considerations for safe human-robot collaborative workplaces,” Procedia CIRP, vol. 37, pp. 248–253, Dec. 2015.
  • [3] S. Robla-Gómez, V. M. Becerra, J. R. Llata, E. Gonzalez-Sarabia, C. Torre-Ferrero, and J. Perez-Oria, “Working together: A review on safe human-robot collaboration in industrial environments,” IEEE Access, vol. 5, pp. 26 754–26 773, Nov. 2017.
  • [4] I. Maurtua, A. Ibarguren, J. Kildal, L. Susperregi, and B. Sierra, “Human-robot collaboration in industrial applications: Safety, interaction and trust,” Int. J. Adv. Robot. Syst., vol. 14, no. 4, pp. 1–10, Jul. 2017.
  • [5] M. Helander, A guide to human factors and ergonomics.   CRC Press, 2005.
  • [6] “Back injuries prominent in work-related musculoskeletal disorder cases in 2016,” Aug 2018. [Online]. Available:
  • [7] “ICRA 2018 workshop.” [Online]. Available:
  • [8] P. Spielholz, B. Silverstein, M. Morgan, H. Checkoway, and J. Kaufman, “Comparison of self-report, video observation and direct measurement methods for upper extremity musculoskeletal disorder physical risk factors,” Ergonomics, vol. 44, no. 6, pp. 588–613, Jun. 2001.
  • [9] C. Li and S. Lee, “Computer vision techniques for worker motion analysis to reduce musculoskeletal disorders in construction,” in Comput. Civil Eng., 2011, pp. 380–387.
  • [10] A. Golabchi, S. Han, J. Seo, S. Han, S. Lee, and M. Al-Hussein, “An automated biomechanical simulation approach to ergonomic job analysis for workplace design,” J. Construction Eng. Manag., vol. 141, no. 8, p. 04015020, Jan. 2015.
  • [11] X. Li, S. Han, M. Gül, and M. Al-Hussein, “Automated post-3D visualization ergonomic analysis system for rapid workplace design in modular construction,” Autom. Construction, vol. 98, pp. 160–174, Jan. 2019.
  • [12] Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, T. Rose, and W. An, “Detecting non-hardhat-use by a deep learning method from far-field surveillance videos,” Autom. Construction, vol. 85, pp. 1–9, Jan. 2018.
  • [13] L. Ding, W. Fang, H. Luo, P. E. Love, B. Zhong, and X. Ouyang, “A deep hybrid learning model to detect unsafe behavior: integrating convolution neural networks and long short-term memory,” Autom. Construction, vol. 86, pp. 118–124, Feb. 2018.
  • [14] W. Fang, L. Ding, H. Luo, and P. E. Love, “Falls from heights: A computer vision-based approach for safety harness detection,” Autom. Construction, vol. 91, pp. 53–61, Feb. 2018.
  • [15] A. Abobakr, D. Nahavandi, J. Iskander, M. Hossny, S. Nahavandi, and M. Smets, “A kinect-based workplace postural analysis system using deep residual networks,” in IEEE Int. Syst. Eng. Symp., 2017, pp. 1–6.
  • [16] R. Mehrizi, X. Peng, Z. Tang, X. Xu, D. Metaxas, and K. Li, “Toward marker-free 3D pose estimation in lifting: A deep multi-view solution,” in IEEE Int. Conf. Autom. Face Gesture Recognit., 2018, pp. 485–491.
  • [17] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-stream bi-directional recurrent neural network for fine-grained action detection,” in

    IEEE Conf. Comput. Vis. Pattern Recognit

    , 2016, pp. 1961–1970.
  • [18] A. Fathi and J. M. Rehg, “Modeling actions through state changes,” in IEEE Conf. Comput. Vis. Pattern Recognit, 2013, pp. 2579–2586.
  • [19] H. Kuehne, J. Gall, and T. Serre, “An end-to-end generative framework for video segmentation and recognition,” in IEEE Winter Conf. Appl. Comput. Vis., 2016, pp. 1–8.
  • [20] D.-A. Huang, L. Fei-Fei, and J. C. Niebles, “Connectionist temporal modeling for weakly supervised action labeling,” in Eur. Conf. Comput. Vis., 2016, pp. 137–153.
  • [21] S. Hignett and L. McAtamney, “Rapid entire body assessment,” in Handbook of Human Factors and Ergonomics Methods.   CRC Press, 2004, pp. 97–108.
  • [22] B. Burger and P. Toiviainen, “MoCap toolbox - A Matlab toolbox for computational analysis of movement data,” in Sound Music Comput. Conf., 2013, pp. 172–178.
  • [23] J. R. Terven and D. M. Córdova-Esparza, “Kin2. a kinect 2 toolbox for matlab,” Sci. Comput. Prog., vol. 130, pp. 97–106, 2016.
  • [24] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Int. Conf. Learning Representations, 2014.
  • [25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.
  • [26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Int. Conf. Learning Representations, 2014.
  • [27] G. Chéron, I. Laptev, and C. Schmid, “P-CNN: Pose-based CNN features for action recognition,” in IEEE Int. Conf. Comput. Vis., 2015, pp. 3218–3226.
  • [28] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in Eur. Conf. Comput. Vis., 2004, pp. 25–36.
  • [29] G. Gkioxari and J. Malik, “Finding action tubes,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 759–768.
  • [30] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in British Mach. Vis. Conf., 2014.
  • [31] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 156–165.
  • [32] “CS University of Toronto CSC321 Winter 2014 - lecture six slides.” [Online]. Available:
  • [33] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in Speech Synthesis Workshop, 2016.
  • [34] A. Graves, S. Fernández, and J. Schmidhuber, “Bidirectional LSTM networks for improved phoneme classification and recognition,” in Int. Conf. Artif. Neural Networks, 2005, pp. 799–804.
  • [35] C. Lea, R. Vidal, and G. D. Hager, “Learning convolutional action primitives for fine-grained action recognition,” in IEEE Int. Conf. Robot. Autom., 2016, pp. 1642–1649.
  • [36] M. Tenorth, J. Bandouch, and M. Beetz, “The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition,” in IEEE Int. Conf. Comput. Vis. Workshops, 2009, pp. 1089–1096.
  • [37] J. Bandouch and M. Beetz, “Tracking humans interacting with the environment using efficient hierarchical sampling and layered observation models,” in IEEE Int. Comput. Vis. Workshops, 2009, pp. 2040–2047.
  • [38] Tensorflow, “tensorflow/tensorflow.” [Online]. Available:
  • [39] Keras-Team, “keras-team/keras,” Feb. 2019. [Online]. Available: