Multi-Modal Recognition of Worker Activity for Human-Centered Intelligent Manufacturing

In a human-centered intelligent manufacturing system, sensing and understanding of the worker's activity are the primary tasks. In this paper, we propose a novel multi-modal approach for worker activity recognition by leveraging information from different sensors and in different modalities. Specifically, a smart armband and a visual camera are applied to capture Inertial Measurement Unit (IMU) signals and videos, respectively. For the IMU signals, we design two novel feature transform mechanisms, in both frequency and spatial domains, to assemble the captured IMU signals as images, which allow using convolutional neural networks to learn the most discriminative features. Along with the above two modalities, we propose two other modalities for the video data, at the video frame and video clip levels, respectively. Each of the four modalities returns a probability distribution on activity prediction. Then, these probability distributions are fused to output the worker activity classification result. A worker activity dataset of 6 activities is established, which at present contains 6 common activities in assembly tasks, i.e., grab a tool/part, hammer a nail, use a power-screwdriver, rest arms, turn a screwdriver, and use a wrench. The developed multi-modal approach is evaluated on this dataset and achieves recognition accuracies as high as 97 respectively.



There are no comments yet.


page 2

page 4

page 5

page 7

page 8

page 15


MEx: Multi-modal Exercises Dataset for Human Activity Recognition

MEx: Multi-modal Exercises Dataset is a multi-sensor, multi-modal datase...

Adaptive Feature Processing for Robust Human Activity Recognition on a Novel Multi-Modal Dataset

Human Activity Recognition (HAR) is a key building block of many emergin...

Multi-modal Egocentric Activity Recognition using Audio-Visual Features

Egocentric activity recognition in first-person videos has an increasing...

Autonomous Human Activity Classification from Ego-vision Camera and Accelerometer Data

There has been significant amount of research work on human activity cla...

Model enhancement and personalization using weakly supervised learning for multi-modal mobile sensing

Always-on sensing of mobile device user's contextual information is crit...

Seeing What You're Told: Sentence-Guided Activity Recognition In Video

We present a system that demonstrates how the compositional structure of...

Template co-updating in multi-modal human activity recognition systems

Multi-modal systems are quite common in the context of human activity re...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Industrial big data has been increasingly accessible and affordable, benefiting from the availability of low-cost sensors and the development of Internet-of-Things (IoT) technologies [1, 2], which builds up the data foundation for advanced manufacturing. A variety of methods and algorithms have been developed to learn valuable information from the data, and to make the manufacturing more intelligent [3]

. With the recent fast growing of Artificial Intelligence (AI) technologies, especially deep learning 


and reinforcement learning 

[5] methods, AI boosted manufacturing has been increasingly attractive in both the scientific research and industrial applications.

In an intelligent manufacturing system involving workers, recognition of the worker’s activity is one of the primary tasks. It can be used for quantification and evaluation of the worker’s performance, as well as to provide onsite instructions with augmented reality. Also, worker activity recognition is crucial for human-robot interaction and collaboration. It is essential for developing human-centered intelligent manufacturing systems.

1.1 Related Work

In the computer vision area, image/video-based human activity recognition using deep learning methods has been intensively studied in recent years and unprecedented progress has been made 

[6, 7]. However, visual-based recognition suffers from the occlusion issue, which affects the recognition accuracy. Wearable devices, such as an armband embedded with an Inertial Measurement Unit (IMU), directly sense the movement of human body, which can provide information on the body status. In addition, there are a lot of inexpensive wearable devices in the market, such as smart armbands and smartphones, which are widely used in activity recognition tasks. Wearable devices are directly attached to the human body and thus do not have the occlusion issue. Nevertheless, a wearable device can only sense the human body activity locally, it is challenging to precisely recognize an activity involving multiple body parts. Although multiple devices can be applied to simultaneously sense the activity globally, it makes the system cumbersome and brings discomfort to the user.

Worker activity recognition in the manufacturing area is still an emerging topic and few studies have been made. Stiefmeire et al. [8]

utilized ultrasonic and IMU sensors for worker activity recognition in a bicycle maintenance scenario using a Hidden Markov Model classifier. Later they proposed a string-matching based segmentation and classification method using multiple IMU sensors for recognizing worker activity in car manufacturing tasks 

[9, 10]. Koskimaki et al. [11] used a wrist-worn IMU sensor to capture the arm movement and a K-Nearest Neighbor model to classify five activities for industrial assembly lines. Maekawa et al. [12]

proposed an unsupervised measurement method for lead time estimation of factory work using signals from a smartwatch with an IMU sensor. Recently, deep learning methods have been introduced to recognize worker activity in human-robot collaboration studies 

[13, 14].

In general, the activity recognition task can be broken down into two subtasks: feature extraction and subsequent multiclass classification. To extract more discriminative features, various methods have been applied to the raw signals in the time or frequency domain, e.g., mean, correlation, and Principal Component Analysis 

[15, 16, 17, 18]

. Different classifiers have been explored on the features for activity recognition, such as the Support Vector Machine 

[15, 17]

, Random Forest, K-Nearest Neighbors, Linear Discriminant Analysis 

[16], and Hidden Markov Model [18]. To effectively learn the most discriminative features, Jiang et al. [19] proposed a method based on Convolutional Neural Networks (CNN). They assembled the raw IMU signals into an activity image, which enabled the CNN model to automatically learn the discriminative features from the activity image for classification.

1.2 Proposed Method

Figure 1: Overview of our multi-modal approach for worker activity recognition.

Few attempts have been made for the worker activity recognition in the manufacturing field, and most of them only use single sensing modality, which cannot guarantee robust recognition under various circumstances. In the present research, to comprehensively perceive the worker, we choose a smart armband to acquire the Inertial Measurement Unit (IMU) signals and a visual camera to capture the image sequence of the worker’s activity. An overview of our method is illustrated in Figure 1. For the IMU signals, we design two novel mechanisms, in both the frequency and spatial domains, to assemble the captured IMU signals as images. The assembled signal representation allows us to use Convolutional Neural Networks to explore the correlation among time-series signals and learn the most discriminative features for worker activity recognition. As for the video data, we propose two modalities, at the frame and video-clip levels, respectively. Overall, we have four modalities in parallel and each of the four modalities can return a probability distribution on the activity recognition. Then these probabilities are fused to output the worker activity classification result. To evaluate the method, a worker activity dataset containing 6 common activities in assembly tasks is established.

The main contributions of our work are as follows:

  • We propose a multi-modal approach for the worker activity recognition in manufacturing, using both wearable devices and visual cameras.

  • To take advantage of the powerful learning ability of CNN on images, we design two novel mechanisms to produce 2D signal representations of the IMU signals from wearable devices, in both the frequency and spatial domains.

  • To synthesize more physical-realistic variations in the training dataset, we propose a kinematics-based data augmentation method for the wearable sensor data. It generates more data by spatial rotation and mirroring, in order to augment variations that cannot be achieved using traditional image augmentation methods.

The remainder of this paper is organized as follows. Section 2 discusses how we build up the worker activity dataset. Section 3 focuses on the novel feature representation and data augmentation. Section 4 describes the details of neural network architectures, training and testing of the multi-modal activity recognition. The experimental setups and results are described in Sections 5 and 6, respectively. Finally, Section 7 provides the conclusions of this research.

2 Multi-modal Sensing and Data Acquisition

To establish our dataset of worker activity, six activities commonly performed in assembly tasks are chosen, which are: grab a tool/part (GT), hammer a nail (HN), use a power-screwdriver (UP), rest arms (RA), turn a screwdriver (TS), and use a wrench (UW). There are 8 subjects recruited to conduct a set of tasks (listed in Table I) containing the 6 activities.

No. Tasks Activities
1 Grab 30 tools/parts from the 3 containers GT
2 Hammer 15 nails into the wooden dummy HN
3 Tighten 20 screws using a power-screwdriver UP
4 Rest arms for about 60 seconds RA
5 Tighten 10 nuts using a screwdriver TS
6 Tighten 10 nuts using a wrench UW
Table 1: Tasks for collecting worker activity.

As demonstrated in Figure 2(a), the subject is asked to stand in front of the workbench, wear a smart armband [20] on his/her right forearm with a fixed orientation (Figure 2(b)), and perform the tasks on assembly dummies in a natural way. The armband from Thalmic Labs is equipped with IMU sensors for wearable sensor data acquisition. The IMU returns three types of signals (3-channel acceleration, 3-channel angular velocity, and 4-channel orientation) at the sample rate of 50Hz. These 10-channel signals captured on a worker are transmitted via Bluetooth to the workstation in real time.

Figure 2: (a) Data collection setup; (b) Wearing orientation of a right-hand.

While collecting wearable sensor data from the armband, an overhung camera is used to record the assembly tasks simultaneously for monitoring the process. Examples of the 6 activities are shown in Figure 3, which are taken from the overhung camera.

Figure 3: Examples of the 6 activities captured from the overhung camera.

3 Data Preprocessing, Signal Representation and Data Augmentation

Convolution-based deep learning methods need the input data to be formatted as tensors, for example, with a fixed size of

for images or with a fixed size of for image sequences (video clips) where , and are the height, width and the number of channels of the image, respectively, and is the image sequence length. Therefore, some preprocessing steps are necessary before the data can be fed into a convolutional neural network. In this section we give a detailed description of the pipeline for data preprocessing and the new methods for signal representation. Furthermore, to generate more realistic data, we propose a kinematics-based augmentation method which is also presented in this section.

3.1 Data Sampling

Although the data (i.e., IMU sensor signals and videos) are collected simultaneously for all tasks and each task consists of only one activity, there still might be some unrelated activities inside the data, such as preparing activities before hammering nails. To address it, the recorded videos are manually annotated to locate the time durations (i.e., the starting and ending timestamps), each of which contains only one of the six activities. These durations are used to segment the raw data (IMU sensor signals and videos).

Usually, the duration of an activity instance ranges from a few seconds to more than one minute. Thus, sampling is needed to prepare the data samples for recognition. As depicted in Figure 4, the 10-channel IMU signals and the video recording are synchronized with the timestamps. Then the 50Hz IMU signals are sampled using a temporal sliding window with the width of timestamps and 75% overlap between two windows. Thus, each IMU sample lasts for about 1.3 seconds, which covers at least one activity pattern. After sampling the IMU signals, the video recordings are sampled according to the time durations of the IMU samples. Then, each video clip has an approximate length of 38 frames.

Figure 4: Scheme of the signal sampling method.

After sampling, we denote our dataset as


where is a sample set of time-series IMU signals, is the corresponding video clip sample, and is the manually labeled ground truth of the activity class. More specifically, a sequence of discrete-time data over timestamps, , and each element is elaborated as


where , , and are acceleration, angular velocity, and orientation in quaternion, respectively.

After sampling, the quantitative information of the dataset is listed in Table II. There are 11,211 data samples in total. The eight subjects use different amounts of time to finish each task, therefore they have different numbers of data samples for each activity.

Subject No. GT HN UP RA TS UW
1 193 140 364 266 222 442
2 302 408 195 56 274 751
3 198 183 171 251 214 567
4 204 172 188 29 82 344
5 187 204 142 43 213 372
6 216 77 179 47 129 301
7 213 196 203 254 231 576
8 200 184 262 145 148 273
Total 1713 1564 1704 1091 1513 3626
Table 2: Number of data samples for each activity of different subjects.

3.2 Wearable Sensor Signal Representation

To take advantage of the powerful learning ability of CNNs on images, we propose to transfer the time-series IMU sensor signals to the image representation. As shown in Figure 5, the frequency feature transform assembles the sensor signals in a special pattern such that the hidden correlations among different channels of sensor signals are revealed; and the spatial feature transform uncovers the changing history of orientation signals in the spatial domain. Both feature transform mechanisms enable a CNN model to learn the most discriminative features from images, which are not possible in the original time-series sensor signals.

Figure 5: Illustration of the feature transforms for wearable sensor signals.

Frequency feature transform

: Frequency domain analysis is a commonly used technique for signal pattern recognition. Rather than directly applying the frequency transform to time-series signals, we propose a new way to unveil the hidden correlations among sensor signals: 1) The 10-channel signals

in an IMU sample are stacked row by row as an image with the size of (Fig. 5(a)); 2) We expand the 10-row image with a shuffling algorithm [19] to form (Fig. 5(b)) with the size of

. The idea here is to make every pair of 10 channels have the chance to be row-neighbors in the image, then the correlations among different channels can be exposed and be further detected by a CNN model; 3) Two-dimensional (2D) Discrete Fourier Transform (DFT) is applied to

to get the representation in the frequency domain to analyze the frequency characteristics. Only its logarithmic magnitude is taken to form the image (Fig. 5(c)); 4) Due to the conjugate symmetry of Fourier Transforms


where and represent the two directions of an image, we can use only a half to represent the DFT image to remove the redundancy. This will reduce the architectural complexity and the number of training parameters for the CNN model. Here we keep using the notation to represent the one-half (the first and fourth quadrants) of DFT image for simplicity.

Spatial feature transform: Implementing feature transform in the frequency domain unavoidably abandons the spatial information from the signals, which motivates us to introduce the second mechanism to exploit the spatial information included in the raw signals. Since recovering the spatial trajectory from IMU data is not an easy task, here we develop an orientation changing history (och) image to represent the pose-changing information of the subject in the spatial domain


where is the spatial feature transform and is the resulted image. In the spatial feature transform described below, only the orientation information is considered.

First, a unit vector is rotated by to generate a direction vector by


where denotes the rotation operation defined as


where is the quaternion multiplication, defined as


where , , and is the conjugate of :


Then, the orientation changing history can be represented by a series of orientation vectors at different time steps.


which is essentially a set of points on the unit sphere surface.

Secondly, these points are projected onto three orthogonal planes (Fig. 5(d)). On each plane, the points are connected with line segments sequentially to form orientation changing curves in an image.

Finally, these three projected images are stacked as a 3-channel image which is represented in red, green and blue color, respectively (Fig. 5(e)). Figure 6 shows some examples of image representations in the frequency and spatial domain, from one subject on six activities, from which we can observe unique patterns of each activity.

Figure 6: Examples of IMU image representations by the frequency and spatial feature transforms.

3.3 Visual Signal Representation

Besides the two mechanisms of feature transforms on the IMU sensor signals, since the recorded video contains rich visual contexts of the worker’s activity and visual-based activity recognition also has shown promising results [6, 7], we introduce two other mechanisms to represent the video at two levels.

Frame-level visual representation: At the frame level, the middle frame of a video clip is selected as an image representation of an activity, which focuses on the worker’s static posture and surrounding environment. The operation is denoted as


where is the operation to extract the middle frame from a video clip .

Video-level visual representation: At the video-clip level, the video clip samples are sampled again to make each video clip have the defined length of frames for a CNN model (to be described in Section 4). The operation is


where is the resulted fixed-length video-clip from the operation of .

3.4 Kinematics-based Data Augmentation

For deep learning approaches, a large amount of labeled data are needed to train a valid model with decent performance of generalization. Nevertheless, it is always time-consuming and costly to collect such big data with labels annotated. Data augmentation that synthesizes additional data derived from original ones, is a commonly-used technique to resolve the data shortage problem. Traditionally, image data augmentation techniques include implementing a series of image transformation operations, such as rotating, scaling, shifting, flipping, shearing, etc., on the original images, to generate more image data. The image transformation can introduce more variations and still keep the recognizable contents, and thus it is applied to generate more data for images and video-clips from the visual signal representation. However, the variations introduced by the basic image transformation is not physically-realistic in our sensor signal context.

To include more reasonable variations in the training dataset, we propose a kinematics-based augmentation method to generate more wearable sensor signal samples, rather than implementing image data augmentation on those images resulted from feature transforms. More specifically, the kinematics-based augmentation refers to creating variations by spatial rotation and mirroring on the four channels of orientation signals.

Suppose we have a four-channel orientation signal represented as a quaternion , a new orientation can be generated by rotating with


where represents a rotation quaternion of an angle about an axis . It can be calculated by


Applying mirroring to the original data is to add variations in some situations, for example, the armband is worn in different dominant arms for different subjects. First, the vector mirrored from the current direction vector (Eq. 5) against a certain plane can be calculated with


where is the normal vector of the given plane.

Then the mirrored quaternion , representing the transition between the two vectors and , can be obtained by


where and are the cross and dot products, respectively.

For the other six channels of linear acceleration and angular velocity, since their measurements are relative to the sensor’s coordinate systems, rotation and mirror operation do not affect the values. Some random noises (uniformly distributed in the range of

of the original signals) are added to simulate the possible fluctuations.

4 Multi-modal Recognition

In this section the developed multi-modal approach for worker activity recognition is detailed: four deep learning architectures created for different input modalities are presented; the cost function for training each modality is introduced; and the inference fusion strategies to output the recognition result are described.

4.1 Deep Learning Architectures of Four Input Modalities

After the preprocessing, signal representation generation and data augmentation described in Section 3, there are 111Here we use the same notation for simplicity but this is larger than the one in Eq. 1 due to the data augmentation. data samples , each of which contains four different inputs:


where and are the four inputs of frequency feature transform, spatial orientation changing history (och) feature transform, frame-level visual representation and video-level visual representation, respectively.

For the three image inputs, and , 2D convolutional operation [6] is applied to extract features layer by layer. The value at position in the th feature map of the th layer is computed by



denotes a non-linear activation function.

is the bias for this feature map, is the index of the feature maps in layer , is the value at the position of the kernel connected to the th feature map, and and are the height and width of the two-dimensional kernel, respectively.

For video-clip input , 3D convolutional operation [6] is applied to deal with the additional temporal dimension. The value at position in the th feature map of the th layer is given by


where is the size of the 3D kernel along the temporal dimension, is the th value of the kernel connected to the th feature map in the previous layer.

The feature maps obtained from a series of convolutional operations are flattened as a feature vector. To solve the classification problem, the vector is further input to a multi-layer neural network. The value of the

th neuron in the

th fully connected layer, denoted as , is given by


where is the bias term, indexes the set of neurons in the th layer connected to the current feature vector, is the weight value in the th layer connecting the th neuron to the th neuron in the previous layer.

In details, the proposed CNN models for the four input modalities are described as follows:

: The architecture of our CNN model for is illustrated in Figure 7. It accepts the frequency image as the input, and outputs a probability distribution of the 6 activities. has the size of (height, width, depth, respectively) and is normalized to the interval before being fed into two convolutional layers for feature extraction. Each convolutional layer is down-sampled to a half by implementing a max pooling layer. The classification module accepts the feature map from the last pooling layer and flattens it as a feature vector. Then, two fully connected layers are used to densify the feature vector to the dimensions of 128 and sequentially, where is the number of worker activity classes. Finally, this -dimensional score vector is transformed to output the predicted probabilities with a softmax function as follows:


where is the predicted probability of being class for sample .

Figure 7: The architecture of our CNN model for . ‘Conv.’ and ‘Pool.’ denote the operations of convolution and pooling, respectively.


: For these two input modalities, we use transfer learning to solve the image classification problem instead of building and training CNN models from scratch. To extract image features, we use a VGG network 


pretrained on the ImageNet dataset 

[22]. For each image input, the feature vector obtained from the fully connected layer FC7 in the VGG model is used to represent the image, then a new classifier is designed on top of it to output the prediction on activity class.

: The video-clip input contains spatial-temporal information. We use the C3D model pretrained on the Sports-1M dataset [23, 24, 25]. The C3D network reads sequential frames and outputs a fixed-length feature vector every 16 frames. We extract activation vectors from the fully connected layer FC6-1, which is then connected to a new classifier to predict the worker activity class.

4.2 Training

Training a deep learning model refers to a process of optimizing the network’s weights

to minimize a chosen loss function using training data

. The commonly used regularized cross entropy [26] is chosen as the loss function:


where equals to 0 when the ground truth label of an input image is the th label, and equals to 1 otherwise. To penalize large weights during training, an regularization term is applied to the loss function, where is its coefficient. To mitigate overfitting issues, the dropout regularization [27] is used during training, which randomly drops neuron units from the neural network.

4.3 Inference Fusion

Just like how human uses five senses to perceive the world, multi-modal approach has the opportunity to integrate all the information and make a comprehensive understanding of the learning problem. Mathematically, each individual model can return a probability distribution on the worker activity prediction, we can design different strategies to fuse the inferences from different models:

Maximum fusion. This method reports the maximum output within a list of predictions.


where is the index of different models and is the total number of models.

Average fusion. In this method, we adopt the average to fuse the outputs of different modalities, i.e.,


Weighted fusion. We introduce the informativity value to evaluate the prediction confidence of each modality . is calculated with Eq. 24, which is modified from the Shannon entropy of a discrete probability distribution to vary in the interval of .


where is the index of modalities and is the index of top- candidates. represents the probability of the th class candidate at the th model. will be close to 0 if all the top- candidates have similar probabilities (i.e., ), and 1 if the probability of top- class candidate is about reaching 1 (i.e., ).

Then every predicted probability of the th model is weighted by of this model and the weighted maximum fusion and the weighted average fusion scores are


For the above four fusion strategies, the final predicted label is chosen as the one that maximizes the fusion score (e.g., for weighted average fusion, ).

5 Experiments and Evaluation Metrics

5.1 Implementation Details

The CNN architectures of the four input modalities described in the previous sections are constructed using TensorFlow 


and Keras libraries. They are trained individually so that each of them can make its own inference for further decision fusion. The SGD optimizer is used in training, with the momentum of 0.9, the learning rate of 0.001 and the regularizer coefficient of 1e-5. The batch size for each of the four models is 512, 64, 64 and 512, respectively, which is limited by the computation memory. The number of training epochs is 1000 and 100 for the first modality

and the other modalities, respectively. We use a workstation with one 12-core Intel Xeon processor, 64GB of RAM and two Nvidia Geforce 1080 Ti graphic cards for the training jobs. It takes approximately 30 minutes to train each model for a leave-one-out experiment.

5.2 Evaluation Metric

Two evaluation policies are conducted, i.e., half-half and leave-one-out policies. In the half-half evaluation, after randomly shuffling, one half of the dataset is used for training and the other half is kept for testing. In the leave-one-out evaluation, the samples from 7 out of 8 subjects are used for training, and the samples of the left one subject are reserved for testing. We employ several commonly used metrics [26] to evaluate the classification performance, which are listed as follows:

  • Accuracy

  • Precision and Recall

  • score


where is an indicator function. For a certain class , True Positive (TP) is defined as a sample of class that is correctly classified as ; False Positive (FP) means a sample from a class other than is misclassified as ; False Negative (FN) means a sample from the class is misclassified as another ‘not ’ class.

score is the harmonic mean of Precision and Recall, which ranges in the interval [0,1].

6 Results and Discussion

In this section, we first perform evaluations of the data augmentation methods. Then, we compare the performance of different fusion methods. After that, we explore various modalities and their combinations for an ablation study. The performance of our approach on some public dataset is also reported. Then, we conduct visualizations for a better understanding of the CNN model. Finally, future research needs are discussed.

6.1 Evaluation of the Data Augmentation Methods

To evaluate the effectiveness of our proposed kinematics-based augmentation (KA) method, we compare it to the jittering augmentation (JA) method [29], which has been proved to be an effective method and is commonly used in CNN-based image classification tasks.

For the KA method, four rotation angles are selected for rotation augmentation, i.e., new samples are generated by implementing rotation on the original signal samples, and two mirroring planes -plane and -plane are chosen for mirroring augmentation, overall yielding 6 augmented samples for each actual sample. Then the amount of the augmented training dataset is 6 times more than the original one. Note that the augmentation is applied directly on each original signal sample before the feature transforms.

As for the JA method, to have a fair comparison with the KA method, 6 augmented samples are generated by randomly translating in the range of of the image width/height, scaling in the range of ratio, and rotating in the range of degrees. In the JA method, the augmentation is applied to each image and after the feature transforms.

We also evaluate the performance of the JA+KA method, in which the augmented data from the JA and KA methods are integrated. The leave-one-out evaluations of the two modalities and on our activity dataset with the different augmentation methods are shown in Table 3 (the half-half accuracies are not considered for the comparison purpose because they are about reaching 100%). All the three augmentation methods have accuracy improvements compared with the models without using data augmentation. For , the JA method improves the accuracy from 88.0% to 88.7%, and the KA method outperforms the JA method, whose accuracy is 90.2%. By combining the JA and KA methods, the accuracy is slightly further improved to 90.5%. For , the accuracy is improved from 63.6% to 65.0% with the JA method, and is further improved by using the KA method, which is 77.3%, 12 percentage points higher than the JA method. However, the JA+KA method does not further improve the accuracy and its accuracy 75.3% is lower than the KA method.

Modalities Data Augmentation Methods
88.01 88.71 90.18 90.51
63.59 65.01 77.27 75.33
Table 3: Comparison (%) of accuracy regarding to data augmentation: None (without data augmentation), JA (jittering augmentation), KA (kinematics augmentation) and JA+KA, for the leave-one-out experiments.

Overall, the data augmentation techniques, JA and KA, demonstrate the effectiveness in improving the model performance, because the augmentation process introduces more variations to the training dataset to simulate the potential variations in the unseen samples, which pushes the deep learning model to learn the most discriminative features and makes the training more robust. Meanwhile, the KA method outperforms the JA method. It is because rather than introducing variations to the image, like what JA method does, KA method directly generates some physically-realistic variations to the original signal sample, which is more effective to augment the dataset to be more comprehensive. Although JA+KA method improves the performance of slightly compared with KA method, it does not for . Because JA+KA method has a larger amount (i.e., 2 times) of training data and has a more complex architecture than , which makes the training less efficient, we choose KA method for both of the modalities in the following study as a compromise between performance and training efficiency.

6.2 Evaluation of Different Fusion Methods

Each of the four input modalities generates a vector output before the fusion step. Then, these vector outputs are fused to have only one score vector as the final output. To study the effect of the four different fusion methods: 1) maximum fusion, 2) average fusion, 3) weighted maximum fusion and 4) weighted average fusion, a set of experiments are conducted on our activity dataset.

The comparisons of the fusion performance, in terms of accuracy, precision, recall and F score, are listed in Table 

4. The average fusion method performs better than the maximum fusion method for all the metric items. The weighted maximum method has the same accuracy as the maximum method but lower precision, recall and F score. The weighted average method has lower performance than the average method. The two weighted methods do not contribute additional improvement as they did in [30]. Therefore, the average method is chosen as the fusion strategy for our following experiments.

Fusion Methods Accuracy Precision Recall F Score
Maximum 93.68 92.48 92.50 91.09
Average 97.17 97.04 96.82 96.81
Weighted Max. 93.68 92.45 92.49 91.07
Weighted Avg. 96.79 96.38 96.28 96.04
Table 4: Comparison (%) of different fusion methods for the leave-one-out experiments.

6.3 Evaluation of Different Input Modalities

A central idea of our approach is that the reasoning based on multiple modalities can significantly improve the inference performance based on single modality. To validate this idea, we perform a comprehensive ablation study where we progressively increase the number of modalities and try different modality combinations. The performance of these cases in terms of accuracy, precision, recall and F score with two evaluation policies (half-half and leave-one-out) is summarized in Table 5.

To simplify the abbreviation, we use , , and to represent the four input modalities, , , and , respectively. For the single-modal cases, although and are both based on the IMU signals, shows lower performance because it only uses the 4 orientation channels out of the 10 channels. Also, it demonstrates that the frequency feature transform provides more discriminative features for activity recognition. performs better than , which shows that the current pretrained VGG model can extract more discriminative features than the C3D model. Overall, achieves the highest performance in the single-modal cases, whose metric items are accuracy (90.2%), precision (90.7%), recall (89.5%) and F score (87.6%), respectively.

For the dual-modal cases, all the 6 combinations are evaluated. All the cases have better results compared with their related single-modal cases, e.g., performs better than both and . has the highest accuracy as their individual modalities are also the highest two for the single-modal cases.

For the triple-modal cases, 4 combinations are tested. The fusion of more modalities further improve the performance than the duel-modal cases. has the highest accuracy as their individual modalities are also the highest three for the single-modal cases.

Finally, a quad-modal case including all the four modalities is experimented, which achieves the highest performance. Therefore, we choose the quad-modal architecture for our model.

Methods Accuracy Precision Recall F Score
hh loo hh loo hh loo hh loo
Previous [31] 97.6 87.4 97.8 89.0 97.5 89.5 97.7 87.6
99.5 90.2 99.5 90.7 99.6 90.9 99.5 90.3
93.0 77.3 92.3 77.5 92.5 78.3 92.4 75.0
100 86.8 100 83.0 100 83.2 100 81.3
100 80.8 100 79.1 100 77.7 100 74.3
99.6 91.1 99.6 91.5 99.6 92.1 99.6 91.4
100 94.8 100 94.9 100 94.6 100 94.3
100 92.2 100 93.1 100 91.3 100 90.1
100 90.3 100 90.9 100 87.8 100 87.0
100 85.0 100 84.0 100 82.8 100 80.2
100 89.5 100 86.3 100 86.0 100 84.2
100 95.3 100 95.4 100 95.4 100 95.2
100 93.9 100 93.5 100 94.1 100 93.0
100 95.9 100 95.2 100 94.8 100 94.2
100 92.6 100 90.4 100 90.7 100 89.5
100 97.2 100 97.0 100 96.8 100 96.8
  • , , and represent the four input modalities, , , and , respectively. denotes a multi-modality model, e.g., represents the fusion of the and modalities. We use average fusion method to fuse multiple modalities together.

Table 5: Overall performance (%) of the half-half (hh) and leave-one-out (loo) experiments.

For the half-half experiments, almost all of the testing samples are correctly recognized. It is higher than the leave-one-out experiments. This is because all the testing subjects are seen in the half-half experiment, while the testing subject in the leave-one-out experiment is unseen.

6.4 Performance Comparison on the Public Dataset

To validate the generalization of our method, a commonly-used public dataset for human activity recognition, PAMAP2 dataset [32], is also chosen for comparison. This dataset has 12 human activities (lying, sitting, standing, walking, running, cycling, Nordic walking, ascending stairs, descending stairs, vacuum cleaning, ironing and rope jumping) captured by three IMU sensors (worn on the wrist, chest and ankle, respectively), and the activities are performed by 9 different subjects. Since the PAMAP2 dataset does not include video recordings, we evaluate the performance of our CNN models of the first two modalities on it. The performance comparison of several existing deep learning models on the PAMAP2 dataset is listed in Table 6. Using the same evaluation protocol, our model achieves the best recognition accuracy, 94.2%, compared with other methods in the literature.

Method Accuracy
Hammerla et al. (2016) [33] 93.70
Murahari et al. (2018) [34] 87.50
Zeng et al. (2018) [35] 89.96
Xi et al. (2018) [36] 93.50
Xu et al. (2019) [37] 93.50
Our model 94.16
Table 6: Performance (%) comparison of existing deep models on the PAMAP2 activity dataset.

6.5 Visualizing the Class Activation Map of

It is known that a deep learning model driven by a large amount of data can achieve superior performance for various tasks, such as the image classification task for our third modality . However, it is usually treated as a black box and criticized for weak interpretability, due to the high model complexity and tremendous hidden weights. To have an intuitive interpretation of the connections between image contents and the predicted activity class, we visualize the class activation map (CAM), which is a score heatmap associated with a specific activity class, computed for every location of an input image, representing how strong the connection is between each location and a specific activity class. A set of CAM visualizations are shown in Figure 8, where the generated heapmaps are overlaid onto the input images. We can see that the model is able to focus on the hand and tool regions, where exactly the interaction happens.

Figure 8: Examples of Class Activation Map (CAM) Visualization.

6.6 Future Research Needs

At present, we conduct the multi-modal recognition of 6 basic activities. To further push the current approach to the practical application, some directions for future work are considered, such as recruiting more subjects to learn more working styles, optimizing data augmentation techniques to add more variations to the collected data, and exploring different methods of signal preprocessing and feature extraction to fully exploit the recorded signals. In addition, more fusion methods can be explored and every modality can be further improved to reach their optimal performance.

7 Conclusion

Worker behavior awareness is crucial towards human-centered intelligent manufacturing. In this paper, we proposed a multi-modal approach for worker activity recognition. Two sensors (wearable device and camera) were adopted to perceive the worker, and four modalities were built to recognize the activity independently. Then, inference fusion was implemented to achieve an optimal understanding of the worker’s behavior.

We designed two novel mechanisms to produce image representations of the IMU sensor signals in both the frequency and spatial domains. A kinematics-based data augmentation method was developed to generate more physically-realistic variations in the training dataset. This performs better than the traditional data augmentation method. A worker activity dataset has been established, which currently involves 8 subjects and contains 6 common activities in assembly tasks (i.e., grab a tool/part, hammer a nail, use a power-screwdriver, rest arms, turn a screwdriver and use a wrench). The multi-modal approach is evaluated on the dataset and achieves 100% and 97% recognition accuracy in the half-half and leave-one-out experiments, respectively. Our approach can be further generalized to other sensors, modalities, and working contexts.


This research work is supported by the National Science Foundation grants CMMI-1646162 and NRI-1830479, and also by the Intelligent Systems Center at Missouri University of Science and Technology. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.


  • [1] S. Jeschke, C. Brecher, T. Meisen, D. Özdemir, T. Eschert, Industrial internet of things and cyber manufacturing systems, in: Industrial Internet of Things, Springer, 2017, pp. 3–19.
  • [2] J. Lee, H. D. Ardakani, S. Yang, B. Bagheri, Industrial big data analytics and cyber-physical systems for future maintenance & service innovation, Procedia CIRP 38 (2015) 3–7.
  • [3] K. Nagorny, P. Lima-Monteiro, J. Barata, A. W. Colombo, Big data analysis in smart manufacturing: A review, International Journal of Communications, Network and System Sciences 10 (03) (2017) 31.
  • [4] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
  • [5] J. Kober, J. A. Bagnell, J. Peters, Reinforcement learning in robotics: A survey, The International Journal of Robotics Research 32 (11) (2013) 1238–1274.
  • [6] S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition, IEEE transactions on pattern analysis and machine intelligence 35 (1) (2013) 221–231.
  • [7] J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, in: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE, 2017, pp. 4724–4733.
  • [8] T. Stiefmeier, G. Ogris, H. Junker, P. Lukowicz, G. Troster, Combining motion sensors and ultrasonic hands tracking for continuous activity recognition in a maintenance scenario, in: Wearable Computers, 2006 10th IEEE International Symposium on, IEEE, 2006, pp. 97–104.
  • [9] T. Stiefmeier, D. Roggen, G. Troster, Fusion of string-matched templates for continuous activity recognition, in: Wearable Computers, 2007 11th IEEE International Symposium on, IEEE, 2007, pp. 41–44.
  • [10] T. Stiefmeier, D. Roggen, G. Ogris, P. Lukowicz, G. Tröster, Wearable activity tracking in car manufacturing, IEEE Pervasive Computing 7 (2).
  • [11] H. Koskimaki, V. Huikari, P. Siirtola, P. Laurinen, J. Roning, Activity recognition using a wrist-worn inertial measurement unit: A case study for industrial assembly lines, in: Control and Automation, 2009. MED’09. 17th Mediterranean Conference on, IEEE, 2009, pp. 401–405.
  • [12] T. Maekawa, D. Nakai, K. Ohara, Y. Namioka, Toward practical factory activity recognition: unsupervised understanding of repetitive assembly work in a factory, in: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, ACM, 2016, pp. 1088–1099.
  • [13] P. Wang, H. Liu, L. Wang, R. X. Gao, Deep learning-based human motion recognition for predictive context-aware human-robot collaboration, CIRP Annals.
  • [14] H. Petruck, A. Mertens, Using convolutional neural networks for assembly activity recognition in robot assisted manual production, in: International Conference on Human-Computer Interaction, Springer, 2018, pp. 381–397.
  • [15] D. Anguita, A. Ghio, L. Oneto, X. Parra, J. L. Reyes-Ortiz, A public domain dataset for human activity recognition using smartphones., in: ESANN, 2013.
  • [16] T. Peterek, M. Penhaker, P. Gajdoš, P. Dohnálek, Comparison of classification algorithms for physical activity recognition, in: Innovations in Bio-inspired Computing and Applications, Springer, 2014, pp. 123–131.
  • [17] W. Chang, L. Dai, S. Sheng, J. T. C. Tan, C. Zhu, F. Duan, A hierarchical hand motions recognition method based on imu and semg sensors, in: Robotics and Biomimetics (ROBIO), 2015 IEEE International Conference on, IEEE, 2015, pp. 1024–1029.
  • [18] C. A. Ronao, S.-B. Cho, Human activity recognition using smartphone sensors with two-stage continuous hidden markov models, in: Natural Computation (ICNC), 2014 10th International Conference on, IEEE, 2014, pp. 681–686.
  • [19] W. Jiang, Z. Yin, Human activity recognition using wearable sensors by deep convolutional neural networks, in: Proceedings of the 23rd ACM international conference on Multimedia, ACM, 2015, pp. 1307–1310.
  • [20] Thalmic Labs Inc., Myo armband, [Online; accessed 17-August-2019] (2019).
  • [21] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.
  • [22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database, in: CVPR09, 2009.
  • [23] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  • [24]

    Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM international conference on Multimedia, ACM, 2014, pp. 675–678.

  • [25] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
  • [26] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016,
  • [27]

    N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting., Journal of machine learning research 15 (1) (2014) 1929–1958.

  • [28] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, software available from (2015).
  • [29] P. Sermanet, Y. LeCun, Traffic sign recognition with multi-scale convolutional networks, in: Neural Networks (IJCNN), The 2011 International Joint Conference on, IEEE, 2011, pp. 2809–2813.
  • [30] W. Tao, M. C. Leu, Z. Yin, American sign language alphabet recognition using convolutional neural networks with multiview augmentation and inference fusion, Engineering Applications of Artificial Intelligence 76 (2018) 202–213.
  • [31] W. Tao, Z.-H. Lai, M. C. Leu, Z. Yin, Worker activity recognition in smart manufacturing using imu and semg signals with convolutional neural networks, Procedia Manufacturing 26 (2018) 1159–1166.
  • [32] A. Reiss, D. Stricker, Introducing a new benchmarked dataset for activity monitoring, in: 2012 16th International Symposium on Wearable Computers, IEEE, 2012, pp. 108–109.
  • [33] N. Y. Hammerla, S. Halloran, T. Plötz, Deep, convolutional, and recurrent models for human activity recognition using wearables, arXiv preprint arXiv:1604.08880.
  • [34]

    V. S. Murahari, T. Plötz, On attention models for human activity recognition, in: Proceedings of the 2018 ACM International Symposium on Wearable Computers, ACM, 2018, pp. 100–103.

  • [35] M. Zeng, H. Gao, T. Yu, O. J. Mengshoel, H. Langseth, I. Lane, X. Liu, Understanding and improving recurrent networks for human activity recognition by continuous attention, in: Proceedings of the 2018 ACM International Symposium on Wearable Computers, ACM, 2018, pp. 56–63.
  • [36] R. Xi, M. Li, M. Hou, M. Fu, H. Qu, D. Liu, C. R. Haruna, Deep dilation on multimodality time series for human activity recognition, IEEE Access 6 (2018) 53381–53396.
  • [37] C. Xu, D. Chai, J. He, X. Zhang, S. Duan, Innohar: A deep neural network for complex human activity recognition, IEEE Access 7 (2019) 9893–9902.