A Novel Multi-Stage Training Approach for Human Activity Recognition from Multimodal Wearable Sensor Data Using Deep Neural Network

01/03/2021 ∙ by Tanvir Mahmud, et al. ∙ Bangladesh University of Engineering and Technology Princeton University 9

Deep neural network is an effective choice to automatically recognize human actions utilizing data from various wearable sensors. These networks automate the process of feature extraction relying completely on data. However, various noises in time series data with complex inter-modal relationships among sensors make this process more complicated. In this paper, we have proposed a novel multi-stage training approach that increases diversity in this feature extraction process to make accurate recognition of actions by combining varieties of features extracted from diverse perspectives. Initially, instead of using single type of transformation, numerous transformations are employed on time series data to obtain variegated representations of the features encoded in raw data. An efficient deep CNN architecture is proposed that can be individually trained to extract features from different transformed spaces. Later, these CNN feature extractors are merged into an optimal architecture finely tuned for optimizing diversified extracted features through a combined training stage or multiple sequential training stages. This approach offers the opportunity to explore the encoded features in raw sensor data utilizing multifarious observation windows with immense scope for efficient selection of features for final convergence. Extensive experimentations have been carried out in three publicly available datasets that provide outstanding performance consistently with average five-fold cross-validation accuracy of 99.29 HAR database, 99.02 outperforming other state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Activity recognition using wearable sensors has been a trending topic of research for its widespread applicability on diverse domains ranging from health care services to military applications [a1]. With the ubiquitous availability of modern mobile devices such as smartphones, tablets, and smartwatches, various types of sensor data are available that can be utilized effectively in numerous applications like activity recognition. Various types of sensor data along with image and video data have been employed for recognizing human activity [i1]. In this work, we have mainly focused on the time series wearable sensor data, e.g. accelerometer, gyroscope, and magnetometer, as these are easy to obtain even with our smart devices and can be used to recognize human activity from distant position on real-time basis as these sensors’ data are very small in volume and easy to share through internet.

(a)
(b)
Figure 1:

Multiple training stages are utilized to incorporate features from numerous transformed representations of input sensor data. In training stage-1, different feature extractors are individually trained to extract features from different transformed spaces. Features extracted from diverse representation spaces can be weighted and these weight vectors can be jointly optimized either (a) in a combined additional training stage, or (b) multiple sequential training stages can be utilized to learn the weight vectors in a sequential manner.

Large varieties of approaches have been applied to make the correct recognition ranging from traditional feature-based approaches to the end-to-end deep neural network in recent times. Numerous hand-crafted feature extraction process with shallow classifiers are explored in the literature for utilizing multimodal sensor data in activity recognition 

[r6, n2, m29, m23, n3, m31, m6, m30]. Though these types of handcrafted features perform well in limited training data scenario, the extraction of effective features gets very complicated with more number of sensors. Additionally, the process heavily demands domain expertise for proper selection of features which becomes harder with the presence of random noises that occurs very often in practical conditions.

To automate the complicated feature extraction process, various types of deep neural networks have been studied in the literature to recognize human activity from wearable sensor data [m28, rr1, r8, m11, n6, n5, n1, r2, r9, r5, r7]

. Most of these approaches directly employ the collected raw sensor data for automated feature extraction using the deep neural network, such as convolutional neural network (CNN) 

[m28, rr1, r8]

, recurrent neural network (RNN) 

[n1, r2]

, long short term memory (LSTM) network 

[r9], hybrid CNN-LSTM network [r5], and a more complicated LSTM-CNN-LSTM based hierarchical network [r7]. Most of these networks are very deep in structure and therefore, a large amount of data is required to train them properly. Moreover, due to random noises and perturbations in multi-modal sensor data from different sources, the process gets more intricate to operate with the raw data directly. Hence, with an increasing number of sensors, while having a small amount of data for some of the activity classes, this problem becomes critical for the automated extraction of features from raw sensor data using deep network that severely affects the performance.

In [g1, r1, w2, s], various approaches have been introduced to represent the time series data in a modified space that makes the feature extraction process easier by reducing the effects of noise or random variations. These transformations on the time series sensor data provide more opportunities to explore the variations of features from different spaces. Though these transformations provide efficient representation of some of the features in a different space, some other features may become complicated to extract from that particular space. However, different transformations provide diverse viewpoints to explore the feature space of raw time series data. Hence, similar to these studies, depending solely on a single transformed space for feature exploration limits the scope of feature extraction that may result in smaller performance in many circumstances. If features extracted from different transformed spaces can be incorporated in the final decision-making process, it will provide a more robust opportunity to analyze the information on raw data. But, the challenging task of integrating effective features from diverse transformed spaces through joint optimization to reach the optimum performance in activity recognition is yet to be attempted.

In this work, we have proposed a novel multi-stage training methodology to overcome these limitations by efficiently employing a multitude of time-series transformations that facilitates the exploration of diversified feature spaces. Instead of relying on a single transformed space, features from numerous transformed spaces are integrated together. An efficient deep convolutional neural network architecture is proposed that can be separately tuned and optimized as efficient feature extractors from different transformed spaces. These CNN based automated feature extractors reduce the complexity of manual feature selection from numerous transformed spaces. Afterwards, these separately optimized networks operating on different transformed spaces are combined into a unified architecture utilizing the proposed additional combined training stage or multi-stage sequential training stages where features extracted from different transformed spaces are sequentially weighted, optimized, and integrated for the final activity recognition. Hence, different portions of this unified architecture are trained and optimized in several training stages that make it possible to optimize with a smaller amount of available training data. Moreover, different types of realistic data augmentation techniques have been introduced to increase the variations of the available data. The proposed approach opens scopes for optimization of diversified features extracted from different transformed spaces and makes the process more resilient from noise and other random perturbations by exploiting the advantages provided by numerous representations of the raw data. Results of intense experimentations have been presented using three publicly available datasets that provide very satisfactory performance compared to other state-of-the-art approaches.

Ii Methodology

The proposed multi-stage training approach is represented in Fig. 1

. In the first stage of training, individual feature extractors operating on different transformed spaces are trained in parallel with separate classifiers. In the literature, varieties of feature extractors from time-series data have been explored ranging from PCA, ICA, wavelet-based methods to modern CNN, DNN, LSTM, and numerous deep learning methods 

[m28, n1, n5, n6]. To overcome additional complexities mainly arising from the difficulty of feature selection and optimization from different diversified transformed representations of time series data, we have proposed deep CNN architectures as feature extractors from different transformed domains. As it is completely data-driven, deep CNN architecture can be trained as an efficient feature extractor from any representation of data. For joint optimization of multiple transformed feature spaces, learning of this first training stage is transferred into a unified structure utilizing another combined training stage (Fig. (a)a) or utilizing a number of sequential training stages (Fig. (b)b).

After completing the first stage of training, all the separate classifiers of individual networks are removed. As a result, when input data is fed to the network, representational features extracted from different transformed domains utilizing the trained feature extractors are available which were fed into separate classifier units in the first training stage. However, the feature quality can be varied with the transformation of the raw data which can be visible by evaluating the performance of the separate feature extractors in the first stage. Hence, in the second and final stage of training (Fig. (a)a), these feature vectors are multiplied by separate weighting vectors to increase the selectivity of the system. Later, all these weighted feature vectors are concatenated together and a common dense classifier unit is trained to provide the exact prediction from these concatenated features. Therefore, these weighting vectors, along with the combined dense classifier unit, are supposed to learn in this stage of training utilizing the data again.

(a)
(b)
Figure 2: Proposed (a) Convolutional Neural Network and (b)

Convolutional Neural Network. Tensor dimensions shown after each operation are optimized for UCI HAR Database 

[m19].
(a)
(b)
Figure 3: Proposed (a) unit residual block and (b) unit residual block. In (a), ‘l’ denotes length and ‘c’ denotes number of channels of the tensor. While in (b), ‘h’ and ‘w’ denote height and width of the

tensor, respectively. Additionally, ‘k’ stands for kernel size, ‘f’ for filter number, ‘s’ for strides, and ‘BN’ for batch normalization in the convolution.

In Fig. (b)b, the proposed multi-stage sequential training is shown. In the two-stage training, as described, the final second-stage training learns the weighting vector for each feature map at the same time with the combined classifier. However, in the multi-stage sequential training, weighting vectors for only two feature vectors, extracted by the feature extractors trained in the previous stage, are learned along with the combined classifier at a time. In the following stage, the classifier is removed and the merged weighted feature vectors of these two transformations undergo through similar next stage of training with one of the remaining feature vectors representing different transformation. Thus, in each stage of sequential training, weighted feature vector from an additional transformed space is accumulated with the combined feature extractors trained in the previous stage. This method of sequential training offers additional opportunity to converge individual feature representations corresponding to variegated transformed spaces to the final decision-making process by optimizing two feature vectors sequentially. Moreover, in deep learning-based approaches, these weighting vectors applied on separate feature vectors can be easily integrated by introducing a separate densely connected layer operating on each feature vectors accompanied by different weighting vectors.

Ii-a Transformations on Time Series Data

Different types of transformations on time series data have been utilized in the proposed approach. These are described briefly as below.

Ii-A1 Gramian Angular Field Transformation (GAF)

Gramian angular field transformation maps the elements of a time-series data into a matrix representation. This encoding scheme preserves the temporal dependency of the original time series data along the diagonal of the encoded matrix while the non-diagonal entries essentially represent the correlation between samples [g1]. In this transformation, , the input time series, , is transformed into polar coordinate after normalization.

(1)
(2)

Here, the time stamp and is a constant factor to regularize the span of the polar coordinate system. These polar angles are utilized to get the final transformed matrix , which is,

(3)
(a)
(b)
(c)
(d)
(e)
Figure 4: Schematic representation of the proposed multi-stage sequential training scheme. Here, (a) represents Individual training stage, (b) represents combined training stage where all the pre-trained networks are merged using one additional training stage, and (c), (d), (e) represent the sequential training stages where pre-trained networks are converged sequentially towards the unified architecture. The tensor dimensions are optimized for UCI HAR dataset [m19].

Ii-A2 Recurrence Plotting

The recurrence plot portrays the inherent recurrent behavior of time-series, e.g. irregular cyclicity and periodicity, into a matrix [r1]. This method provides a way to explore the m-dimensional phase space trajectory of time series data for generating a representation by searching points of some trajectories that have returned to the previous state and is represented by,

(4)

where is the number of considered states , is a threshold distance, a norm and is the Heaviside function.

Ii-A3 Scattering Wavelet Transformation

Scattering wavelet transform offers representational features of the time-series data those are rotation/translation-invariant while remaining stable to deformations. This technique provides the opportunity to extract features from a very small number of data  [w2]. A mortlet wavelet function, defined as mother wavelet, undergoes through convolution operation with the raw time series data while being scaled and rotated, and thus creates different levels of representational features.

Let’s consider, and to be the averaging operation and complex modulus of the averaged signal, respectively, for order () of the scattering coefficients, and these coefficients can be described as

(5)

where represents the Gaussian low pass filter and represents the mortlet wavelet function of order . Therefore, a scattering representation, of time series data, , is obtained by concatenating the scattering coefficients of a different order,

(6)

As multi-channel sensor data collected from numerous sensors have been used in this work, each channel of such time-series data is transformed individually using any of these transformations, and all such transformed data are stacked together maintaining a similar time information in all the channels. Later, they undergo through the feature extraction process utilizing deep neural networks.

Data: training sample X; training label
Result: weight matrices
/* Individual training begins */
1 for  to  do
2        Calculate ;
3        Randomize and , for ;
4       while The training error threshold is unsatisfied do
5              Calculate ;
6              Find ;
7              Find loss ;
8              Update and , for ;
9            
10       end while
11       Calculate ;
12      
13 end for
/* Combined training stage begins */
14  Randomize , for ;
15 while The training error threshold is unsatisfied do
16       for  to  do
17              Set, ;
18            
19       end for
20       Set feature mapping group, ;
21        Find ;
22        Find loss ;
23        Update , for ;
24      
25 end while
Algorithm 1 Proposed Two-Stage Training Method
Data: training sample X; training label
Result: weight matrices
/* Individual training begins */
1 for  to  do
2        Calculate ;
3        Randomize and , for ;
4       while The training error threshold is unsatisfied do
5              Calculate ;
6              Find ;
7              Find loss ;
8              Update and , for ;
9            
10       end while
11      
12 end for
/* Sequential training begins */
13  Initialize ;
14 for  to  do
15        Set ;
16        Randomize , for ;
17       while The training error threshold is unsatisfied do
18              Set ;
19              Set ;
20              Set feature mapping group, ;
21              Find ;
22              Find loss ;
23              Update weights of , for ;
24            
25       end while
26       Calculate ;
27        Calculate ;
28        Set ;
29      
30 end for
Algorithm 2 Proposed Sequential Training Method

Ii-B Proposed Deep Neural Network Architectures

For feature extraction and classification, two deep CNN architectures are proposed, as shown in Fig. (a)a and Fig. (b)b, optimized to operate in 1D and 2D domain, respectively. Both of them are very similar to each other, as the objective of them is to extract features for activity recognition, with some modifications to operate in different domains for handling different dimensions of data. In general, the proposed CNN architecture mainly consists of a CNN base part followed by a top classifier layer. The CNN base part involves a number of convolution and pooling operations while the top classifier layer consists of a series of densely connected layers followed by the final activation layer to generate activity prediction. The operations performed here are discussed below.

  1. The input 1D time-series data undergo an initial transformation operation as discussed above before starting the convolutional filtering in the deep network.

  2. Next, the tensor enters the convolutional base part where it passes through a series of unit residual block operations to extract deep features from a broad spectrum. Different representations of these unit residual blocks are shown in Fig. 

    3 with some variations in operations for handling (Fig. (a)a) and (Fig. (b)b) data. In these blocks, the input tensor passes through two different operations in parallel and the transformed tensors get added later to produce the final output tensor. Subsequently, a global average pooling operation is performed to extract the global features from each channel of the transformed tensor. This CNN base part extracts effective temporal/spatial features through convolutional filtering and pooling operations required for the final decision.

  3. After that, the tensor propagates through the top classifier block where series of densely connected layers explore the extracted features of the CNN base part to get higher level of representation with the softmax activation layer at the end to merge these representations into a specified class of action.

The values of different convolutional kernel sizes, number of convolutional layers in each unit block, and number of unit residual blocks are established through experimentation to reach the maximum performance. Shallower networks are prone to underfit with the training data while deeper networks are prone to overfit. However, the proposed network effectively utilizes efficient separable convolutions along with residual operations to reduce vanishing gradient and overfitting issues for achieving optimum performance.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 5: Effect of various types of augmentation of the sample data. (a) Raw sample data collected from axis accelerometer, with (b) scaling, (c) jittering, (d) permutation, (e) magnitude warping, and (f) time warping applied on raw data.

Ii-C Proposed Multi-Stage Training Scheme

In the proposed training method, a number of training stages have been utilized to combine features from different transformed spaces. In Fig. 4, this scheme is represented schematically. These optimizations of individually trained feature extractors can be done in two stages or number of sequential stages. Algorithm 1 and 2 are executed for implementing two-stage training scheme, and multi-stage sequential training scheme, respectively. Operations performed in different stages are described below.

  1. Individual training stage: This stage is common for both two-stage and multi-stage training schemes. In this stage, separate CNN base parts with associate dense classifiers are trained individually to prepare the CNN base part as an efficient feature extractor for the respective transformed domain, as shown in Fig. (a)a. Here, the identity transform is also used to incorporate features from unaltered raw data along with other transformations. However, some of these transformations contain more distinctive features related to the final activity recognition compared to others that lead to variations of performance after being trained.

  2. Combined training stage: After the first training stage, each CNN base part provides an effective feature vector from its respective transformed domain. An additional combined training stage is employed to combine all these individually trained feature extractors for the proposed two-stage training scheme, as shown in Fig. (b)b. Though these architectures are similar in structure, for being trained with different representations of the transformed time series data, their extracted features will contain diverse characteristics. In this stage, all individual top dense classifier blocks trained in the first stage are removed while all CNN base parts are used unaltered as they are finely tuned as efficient feature extractors. Next, a separate densely connected layer is introduced on top of each CNN base part to reduce the extracted spatial/temporal features into more general representation. These separate densely connected layers act as the weighting vectors for feature selection from different transformed domains as introduced in Fig. 1. Here, the number of nodes in these densely connected layers are varied for incorporating more features from the feature extractors that contain more information for final classification. However, the information quantity of features extracted by individual CNN base parts can be analyzed by observing their performance in the first individual training stage. Following that, output feature vectors from these densely connected layers are concatenated and undergo through a combined dense classifier block. This block consisting series of densely connected layers will explore all the extracted features from different transformed domains as a whole and merge them to the final prediction with the softmax classifier at the end. In this stage, all the newly introduced densely connected layers are optimized through further training with data while keeping the CNN base part unaltered as efficient feature extractors, as shown in Fig. (b)b.

  3. Sequential training stages: In the proposed multi-stage sequential training scheme, individually trained feature extractors are optimized and converged in a unified architecture through series of sequential training stages, as shown in Fig. (c)c,  (d)d and (e)e. In this approach, two of the CNN base units operating on different transformed spaces are optimized together at a time by training an individual densely connected layer for each of the base units followed by feature concatenation and combined dense classifier unit, as shown in Fig. (c)c. Later, these combined two feature extractors are considered as an individual unit and further merged with the next CNN base part. Similarly, in the next stage, another separate densely connected layers with a combined dense classifier unit are trained, as shown in Fig. (d)d and (e)e. Therefore, through each training stage, a new CNN base part corresponding to another transformation is combined with the merged feature extractor. Moreover, each such stage merges these base feature extractor units by introducing a newly trained densely connected layers for providing the most optimized features at a whole utilizing all the existing features. As this approach optimizes two architectures at a time and contributes the merged architecture to the next training stage with the separate densely connected layers discarding the classifier unit, it provides more opportunity to empirically select the number of nodes of the separate dense layers used for feature selection and concatenation. Moreover, the number of parameters to be trained in a single stage is also lower compared to the previous combined training stage and thus provides more opportunity to extract more general features combining all these feature extractors in the expense of an increased number of training stages. Additionally, this sequential training approach is highly scalable that can incorporate a large number of feature spaces. Hence, features from additional space can be easily integrated into the feature extraction process by utilizing additional training stages with separate feature extractors.

Ii-D Data Augmentation

As imbalance in the dataset makes the training process complicated for learning the distribution of minority class, data augmentation is a viable approach to mitigate such problems. In this work, we have utilized the combination of five techniques that incorporate realistic variations in the data and make the process more robust [aug]. However, all such augmentations are applied to the training data leaving the testing data unaltered for proper evaluation of the proposed methods. Jittering simulates the randomness of additive thermal noise and environmental perturbations to the acquired sensor data while scaling simulates the effect of changing the sensor’s dynamic range. Moreover, in permutation operation, the input time window is divided into several segments and these segments are randomly permuted to create a modified window to make the trainer robust against the change in the sequence of steps on a particular activity. In magnitude warping, a smoothly varying random noise is multiplied with the original time series signal to warp the magnitude to simulate some random multiplicative noises that can be present in the real scenario, while in time warping, the sampling interval is smoothly varied to introduce variations in the time window. In Fig. 5, the individual effect of these augmentations are shown on raw sample data. To increase the diversity of the augmentation process, we have used all five augmentation techniques sequentially to generate each augmented sample. Hence, in each sample, the effects of all five techniques are present that provide more realistic random variations in the augmented samples. As there exists an imbalance in the number of samples per-class in all three databases used in this study, the proposed augmentation process is applied in a higher rate to the minority classes for generating more number of augmented samples to balance out the training samples per activity class. Hence, a higher number of synthetic samples are generated for the classes with a smaller number of samples.

Iii Results and Discussions

Three publicly available datasets used for this study are described below. Detailed comparative analysis of the results obtained is discussed later.

Iii-a Description of the Database

UCI HAR database [m19] contains activities collected from subjects with Hz sampling rate using axis accelerometer, gyroscope, and magnetometer embedded on a smartphone placed on the waist. USC HAR database [m20] contains activities collected from subjects with Hz sampling rate using axis accelerometer and gyroscope. SKODA database [m23] contains activities collected from a single subject in a car maintenance scenario using only a axis accelerometer sampled at Hz.

Actual Predicted
Walk
Up
stairs
Down
stairs
Sit Stand Lay
Walk 466 20 5 0 0 0
Upstairs 7 525 0 0 0 0
Downstairs 0 0 537 0 0 0
Sit 0 0 0 469 2 0
Stand 0 0 0 6 414 0
Lay 0 0 0 0 0 495
Table I: Confusion Matrix on UCI HAR Dataset [m19] for Proposed Two-stage Training on a Test Fold
Predicted
Actual Walk
Up
stairs
Down
stairs
Sit Stand Lay
Walk 478 9 4 0 0 0
Upstairs 4 528 0 0 0 0
Downstairs 0 0 537 0 0 0
Sit 0 0 0 470 1 0
Stand 0 0 0 3 417 0
Lay 0 0 0 0 0 495
Table II: Confusion Matrix on UCI HAR Dataset [m19] for Proposed Multi-stage Training on a Test Fold
Met- rics Prop. Meth. Class
Walk Up Stairs Down Stairs Sit Stand Lay
Prec. (%) 2-Stg. 98.53 96.34 99.14 98.75 99.58 100
M-Stg. 99.27 98.36 99.32 99.46 99.81 100
Rec. (%) 2-Stg. 94.93 98.72 100 99.61 98.64 100
M-Stg. 97.44 99.26 100 99.83 99.35 100
IoU Sc. (%) 2-Stg. 96.31 97.41 99.49 99.07 99.02 100
M-Stg. 98.24 98.68 99.52 99.59 99.47 100
Table III: Average Cross-Validation Performance Analysis on Various Activities of UCI HAR Dataset [m19] for Proposed Two-Stage and Multi-Stage Training
Figure 6: Comparison of Average Cross-Validation IoU scores on various activities of UCI HAR Database [m19] obtained using different transformation schemes along with the proposed combined schemes of two-stage and multi-stage training.

Iii-B Experimentation

A five-fold cross-validation scheme is carried out for evaluation of the proposed scheme on each database separately. The performances of the evaluation metrics obtained from each test fold are averaged to get the final values. All the augmentation techniques were applied to training data only. Adam optimizer (learning rate =

, = and =

) was employed for optimization with categorical cross-entropy as loss function (

). Keras deep learning library was used with python programming language for the implementation of the proposed neural networks. The Wilcoxon rank-sum test is used for statistical analysis of the average accuracy improvement obtained from the proposed scheme. The accuracies of the proposed schemes are statistically analyzed and the statistical significance level is set to

. The null hypothesis is that no significant improvement of average accuracy is achieved using the proposed scheme over the other existing best performing approaches.

Iii-C Performance Evaluation

The performance of the optimized networks is evaluated using the test data of various datasets. Traditional evaluation metrics for the multi-class classification task, i.e accuracy, precision, recall, and intersection-over-union (IoU) score, are employed for analyzing the performance. In Tab. I and II, confusion matrices are provided for the proposed two-stage and multi-stage training approach on UCI HAR database [m19] on a specific test fold. Moreover, in Tab. III, the score of averaged cross-validation evaluation metrics are provided for both these training approaches. It is clear that both these approaches provide a considerable performance of over in most of these classes that are separated almost perfectly. However, the two-stage method slightly struggles to separate features between walking and ascending upstairs activities as these activities contain close inter-relation in the feature space. But, in the case of multi stage-training, this problem is reduced which signifies the robust optimization capability of this method as it can separate features with proximity.

In Fig. 6, the average cross-validation IoU score of the optimized networks on different transformed spaces along with the final converged networks using both two-stage and multi-stage training are compared for all the activities. It is visible that identity transform representing the unaltered raw data provides better performance with more than improvement in most classes compared to other transformed spaces in case of individual training. However, irrespective of the performance, all the networks operating on separate transformed spaces extract features that are significantly different as they work with diversified representations of the data. Through optimization of these features, as visible in Fig. 6, the proposed two-stage, and multi-stage training approach provide a sharp increase in IoU scores in all the activity classes compared to the individual training stage. However, lower performing transformed spaces are de-emphasized through a smaller number of densely connected nodes and with smaller weights generated in the later training stages while merging, as shown in Fig. 4. For example, in two-stage training configuration (Fig. (b)b) for UCI database, before concatenation of features extracted from multiple transformed spaces, 96 densely connected nodes are provided following the identity transformed feature space while 32 nodes are provided following the GAF transformed space as identity transformed features provided higher average IoU score compared to the GAF transformed space in the individual training stage. Hence, more number of nodes can be adjusted for emphasizing the individually better-performing feature space. Despite that, all of the transformed spaces contribute some new and valuable information that may be indistinguishable even on other space that provides significantly better performance. Moreover, the later training stages are mainly dedicated to extracting the most distinguishable features while de-emphasizing the redundant features and thus provides this higher IoU score.

Moreover, in multi-stage sequential training, two of the feature spaces are optimized at a time by integrating an additional feature space to the resultant feature space (Fig. 4(c)-4(e)). It should be noticed that more number of nodes are provided in the densely connected layer following the features space of respective transformation to emphasize the features from those space that provided higher performance during the individual training stage. For example, in Fig 4(c), 144 nodes are provided for the identity transformed feature space while 112 nodes for the scattering wavelet transformed space, as features from identity transformed space performed better in the individual training stage. This manipulation of the number of nodes is iteratively done to reach maximum performance. If the sequence of optimization is altered, we have to adjust the number of nodes that were applied to the extracted features of different transformed spaces. It is expected that in any combination, the achieved performance will be similar if the number of nodes for different transformed spaces is properly adjusted which ensures the proper exploration of the generated feature spaces. However, in the sequential integration process, we have integrated individually better-performing feature spaces in early stages with more number of nodes before feature concatenation.

Training
Stage
Two Stage Training Multi Stage Training
Trainable
Non-
Trainable
Trainable
Non-
Trainable
Stage-1 1.2M (1D) 0 1.2M (1D) 0
2.8M (2D) 2.8M (2D)
Stage-2 82K 8.2M 82K 2.5M
Stage-3 - - 82K 5.4M
Stage-4 - - 82K 8.3M
Table IV: Number of Trainable and Non-trainable Parameters on Various Training Stages for Proposed Training Schemes in UCI HAR Database [m19]
Predicted
Actual
Walking
Forward
Walk.
Left
Walk.
Right
Walk.
Upst.
Walk.
Downst.
Running Jumping Sitting Standing Sleeping
In
Elevator
W. Forward 1584 4 8 7 1 1 0 0 2 0 0
W. Left 8 1066 5 2 3 1 0 0 0 0 0
W. Right 5 3 1101 1 2 0 1 0 1 0 0
W. Upstairs 1 2 5 893 2 1 2 0 1 0 0
W. Down. 2 3 4 1 846 2 4 0 0 0 0
Running 2 3 3 0 3 711 2 0 2 0 0
Jumping 2 1 0 2 3 0 416 1 2 0 1
Sitting 0 0 0 0 0 0 1 1015 6 0 2
Standing 1 0 0 0 0 0 1 8 966 0 12
Sleeping 0 0 0 0 0 0 3 5 0 1580 0
In Elevator 0 0 0 0 0 0 0 0 14 1 655
Table V: Confusion Matrix on USC HAR Dataset [m20] for Proposed Two-stage Training on a Test Fold
Predicted
Actual
Walking
Forward
Walk.
Left
Walk.
Right
Walk
Upst.
Walk.
Downst.
Running Jumping Sitting Standing Sleeping
In
Elevator
W. Forward 1590 3 7 5 1 1 0 0 0 0 0
W. Left 4 1075 2 1 2 1 0 0 0 0 0
W. Right 4 2 1106 2 0 0 0 0 0 0 0
W. Upstairs 1 2 3 897 1 2 1 0 0 0 0
W. Down. 3 2 3 1 847 3 2 0 0 0 1
Running 3 2 1 1 2 713 3 0 1 0 0
Jumping 1 2 0 2 2 0 420 0 0 0 1
Sitting 0 0 0 0 0 0 1 1019 4 0 0
Standing 0 0 0 0 0 0 2 5 972 0 8
Sleeping 0 0 0 0 0 0 2 4 0 1582 0
In Elevator 0 0 0 0 0 0 0 2 10 1 657
Table VI: Confusion Matrix on USC HAR Dataset [m20] for Proposed Multi-stage Training on a Test Fold
Class
Two stage
Training
Multi Stage
Training
Prec. (%) Rec. (%) IoU (%) Prec. (%) Rec. (%) IoU (%)
Walking Forward 99.2 98.3 98.5 99.7 99.4 99.4
Walking Left 99.1 98.5 98.7 99.6 99.3 99.3
Walking Right 99.2 99.4 99.1 99.5 99.6 99.5
Walking Upstairs 99.3 98.6 98.8 99.5 99.1 99.2
Walking Down 98.2 98.4 98.1 99.1 98.8 98.7
Running 99.0 98.2 98.4 99.3 98.7 98.9
Jumping 97.2 97.4 97.1 97.9 98.6 98.1
Sitting 99.1 99.2 99.0 99.4 99.5 99.2
Standing 97.5 98.1 97.8 98.5 98.8 98.4
Sleeping 100 99.5 99.6 100 99.7 99.7
In Elevator 98.1 98.3 98.1 98.4 98.6 98.4
Table VII: Average Cross-Validation Performance Analysis on Various Activities of USC HAR Dataset [m20] for Two-Stage and Multi-Stage Training
Class
Two stage
Training
Multi Stage
Training
Prec. (%) Rec. (%) IoU (%) Prec. (%) Rec. (%) IoU (%)
Null 95.3 96.4 95.8 96.2 96.5 96.1
Write on notepad 98.5 97.1 97.6 99.1 98.6 98.6
Open hood 95.2 94.5 94.7 97.3 95.4 96.3
Close hood 95.7 96.1 95.7 96.5 96.9 96.5
Check gaps on front door 96.3 97.5 96.5 97.8 99.1 98.3
Open left front door 96.6 95.8 96.1 97.5 95.7 96.4
Close left front door 96.7 95.6 95.9 97.2 95.9 96.3
Check trunk gaps 98.1 98.4 98.1 99.2 98.7 98.8
Open and close trunks 97.2 98.1 97.4 97.7 99.2 98.3
Check steering wheel 97.4 98.5 97.8 97.9 99.4 98.4
Table VIII: Average Cross-Validation Performance Analysis on Various Activities of SKODA Dataset [m23] for Two-Stage and Multi-Stage Training
Figure 7: Average Cross-Validation IoU score in various training stages of multi-stage sequential training on different databases.

In Tab. IV, the number of trainable and non-trainable parameters on different training stages is presented for the UCI database. As the stage-1 is similar for both these methods for training individual networks, all the parameters are set as trainable to train different feature extractors. However, 2D representation of the data requires a larger network compared to 1D representation to train with an increased size of transformed data. In later training stages, most of the parameters are set as non-trainable to utilize already trained deep feature extractors while some of the trainable parameters are introduced to merge different networks. Though the number of parameters increases in higher training stages, the number of trainable parameters on a single training stage is much smaller that makes the network resilient against overfitting with the training data. In multi-stage training, different feature extractors are merged in multiple stages and thus non-trainable parameters are increased in steps while in two-stage training, all four feature extractors are combined in one stage resulting in a larger number of non-trainable parameters in stage-2.

Database Two Stage training Multi Stage training
Accuracy (%) Average Precision(%) Average Recall(%) Average IoU Score(%) Accuracy (%) Average Precision(%) Average Recall(%) Average IoU Score(%)
UCI [m19] 98.630.29 98.750.27 98.650.22 98.550.23 99.290.13 99.370.08 99.310.12 99.250.11
USC [m20] 98.570.32 98.710.21 98.530.28 98.470.27 99.020.17 99.170.13 99.140.11 98.980.14
SKODA [m23] 96.510.38 96.70.26 96.560.21 96.720.29 97.210.21 97.640.18 97.540.23 97.410.14
Average 97.910.33 98.050.25 98.240.27 97.860.24 98.510.17 98.730.13 98.660.15 98.550.15
Table IX:

Central Tendency(Mean) and Dispersion(Standard Deviation) Measures of the Average Evaluation


Metrics Across Various Cross-Validation Folds on Different Datasets
UCI HAR Database [m19] USC HAR Database [m20] SKODA Database [m23]
Work Method Acc.(%) P-value Work Method Acc.(%) P-value Work Method Acc.(%) P-value
 [r6] MLP 86.1 NA  [n4] MLP, J48 89.2 NA  [m23] HMM 86 NA
 [rr1] CNN 94.2 NA  [m31] Random Forest 90.7 NA  [m6] DBN 89.4 NA
 [n3] DTW 95.3 NA  [n5] CNN 93.2 NA  [n6] Deep Conv LSTM 91.2 NA
 [m19] SVM 96 NA  [m30] LS-SVM 95.6 NA  [m11] CNN 91.7 NA
 [n1] Deep RNN 96.7 NA  [m28] CNN 97 NA  [n7] Ensemble LSTM 92.4 NA
 [n2] SVM 97.1 NA  [n1] Deep RNN 97.8 NA  [n1] DeepRNN 92.6 NA
Prop. 2-Stage CNN 98.63 3.4e-5 98.57 2.5e-6 96.51 4.2e-4
Prop. M-Stage CNN 99.29 5.1e-5 99.02 1.3e-6 97.21 2.8e-4
Table X: Comparison of the Proposed Schemes with Other Existing Approaches on Different Datasets

In Tab. V and VI, confusion matrices for the USC HAR dataset are provided for two-stage and multi-stage training method, respectively. Moreover, in Tab. VII, the average cross-validation performance of the proposed schemes on different activity classes of this dataset is provided. It is clear that both these approaches provide consistent performance over for most of the classes. However, multi-stage training provides a slight increase in incorrect predictions for some closely related activities like among various walking actions, between standing and sitting actions. In Tab. VIII, the average crosss-validation performance of both the training approaches is presented on the SKODA dataset. Though most of the activities contain close inter-relation in this dataset, our proposed training methods provide consistent performance over for most of the classes. However, some activities like opening and closing hood, opening, and closing doors, are difficult to separate as expected. Despite that, comparable performances have been achieved in these classes utilizing the proposed scheme.

In Fig. 7, the average IoU score in different stages are shown for multi-stage training. It is clear that each stage provides some improvement in performance by incorporating new features. However, in the first two stages, the trained network has achieved significant performance improvement with more than improvement in the average IoU score mostly achieved utilizing the features from identity transformation and scattering wavelet transformation with the 1D deep CNN feature extractor. Nevertheless, features from other transformations exploited at the later stages still provide a considerable contribution with around improvement in total to make the final network more optimized to separate challenging classes and thus to attain a higher average IoU score. Hence, integration of features from four transformed spaces in the proposed sequential training approach, improvement of average IoU score is achieved in total compared to operating with raw sensor data alone. In Tab. IX, central tendency (mean) and dispersion measures (standard deviation) of the evaluation metrics in different databases are provided. It should be noticed that the standard deviations of performance over various cross-validation folds are trivial in most cases that signify the generalizability of the proposed scheme. Moreover, around reduction of standard deviation is achieved in the multi-stage training approach over the two-stage counterpart. Additionally, the average performances over all three databases are also reported which surpasses in most cases that signify the robustness and consistency of the proposed scheme over numerous databases.

Various existing approaches are compared with the proposed ones in Tab. X on different datasets. Average accuracies obtained from the proposed two-stage and multi-stage training methods are compared with the reported accuracy of varieties of state-of-the-art approaches. It can be noted that the proposed multi-stage scheme has improved average accuracy from 86.1% to 99.29% (13.19% improvement) in UCI database, from 89.2% to 99.02% (9.82% improvement) in USC database, and from 86% to 97.21% (11.21% improvement) in SKODA database. The improvement in the multi-stage approach is around higher over the two-stage training approach for its increased opportunity of optimization through multiple stages. However, the training complexity also increases as more number of training stages need to be adjusted. As the values obtained from the statistical significance test on different databases are considerably smaller from the predefined threshold of 0.01, we have to reject the null hypothesis and it suggests that considerable improvement of average accuracy is achieved using the proposed schemes over other existing approaches. Moreover, the following observations can be drawn from the analysis:

  • In the UCI database, the shallow machine learning approaches (

    [n2, m19]) comparatively provide better performance compared to other traditional deep learning-based approaches ([m28, rr1]) due to the smaller amount of available training data. It should be noticed that the proposed scheme exploits the available data by incorporating a diverse representation of the feature space through multiple training stages that extract the effective features without the overfitting issues which is predominant in other traditional deep learning approaches. Hence, despite the smaller amount of available data, the proposed multi-stage deep CNN-based approach outperforms traditional shallow classifiers in the UCI database.

  • Deep learning-based approaches mostly dominate in the USC database due to the higher number of training samples. Nevertheless, due to more number of activity classes, there exist additional complexity in the feature extraction process. These traditional deep learning-based methods mostly struggle in challenging cases due to the complicated deep network-based approaches that operate solely on raw sensor data. On the contrary, the proposed scheme employs deep feature extractors efficiently on numerous transformed spaces and splits the training process into several stages that provides a better selection of features along with higher resilience over random noises and perturbations.

  • The SKODA database is more complicated due to close inter-relation of many activities along with a large number of activity classes that result in smaller performance in most-other approaches. However, the proposed scheme explores a number of representational feature spaces instead of a single space that not only increases the diversity of features but also assists better separation of inter-related activities. Hence, the proposed scheme provides a sharp improvement of average accuracy (more than in multi-stage over the other best approach) in this database.

Though we have incorporated features from four transformed spaces (including identity transform) in this work, it is to be noted that the proposed sequential training scheme is adaptive and features from newly transformed spaces can be easily integrated with the resultant feature space by including additional training stages. However, to incorporate effective features from new representational space, separate CNN-based feature extractors need to be incorporated which will increase the total size of the network accordingly. But, in the traditional training approach, if the whole system of the network is trained in a single training stage, it will be very complicated to achieve convergence and the network will be highly prone to overfit with the training data that will limit the integration of numerous transformations. Whereas, the proposed training scheme separately optimizes individual deep feature extractors and integrates the extracted feature spaces in a sequential manner that makes it possible to exploit a large number of feature spaces which provides a significant advantage over the traditional approaches. However, if a large number of transformed spaces are integrated into the feature extraction process, the increased size of the network may limit its application in mobile devices. Nevertheless, it is shown that very satisfactory performance is achieved by incorporating a fewer number of transformed spaces only. Hence, to reduce the complexity of the network for mobile applications, a fewer number of transformed spaces can be integrated into the feature extraction process while for achieving more robust performance, a large number of transformed feature spaces can be utilized.

Iv Conclusion

In this paper, various types of human activities are recognized utilizing the proposed multi-stage training method. Firstly, the raw data undergo through numerous transformations to interpret the information encoded in raw data in different spaces and thus to obtain a diversified representation of the features. Afterwards, separate deep CNN architectures are trained on each space to be an optimized feature extractor from that particular space for the final prediction of activity. Later, these tuned feature extractors are merged into a final form of deep network effectively through a combined training stage or through sequential stages of training by exploring the extracted feature spaces exhaustively to attain the most robust and accurate feature representation. It is found that instead of utilizing trained CNN as a feature extractor from a single space if multiple trained CNNs dealing with numerous transformed spaces can be utilized together, much better representation of features can be obtained. Such an idea of multiple training stages utilizing the initially trained CNN models from the preceding stages operating on different transformed spaces can offer a significant increase in performance with improvement in average IoU scores. This method outperforms other state-of-the-art approaches in different datasets by a considerable margin with an average accuracy of (11.49% average improvement) over three databases. Therefore, the proposed scheme opens up a new approach of employing multiple training stages for deep CNNs deploying various transformed representations of data which can also be utilized in very diversified applications by increasing the diversity of the extracted features.

References