Space-Time Domain Tensor Neural Networks: An Application on Human Pose Recognition

04/17/2020 ∙ by Konstantinos Makantasis, et al. ∙ 0

Recent advances in sensing technologies require the design and development of pattern recognition models capable of processing spatiotemporal data efficiently. In this work, we propose a spatially and temporally aware tensor-based neural network for human pose recognition using three-dimensional skeleton data. Our model employs three novel components. First, an input layer capable of constructing highly discriminative spatiotemporal features. Second, a tensor fusion operation that produces compact yet rich representations of the data, and third, a tensor-based neural network that processes data representations in their original tensor form. Our model is end-to-end trainable and characterized by a small number of trainable parameters making it suitable for problems where the annotated data is limited. Experimental validation of the proposed model indicates that it can achieve state-of-the-art performance. Although in this study, we consider the problem of human pose recognition, our methodology is general enough to be applied to any pattern recognition problem spatiotemporal data from sensor networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Advances in sensing technologies have enabled the development of time-evolving sensor networks where a single node can monitor a plethora of user (e.g. body sensor networks) and environmental information [1]. Sensed information corresponds to multimodal data, in space and time, which is used to continuously observe the progress of a phenomenon [2]. Processing and correlating multiple, potentially heterogeneous, information streams to detect and recognize spatiotemporal patterns is becoming a fundamental yet non-trivial task. A typical and emerging example of spatiotemporal sensing is Kinect-II 3D skeleton information, which extracts, in case of humans, 3D point joints and their motion in space and time.

The presented work focuses on processing and fusing data coming from multiple information streams, as well as on discovering informative patterns for a given learning task at hand. Specifically, we introduce a novel tensor-based deep neural network learning machine able to automatically process and correlate spatiotemporal information from different sources and discover appropriate patterns for assigning inputs to desired outputs. This is a generic space-time learning machine, which can be useful for a variety of time series analysis applications, such as human’s behavior recognition, moving objects analysis, radar signals, audio processing, etc. In this paper, we evaluate the new proposed learning scheme on human pose recognition using 3D skeleton information coming from the Kinect-II sensor [3]. Initially, the tensor-based neural network processes 3D skeleton information to extract spatiotemporal patterns that can collectively describe specific human postures. Then, the derived patterns are fused into a rich yet compact tensor object, which, in turn, is processed by the proposed tensor-based neural network. Although the steps mentioned above seem to be independent, actually they happen concurrently within an end-to-end trainable tensor-based learning machine.

The input layer of the proposed learning machine is specifically designed to process spatiotemporal data for constructing compact and discriminative features for the learning task at hand. The design of the input layer is inspired by the Common Spatial Patterns (CSP) algorithm [4], and thus, we refer to the constructed features as CSP-like features. The constructed features for each information stream are fused into a compact yet rich tensor representation, which in turn is processed by sequential tensor contraction layers. The number of tensor construction layers determines the depth of the learning machine, enabling, this way, the design of deep tensor-based learning architectures.

I-a Related Work

This paper deals with spatiotemporal information streams and their processing. Therefore, the related work section is divided into three subsections; works on correlating multiple information streams, works on pattern analysis for spatiotemporal data, and works related to human pose estimation.

I-A1 Correlating Multiple Information Streams

Fusion techniques merge and correlate information from different data streams. These techniques can be classified into feature-level and score-level. Score-level fusion methods select a hypothesis based on a set of hypotheses generated by processing each data stream

separately [5, 6]. The final hypothesis is selected either by averaging the generated hypotheses or by stacking another learning machine. In the latter case, the input of the learning machine is the set of generated hypotheses from each stream and its output the final hypothesis. Score-level fusion approaches do not correlate the information from different data streams; instead, they try to make robust their decision by operating similar to ensemble methods.

Feature-level fusion approaches aggregate features or raw data from different data streams by element-wise averaging or addition (assuming that the dimension of features allows it) or by concatenation [7, 8]. However, simple averaging, product or concatenation of features cannot capture complex interactions between different data streams. Therefore, capturing and modeling such interactions is left to a learning machine that follows the fusion operation.

Although learning machines are capable of disentangling complex relations in data [9], fusion techniques capable of highlighting such relations [10] are crucial for the successful outcome of the training process, especially under small sample setting problems that employ a limited number of training examples. The work presented in [11] tries to overcome the above limitation by proposing a rich tensor-based data fusion framework. Kronecker products are used to fuse various data streams into a unified tensor object whose dimension, then, is reduced via Tucker decomposition.

In this study, we fuse 3D skeleton information into unified multilinear (tensor) objects following the approach in [11]

. We do not, however, decompose the fused information to create appropriate inputs (e.g. matrices or vectors) for the employed learning machine. Instead, we use a tensor-based learning machine capable of processing the fused information in its original multilinear form.

I-A2 Pattern Recognition for Spatiotemporal Data

Efficient pattern recognition algorithms for processing spatiotemporal information aim to discover and correlate patterns both across the spatial and the temporal domain of the data. The discovery of spatiotemporal patterns is related to the feature construction process, while the correlation of those patterns to the employment of a machine learning model. Those two processes can be conducted separately or fused into a unified machine learning framework. In the first case, features, which are compact representations of the spatiotemporal information of the data, are constructed and then are used as input to machine learning models. In the latter case, the feature construction process takes place during the training of machine learning models by using, for example, deep learning architectures.

The most common approach for compactly representing spatiotemporal data is by using statistical features such as mean, variance, energy and entropy

[12, 13, 14, 15]

. By treating spatiotemporal data as time series, frequency domain features, such as Fast Fourier Transform

[16] and Wavelets Transform [17]

coefficients, can also be used to represent the data. More sophisticated approaches employ Autoregressive models

[18, 19] to construct features for representing spatiotemporal data via a learning (model-fitting) process. The approaches mentioned above focus solely on feature construction. Therefore, there is no information flow between the feature construction and the pattern recognition tasks, even though these are sequential. That poses several problems, such as computational complexity, difficulty in transfer of learning and adaptation, and, in many cases, a high-risk for over-fitting the pattern recognition model [20].

Deep learning models unify the feature construction and pattern recognition tasks. Those models, during the training process, learn high-level representations of raw inputs automating this way the feature construction. Convolutional Neural Networks (CNNs) are state-of-the-art learning machines for processing spatial data. Besides spatial data, CNNs can also process spatiotemporal data. That can be done either directly by using spatiotemporal convolutions

[21, 22], or indirectly by applying spatial convolutions on spatiotemporal data [23, 24], for example on videos where frames are concatenated along the temporal dimension. When the data are spatially coherent, i.e., neighbouring bits of information are highly correlated (e.g., pixels in images), then CNNs can produce highly descriptive features. When, however, such coherency is not the case (e.g., EEG data where the responses of adjacent channels/electrodes are not necessarily related), CNNs are not able to produce high-quality features. Besides the requirement for spatially coherent data, another drawback of deep learning models is the number of their trainable parameters. Usually, those models employ a vast number of parameters (much larger than the number of available data) the values of which is tough to be estimated especially when small sample setting problems need to be addressed [25].

In this work, we propose a machine learning model capable of unifying the feature construction and pattern recognition tasks, and at the same time, it overcomes the problems of CNNs. First, by exploiting tensor algebra tools, we can significantly reduce the number of model’s trainable parameters making it suitable for problems where the number of available data is limited. Second, the proposed model can capture spatial correlations even for data that are not spatially coherent by employing a novel neural network layer capable of constructing CSP-like features. The design of this layer is inspired by the CSP algorithm, which does not require spatial coherency within the data.

I-A3 Human Pose Recognition

Human pose recognition is usually formulated as a computer vision problem, where the human poses are described via the detection of body parts through pictorial structures

[26, 27, 28]

. In this study, however, instead of using visual information, we focus on human pose estimation using solely 3D skeleton measurements. 3D skeleton data are used in

[29] for the development of a gesture classification system. The authors of [30] propose the Moving Pose system, which is based on a 3D kinematics descriptor. In [31], skeleton data is split into different body parts, which are then transformed to allow view-invariant pose recognition. 3D skeleton data from MS Kinect are used in [32] for recognizing individual persons based on their walking gait, while Rallis et al. in [33] propose a key posture identification method based on Kinect-II measurements.

I-B Our Contribution

Based on the discussion so far, the main contributions this study can be summarized into the following four points. First, we propose an end-to-end trainable architecture that unifies the feature and pattern recognition tasks. Second, in contrast to CNNs, the proposed machine learning model can construct highly descriptive feature form data that are not spatially coherent. Third, we exploit tensor algebra tools to significantly reduce the number of the proposed model’s trainable parameters making it very robust for small sample setting problems. Last but not least, although this study focuses on the problem of human pose recognition, the proposed approach is a general one that can be applied to any problem that includes spatiotemporal data coming from sensor networks.

Ii Approach Overview

In this section, we formulate the problem that we are trying to address and present the main components of the proposed methodology. For the rest of the paper, we represent scalars, vectors, matrices and tensor objects of order larger than two with lowercase, bold lowercase, uppercase and bold uppercase letters respectively.

Ii-a Problem Formulation

This study focuses on the problem of human pose estimation using 3D skeleton data from Kinect-II. As we will see later, that problem is a specific instance of the more general problem of pattern recognition using information coming from sensor networks. Therefore, in this section, we describe the form of the latter more general problem.

Consider a sensor network that contains sensors. Each one of the sensors, let’s say the -th sensor, retrieves measurements (information modalities) at each time instance , which can be represented by the vector

(1)

for . Since each sensor occupies a specific spatial position, the spatial information for the -th information modality captured by the sensor network can be represented by the following vector:

(2)

for , while the spatiotemporal information corresponding to a time window to can be represented by the matrix

(3)

The information from all , can be aggregated into a tensor object

(4)

in . For the sake of clarity, in the following we omit the time index, thus, when we write we refer to a tensor object of the form of (4) for some time instance . Obviously, for a specific time window, the tensor object in (4) encodes the spatiotemporal information for all information modalities and all sensors in a sensor network.

Each tensor describes a pattern that belongs to a specific class. Let us denote as the class of that pattern, and assume that we have in our disposal a set of pairs of the form:

(5)

The objective of this study is to derive a function for mapping to given the set in (5). This can be seen as a machine learning problem. Let us denote as the class of functions that can be computed by a learning machine. We want to select the function

(6)

such that . In (6)

is a loss function. For classification problems

usually is the cross entropy loss.

Remark 1: In order to facilitate the solution of problem (6) the learning machine must contain a number of trainable parameters that is comparable to the cardinality of set , and at the same time it should be capable of fully exploiting the spatiotemporal nature of the data.

Remark 2: The problem of human pose recognition using 3D skeleton data from Kinect-II is a special instance of the problem described above. Each skeleton joint can be seen as a sensor, which, at every time instance, measures its location. So, in this case equals the number of skeleton joints and in (1) equals 3 (, and positions).

Ii-B Proposed Methodology

In this study, we use 3D skeleton data captured using Kinect-II, along with their annotations, which correspond to the depicted human pose at every time instance. Initially, we process the skeleton data to create tensor objects as in (4) and then use their annotations to create a training set as in (5).

After creating the training set, we design an end-to-end trainable neural network, which is able to fully exploit the spatiotemporal nature of the data, and at the same time employs a small number of trainable parameters (compared to the size of the training set). The first layer of the proposed model learns CSP-like features [20] from each information modality using inputs in the form of (3). Then, the constructed features from all modalities are fused into a tensor object to compactly represent the spatiotemporal information captured by the sensor network. Finally, the tensor objects are processed by a tensor-based neural network for producing a mapping from 3D skeleton data to human poses. In the following, we describe each one of the steps presented above in details.

Fig. 1: Kinect II skeletal capturing system (vvvv.org/documentation/kinect).

Iii Data Preprocessing

In this section, we present the 3D skeleton data as well as their preprocessing for human pose recognition. The Kinect-II sensor creates a depth map over which twenty-five skeletal joints are identified and monitored at the constant rate of 30 measurements per seconds, see Fig. 1. For each joint its position in the 3D space with respect to the Kinect-II device is provided. A human pose, however, is characterized by the relative positions of the human body parts. For this reason, we represent the position of each joint with respect to the position of the Spine Base joint. In other words, we use the Spine Base joint as the origin of a local coordinate system. This way, the recognition of human poses does not depend on the position of the human with respect to the Kinect-II device.

Specifically, if we denote as the coordinates of the Spine Base joint and as , the coordinates of all other joints, then the coordinates of the joints with respect to the Spine Base joint will be given by

(7)

Using the transformed coordinates in (7), we create matrices as in (3) for that correspond to positions. Those matrices encode the spatiotemporal information for recognizing human poses, and, thus, we want to map those matrices to a specific human pose.

At this point, we have to mention that parameter in (3) is application dependent and affects the recognition results. For this reason, it must be set appropriately. For , the pose recognition model will not be able to exploit the temporal information and thus it will be more prone to measurements errors, while large values of may result to a dataset where each datum depicts more than one pose, increasing, this way, the uncertainty in recognition. The effect of parameter on the recognition results is further discussed in Section V-B2.

Iv Space-Time Domain Tensor Neural Network

The proposed novel tensor-based neural network consists of three main components; the input neural network layer capable of computing CSP-like features, the tensor fusion operation, and the tensor contraction and tensor regression layers that process high-order data in its original multilinear form. In the following, we describe each one of those components in detail.

Iv-a CSP Neural Network Layer

The CSP neural network layer aims to produce highly discriminative spatiotemporal features for human pose recognition. The design of that layer is motivated by the CSP algorithm [4], which is widely used for classifying EEG signals. For the sake of clarity and completeness, we briefly describe the CSP algorithm.

The CSP algorithm originally was developed for binary classification problems. It receives as input zero average signals in the form of (3) along with their labels. Then, its objective is to produce features that increase the separability between two pattern classes. Specifically, consider that we have in our disposal samples , where denotes the class of each sample. The CSP algorithm computes the covariance matrix

(8)

for each sample, and the average covariance matrix

(9)

for each class, where is the number of samples belonging to class . Then, the CSP filter, , is constructed by using ,

, eigenvectors corresponding to

largest and

smallest eigenvalues of

. Finally, using each sample is represented by a feature vector of the following form;

(10)

where stands for the -th row of . Features typically are used for as inputs to learning models since they encode the spatiotemporal information of signals .

Theoretically sound, the CSP algorithms presents several drawbacks when applied to real world problems mainly due to the non-stationarity of captured signals [20]. Moreover, it is a feature construction technique that is performed individually, and thus does not permit information flow between feature construction and pattern recognition tasks (see Section I-A2).

To overcome those drawbacks, the proposed CSP neural network layer learns during the training of the pattern recognition model. Trainable matrix projects measurements in and then features as in (10) are computed from the projected measurements. Additionally, since Kinect-II measurements extract 3D coordinates, we use three parallel CSP layers, one for each coordinate in the 3D space. Therefore, the output of the CSP layer consists of three vectors in . Finally, as shown in [20] parameters can be efficiently learned in an end-to-end trainable neural network.

Fig. 2: The proposed CSP layer and the tensor fusion operation. Parameter stands for the number of skeleton joints.

Iv-B Tensor Fusion Operation

The fusion module receives as input the feature vectors constructed by the CSP layer and its objective is to produce a rich and compact representation of the data. Since we do not know in advance the kind of interactions between the elements of the constructed feature vectors, we cannot fuse them using feature averaging or addition (see Section I-A1).

The employed fusion technique is motivated by the work in [11]. Specifically, the output of the fusion module corresponds to the Kronecker product of the feature vectors produced by the CSP layer. Therefore, after the fusion module each input sample in the form of (4) is represented by a tensor object in . Contrary to [11], we do not reduce the dimensionality of the fused tensor object by using tensor decomposition techniques. Instead, we use a tensor-based learning machine capable of processing the fused information in its original multilinear form. The proposed CSP layer and the tensor fusion operation are visually presented in Fig. 2.

Iv-C Tensor-based Neural Network

The employed tensor-based neural network is a fully connected feed forward neural network, its parameter space, however, is compressed

[34]. At each layer the weight of the tensor-based neural network should satisfy the Tucker decomposition [35]. In particular, the weights at the -th hidden layer are expressed as

(11)

where is a tensor all elements of which equal one, and the operation ”” stands for the mode- product.

The information is propagated through the layers of the tensor-based neural network in a sequence of projections – at each layer the tensor input is projected to another tensor space – and nonlinear transformations. Formally, consider a network with hidden layers. An input (tensor) sample is propagated from the -th layer of the network to the next one via the projection

(12)

and the nonlinear transformation

(13)

where is a nonlinear function (e.g. sigmoid) that is applied element-wise on a tensor object. For the first layer . The layers that propagate tensor objects information in the way described above are referred as Tensor Contraction Layers (TCL) [36] 111The term ”contraction”, however, is misleading, since it implies that the projection operation should reduce the dimension of the input, which obviously is not necessary.

Fig. 3: Propagation of information through the layers of the tensor-based neural network.

Finally, the output of the -th hidden layer is fed to a Tucker regression model [34], which outputs

(14)

for the -th class. In (14) the tensor and is the rank of the Tucker decomposition along mode used in the output layer. The scalar is the bias associated with the -th class, while the subscript indicates that separate sets of parameters are used to model the response for each class. The tensor-based neural network is presented inf Fig. 3.

At this point it should be highlighted that the sequential projections and nonlinear transformations can be seen as a hierarchical feature construction process, which aims to capture statistical relations between the elements of the input in order to emphasize discriminative features for the pattern recognition task.

Finally, since the weights of the employed tensor-based neural network need to satisfy the decomposition in (11), the total number of trainable parameters is reduced substantially [34]. This reduction acts as a very strong regularizer that shields the network against overfitting (see [37], Section 2.2).

Fig. 4: Examples of the seven different postures.

V Experimental Results

In this section we describe the dataset employed in this study, the effect of different parameters on the performance of the proposed scheme, as well as a performance evaluation of the proposed methodology against state of the art methods for choreography modeling.

V-a Dataset Description

In this study we employ the dataset captured during the framework of the EU project TERPSICHORE [38]. The dataset consists of five Greek folklore dances, while each dance is performed by three professionals. Each dance performance is described by consecutive frames and each frame is represented by the spatial coordinates of the twenty-five tracked skeleton joints (see Fig. 1). The frames of the captured choreographies were manually annotated by dance experts according to the posture they depict. In total seven different postures are depicted, see Fig. 4. Therefore, the objective is to train the proposed model to correctly classify posture into seven different categories.

Fig. 5: Average classification accuracy and F1 score of a tensor-based neural network with one tensor contraction layer, for , and for different values of parameter .

Three steps are followed to transform the captured data into a dataset suitable for training and testing our proposed methodology. First, we follow the procedure described in Section III to transform the coordinates of skeleton joints to a coordinate system in which the origin is the Spine Base joint. Second, we use different values for parameter to create a dataset as in (4). Third, we assign to each sample the annotation of the centered frame, e.g., for we assign to the sample the annotation of the -th frame. By following those steps the resulting dataset consists of annotated samples.

For evaluating the performance of our methodology, we randomly shuffle the constructed dataset and follow a -fold cross validation scheme. Under that scheme, the performance is evaluated in terms of average classification accuracy and F1 score across the folds.

V-B Parameters Effect Investigation

There are three different parameters that affect the performance of the proposed methodology; namely, parameter , that is the dimension of feature vector constructed by the CSP layer, parameter , that is the temporal dimension of the samples, and that is the number of tensor contraction layers employed in the tensor-based neural network architecture.

V-B1 The effect of parameter

As mentioned above parameter corresponds to the dimension of the features constructed by the CSP layer of the proposed neural network architecture. For investigating the effect of that parameter on the performance of the model, we keep fixed the parameter . Then, we train and test the performance of the proposed model with one tensor contraction layer for different values of . Specifically, we use four different values for parameter , i.e., , , , and , while the tensor contraction layer receives an input in and projects it in and the ranks of the Tensor Regression Layer are .

The effect of the parameter is depicted in Fig. 5. By increasing the value of parameter the performance of the proposed model monotonically increases. The dimension of the features constructed by the CSP layer is directly related to their representation power. Thus, features of higher dimension can better capture the spatial and temporal patterns of skeleton data resulting to more accurate human pose recognition. Moreover, increasing the value of parameter increases the total number of trainable parameters of the model, and thus, the learning capacity of the model. This further justifies higher performance for larger values of . Indicatively, the number of trainable parameters for equals , , , and is , , , and respectively.

Fig. 6: Average classification accuracy and F1 score of a tensor-based neural network with one tensor contraction layer, for , and for different values of parameter .

V-B2 The effect of parameter

In contrast to parameter , parameter does not affect the number of trainable parameters of the model nor the dimension of the features constructed by the CSP layer due to the variance operator employed in (10). Parameter indirectly determines the amount of temporal information that is taken into consideration during the construction of the features. Therefore, small values of result to features that encode small amount of temporal information and may not be able to sufficiently represent the temporal relations present in the data.

The effect of parameter on the performance of the model is presented in Fig. 6. To obtain those results we train a tensor-based neural network with one tensor contraction layer and keep the value of parameter fixed and equal to . Again, the tensor contraction layer receives an input in and projects it in and the ranks of the Tensor Regression Layer are . Producing features that encode larger amounts of temporal information results to higher human pose recognition accuracy. Increasing the value of parameter from to results in a performance improvement more than . Increasing, however, more the value of results in smaller performance improvements around . This implies that capturing the most important temporal information for problem at hand more that consecutive frames need to be used.

V-B3 The effect of parameter

Parameter corresponds to the number of tensor contraction layer present in the network. Fig. 7 presents the effect of the number of tensor contraction layers on the performance of the model. To obtain those results we keep parameter an fixed and equal to and respectively, and trained four different tensor-based neural networks with , , , and tensor contraction layers. The projections of the employed contraction layers are presented in Table I. Increasing the number of tensor contraction layers increases the total number of trainable parameters of the model, and thus its learning capacity. Indicatively, the number of trainable parameters for equals , , , and is , , , and respectively. That increase, however, of the learning capacity does not seem to affect the performance of the model, since the performance improvement from to is only .

Fig. 7: Average classification accuracy and F1 score of a tensor-based neural network with different number of tensor contraction layers (parameter ) for and .
1 TCL 2 TCLs 3 TCLs 4 TCLs
Input
Layer1
Layer2 -
Layer3 - -
Layer4 - - -
TRL
TABLE I: Projections of tensor objects when they propagated through Tensor Contraction Layers (TCL) and the ranks of the Tensor Regression Layer (TRL).

The investigation above suggests that the most important parameter for achieving highly accurate human pose recognition results is parameter . Indeed, increasing the dimension of the features constructed by the CSP layer from to , we achieve a performance improvement of almost . On the contrary, designed deeper architectures does not seem to significantly affect the performance of the model. This might be due to the Tucker decomposition imposed on the weights of the tensor contraction layers (see (11)), which acts as a very strong regularizer for the model.

V-C Performance Evaluation Against State of the Art Methods

In this section we compare the performance of the proposed tensor-based neural network against the performance of state-of-the-art methods for choreographic modeling. Specifically, we compare the performance of our model against LSTM and the recently proposed Bayesian Optimized Bidirectional LSTM (BOBi LSTM) [39] . The experimental results in [39]

indicate that LSTM and BOBi LSTM outperforms other machine learning techniques, such as support vector machines and feedforward fully connected neural networks, on this specific task. For this reason, in the present study, we compare our model against only LSTM and BOBi LSTM models.

For the performance comparison, we utilize a tensor-based neural network with two tensor contraction layers (), and parameters and equal to and respectively. Regarding the LSTM and the BOBi LSTM models, their architectures are the ones presented in [39] and they use a memory of frames for recognizing human poses. At this point we should emphasize that those models receive as input the kinematic properties of the skeleton joints; i.e., the spatial position as well as the velocity and the acceleration of each joint. In contrast, our method receives as input solely the spatial position of the joints. Moreover, the proposed model consists of trainable parameters. In contrast, the BOBi-LSTM network in [39] was composed by 2 LSTM Layers of 128 cells each and two additional dense layers as the output. This makes the total number of training parameters at 205,674, namely 87 times more than the number of trainable parameters in our approach. This significant reduction favors the efficient parameter estimation especially when small sample setting problems need to be addressed.

Accuracy (%) F1 Score (%)
LSTM 84.2% 82.0%
BOBi LSTM 85.4% 80.7%
Our Approach 91.6% 90.9%
TABLE II: Performnce comparison in terms of average classification accuracy and F1 score against LSTM and BOBi LSTM models.

Table II presents the results of that comparison. The proposed tensor-based neural network approach performs more than better compared the BOBi LSTM, despite the fact that is uses a simpler input representation (our method is completely blind to kinematics information of the skeleton joints).

The comparison above implies the following. First, the proposed CSP layers can produce highly discriminative features that encode the spatial and the temporal information in the data. Second, employing the tensor fusion operation produces compact yet highly descriptive representations of the input. Finally, tensor contraction and tensor regression layers can efficiently process data in tensor form and produce learning models with high generalization capacity.

Vi Conclusion

In this work we proposed a spatially and temporally aware tensor-based neural network that can efficiently process spatiotemporal data. We evaluated the performance of the proposed model on the problem of human pose recognition using 3D data captured using the Kinect-II sensor. The evaluation results indicate that the proposed model can construct highly discriminative spatiotemporal features and achieve state-of-the-art performance. As mentioned in SectionII-A, the problem of recognizing human poses using 3D skeleton data is a specific instance of the more general problem of pattern recognition using information coming from sensor network. Therefore, despite the fact that in this work we consider that specific problem, our model is a general one that can be applied on general pattern recognition problems that employ spatiotemporal data from sensor networks.

References

  • [1] R. Gravina, P. Alinia, H. Ghasemzadeh, and G. Fortino, “Multi-sensor fusion in body sensor networks: State-of-the-art and research challenges,” Information Fusion, vol. 35, pp. 68–80, 2017.
  • [2] M. C. Vuran, Ö. B. Akan, and I. F. Akyildiz, “Spatio-temporal correlation: theory and applications for wireless sensor networks,” Computer Networks, vol. 45, no. 3, pp. 245–259, 2004.
  • [3] Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE multimedia, vol. 19, no. 2, pp. 4–10, 2012.
  • [4]

    M. Grosse-Wentrup and M. Buss, “Multiclass common spatial patterns and information theoretic feature extraction,”

    IEEE transactions on Biomedical Engineering, vol. 55, no. 8, pp. 1991–2000, 2008.
  • [5] D. H. Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp. 241–259, 1992.
  • [6] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
  • [7] E. Park, X. Han, T. L. Berg, and A. C. Berg, “Combining multiple sources of knowledge in deep cnns for action recognition,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).   IEEE, 2016, pp. 1–8.
  • [8] A. Liapis, D. Karavolos, K. Makantasis, K. Sfikas, and G. N. Yannakakis, “Fusing level and ruleset features for multimodal learning of gameplay outcomes,” in Proceedings of the IEEE Conference on Games, 2019.
  • [9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
  • [10] K. Makantasis, A. Doulamis, N. Doulamis, and A. Voulodimos, “Common mode patterns for supervised tensor subspace learning,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 2927–2931.
  • [11]

    G. Hu, Y. Hua, Y. Yuan, Z. Zhang, Z. Lu, S. S. Mukherjee, T. M. Hospedales, N. M. Robertson, and Y. Yang, “Attribute-enhanced face recognition with neural tensor fusion networks,” in

    Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3744–3753.
  • [12] L. Bao and S. S. Intille, “Activity recognition from user-annotated acceleration data,” in International conference on pervasive computing.   Springer, 2004, pp. 1–17.
  • [13] S. Wang, J. Yang, N. Chen, X. Chen, and Q. Zhang, “Human activity recognition with user-free accelerometers in the sensor networks,” in 2005 International Conference on Neural Networks and Brain, vol. 2.   IEEE, 2005, pp. 1212–1217.
  • [14] N. Ravi, N. Dandekar, P. Mysore, and M. L. Littman, “Activity recognition from accelerometer data,” in Aaai, vol. 5, no. 2005, 2005, pp. 1541–1546.
  • [15] K. Makantasis, A. Nikitakis, A. D. Doulamis, N. D. Doulamis, and I. Papaefstathiou, “Data-driven background subtraction algorithm for in-camera acceleration in thermal imagery,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2090–2104, 2017.
  • [16] T. Huynh and B. Schiele, “Analyzing features for activity recognition,” in Proceedings of the 2005 joint conference on Smart objects and ambient intelligence: innovative context-aware services: usages and technologies, 2005, pp. 159–163.
  • [17] M. G. Abdu-Aguye and W. Gomaa, “Novel approaches to activity recognition based on vector autoregression and wavelet transforms,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).   IEEE, 2018, pp. 951–954.
  • [18] Z.-Y. He and L.-W. Jin, “Activity recognition from acceleration data using ar model representation and svm,” in 2008 international conference on machine learning and cybernetics, vol. 4.   IEEE, 2008, pp. 2245–2250.
  • [19] A. M. Khan, Y.-K. Lee, and T.-S. Kim, “Accelerometer signal-based human activity recognition using augmented autoregressive model coefficients and artificial neural nets,” in 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.   IEEE, 2008, pp. 5172–5175.
  • [20] A. Nikitakis, K. Makantasis, N. Tampouratzis, and I. Papaefstathiou, “A unified novel neural network approach and a prototype hardware implementation for ultra-low power eeg classification,” IEEE transactions on biomedical circuits and systems, vol. 13, no. 4, pp. 670–681, 2019.
  • [21] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  • [22] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012.
  • [23] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
  • [24] K. Makantasis, A. Doulamis, N. Doulamis, and K. Psychas, “Deep learning based human behavior recognition in industrial workflows,” in 2016 IEEE International Conference on Image Processing (ICIP).   IEEE, 2016, pp. 1609–1613.
  • [25] K. Makantasis, A. D. Doulamis, N. D. Doulamis, and A. Nikitakis, “Tensor-based classification models for hyperspectral data analysis,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 12, pp. 6884–6898, 2018.
  • [26] A. Toshev, C. Szegedy, and G. DeepPose, “Human pose estimation via deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 2014, pp. 24–27.
  • [27] X. Chen and A. L. Yuille, “Articulated pose estimation by a graphical model with image dependent pairwise relations,” in Advances in neural information processing systems, 2014, pp. 1736–1744.
  • [28] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convolutional network and a graphical model for human pose estimation,” in Advances in neural information processing systems, 2014, pp. 1799–1807.
  • [29] M. Raptis, D. Kirovski, and H. Hoppe, “Real-time classification of dance gestures from skeleton animation,” in Proceedings of the 2011 ACM SIGGRAPH/Eurographics symposium on computer animation, 2011, pp. 147–156.
  • [30] M. Zanfir, M. Leordeanu, and C. Sminchisescu, “The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2752–2759.
  • [31] A. Kitsikidis, K. Dimitropoulos, S. Douka, and N. Grammalidis, “Dance analysis using multiple kinect sensors,” in 2014 International Conference on Computer Vision Theory and Applications (VISAPP), vol. 2.   IEEE, 2014, pp. 789–795.
  • [32] A. Ball, D. Rye, F. Ramos, and M. Velonaki, “Unsupervised clustering of people from’skeleton’data,” in Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, 2012, pp. 225–226.
  • [33] I. Rallis, I. Georgoulas, N. Doulamis, A. Voulodimos, and P. Terzopoulos, “Extraction of key postures from 3d human motion data for choreography summarization,” in 2017 9th International Conference on Virtual Worlds and Games for Serious Applications (VS-Games).   IEEE, 2017, pp. 94–101.
  • [34] X. Li, D. Xu, H. Zhou, and L. Li, “Tucker tensor regression and neuroimaging analysis,” Statistics in Biosciences, vol. 10, no. 3, pp. 520–545, 2018.
  • [35] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
  • [36] J. Kossaifi, Z. C. Lipton, A. Khanna, T. Furlanello, and A. Anandkumar, “Tensor regression networks,” arXiv preprint arXiv:1707.08308, 2017.
  • [37] A. Cichocki, A.-H. Phan, Q. Zhao, N. Lee, I. Oseledets, M. Sugiyama, D. P. Mandic et al., “Tensor networks for dimensionality reduction and large-scale optimization: Part 2 applications and future perspectives,” Foundations and Trends® in Machine Learning, vol. 9, no. 6, pp. 431–673, 2017.
  • [38] N. Doulamis, A. Doulamis, C. Ioannidis, M. Klein, and M. Ioannides, “Modelling of static and moving objects: digitizing tangible and intangible cultural heritage,” in Mixed Reality and Gamification for Cultural Heritage.   Springer, 2017, pp. 567–589.
  • [39] I. Rallis, N. Bakalos, N. Doulamis, A. Voulodimos, A. Doulamis, and E. Protopapadakis, “Learning choreographic primitives through a bayesian optimized bi-directional lstm model,” in 2019 IEEE International Conference on Image Processing (ICIP).   IEEE, 2019, pp. 1940–1944.