DFTerNet: Towards 2-bit Dynamic Fusion Networks for Accurate Human Activity Recognition

07/31/2018 ∙ by Zhan Yang, et al. ∙ Central South University 0

Deep Convolutional Neural Networks (DCNNs) are currently popular in human activity recognition applications. However, in the face of modern artificial intelligence sensor-based games, many research achievements cannot be practically applied on portable devices. DCNNs are typically resource-intensive and too large to be deployed on portable devices, thus this limits the practical application of complex activity detection. In addition, since portable devices do not possess high-performance Graphic Processing Units (GPUs), there is hardly any improvement in Action Game (ACT) experience. Besides, in order to deal with multi-sensor collaboration, all previous human activity recognition models typically treated the representations from different sensor signal sources equally. However, distinct types of activities should adopt different fusion strategies. In this paper, a novel scheme is proposed. This scheme is used to train 2-bit Convolutional Neural Networks with weights and activations constrained to -0.5,0,0.5. It takes into account the correlation between different sensor signal sources and the activity types. This model, which we refer to as DFTerNet, aims at producing a more reliable inference and better trade-offs for practical applications. Our basic idea is to exploit quantization of weights and activations directly in pre-trained filter banks and adopt dynamic fusion strategies for different activity types. Experiments demonstrate that by using dynamic fusion strategy can exceed the baseline model performance by up to 5 OPPORTUNITY and PAMAP2 datasets. Using the quantization method proposed, we were able to achieve performances closer to that of full-precision counterpart. These results were also verified using the UniMiB-SHAR dataset. In addition, the proposed method can achieve 9x acceleration on CPUs and 11x memory saving.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Artificial Intelligence (AI), as an auxiliary technology in modern games, has played an indispensable role in improving gaming experience in the last decade. The film “Ready Player One” vividly shows the charm of future virtual games on the world. It demonstrates that one of the core technologies of virtual-realistic interaction is recognizing all kinds of complex activities.

Convolutional neural networks are very powerful and have been successfully used in many neural network models. They have been widely applied in lots of virtual-realistic interactive practical applications, e.g., object recognition [1, 2, 3], Internet of Things [4, 5], human activity recognition [6, 7]. Its successes have been driven by the recent data explosion as well as the increase in model size.

Figure 1: The rough numbers for the computations’ energy consumption (45nm technology)  [8] and some effective methods to deploy deep neural networks on portable devices.

In spite of these successes, the large computational power and high memory requirements of these models restricts them from being deployed on portable devices that lack high-performance Graphical Processing Units (GPUs) as shown in Figure 1. As far as we know, it is more interesting to operate sensor-based portable gaming devices which are able to recognize human activities. As a result of this, it is necessary to deploy advanced CNN, e.g., Inception-Nets [9], ResNets [10] and VGG-Nets [11] on smart portable devices. However, as the winner of the ILSVRC-2015 competition, ResNets-152 [10]

is a classifier trained with nearly 19.4 million real-valued parameters, making it resource-intensive in different aspects. It is unable to run on portable devices for real-time applications, due to its high CPU/GPU workload and memory usage requirements. A similar phenomenon occurs in other state-of-the-art models, such as, VGG-Net 

[11] and AlexNet [12].

Recently, in order to resolve the storage and computational problems [13, 14] can be categorized into three methods, i.e., network pruning, low-rank decomposition and quantization of network. Among them, quantization of network has received more and more research focus. DCNNs with binary weights and activations have been designed  [15, 16, 17]

. Binary Convolutional Neural Networks (BCNNs) with weights and activations constrained to only two values (e.g., -1,+1), can bring great benefits to specialized machine learning hardware because of the following major reasons: (1) the quantized weights and activations reduce memory usage and model size by

32x compared to the full-precision version; (2) if networks are binary, then most multiply-accumulate operations (require hundreds of logic gates at least) can be replaced by popcount-XNOR operations (only require a single logic gate), which are especially well suited for FPGAs and ASICs 


However, the following problems limit the practical applications of above-mentioned idea: 1) quantization usually causes severe prediction accuracy degradation. The reported accuracy of obtained models is unsatisfactory on complex tasks (e.g., ImageNet dataset). More concretely, Rastegari

et al. [17] shows that binary weights cause the accuracy of ResNet-18 to drop by about 9% (GoogLenet drops by about 6%) on the ImageNet dataset. It is obvious that there is a considerable gap in performance between the accuracy of a quantization model and the full-precision model. 2) in the case of practical applications, typical human activity recognition datasets containing multiple sensor data are collected from different positions on the body. This data can be fused at different stages within a convolutional neural network architecture [19]. However, depending on the type of activity being performed, a sensor may contribute less or more to the overall result compared to other sensors depending on its location, and thus its fusion weight adjusted accordingly.

In light of these considerations, this paper proposes a novel quantization method and dynamic fusion strategy, to achieve deployments on an advanced high-precision and low-cost computation neural network model on portable devices. The main contributions of this paper are summarized as follows:

  1. We propose a quantization function with an elastic scale parameter (-) to quantize the entire full-precision convolutional neural network. The quantization of weights, activations and fusion weights are derived from the quantization function (-) with different scale parameters (). We quantize the weights and activations to 2-bit values , and use a masked Hamming distance instead of a floating-point matrix multiplication. This setup is able to achieve an accuracy score close to that of the full-precision counterpart, using about 11 less memory and achieving a 9 speedup.

  2. We introduce a dynamic fusion strategy for multi-sensors environmental activity recognition, which can decrease the computational complexity and improve the performance of model by reducing the representations from sensors that have “less contribution” during a particular activity. For sensors whose “contribution” (sub-network) is less than the others, we randomly reduce their representations through fusion weights, which are sampled from a Bernoulli distribution given by the scale parameter

    from the quantization method. Experimental results show that by adopting a dynamic fusion strategy, we were able to achieve a higher accuracy level and lower memory usage than the baseline model.

Ideally, using more quantization weights, activation modules and fusion strategy will result in better accuracy and eventually achieve a higher accuracy than the full-precision baselines. The training strategy can reduce the amount of computing power required and the energy consumption, thereby realizing the main objective of designing a system that can be deployed on portable devices. More importantly, adopting dynamic fusion strategies for different types of activities are more in line with the actual situation. This was verified by using both the quantization method and fusion strategy on the OPPORTUNITY and PAMAP2 datasets. Only the quantization method was applied on the UniMiB-SHAR dataset. This is the first time that both quantization and dynamic fusion strategy were adopted for convolutional networks to achieve a high prediction accuracy on complex human activity recognition tasks.

The remainder of this paper is structured as follows. In Section II, we briefly introduce the related works on human activity recognition, quantization models and methods for convolutional neural networks. In Section III, we highlight the motivation of our method and provide some theoretical analysis for its implementation. In Section IV, we introduce our experiment. Section V experimentally demonstrates the efficacy of our method and finally Section VI states the conclusion and future work.

Ii Related Work

i Convolutional Neural Networks for Human Activity Recognition

Several advanced algorithms have been evaluated in the last few years on the human activity recognition. Hand-crafted features method [20]

uses simple statistical value (e.g., std, avg, mean, max, min, median, etc.) or frequency domain correlation features based on the signal Fourier transform to analyze the time series of human activity recognition data. Due to its simplicity to setup and low computational cost, it is still being used in some areas, but the accuracy cannot satisfy the requirement of modern AI games. Decision Trees 


, Support Vector Machines (SVM) 


, Random Forests 

[19], Dynamic Time Warping [23]

and Hidden Markov Models (HMMs) 

[24, 25] for predicting action classes, work well when data are scarce and highly unbalanced. However, when faced with the activity recognition of complex high-level behaviors tasks, identifying the relevant features through these traditional approaches is time-consuming [22].

Recently, many researchers have adopted Convolutional Neural Networks (CNNs) to deploy human activity recognition system, such as  [6, 22, 26, 27, 28]. Convolutional Neural Networks were based on the discovery of visual cortical cells and retain the spatial information of the data through receptive field. It is known that the power of CNNs stems in large part from their ability to exploit symmetries through a combination of weight sharing and translation equivariance. Also, with their ability to act as feature extractors, a plurality of convolution operators are stacked to create a hierarchy of progressive abstract features. Apart from image recognition  [12, 29, 30], NLP  [31, 32] and video recognition  [33], more and more researches in recent years are using CNNs to learn sensor-based data representations for human activity recognition and have achieved remarkable performances  [6, 34]. The model  [28]

consists of two or three temporal-convolution layers with ReLU activation function followed by max-pooling layer and a soft-max classifier, which can be applied over all sensors simultaneously. Yang

et al.  [6] introduce a four temporal-convolutional layers on a single sensor, followed by a fully-connected layer and soft-max classifier and it shows that deeper networks can find correlations between different sensors.

Just like the works discussed above, we adopt convolutional neural networks to learn representations from wearable multi-sensor signal sources. However, these advanced high-precision models are difficult to deploy on portable devices, due to their computational complexity and energy consumption. As a result, quantization of convolutional neural networks has become a hot research topic. The aim is to reduce memory usage and computational complexity while maintaining an acceptable accuracy.

ii Quantization Model of Convolutional Neural Networks

The convolutional binary neural network is not a new topic. Inspired by neuroscience, the unit step function is used as an activation function in artificial neural networks  [35]. The binary activation mode can use spiking response for computing and communication, which is an energy-efficient method because it consumes energy only when necessary  [13].

Recently, Binarized-neural-networks (BNNs)  

[16] have been used to quantize the weights and activations to binary values of each layer successfully. They proposed two binarization functions, the first is deterministic as shown in (1) and the second is stochastic as shown in (2). Where is the binarized variable and is the full-precision variable,

is the “hard sigmoid” function.


Ternary Weight Network (TWN) [36] constrains the weights to ternary values () by referencing symmetric thresholds. In each layer, the quantization of TWN is shown in (3), where is a positive threshold parameter. They claim a trade-off between model complexity and generalization.


DoReFa-Net  [37] is derived from AlexNet, which has 1-bit weights, 2-bit activations and 6-bit gradients and it can achieve 46.1% top-1 accuracy on ImageNet validation set. DoReFa-Net adopts a method as shown in (4), where and are the full-precision (original) and quantized weights, respectively, and is the mean absolute value of weights.


iii Quantization Method for Convolutional Neural Networks

The idea of quantization of weights and activations was first proposed by  [16]. The research showed the following two contributions: 1) the costly arithmetic operations between weights and activations in full-precision networks can be replaced with cheap bitcount and XNOR operations, which can result in significant speed improvements. Compared with the full-precision counterpart, 1-bit quantization reduces the memory usage by a factor of 32; and 2) in some visual classification tasks, using 1-bit quantization could achieve fairly good performance.

Some researchers  [17, 38] have introduced easy, high-performance and accurate approximations to convolutional neural networks by quantizing the weights, and using a uniform quantization method, which first scales its value in the range . Then it adopts the following -bit quantization as shown in (5), where approximates continuous values to their nearest discrete states. The benefit of this quantization method is that when calculating the inner product of two quantized vectors, costly arithmetic calculations can be replaced by cheap operations. (e.g. bit shift, count operation) In addition, this quantization method is rule-based and thus easy to implement.


Zhou et al.  [39] propose a network compression method called Incremental Network Quantization (INQ). After obtaining a network through training, the parameters (full-precision parameters) of each layer are first divided into two groups. The parameters in the first group are directly quantized and fixed. The other group of parameters through retraining compensated for the loss of accuracy caused by quantization. The above process iterates until all parameters are quantized. With incremental quantization, using weights with small-width values (e.g., 3-bit, 4-bit and 5-bit) results in almost no accuracy loss compared with the full-precision counterpart. The quantization method is shown in (6), where and are full-precision (original) and quantized weights, respectively, and are the lower and upper bounds of the quantized set, respectively.


Wen et al.  [40] proposes a method as shown in (7), where is a scaler parameter, is the Hadamard product, respectively returns the value of each element. The method quantizes gradients to ternary values that can effectively improve clients-to-server communication in distributed learning.


Guo et al.  [41] propose greedy approximation, which instead tries to learn the quantization as shown in (8), where is a binary filter, are optimization parameters and input channels () width () height () is the size of the filter.


The greedy approximation expands to -bit () quantization by minimizing the residue in order. Although not able to achieve a high-precision solution, the formulation of minimizing quantization error is very promising, and quantitative neural networks designed in this manner can be effectively deployed on modern portable devices.

Iii Method

In this section, we introduce our quantization method and dynamic fusion strategy, which is termed DFTerNet (Dynamic-Fusion-Ternary(2-bit)-Convolutional-Network) for convenience. We aim to recognize human activity extracted from IMU sensors. For this purpose, a fully-convolutional-based architecture is chosen and we focus on the recognition accuracy of the final model. During train-time

(Training), we still use the full-precision network (the real-valued weights are retained and updated at each epoch). During

run-time (Inference), we use ternary weights in convolution.

i Linear Mapping

In this paper, we propose a quantization function - that converts a floating-point to its -bitwidth signed integer. Formally, it can be defined as follows:


where is uniform distance, whose role is to perform a discretization of -bit linear mapping of continuous and unbounded values, is a scale parameter, is the approximation function that approximates continuous values to their nearest discrete states, function that clips unbounded values to [,].

For example, when the scale parameter , quantizes to . Consider the scale parameter , assume we set two different scale parameters: and corresponds to and . In that case is 0 and is 0.5. Clearly, it can be seen that each quantization function can use the scale parameter to adjust the quantization threshold, clip differently to represent the input value.

ii Approximate weights

Consider that we use a -layer CNN model. Suppose that learnable weights of each convolutional layer are represented as , in which indicate the input-channel, output-channel, filter width and filter h

eight, respectively. It is known that when using 32 bits (full-precision) floating-point arithmetic, storing all these weights would require a

bit memory.

As stated above, at each layer, our goal is to estimate the real-weight filter

using 2-bit filter . Generally, we define a reconstruction error as shown in (10):


where describes a nonnegative scaling parameter. To retain the quantization network accuracy, the reconstruction error should be minimized. However, directly minimizing reconstruction error seems an NP-hard problem, so forcibly solving it will be very time-consuming  [42]. In order to solve the above problem in a reasonable time, we need to find an optimal estimation algorithm, in which and are sequentially learnt. That is to say, the goal is to solve the following optimization problem:


in which , the is defined as

for any three-dimension tensor


One way to solve the optimization problem shown in (11) is to expand the cost function and take the derivate w.r.t. and , respectively. However, in this case, it must get correlation-dependence value of and . To overcome this problem, we use the quantization function to quantize by (9):


In this work, we aim to quantize the real-weight filter to ternary values {-0.5,0,0.5} , so the parameter and the threshold of weights are controlled by as shown in (13),


where is a shift threshold parameter which can be used to constrain thresholds.

With the fixed through (12), Equation (11

) becomes a linear regression problem:


in which the serve as the bases in the matrix. Therefore, we can use the “straight-through (ST) estimator”  [43] to back-propagate through . This is shown in detail in Algorithm 1. Note that in run-time, only () is required.

Algorithm 1 Training with “straight-through (ST) estimator”  [43] on the forward and backward approach of an approximated convolution.
Require -, shift parameter . Assume

as the loss function,

and as the input and output tensors of a convolutional layer respectively.
A. Forward propagation:
    1. ,    #Quantization
    2. Solve Eq. (14) for ,
    3. .                    ()
B. Back propagation:

    By the chain rule of gradients and ST we have:

    1. .

iii Activation quantization

In order to avoid substantial memory consumption and computational requirement, which is caused by cumbersome floating-point calculations, we should use bitwise operation. Therefore, the activations as well as the weights must be quantized.

If activations are 1-bit values, we can quantize activations after they pass through a function similar to the activation quantization procedure in  [37]. Formally, it can be defined as:


If activations are presented in -, the quantized of real-value activations can be defined as:


In this paper, we constrain the weights to ternary values {-0.5, 0, 0.5}. In order to transform the real-valued activation into ternary activation, we set the parameter . The scale parameter

controls the clip threshold and can be varied throughout the process of learning. Note that, quantization operations in networks will cause the variance of weights to be scaled compared to the original limit, which will cause exploding of network’s outputs. XNOR-Net  

[17] proposes a filter-wise scaling factor calculated continuously with full precision to alleviate the amplification effect. In our experiment implementation, we control the activation threshold to attenuate the amplification effect by setting the scale parameter as

where is pre-defined constant for each layer, and will be updated by in each epoch:

where is the trained weights of each layer. The forward and backward propagation of the activation are shown in detailed in Algorithm 2.

Algorithm 2 Training with “straight-through (ST) estimator”  [43] on the forward and backward approach of the activation.
Require -, shift parameter , can be seen as propagating the gradient through , indicates Hadamard product. Assume as the loss function.
A. Forward propagation:
    1. ,    #Quantization
B. Back propagation:
    1. ,    #using STE

Figure 2: An overview of three fusion strategy methods and architecture of the hierarchical DFTerNet for activity recognition. (a) Early fusion, (b) Late fusion, (c) Dynamic fusion, are summarized in Section iv. From the left of each sub-fig, the multi-sensor signal sources from different positions are processed by a common convolutional network in (a) and three sub-convolutional networks in (b)&(c). Input sensor signals of size , where denotes the length of features maps and the number of sensor channels. The C1. C2. C3

. ((kernel size), (siding stride), numbers of kernel) are ((11,1),(1,1),50), ((10,1),(1,1),40), ((6,1),(1,1),30), respectively. The

Mp1. Mp2. Mp3

. size are (2,1), (3,1), (1,1), respectively. Neurons in

fully connected layer is 1000. The tensors are the fusion weights.

iv Scalability to Multiple Sensors (IMUs)

Each activity in the OPPORTUNITY and PAMAP2 datasets is collected by multi-sensors in different parts and each sensor is independent. For different types of activities, different sensors may not have the same “contribution”. In order to improve the accuracy of our model, we conducted a comprehensive evaluation using different feature fusion strategies as shown in Figure 2. Note that the UniMiB-SHAR dataset only has 3-channels data (3D accelerometer), so we apply early fusion.

Early fusion. All joints from multi-sensors in different parts are stacked as input of the network  [22, 44].

Late fusion. Independent sensors in different signal sources through their own conv3 feature maps () are concatenated by fusion weights as in  [19, 45] and the feature maps after fusion can be expressed as:

Dynamic fusion. Different parts of the body (different sensors locations) have different levels of participation in different types of activities. For example, for ankle-hand-based activities (e.g., running and jumping), the “contribution” of back-based sensor is lower than that of the sensors on the hands and ankles. In the case of hand-based activities (e.g., opening a drawer, closing a drawer), the “contribution” of the sensors in the ankles and back is lower than that of the hands, etc. Therefore, unlike in the late fusion method, the fusion weight settings of dynamic fusion are different. For notational simplicity, we refer to the last convolutional layer of CNNs, i.e., the convolutional layer as conv3, the full-precision weights and feature maps as and respectively, where correspond to the fusion weights , . This is mainly because CNNs extract low-level features at the bottom layers and learn more abstract concepts such as the parts or more complicated texture patterns at the mid-level (i.e., conv3). These mid-level representations are more informative compared to the higher-level representations [46]. Hence, we propose a novel dynamic fusion method, which aims to randomly reduce the representations of less “contribution” signal sources. Dynamic fusion method can be considered as “dynamic dropout method”, i.e., dynamic clip parameter by its weights (non-fixed parameter). Given a quantized weight , each element of fusion weight independently follows the Bernoulli distribution as shown in (24):


where and are the - parameter of and respectively.

Train-time. The full-precision weights are first quantized by (9):


According to (17), the generated fusion weight as shown in (19) is given by:


Assumption. The -- are the less “contribution” sub-networks. Thus, the fusion weight is set to . The feature maps after dynamic fusion strategy can be expressed as:


where denotes the Hadamard product. An example of this process is shown in Figure 3.

Run-time. The full-precision has been quantized in run-time. Therefore, (18) can be skipped and only (17), (19), (20) were used.

The use of a stochastic rounding method, instead of deterministic one, is chosen by TernGrad  [40] and QSDG  [47]. Some researchers (e.g.  [16, 48]) have proven that stochastic rounding has an unbiased expectation and has been successfully on low-precision.

Figure 3: An example of the dynamic fusion processing when in train-time. (a) is the sub-network feature maps . (b) is the full-precision weights . (c) represents the quantized weights . (d) is the fusion weights . (e) is the feature maps after fusion. denotes the function from (b) to (c) is quantization function, which quantize the full-precision weights to 2-bit weights by using (18). The denotes from (c) to (d) is the Bernoulli distribution (17) that stochastically samples to either 0 or 1, where is Hadamard product.

v Error and Complexity Analysis

Reconstruction Error According to (10) and (11), we have defined the reconstruction error . In this section, we analyze the boundary that is satisfied by .

Theorem 1. (Reconstruction Error Bound). The reconstruction error is bounded as


where and denotes the number of elements in .

Proof. We define which indicates the approximation residue after combining all the previously tensors as


Through derivative calculations, (10) is equivalent to


Since , we can obtain,


in which is an entry of . According to (22) and (24), we have


in which varies from 0 to .                                             

We can see from Theorem 1 that, the reconstruction error follows an “exponential decay” with a rate . It means that, given a small size , i.e., is a small value, the reconstruction error algorithm can be quite good.

Efficient Operations Both modern CPUs and SoCs contain instructions to efficiently and massively compute 64-bit strings in short time cycles  [49]. However, floating-point calculations require very complex logic. The calculation efficiency can be improved by several tens of times by adopting each bit-count operator instead of the 64-bit floating-point addition and multiplication calculation.

In the classic deep learning architecture, floating point multiplication is the most time-consuming part. However, when the weights and activations are ternary values, floating point calculations should be avoided. In order to efficiently reduce the computational complexity and time consumption, we have to design a new operation, which aims to replace the full-precision cumulant operation of input tensor

and filter . Some previous works  [16, 17] on 1-bit networks have been successfully implemented using Hamming space calculation111The Hamming space can be used to calculate matrix multiplication and its inner-products. (bit-counting) as a replacement for matrix multiplication. For example, , the matrix multiplication can be replaced by (26):


where defines a bit-count over the bits in the rows of and , and an exclusive OR operator.

In this paper, we aim to extend the concept to -bit networks. The quantized input tensor and filter can be denoted as , , where the value of and are composed of , - and -. Given a fixed value, the is fixed as well. Therefore, we define two tensors as and to store and , respectively. (Note that the superscript and mean and respectively). The values of and as

In this work, our goal is to replace matrix multiplication with the notion of bit-counting in -bit convolutional networks. Imagine that =, . Therefore, the inner-product calculation can be used with two bit-counts in Hamming space:


where defines the negated XOR, an AND operator. Note that, if , the behavior of the element-wise operator must be custom.

Batch Normalization In previous works, weights are quantized to binary values by using a function  [16] and to ternary values by using a positive threshold parameter  [36] during train-time

. However, neural networks with quantized weights all failed to converge without batch normalization, because the quantized values are rather discretization for full-precision values. Batch normalization  


efficiently avoids the exploding and vanishing gradients problem. In this part, we briefly discuss the batch normalization operation which might increase extra computational cost. Simply, batch normalization is an affine function:


where , and

are the mean and standard deviations respectively,

and are scale and shift parameters respectively. More specifically, a batch normalization can be quantized in 2-bit values by the following quantization method:


where =. Equation (29) can be converted to the following:


Therefore, batch normalization will be accomplished at no extra cost.

Iv Experiments

To demonstrate the usefulness of quantization methods and fusion strategies on convolutional neural networks for high-precision human activity recognition on portable devices, we explore the activity recognition on three well-known datasets. The extension of our quantization methods and fusion strategies to activity recognition is straightforward. Providing better game experience for virtual-realistic interactive games on VR/AR devices and portable devices. Therefore, the memory requirements and the quantized weights of each layer are also analyzed in detail. Many natural activities are complex, involve several parts of the body and often very faint or subtle making recognition very difficult. Therefore, networks with better generalization ability to robustly fuse the data features of different parts of sensor are necessary, at the same time, an automatic method should depict the sketch of the activity feature and accurately recognize the activity.

The primary parameter of any experimental setup is the choice of datasets. To choose the optimal datasets for this study, we considered the complexity and richness of the datasets. Based on the background of our research, we selected the OPPORTUNITY  [51], PAMAP2  [52] and UniMiB-SHAR  [53] benchmark datasets for our experiments.

i Data Description and Performance Measure

i.1 Opportunity

The OPPORTUNITY public dataset has been used in many open activity recognition challenges. It contains four subjects performing 17 different (morning) Activities of Daily Life (ADLs) in a sensor-rich environment, as listed in Table 1. They were acquired at a sampling frequency of 30Hz equipping 7 wireless body-worn inertial measurement units (IMUs). Each IMU consists of a 3D accelerometer, 3D gyroscope and a 3D magnetic sensor, as well as 12 additional 3D accelerometers placed on the back, arms, ankles and hips, accounting for a total of 145 different sensor channels. During the data collection process, each subject performed a session 5 times with ADL and 1 drill session. During each ADL session, subjects were asked to perform the activities naturally-named “ADL1” to “ADL5”. During the drill sessions, subjects performed 20 repetitions of each of the 17 ADLs of the dataset. The dataset contains about 6 hours of information in total, and the data are labeled on a timestamp level. In our experiment, the training and testing sets have 63-Dimensions (36-D on hand, 9-D on back and 18-D on ankle, respectively).

In this paper, the models were trained on the data of ADL1, ADL2, ADL3, drill session, and the model test on the data of ADL4, ADL5.

i.2 Pamap2

The PAMAP2 dataset contains recordings from 9 subjects who participated in carrying out 12 activities, including household activities and a variety of exercise activities as shown in Table 2. The IMU and HR-monitor are attached on the hand, chest and ankle and they are sampled at a constant sampling rate of 100Hz222Note that, following  [26], the PAMAP2 dataset is downsampled to Hz, in order to have a temporal resolution comparable to the OPPORTUNITY dataset.. The accelerometer, gyroscope, magnetometer, temperature and heart rate contain 40 sensors and are recorded from IMU over 10 hours in total. In our experiments, the training and testing sets have 36-Dimensions (12-D on hand, 12-D on back and 12-D on ankle, respectively).

In this paper, data from subjects 5 and 6 are used as testing sets and the remaining data are used for training.

i.3 UniMiB-SHAR

The UniMiB-SHAR dataset collected data from 30 healthy subjects (6 male and 24 female) acquired using the 3D-accelerometer of a Samsung Galaxy Nexus I9250 with Android OS version 5.1.1. The data are sampled at a constant sampling rate of 50 Hz, and split into 17 different activity classes, 9 safety activities and 8 dangerous activities (e.g., a falling action) as shown in Table 3. Unlike the OPPORTUNITY dataset, the dataset does not have any NULL class and remains relatively balanced. In our experiments, the training and testing sets have 3-Dimensions.

i.4 Performance Measure

ADL datasets like the OPPORTUNITY dataset are often highly unbalanced. For this dataset, the overall classification accuracy is not an appropriate measure of performance, because the activity recognition rate of the majority classes might skew the performance statistics to the detriment of the least represented classes. As a result, many previous researches such as  


show the use of an evaluation metric independent of the class repartition—

-score. The -score combines two measures: the precision and the recall : is the number of correct positive examples divided by the number of all positive examples returned by the classifier, and is the number of correct positive results divided by the number of all positive samples. The -score is the harmonic average of and , where the best value is at 1 and worst at 0. In this paper, we use an additional evaluation metric to make the comparison with them easier: the weighted -Score (Sum of class -scores, weighted by the class proportion):


where and is the number of samples in class , and is the total number of samples.

Class Proportion Class Proportion
Open Door 1/2 1.87%/1.26% Open Fridge 1.60%
Close Door 1/2 6.15%/1.54% Close Fridge 0.79%
Open Dishwasher 1.85% Close Dishwasher 1.32%
Open Drawer 1/2/3 1.09%/1.64%/0.94% Clean Table 1.23%
Close Drawer 1/2/3 0.87%1.69%/2.04% Drink from Cup 1.07%
Toggle Switch 0.78% NULL 72.28%
Table 1: Classes and proportions of the OPPORTUNITY dataset
Class Proportion Class Proportion
Lying 6.00% Sitting 5.78%
Standing 5.92% Walking 7.45%
Running 3.06% Cycling 5.13%
Nordic walking 5.87% Ascending stairs 3.66%
Descending stairs 3.27% Vacuum cleaning 5.47%
Ironing 7.44% House cleaning 5.84%
Null 35.12%
Table 2: Classes and proportions of the PAMAP2 dataset
Class Proportion Class Proportion
StandingUpfromSitting 1.30% Walking 14.77%
StandingUpfromLaying 1.83% Running 16.86%
LyingDownfromStanding 2.51% Going Up 7.82%
Jumping 6.34% Going Down 11.25%
F(alling) Forward 4.49% F and Hitting Obstacle 5.62%
F Backward 4.47% Syncope 4.36%
F Right 4.34% F with ProStrategies 4.11%
F Backward SittingChair 3.69% F Left 4.54%
Sitting Down 1.70%
Table 3: Classes and proportions of the UniMiB-SHAR dataset

ii Experimental Setup

Sliding Window Our selected data are recorded continuously. We can think of the continuous-HAR data feature as a video feature. We use a sliding time-window of fixed length to segment the data. Each segmented data can be viewed as a frame in the video (a picture). We define , and as the length of the time-window, the number of sensor channels and the sliding stride, respectively. Through the above segment approach, each “picture” consists of a matrix. We set the segment parameters like  [44], use a time-window of 2s on the OPPORTUNITY and PAMAP2 datasets, resulting in =64, and . On the UniMiB-SHAR dataset, a time-window of 2s was used, resulting in =96. Due to the timestamp-level labeling, each segmented data can usually contain multiple labels. A majority labeling that appears most frequently is chosen from among those of timestamps.

Dynamic Fusion Weights Our selected datasets (OPPORTUNITY and PAMAP2) include two families of human activity recognition samples: that of periodic activities (the locomotion category of the OPPORTUNITY dataset and the whole of PAMAP2 dataset) and that of sporadic activities (the gestures category of the OPPORTUNITY dataset). For designing the dynamic fusion strategies of the two families of activities, we design two groups of feature maps after dynamic fusion strategies. In periodic activities (), we take into account the fact that back-based sensors have less “contribution”. Thus, we set the fusion weights . In sporadic activities (), we consider the fact that both back-based and ankle-based sensors have less “contribution”. Accordingly, we set the fusion weight . Formally, in train-time and run-time, according to (17), (18) and (19), the feature maps and after dynamic fusion strategies can be expressed as (32) and the effectiveness of these strategies we designed for the two families of human activity recognition is verified in Section i.


Pooling Layer The role of common pooling layers is to find the maximum (max-pooling) or the average (avg-pooling) of output of each filter. Our experiments do not use the avg-pooling because the average operation will generate other values except . However, we observe that using max-pooling on

will increase the probability distribution of

, resulting in a noticeable drop in recognition accuracy. Therefore, we put the max-pooling layer before the batch normalization (BN) and activation (A) layers.

iii Baseline Model

The aim of this paper is not necessarily to exceed current state-of-the-art accuracies, but rather demonstrates and analyzes the impact of network model quantization and fusion strategy. Therefore, the benchmark model we used should not be very complex, because increasing the network topology and computational complexity to improve model performance runs counter to the aim of deploying advanced networks in portable devices. In this paper, we considered improving the performance of the model through a training strategy that is more in line with the practical applications. We therefore chose a CNN architecture  [44] as the baseline model. It contains three convolutional blocks, a dense layer and a soft-max layer. Each convolutional kernel performs a 1D convolutional layer on each sensor channel independently over the time dimension. To fairly evaluate the calculation consumption and memory usage of binarization (1-bit) and ternarization (2-bit) on the CNNs, we employ the same number of channels and convolution filters for all comparison models. For early fusion Binary Convolutional Networks [16] (termed BNN) and late fusion Binary Convolutional Networks [16] (termed FBNN)333Note that, when the value of quantized weights is , each element of fusion weights is all 1. Therefore, dynamic fusion Binary Convolutional Networks can be considered as FBNN., see (26).

Layer-wise details are shown in Table 4, in which “Conv2” is the most computationally expensive and “Fc” commits the most memory. For example, using floating-point precision on OPPORTUNITY dataset, the entire model requires approximately 82MFLOPs444Note that FLOPs consist of the same number of FMULs and FADDs. and approximately 2 million weights, thus 0.38MBytes of storage for the model weights. During train-time the model requires more than 12GBytes of memory (batch size of 1024), for inference during run-time this can be reduced to approximately 1.8GBytes (2-bit).

Layer Name Params (b) FLOPs Params (b) FLOPs
Conv1 0.6k 4.84M 0.6k 2.76M
Conv2 20k 68.18M 20k 38.96M
Conv3 7.2k 5.47M 7.2k 3.12M
Fc 1.89M 3.78M 1.89M 2.16M
Layer Name Params (b) FLOPs
Conv1 0.6k 0.23M
Conv2 20k 3.25M
Conv3 7.2k 0.26M
Fc 1.89M 0.18M
Table 4: Details of the learnable layers in our experimental model.

iv Implementation Details

In this section, we provide the implementation details of the architecture of the convolution neural network. Our method is implemented with Pytorch. The model is trained with mini-batch size of 1024, the activation scale parameter

is initialized to 1, 50 epochs and using the AdaDelta with default initial learning rate  [54]. The cross-entropy was used as the loss function. A soft-max function is used to normalize the output of the model. The probability that a sequence belongs to the - class is given by (33):


where is the output of the model, is the number of activities.

Experiments were carried out on a platform with an Intel 2 Intel E5-2600 CPU, 128G RAM and a NVIDIA TITAN Xp 12G GPU. The hyper-parameters of the model are provided in Figure 2555The early fusion is common convolutional neural network architecture (can be regarded as a sub-network). Therefore, the hyper-parameters of early fusion is equal to any sub-network of late fusion or dynamic fusion.. The training procedure, i.e., DFTerNet, is summarized in Algorithm 3.

Algorithm 3 Training a -layer DFTerNet, is the loss function for minibatch, can be seen as propagating the gradient through and is the learning rate decay factor.

indicates Hadamard product. BatchNorm() specifies how to batch-normalize the output of convolution. BackBatchNorm() specifies how to backpropagate through the normalization  

[50]. Update() specifies how to update the parameters when their gradients are known, such as AdaDelta  [54].
Require A minibatch of inputs and targets (), previous weights , -bit, -bit, shift threshold parameter () and learning rate .
Ensure Updated weights .
  1. Computing the parameters gradients:
  1.1. Forward propagation:
    for =1 to do,
       with (12)
      Compute with (14)
      Apply max-pooling
      if then
         with (16)
  1.2. Backward propagation:
  {note that the gradients are full-precision.}
  Compute  knowing and
  for to 1 do
    if then
       by Algorithm 2
    end if
  end for
  2. Accumulating the parameters gradients:
  for to do
    With known, compute by Algorithm 1
  end for

V Result and Discussion

In this section, the proposed quantization method and fusion strategies are evaluated on three famous benchmark datasets. The following are considered: 1) the proposed dynamic fusion models are compared with other baseline models, 2) the effect of weight shift threshold parameter is evaluated, 3) the trade-off between quantization and model accuracy. For the first method (we call it Baseline or (TerNet) method), the required sensor signal sources are stacked together. In the second method (referred to as FTerNet), different sensor signal sources are processed through their own sub-networks and fused together with the learned representations before the dense layer, i.e., each element of fusion weights is equal to 1. The model proposed in this paper (DFTerNet), differs from the second method discussed in the way it handles the fusion part. In the DFTerNet, each element of fusion weights is sampled from a Bernoulli distribution given by the scale parameter of the quantization method that we proposed.

i Multi-sensor Signals Fusion

In order to evaluate the different fusion strategies which were described in Section iv, an ablation study was performed on the OPPORTUNITY and PAMAP2 datasets. The first set of experiments consisted of comparing the three fusion strategies on the each dataset. As shown by the bold scores in Table 6, the order of fusion performance is: matched dynamic fusion in first place, followed by late fusion and finally early fusion. The reason for this is that it is better for each sensor signal source to have its network but improper to apply a single network to unify all signal sources. Meanwhile, there is a correlation between different signal sources and activity types and therefore the result of the recognition should be more reliable when the signal sources are highly correlated with the activity type. According to the two points above, the recognition result should be weighted by the learned representations of multiple signal sources, and the weight of learned representations of each signal source should reflect the similarity between the signal source and the activity type.

=2.6 2.7 2.8 2.9 3.0
DFTerNet () 0.884 0.897 0.910 0.909 0.893
DFTerNet () 0.879 0.894 0.905 0.905 0.891
Table 5: Comparison of ’s value for the activity recognition performances (Weighted -score) on OPPORTUNITY dataset.

ii Analysis of Weight Shift Threshold Parameter

In our DFTerNet, the result of the weight shift parameter of the operation will directly affect the following fusion weights . Therefore, the second set of experiments consider the effect of ’s value. As mentioned in the previous section, the value of is related to the value of and the fusion weights are sampled from . We use matched dynamic fusion- and matched dynamic fusion- on the OPPORTUNITY dataset as a test case to compare the performance of ’s value. In this experiment, the parameter settings are the same as described in Section ii and Section iv. Table 5 summarizes the results of ’s value on matched dynamic fusion. It can be seen that the quantization method we proposed achieves its best performance when using or . Similar phenomenon can also be found in [13].

Method O P U
locomotion gestures Activities ADLs and falls
Early fusion  [44] 0.876 0.09 0.881 0.11 0.867 0.09 0.7981 0.12
BNN [16] 0.752 0.17 0.751 0.20 0.733 0.14 0.6481 0.21
TerNet (Early fusion) 0.865 0.14 0.876 0.19 0.850 0.11 0.7727 0.20
Late fusion  [45] 0.897 0.06 0.917 0.05 0.908 0.05
FBNN [16] 0.765 0.16 0.773 0.19 0.764 0.15
FTerNet (Late fusion) 0.883 0.10 0.9080.14 0.893 0.13
Dynamic fusion- 0.915 0.04 0.914 0.06
DFTerNet () 0.909 0.06 0.901 0.11
Dynamic fusion- 0.920 0.07
DFTerNet () 0.910 0.10
  • Dynamic fusion- match the locomotion category of OPPORTUNITY and PAMAP2.

  • Dynamic fusion- and the gestures category are matched.

(F)BNN O 20k
P 10k
U 0.9k
TerNet O 39k
P 20k
U 1.8k
FTerNet O 40k
P 24k
U 1.9k
DFTerNet O 34k
P 17k
U 1.8k
FP O 0.38M
P 0.22M
U 17k
Table 6: (a) Weighted performances of different fusion strategies, (F)BNNs (1-bit) and our proposed models (2-bit) for activity recognition on the OPPORTUNITY, PAMAP2 and UniMiB-SHAR datasets. (b) - quantization method generates 2-bit weight and activation Convolutional networks for activity recognition with the ability to make faithful inference and roughly fewer parameters than its counterpart.

iii Visualization of The Quantization Weights

In addition to analyzing the quantized weights, we further looked inside the learned layers and checked the values. We plot the heatmap of the fraction of zero value by DFTerNet-() on the locomotion category of the OPPORTUNITY dataset across epochs. As shown in Figure 5, we can see that the fraction of zero values increase in later epochs, similar phenomena also appear in DFTerNet-() on PAMAP2 dataset and DFTerNet-() on the gestures category of the OPPORTUNITY dataset. Section v proves the reconstruction error boundary, the model can achieve a very small value. Table 4 shows that the layers contain most of the free parameters with increased sparsity at the end of training, this indicates that our proposed quantization method can avoid overfitting and sparsity acts as a regularizer.

Figure 4: Validation Weighted -score Curves on These Datasets.

iv The Trade-off Between Quantization and Model Accuracy

The third set of experiments are performed to explore the model accuracy of the quantization method. Just like in the first and second sets of experiments, a four-layer convolutional network is used and the parameter settings for the sliding window as well as batch size are kept completely the same. The weight shift threshold parameter is set to . Finally, TerNet, FTerNet and DFTerNet with their own full-precision counterparts are generated for comparison. Table 6 shows the weighted -score performance of different full-precision models and their counterparts, which are described in Figure 2. It also depicts the memory usage of all models mentioned above. Table 6 shows that using the proposed quantization method, results in a very small difference in performance between TerNet (2-bit) and FTerNet (2-bit) network and its full-precision counterpart, while beating BNN (1-bit) and FBNN (1-bit) by a large margin. On the fewer channels (i.e., 3-channels) UniMiB-SHAR dataset, BNN and TerNet both get poorer performance than the full-precision counterpart. However, the accuracy gap between TerNet and the full-precision counterpart is smaller than the gap between BNN and the full-precision counterpart. Thus, TerNet beat BNN again.

Figure 4 shows the validation weighted -score curves on these datasets. As shown in the Figure 4, our quantized models (TerNet, FTerNet and DFTerNet) converge almost as fast and stably as their counterparts. This demonstrates the robustness of the quantization technique we proposed.

The efficiency of using Hamming distance calculation by (27) is another indicator of this work, compared with (16/32)-bit floating point, 2-bit (even 8-bit) operations will not only reduce the energy costs for Hardware/IC Design (see Figure 1), but also halves the memory access costs and memory size requirements during run-time, which will greatly facilitate the deployment of binary/ternary convolutional neural networks on portable devices [55]. For example, training a dynamic fusion model using the OPPORTUNITY dataset took about 12 minutes on an NVIDIA TITAN Xp 12G GPU test platform. For inference the full-precision network on a CPU takes about 15 seconds. We estimate that the DFTerNet inference time to be 1.8 seconds on a mobile CPU. This shows that the quantization technique we proposed can achieve a speedup.

Figure 5: Visualization of fraction of zero value at each epoch in DFTerNet () on the locomotion category of OPPORTUNITY dataset.

Vi Conclusion and Future Work

In this paper, we present DFTerNet, a new network quantization method and a novel dynamic fusion strategy, to address the problem of how to better recognize activities from multi-sensor signal sources and deploy them on low-computation capable portable devices. Firstly, the proposed quantization method - is called by two operations through adjusting the scale parameter , weight quantization and activation quantization . Secondly, the bit-counts scheme that replaces matrix multiplication proposed in this work is hardware friendly and realizes a 9 speedup as well as requiring 11 less memory. Thirdly, a novel dynamic fusion strategy was proposed. Unlike existing methods which treat the representations from different sensor signal sources equally, it considers the fact that different sensor signal sources need to be learned separately and less “contribution” signal sources reduce its representations by fusion weights which are sampled from a Bernoulli distribution given by the . Experiments that were performed demonstrated the effectiveness of the proposed quantization method and dynamic fusion strategy. As for future works, we plan to extend the quantization method to quantization gradients and errors so that it can be deployed directly on portable devices for training and inference. Because improvement of model performance requires continuous online learning, separation of training and inference will limit that.


The authors would also like to thank the associate editor and anonymous reviewers for their comments to improve the paper.