Fully Convolutional Network Bootstrapped by Word Encoding and Embedding for Activity Recognition in Smart Homes

12/01/2020 ∙ by Damien Bouchabou, et al. ∙ 0

Activity recognition in smart homes is essential when we wish to propose automatic services for the inhabitants. However, it poses challenges in terms of variability of the environment, sensorimotor system, but also user habits. Therefore, endto-end systems fail at automatically extracting key features, without extensive pre-processing. We propose to tackle feature extraction for activity recognition in smart homes by merging methods from the Natural Language Processing (NLP) and the Time Series Classification (TSC) domains. We evaluate the performance of our method on two datasets issued from the Center for Advanced Studies in Adaptive Systems (CASAS). Moreover, we analyze the contributions of the use of NLP encoding Bag-Of-Word with Embedding as well as the ability of the FCN algorithm to automatically extract features and classify. The method we propose shows good performance in offline activity classification. Our analysis also shows that FCN is a suitable algorithm for smart home activity recognition and hightlights the advantages of automatic feature extraction.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human Activity Recognition (HAR) has been the focus of research efforts due to its key role for different ambient assisted living (AAL) domains as well as the increasing demand for home automation and convenience services in daily activities. The main task of HAR is to recognize human activities from the data collected through environmental sensors and Internet of Things (IoT) devices. They use different sensor technologies such as cameras, wearable or low-level smart sensors to track human activities, as described in [hussain2019different].

Recent advances in IoT technologies and the reduction of the cost of sensors are leading to the proliferation of these ambient devices and the development of smart homes. This is why in this work we will focus more on IoT-based HAR, as opposed to video or wearable-based HAR.

Along the development of the hardware, the HAR algorithms also need to solve the challenges of HAR in smart homes. Indeed, the number, the type but also the placement of sensors can significantly influence the performance of HAR systems. A system suitable for a given home may be completely inadequate in some other, due to different house configuration or user habits. The algorithms thus need to be robust to the variability of environments. Besides, while video-based HAR can leverage rich and redundant information from images and video streams, IoT based HAR faces the challenges of sparse and incomplete information and redundant models. In contrast to videos where objects and people appear on several pixels and over several video frames, the IoT network only detects changes in the environment that are within their range of detection and in their field of view, and is oblivious to most changes in the environment, which occur outside these ranges. When a change is captured, this detection often translates into a signal with a single value from one sensor. This sparsity entails the redundancy challenge: a set of signals from the same set of sensors can be caused by different activities. Thus, algorithms for HAR in smart homes need to address the challenges of variability, sparsity and redundancy.

To adapt to variations of environments and uses, algorithms for HAR have turned to machine learning methods, and more specifically Deep Learning (DL) algorithms. To deal with sparsity and redundancy, first, algorithms that can learn long-term dependencies have been developed so as to understand the context of sensor signals. Second, studies have also tried to introduce domain knowledge and contextualization of sensors signals, through a good feature representation of sensor events. But handcrafted features need a lot of pre-processing and reduce its adaptability to various environments. Therefore HAR algorithms need to automatically extract domain relevant representations.

In recent years, there have been significant improvements of DL techniques. They have been successfully applied to Natural Language Processing (NLP) and Time Series Classification (TSC). Respectively for automatic extraction of good feature representations through word embedding techniques and classifiers.

Our contributions are the following: 1) We apply for the first time the Fully Convolutional Networks (FCN) classifier from TSC on activity recognition in smart homes. 2) We propose to use frequency-based encoding with word embedding from NLP to improve automatic feature extraction. 3) We design an end to end framework to automatically extract key features and classify daily activities in smart homes by merging TSC classifier and NLP words encoding. 4) Finally, we show that domain knowledge gained by event encoding and embedding improves significantly the performance of classifiers.

We propose in the following section to review the state-of-the-art HAR smart home classifiers, a TSC classifier and the existing feature representation methods, in particular those used in NLP applications. In Section 3, we will propose a framework combining a TSC algorithm and a NLP sequence features extractor method. In Section 5, we will report on the performance of our proposed framework before concluding.

2 Related Works

In this section, we describe the algorithms developed for HAR, and more generally for Time Serie Classification. We then examine how TSC can be bootstrapped by incorporating domain knowledge in feature encoding as in Natural Language Processing.

2.1 Traditional HAR Approaches

To recognize human activities based on sensor traces, researchers used various machine learning algorithms as reviewed in [sedky2018evaluating]

. These can be divided into two streams: the algorithms exploiting a spatiotemporal representation, with Naive Bayes, Dynamic Bayesian Networks, Hidden Markov Models; and the algorithms based on features classification, with Decision Tree, Support Vector Machines, or Conditional Random Fields.

Most of these traditional HAR approaches commonly use handcrafted feature extraction methods. Automatic feature extraction is one of the challenges addressed by DL.

2.2 Deep Learning Approaches

Recently, a variety of DL algorithms have been applied for HAR to overcome those limitations and improve the performance of HAR. DL methods learn the features directly from the raw data hierarchically, to uncover high-level features. Long Short Term Memory (LSTM) can be seen as a very successful extension of the Recurrent Neural Networks (RNN), explicitly designed to deal with long-term dependencies. LSTMs allow automatic learning of temporal information from the sensor data without the need of handcrafted features or kernel fusion approaches, and have led to good performance in HAR in smart homes, as reported in

[singh2017human, liciotti_lstm]. [liciotti_lstm] evaluated different LSTMs structures for HAR in smart homes. They show that the LSTM approach outperforms traditional HAR approaches in terms of classification score without using handcrafted features. LSTMs leads to a viable solution to significantly improve the HAR task in the smart home but suffers from training time.

Another DL approach, focusing more on pattern detection is Convolutional Neural Networks (CNNs). They have three advantages for HAR. They can capture local dependencies, that is, the importance of neighboring observations correlated with the current event. They are scale invariant in terms of step difference or event frequencies. In addition, they are able to learn a hierarchical representation of data. Researchers used 2D

[gochoo2018unobtrusive, mohmed2020employing] and 1D [singh2017convolutional] CNNs on HAR in smart homes. The 2D CNN obtained good classification results. But this approach is not robust enough to deal with unbalanced datasets, unlabeled events, and is not suitable for online recognition. 1D CNNs are competitive with LSTMs on sequence problems [singh2017convolutional]. In general LSTMs obtain better performances due to their capacity to use long-term dependencies. But CNNs are faster to train and get accuracy levels close to LSTMs.

The FCN is a particular CNN, with only convolutional layers and no dense layers. FCN has shown compelling quality and efficiency for semantic segmentation of images [long2015fully]. Due to its performance on feature extraction, researchers transferred the FCN on TSC problems [wang2017time]. [fawaz2019deep] compared the FCN against other TSC algorithms and obtained high classification performances. FCNs ranked first on 18 datasets out of 97 and in the top five on the others. However, no application of FCN for HAR in smart homes has been reported. For this reason we propose to apply the FCN to HAR in smart homes as a high-level extractor of features and classifier.

2.3 NLP and TSC coupling

Works such as [tahir2020key, yan2019using] have shown the importance of a good feature representation, but designing features for HAR applications is a tedious task.

DL algorithms can automatically extract features, they have widely shown to improve feature representation with words pre-processing for text classification in NLP. Researchers have devised many language models and different encoding of words. They proposed encodings such as n-gram, term frequency, term frequency-inverse document frequency, bag-of-words. Recently, they use DL algorithms such as word2vec, GloVe ELMo and more recently Transformers, coupled with the aforementioned encoding to achieve meaning word encoding

[kowsari2019text, li2020survey]. DL algorithms infer features from the current input and to a lesser extent from past inputs, these encodings incorporate more general domain knowledge from the whole corpus. Their strong capacity to generate features from raw data and model word sequences increases the performance of DL classifiers. We propose to transpose previously cited NLP techniques on smart homes HAR problems in order to automatically generate key features.

Thus, we introduce in this article a DL methodology for HAR in smart homes inspired by the NLP and the TSC. We propose to combine the term frequency encoding and embedding with FCN, respectively : incorporate domain knowledge of event encoding in the first level of extraction features ; and realize a higher-level of extraction features and the activity classification. The choice of the FCN algorithm from TSC is led by the output of the NLP embedding, which transforms the sequence classification problem into a multivariate TSC problem. To our knowledge this is the first time that a study has used FCN in smart-home activity recognition, and has combined it with embedding techniques to perform an end-to-end system that automatically extracts key features and classifies activities in smart homes.

3 Methodology

We merge NLP encoding and FCN classifier from TSC to deal with smart homes HAR. This coupling allows generating automatic key features and classify activities.

The framework architecture of the proposed method is shown in Figure 1. First raw data from sensors are encoded into a sequence of indexes (section 3.2), then are split using a sliding window (section 3.4). The sliding windows are then processed through an embedding to extract a first level of features, and finally classified by the FCN (section 3.3).

Figure 1: Framework architecture of the proposed method

3.1 Problem Definition

The activity recognition problem is a classification problem. The goal is to attribute an activity label on sensors events sequences. We model our problem as follows. A set of sensors produces events . An event is the value or the state returned by a sensor when the sensors emit a signal: , where is the sensor id, the value returned by the sensor and the time when the sensor changes its state or value. A sequence is a trace of activity. is a list of events . Each can be associated to an activity label , by semantic segmentation.

In this paper we did not take in consideration the timestamp when an event occurs. We simply ignore the parameter for our experiments. We want to be able to recognize an activity regardless of the time of the day. For example, the activity ”Sleeping” appears in general during the night but this activity can appear at any time during the day. Some people can work during the day and sleep by night and vice versa some people can work during the night and sleep during the day.

3.2 NLP Encoding

Our hypothesis is to process sensor events like words and activity sequences as text sentences; these sentences describing the activities carried out by the inhabitants.

First, each sequence of activity is extracted from the dataset as sentences in NLP. Thanks to the label provided by the dataset, it is possible to know the beginning and the end of each activity. As previously described, an event is composed of the sensor ID , the value and the timestamp . By concatenating the sensor ID with his value and by ignoring the timestamp , for the reasons explained previously, a sensor word is created, e.g., and the value becomes . All these different text words define the smart home vocabulary to describe activities.

Then, as in NLP, each word in the sequences are transformed into an index to be usable by a neural network. In NLP the index starts at 1, the 0 value is reserved for the sequence padding. Indexes are assigned based on word frequency, e.g., if the word

has the highest occurrence in the dataset, the assigned index is the lowest one i.e .

Sequences are then passed through an embedding layer which transforms index tokens (words) into auto learned features vectors. This creates a simple word embedding that helps the network to get an internal representation of each word in our cases each sensor event.

3.3 FCN Structure

The FCN is a particular CNN. Its structure only contains convolutional layers e.g., no fully connected layers for the classification part. The same structure as [wang2017time, fawaz2019deep] is used in this paper.

This structure (Figure 2) is composed of three blocks described by EQ 1. Where is the input, the weight matrix, the bias and the convolution operator and

the hidden representation. Each block consists of a 1D convolutional layer with Batch Normalization (BN)


and a rectified linear unit (ReLU) activation to speed up the convergence and help improve generalizations.


After the three convolution blocks, features are fed into a Global Average Pooling (GAP) layer [lin2013network]

. GAP is a pooling operation designed to replace fully connected layers in classical CNNs. The idea is to generate one feature map for each corresponding category of the classification task. The resulting vector is fed directly into the softmax layer to realize the final classification.

One advantage of GAP over the fully connected layers is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus, the feature maps can be easily interpreted as category confidence maps. Another advantage is that there is no parameter to optimize in the GAP thus over fitting is avoided at this layer. Furthermore, GAP sums out the spatial information; thus it is more robust to spatial translations of the input.

One of the advantages of FCNs is the invariance in the number of parameters across time series of different lengths. This invariance due to using a GAP layer enables the use of a transfer learning approach where one can train a model on a certain source dataset and fine-tune it on the target dataset


Figure 2: Fully Convolutional Network (FCN) model core

3.4 Sliding Window

Contrary to LSTMs, CNNs must have a fixed input size and activity sequences can have different lengths, between 1 and more than 5000 events. To tackle this issue, a sliding window is applied over sequences. Using a sliding window also allows anticipating an online HAR. To fill windows with fewer events than the window size, a zero padding is used. The zero padding can impact the final result. To avoid too much zero in the sliding windows, a fine-tuned window size must be found.

For experiments, the Sensor Event Windows (SEW) [quigley2018comparative] was used. The SEW approach divides the data into equal sensor event intervals. The size of a SEW is defined by a number of events. Therefore, the duration of the windows may vary. Authors of [quigley2018comparative] compared different windows types and conclude that Time Windows (TW) provides the best accuracy and F-Measure score. They consider SEW as the second-best window method because SEW are able to classify more activities than TW. We assume this is because SEWs keep a fixed context size while it is variable for TWs. A stable context size allows the neural network to keep the same amount of information regardless of the window.

In this work, SEWs were used for two reasons. First, we want to evaluate the method for its ability to learn automatic features from the window context. The intuition being to train a network onto bounded activity sequences to extract features and then use them on streaming sensor data for online recognition. Second, this avoids too many zeros inside windows by controlling the number of events.

4 Experimental Setup

LSTMs provide very good results on sequence problems and go beyond traditional advanced HAR methods in Smart Homes [liciotti_lstm]. In order to evaluate the method, LSTMs and FCNs were compared with two dataset ARUBA and MILAN from the widely spread CASAS [cook2012casas] benchmark datasets.

4.1 Datasets Description

Two datasets, ARUBA and MILAN (Table 1) from CASAS were selected for the experiments. The CASAS datasets were introduced by Washington State University. Daily activities data collected, comes from real apartments and houses with real inhabitants, who live in their own houses. The houses are equipped with temperature and binary sensors, as motion or doors sensors.

A single person carried out activities in both the datasets. The MILAN dataset was selected for the noise on the dataset produced by the pet, which increases the difficulty of classification. They contain several months of labeled activities and are unbalanced, i.e., some activities are less represented than others. In addition these two datasets contain common and different activities with approximately the same number of sensors.

An unbalanced dataset increases the classification challenge. Indeed if some classes are less represented the system gets fewer examples to find the discriminating features. Moreover some events are unlabeled or unidentified and are tagged under the class name ”Other”. This class appears between 45% and 50% into these datasets. In the literature most researchers remove the class ”Other” and balance the dataset by reducing the number of examples for each class.

This method creates a drawback, by ignoring unlabeled events it becomes a fixed classification problem. The system cannot make the difference between a known and an unknown class. This does not allow the system to be able to discover new sequences of activities.

Here the original distribution was kept. The objective was to evaluate the robustness of the method and the models.

Aruba Milan
Habitants 1 1 + pet
Number of sensors 39 33
Number of activities 12 16
Number of days 219 82
Average sequence length 133 87.3
Table 1: Details of datasets.

4.2 SEW Parameters

As previously described, sequences of activities were segmented in SEWs. Different SEWs sizes, 100, 75, 50, 25, with a stride of one was studied. This stride size allows the HAR process each time a new event is triggered. The goal is to find the best SEW size e.g., the minimal SEW size with the maximal information that allows to discriminate activities sequences with a high F1-score and high-balanced accuracy. The smaller the size of SEW, the faster an activity can be recognized in the case of online HAR.

4.3 Networks Parameters

FCNs parameters are the same as [fawaz2019deep]. All convolutions have a stride equal to one with a zero padding to preserve the exact length of the time series after the convolution. The first convolution contains 128 filters and a length equal to 8, followed by a second convolution of 256 filters with a length equal to 5, which in turn is fed to a third and final convolutional layer composed of 128 filters, with a length equal to 3.

LSTMs parameters are the same as [liciotti_lstm]

. The LSTM cell is composed of 64 neurons and then followed by a softmax layer for the final classification.

As it is usually made in NLP an embedding layer was added between the raw data and the neural network. The number of neurons was fixed to 64 as it was defined in [liciotti_lstm].

4.4 Hardware and Software Setup

Experiments were made on a server, with an Intel(R) Xeon(R) CPU E5-2640 v3 2.60 GHz, with 32 CPUs, 128 Go of RAM and a NVIDIA Tesla K80 graphic card. Keras and Tensorflow frameworks were used for the algorithm’s implementation. The source code can be found at


4.5 Evaluation Method

To evaluate the proposed method, datasets were split into two parts: 70 % for the train and 30% for the test. These two parts contain a shuffled stratified (over class) number of SEW of each activity. e.g., if the dataset contains 100 windows labeled ”Sleeping” after shuffling, the 100 windows are split into two parts: 70 windows for the train set, 30 windows for the test set. The random shuffle helps the algorithm to get a better generalization and representation. The stratified forces both subsets to contain representations of each class.

A stratified (over class) threefold cross-validation procedure is performed on the training set. These three trains and three validation subsets are then used to train and validate algorithms.

During the training phase on each train set, early stop and best model selection methods proposed by the Tensorflow framework was used. These methods stop the training before overfitting and saves the best model of each train. The early stop condition is based on the validation loss value. If the current loss doesn’t decrease after epochs since the last, best model selected (here ) the training is interrupted.

The three best trained models (one for each training subset) were evaluated on the test set to calculate the average balanced accuracy and the average weighted F1-score, because datasets are unbalanced.

To accelerate the training time by epoch and because the number of SEW is big, a batch size of 1024 was used for all experiments. No differences were noticed between the batch size evaluation during the tests, the results were similar except in training time.

5 Experimental Results

5.1 FCNs and LSTMs Performances

Table 2 and Table 3 show the performances of two FCNs and two LSTMs on raw sensor data for the two datasets. Vanilla LSTM, FCN and LSTM, FCN with an embedding layer on different windows size were evaluated. The average balanced accuracy and the average weighted F1-score was computed. FCN appears to obtain the best weighted F1-score with and without the embedding onto both datasets. The LSTM is close to or equal to the FCN on a large SEW, greater than 50. Compared to the FCN the LSTM looks to need more events to realize the classification.

From the balanced accuracy point of view, FCNs get best values except on the MILAN dataset when the window size is higher than 50. This decrease in performance is due to the zero padding. Indeed the average sequence length on the MILAN dataset is around 88 events. When the window size is close to or over this average, the performances of the FCN decrease. Some small sequences like ”Bed to Toilet” or ”Eve Meds” are not classified. This results in a drop in the balanced accuracy score.

As an online HAR is expected in our future work, it is interesting to observe the performance of the method on the small SEW size. The goal is to achieve HAR in as little time as possible, with as few events as possible, to get the most responsive system possible. In this case, the FCN obtained the best values with SEWs of sizes 50 and 25. Performances decrease as the SEW size decreases, but the FCN maintained a high score for balanced accuracy and the F1-score. Performance drops less with FCNs than with LSTMs. It seems that FCNs can generate more relevant automatic features than LSTMs on small sequences, therefore with less information.

Model 100 75 50 25
Weighted avg F1 Score (%)
LSTM 96.67 94.67 90.67 85.00
FCN 99.00 98.00 97.67 92.33
LSTM + Embedding 100.00 99.67 98.00 90.00
FCN + Embedding 100.00 100.00 100.00 99.00
Balanced Accuracy (%)
LSTM 81.45 76.09 71.05 83.30
FCN 88.85 87.41 87.08 80.32
LSTM + Embedding 94.55 93.61 90.20 74.81
FCN + Embedding 95.37 95.07 94.89 92.44
Table 2: Weighted F1 Score and Balanced Accuracy in Aruba’s dataset
Model 100 75 50 25
Weighted avg F1 Score (%)
LSTM 84.00 85.67 75.33 64.00
FCN 77.33 93.67 88.33 83.67
LSTM + Embedding 98.00 97.00 93.00 73.67
FCN + Embedding 99.00 98.00 97.00 94.33
Balanced Accuracy (%)
LSTM 62.15 64.95 55.70 43.29
FCN 42.24 76.41 71.82 71.34
LSTM + Embedding 88.52 86.77 82.05 59.35
FCN + Embedding 84.23 86.64 87.83 90.86
Table 3: Weighted F1 Score and Balanced Accuracy in Milan’s dataset

5.2 Training Time

Table 4 and Table 5 show the average training time and the average amount of training epochs by SEWs size. On both datasets FCNs realized the shortest time on every SEWs size. The embedding layer allows to reduce the number of epochs and the total training time in the majority of cases. The training time is divided by 2 to 6 with the FCN depending on the window size compared to LSTM. This time saving is explained by the ease of parallelization of calculations of convolutional networks.

Model 100 75 50 25
Average epoch number
LSTM 242 278 335 256
FCN 77 71 111 108
LSTM + Embedding 161 191 210 161
FCN + Embedding 67 62 71 98
Average training time (HH:MM:SS)
LSTM 06:28:42 06:43:08 06:29:58 03:00:26
FCN 00:58:00 00:52:15 01:20:35 00:51:27
LSTM + Embedding 04:45:56 04:45:38 04:14:35 02:02:53
FCN + Embedding 01:12:37 00:59:42 00:57:27 00:52:15
Table 4: Training time performance and number of epochs training in Aruba’s dataset
Model 100 75 50 25
Average epoch number
LSTM 274 385 365 324
FCN 45 101 87 145
LSTM + Embedding 255 290 320 183
FCN + Embedding 65 51 52 55
Average training time (HH:MM:SS)
LSTM 02:03:43 02:11:07 01:44:06 01:00:10
FCN 00:09:39 00:20:17 00:15:08 00:16:50
LSTM + Embedding 01:57:52 01:49:35 01:36:56 00:35:26
FCN + Embedding 00:16:24 00:11:42 00:10:00 00:07:67
Table 5: Training time performance and number of epochs training in Milan’s dataset

5.3 Encoding Impact

Tables 2 to 5 show that the embedding layer improves network performances. Indeed, with the embedding layer networks gain significant performance, 10 percentage points on balanced accuracy in average. Sensor events are transformed into vectors of 64 automatically learned features that allow networks to maintain a high score on small SEWs.

During our experiments, we noticed that the frequency encoding strategy improved performance, unlike random or arbitrary index allocation. We think this ordering helps networks generate discriminators on important events or rare events.

6 Conclusion

We have proposed a new method that coupled for the first time FCNs and embedding based on frequency encoding for HAR in smart homes. Our assessment on two datasets shows that:

  • The embedding based on frequency encoding significantly improves the performance of LSTM and FCN in all cases. This means that the domain knowledge incorporated in the embedding can improve the understanding of events by LSTM and FCN.

  • With the same encoding, FCNs obtain the same or better performance than LSTMs, with the exception of only two configurations and are quicker to train.

  • Moreover, FCNs outperform LSTMs when the window size decreases. This means that FCNs have a shorter delay in recognizing activities, and are more suitable for real-time activity recognition.

The proposed framework is pure end-to-end without any heavy pre-processing on the raw data or feature crafting, thanks to frequency-based encoding and the embedding. This method appears to be relevant for HAR problems in smart homes with low-level sensors.

7 Discussion and Future Directions

The results presented in this paper show that the applied DL approach based on NLP encoding and FCN is a relevant solution to significantly improve the smart homes HAR task.

We used a naive embedding based on frequency encoding that improved classification results. We plan to explore more word embedding techniques [li2020survey] such as Word2Vec or ELMo to improve the latent knowledge space and in the process enhance classification performances. Indeed these techniques take into account the context of words.

In addition, we are only experimenting with offline HAR. But the usage of SEWs in our assessment showed relevant results so we want to apply this to online HAR applications.

Moreover we plan to evaluate other windowing methods as TW or Fuzzy Windows [hamad2019efficient] with this method. To study which window methods produce the fastest and most accurate online HAR in smart homes.