Recognizing indoor human movements from interactions and activities from multiple sensors provides essential information for building intelligent real-world applications. Detecting a walking person is noticeable from prior efforts using various sensing methods, including acoustic [Algermissen, GeigerJ, Geiger], vision [He_2021_CVPR, Liang, Zheng], radiofrequency [Korany, Xu, Zeng], wearables [Gafurov, Mantyjarvi, Nguyen, Rong, Teixeira], and structural vibrations [Ekimov, Pan]. Every study has limitations and introduces challenges related to data scarcity and achieving high sensing accuracy. Acoustic-based methods may exhibit irrelevant sounds and are sensitive to ambient audible noise [Algermissen, GeigerJ, Geiger] and sound-emitting objects. Vision-based methods often require a clear visual path which usually affects the sensor installation locations [BenAbdelkader, Zheng]. RF-based introduces challenges related to the dense deployment of instruments to achieve high accuracy [Korany, Xu, Zeng], whereas mobile-based methods [Gafurov, Mantyjarvi, Nguyen, Rong, Teixeira] need to be deployed using a specific target carrying device. Vibration-based methods, which involve geophones and accelerometers, are easy-to-retrofit and can provide high sensing ability [Ekimov, Pan]. However, when applied unassisted, single-handed to the human movement identification problem, they are often sensitive to many indoor uncommon interactions with objects (e.g., falling off an object, movement of a chair). We focus on two activity-based sensors, microphones and geophones respectively, and capturing audio and vibration signals associated with the footsteps-based activity.
With the rise of deep neural networks, a common data-driven approach in activity recognition is to use a unimodal system [Salamon, Hershey, Koutini]. In the field of audio [Droghini, Yichi] and vision [Koch, Vinyals] many studies have demonstrated the use of single-stream unimodal networks which led to their widespread adoption to other sensory inputs and increased the domains [Hannun2019, Martinez, Radu, Supratak]
that can reap the rewards of efficient deep learning algorithms. However, single-stream deep supervised unimodal systems, which require a large amount of well-curated data to achieve the desired task, are essential for this success. Moreover, since multimodal data has more dimensions in input feature space, and thus space to be explored is much larger (in exponential order) than unimodal, which also results in the requirement of more data. Compared to audio and visual systems, large amount of labeled sensory data (such as geophones or other realms) is much more difficult to acquire owing to: privacy issues, complex indoor set-ups for sensor deployments, and the prerequisite of professional knowledge for data labeling. Moreover, there is a strong evidence in literature[Aytar, Palazzo] for using multi-modality data for performance improvement in physical systems leveraging the extraction of features from mixed signals.
Due to limitations in the availability of well-curated data, one approach is metric-based [Vinyals, Snell, Sung, Oreshkin, Chen2019ACL, Chen_2021_CVPR] contrastive learning which holds an enormous potential to leverage the vast amount of labeled data scarcity produced via omnipresent sensing systems. This contrastive learning-based approach builds upon a Siamese neural network comprising multi-stream sister networks that share the same weights. Since most of the existing meta-learners [Finn, Chelsea] rely on a single modality data and initialization, other sensor modalities may involve significantly different parameters for feature extraction while making it difficult to find a common initialization to solve a complex task distribution. Moreover, if the task distribution involves multiple and disjoint tasks, one can anticipate that set of individual meta-learners will solve one task more efficiently than learning the entire distribution. However, assigning each task to one of the meta-learners requires more labeled data, and it is often not feasible to extract the task information when the modes are not clearly disjoint. Consequently, a different approach is needed to leverage the strengths of two sensor modalities by using the existing meta-learning techniques as proposed in Siamese architecture.
) using the metric-based contrastive learning approach for human movement identification. To model the audio and geophone features, the network takes a combination of sound and vibration signal pairs from negative and positive samples. The audio and geophone signals are processed using two Convolutional Neural Networks (CNNs) streams comprised of twin sister-networks using a single initialization step and further sharing the same weights. The results of two signal embeddings are fused to identify similar patterns and structural features in contextual information. Finally, the feature embeddings of the input pairs are tuned using a contrastive loss function to aid in recognizing and facilitating human movement detection task. Our results suggest that metric-based contrastive learning can mitigate the impact of data scarcity for omnipresent sensors and increase robustness against unimodal systems. Specifically, when training a binary classifier on top of the multimodal meta-learner, we increase the accuracy and performance of the multimodal systems. The results show that the proposed method is able to learn useful audio and geophone representations even with limited data samples. To the best of our knowledge, this is the first work conducting contrastive representation learning that fuses two different sensor modalities, namely audio and geophone, for human movement detection via footstep recognition.
In summary, our main contributions are as follows:
We propose a new generalized framework for learning two different sensory traits in a shared space via multimodal contrastive learning. We leverage CNN and LSTM-based models to extract audio and geophone features in a multi-stream network. Further, we then extract apparent similarities by utilizing the representation learning-based approach to perform a classification task.
Generally, end-to-end supervised models require a massive amount of well-curated data to generalize a task of interest. However, our results demonstrate that the metric-based multimodal approach utilizes a smaller network and significantly improves performance by learning spatio-temporal embeddings of the sound and vibration data. It performs well in a low-data regime, increasing its importance in real-world use cases.
We extensively evaluate our proposed framework on a self-collected dataset of 700 pairs. The dataset contains audio and geophone signals of the footstep movement that has been created by capturing the sound and vibration signals in an indoor lab environment. We carefully labeled the dataset and included complex examples to perform training of the proposed architecture. The model achieves encouraging results from evaluating the proposed architecture through extensive experiments.
We briefly explain how to utilize the framework, its limitations, and the impact of training examples on evaluation results.
In the following sections, we present the methods and experiments in detail.
Siamese networks are generic models with a distinctive neural network architecture with identical twin networks that exchange the same weights and parameters to compare the input similarity between the pairs. A loss function, in principle, links the networks by computing the similarity measure between the pairs and evaluating the learned feature embeddings of the twin networks [Palazzo]. This work aims to propose a novel multimodal contrastive learning framework for omnipresent sensing systems in general by fusing two distinct sensor modalities. The work enables the use of siamese architecture in multimodal form and provides the ability to discriminate the features with fewer samples, making it more useful for present and future applications. The core idea of the proposed approach is to attract the combination of positive pairs and repulse the negative sample pairs while simultaneously calculating the similarity score between the output pairs, as shown in Figure 3
. Positive pairs consist of sound and vibration signals having a footstep movement from random users while walking in an indoor environment. In contrast, negative pairs contain no human-induced footstep activity, neither in sound nor in vibration signal. The two streams of the first subnetwork get the positive audio and geophone pairs for uniformity, whereas the second subnetwork receives the negative pair. Furthermore, both positive and negative pairs utilize the pre-processing step where time-frequency audio and geophone features are computed to serve as an input to the network. Finally, the audio and geophone signals are extracted and evaluated using Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to balance the spatial and temporal characteristics in the data sequences.
2.1 Multimodal meta learning with contrastive loss
We use multiple input modalities to achieve a metric learning objective since we are interested in learning from data fusion. The framework takes a pair of inputs comprised of audio and geophone signals. As the structure of these sensory inputs is dissimilar, it becomes non-trivial to find a precise feature mapping between them. However, to maximize the similarity between the input pair, embeddings of each signal representation are transformed to shared space. For this purpose, a contrastive learning-based Siamese architecture narrows the gap between multi modalities and learns the joint embeddings between the sound and vibration signals while maximizing the similarity score between the two input pairs.
In formulation, a sample of input pairs for a multimodal dataset is given by , where and , respectively. Moreover, the audio and geophone signal of the ith and kth sample are represented as () and (), respectively. We assume that and represent the individual space for sound and vibration signals, respectively; nevertheless, our goal is to construct two encoder networks that transform both sensor modalities into a shared space (). In formulation, the audio encoder () and geophone encoder () can be represented as : () : and () : , respectively. For continuity, the audio encoder () and geophone encoder () maps log-compressed Mel-filterbanks into a latent shared space (). We express log-compressed Mel-filterbanks with and , where , , and
are the number of frequency bins and time frames, respectively of the sound and vibration signal. CNN with varying sizes of several convolutional layers is used to create the shared space. Further, to extract features in this space, a dense layer is introduced to create a single-dimensional feature vector. Moreover, the output obtained from two encoders (and ) is concatenated using a compatibility function. Contrastive learning extracts a shared latent space using the compatibility function to measure the similarity between the two positive and negative pairs separately and maximize the agreement between the latent embeddings of the input features. Finally, the function () from two twin networks is sent to a contrastive loss function to generate the model’s output, as shown in Fig.3.
The proposed framework takes two input pairs , where and , and compares the distance in the shared space () to compute the similarity according to a contrastive loss function [Chopra]
. This approach allows us to compute the loss between two training sets known as identical (positive-positive) and distinctive (negative-negative). The aim is to learn embeddings with a small distance (D) for identical pairings and a margin value (m) that separates the representations for distinctive pairs. In this scenario, a value of zero implies that the pairs are comparable (similar), whereas a value of one shows that they are distinct. In practice, a traditional cross-entropy loss function tries to learn and forecast class probabilities separately for each sample to solve the classification problem. However, in metric learning, a contrastive loss function learns to operate on the network’s representations and their distance relative to each other. Unlike the cross-entropy loss, this contrastive loss predicts the relative distances between the input pairs so that a threshold value can be leveraged as a tradeoff to distinguish between the presence of human movement in the input pairs. We define the contrastive loss function in Equation2, where is the margin to separate the dissimilar pairs. We set the margin () value to 1 during training and evaluation. We use the binary indicator function () to determine if the input pairs belong to the same set or not.
where is the Euclidean distance between the two learned representations and from two subnetworks (Eq.3).
Further, in the following sections, we discuss the individual network architectures, as shown in Figure 4, for human movement identification using two subnetworks to process audio and geophone signals.
2.2 Convolutional neural network-based audio and geophone encoders
Here, we describe the audio and geophone encoder (
) for conversion of log-compressed Mel-filterbanks into a common representation. To capture the spatial and temporal information of the signal, we propose to use a convolutional architecture that consists of four blocks. Each block consists of a 2D CNN followed by a rectified linear unit (ReLU) as an activation function. Further, we process each block by batch normalization, max-pooling operation, and dropout rate to avoid overfitting. The first convolutional layer takes log-Mel spectrograms of size 99950 as input and performs filter operation with 8 filters of size 5 5. Moreover, we feed the first layer’s outputs to the second convolutional layer with 16 filters of size 3 3. Next, the second layer connects to the third layer with 32 filters of size 3 3. Further, we connect the third layer to the fourth convolutional layer with 32 filters of size 3 3. Finally, the extracted features from the convolutional layers are flattened to generate a 1D feature vector.
Next, to generate the output from each twin network, we used a concatenation layer (Eq.1) to fuse the extracted one-dimensional feature vector of the audio and geophone encoder. For uniformity, configuration and weights of the audio and geophone encoders are kept the same for both twin networks in the proposed contrastive learning framework. Finally, benefitting from the extracted features of two twin networks, we apply the contrastive loss function to compute the distance-based similarity score.
2.3 Recurrent neural network-based audio and geophone encoders
To further strengthen this study, we also built a recurrent neural network-based Siamese architecture to see the effect of temporal information of the audio and geophone signals. This section describes the audio () and geophone (
) encoders based on long short-term memory (LSTM) to capture the more extended dynamics in the temporal information. This section proposes using three LSTM blocks to build the network architectures for audio and geophone encoder; the first LSTM block consists of 400 neurons. The second block has 200 neurons, while the third block comprises 100 neurons. Further, to increase network performance and avoid overfitting in each LSTM block, we propose to use a dropout and batch normalization process. Moreover, we combine the first linear layer consisting of 50 neurons to the second linear layer of 40 neurons to map the output to the hidden units. Similarly, as proposed in convolutional architecture, we pass the output from each twin network to a concatenation layer (Eq.1) to extract a one-dimensional feature vector of the audio and geophone encoder. Finally, we compute the loss using the distance-based similarity measure from features extracted from the two subnetworks.
For consistency, geophone encoder () is kept identical to the audio encoder () in all its essence, i.e., it has the same amount of convolutional or LSTM layers and parameters as created in the audio encoder ().
3.1.1 Geophone module
This section introduces the in-house built sensing hardware that we use to capture the vibrations signals produced by the footstep-induced structural vibrations, as presented in [Shijia], while making few changes in the original system design. As shown in Figure 5, the sensing unit consists of four main components: the geophone, the amplifier module, the analog to digital convertor, and the communication module (Raspberry-Pi). We place the geophone system to the floor enclosed in a noise suppressor aluminum box to help preserve high-frequency signal and DC noise.
The geophone transforms the velocity produced by the structural vibration of the observed surface to voltage. Geophone of model SM-24 is used for its sensitivity in the range of our interest (0-200Hz). When people walk, they induce a minimal voltage by vertical floor vibration (approximately to m/s range)111https://cdn.sparkfun.com/datasheets/Sensors/Accelerometers/SM-24. Therefore, to capture the human movement, the system needs to amplify the signal. We amplify our sensing node (approximately 2200) using a custom-developed amplification board. We select this setting to prevent signal clipping and obtain a high signal resolution for people with multiple strengths. We convert the amplified analog signal into a digitized signal with a 24-bit ADC module sampled at 4000 Hz. We collect these amplified and digitized signals of human-induced structural vibrations to perform model training and evaluation.
3.1.2 Microphone module
Here we introduce the experimental setup consisting of a communication module (Raspberry-Pi) and an eight-channel microphone array (TAMAGO) to record audio signals in an indoor environment (see Figure 5). The audio signal is sampled at 16,000 Hz and 24bits. Audio signals are carefully selected for positive samples to contain subtle sounds of a person walking in the environment. In contrast, we include human speech signals, keyboard typing, and a few silent sound signals to create negative samples. We collect the audio signals in real-time; however, we process the signals offline. Although TAMAGO can record eight channels, we use only one channel to perform model training and evaluation for simplicity.
We collected the audio and geophone data samples. The dataset consists of a sound and a vibration signal associated with a person moving in the lab environment. We continuously record the data using the sensing devices, as explained earlier. Further, we precisely annotated the recorded data to create a good quality dataset for our training algorithms. Consequently, we annotate 700 positive pairs in which the sound and vibration signal of the moving person is present. Further, we also separated 700 negative pairs which contain no human presence. In this study, we use a complete set of 700+700=1400 pairs. We do not record subject age or gender information during data collection.
3.3 Feature extraction
We propose to train the models using time-frequency features as an input. All audio and geophone signals are of 10-second duration. The sampling frequency of the sound signal is 16kHz, whereas for vibration signal is 4kHz. For uniformity, each audio is zero-padded or clipped at the right side of the signal. We computed the short-time Fourier transform (STFT) with a window length of 20ms; further, we projected the STFT to 256 bins and a Hamming window with a 50% signal overlap. Complex spectrum of the signalcan be expressed as:
where is the magnitude and is the phase spectrum for frequency in frame .
We construct linearly spaced triangular-shaped filters in the Mel scale to extract log-Mel spectrograms with 50 Mel bands (Eq.5, 6). Moreover, we convert and normalize the values into log magnitudes (Eq.7), resulting in spectrograms of size 999 50.
For consistency, we follow the same procedure to extract features of vibration signals. We compute log-Mel spectrograms with a window length of 20ms and STFT using 256 bins. Further, we process the signal with Hamming window with a 50% overlap that resulted in spectrograms of size 999 50. We collect these comprehensive spatial and temporal features to ensure that it is not required to change the network architecture making it a consistent model for both sensor modalities.
3.4 Experimental protocol
In this section, we perform the experiments using the proposed multimodal framework for the collected dataset. Further, we evaluate the results of the convolutional and recurrent neural network-based multimodal architectures by separately training the individual networks. Finally, we compare the results of multimodal systems for human movement detection.
|Modality||Features||Top-1 Accuracy (%)|
|Multimodal (2DCNN)||Audio + Geophone||99.98/99.98/99.89|
|Multimodal (LSTM)||Audio + Geophone||99.99/99.99/99.86|
3.4.1 Train/Val details
We train all the models using an in-house annotated dataset. We then separate the dataset into training, validation, and testing set. From the set of 700 positive and negative pairs (i.e., 1400 pairs in total), we train on 500 similar and dissimilar pairs resulting in 1000 pairs (i.e., 1,000,000 combinations). Similarly, we use 100 positive and negative pairs, a total of 200 pairs (i.e., 40,000 combinations) each for validation and test set. We use unique pairs for the training, validation, and test set, i.e., no pair sample was present in either of the three sets. We optimized the hyper-parameters empirically and selected the best values in all the experiments. For model training, we used a batch size of 250 and stopped the training at 30 epochs by using an early stop criterion. We use Adam optimizer for optimizing the training process with an initial learning rate of 1e-03. We train the model using Nvidia A100 GPU. We recorded that the training time for 30 epochs is 592 minutes. We report the results of the training accuracies and validation accuracies for both the proposed multimodal systems in Figure6, whereas the training and validation loss results are presented in Figure 7.
3.4.2 Evaluation results
Here, we present the findings of human movement detection utilizing both unimodal methods and the proposed multimodal systems. We evaluate the performance of the multimodal systems independently. We combine the audio and geophone pairs for multimodal training and used both CNN and LSTM-based networks for model training. We calculate the distance between each similar and distinctive pair by calculating the Euclidean distance proposed in the contrastive loss function. We optimized the margin threshold value of the contrastive loss function empirically and selected the best value in all experiments for both systems. We report the margin threshold value () to be 0.5 for multimodal systems where the validation accuracy reached the maximum in both the proposed multimodal systems. We report the Top-1 accuracies for each model in Table.1. We observe that the state-of-the-art results are obtained using both CNN and LSTM-based networks. Our proposed framework outperforms by a good margin. We report the ablation comparison (in absolute terms) using the validation and test split with unique samples. The significant improvement in our proposed multimodal architecture shows that there is complementary information in the shared weights that benefit human movement recognition using a combination of audio and geophone data.
3.4.3 Impact of Batch size
We also perform experiments to evaluate the impact of pre-training batch size (see Figures 6 and 7), as larger batch sizes produce more negative examples per batch and facilitate convergence [chen20j]. We show that, on average, a batch size of 250 provides more stability and convergence during the training process. However, increasing the batch did not impact much on the validation set accuracy.
3.4.4 Impact of training examples
We next investigate the performance of our proposed framework in a low-data setting. For this purpose, we trained the algorithm using only 200 pairs and performed the end-task of classification on the evaluation set. The accuracy of the classifier was reduced to 80% and showed significant signs of overfitting in the trained model. We want to highlight that in a low-data regime of 200 pairs, the model cannot learn the feature embeddings. In contrast, if we increase the samples from 200 pairs to 500 pairs, the architecture accuracy increases significantly and also generalizes better (see Figure 8) on the validation set due to the increased number of training examples per modality. We conclude that the synergy of the minimum number of good quality training examples is equally essential compared to developing an exemplary architecture for improving the model’s generalization. Finally, our proposed architecture consistently outperforms, indicating the importance of multimodal data fusion with Siamese-based neural architecture.
We propose a two-stream subnetwork by leveraging the Siamese neural network architecture, using two different modalities, audio, and geophone, for human movement detection in an indoor environment. We showcase the importance of our multimodal framework using ablations on both CNNs and LSTMs encoders. We demonstrate the importance of our feature fusion in the embedding space by calculating similarities between positive and negative pairs. On human movement detection, our methodology significantly improves the performance in low training samples. It is observed that two sensor modalities unravel each other in shared space, helping to reduce data scarcity in real-world systems significantly. We hope that this work will pave the path for the efficient deployment of deep neural networks in physical systems.