Self-Supervised Feature Learning of 1D Convolutional Neural Networks with Contrastive Loss for Eating Detection Using an In-Ear Microphone

The importance of automated and objective monitoring of dietary behavior is becoming increasingly accepted. The advancements in sensor technology along with recent achievements in machine-learning–based signal-processing algorithms have enabled the development of dietary monitoring solutions that yield highly accurate results. A common bottleneck for developing and training machine learning algorithms is obtaining labeled data for training supervised algorithms, and in particular ground truth annotations. Manual ground truth annotation is laborious, cumbersome, can sometimes introduce errors, and is sometimes impossible in free-living data collection. As a result, there is a need to decrease the labeled data required for training. Additionally, unlabeled data, gathered in-the-wild from existing wearables (such as Bluetooth earbuds) can be used to train and fine-tune eating-detection models. In this work, we focus on training a feature extractor for audio signals captured by an in-ear microphone for the task of eating detection in a self-supervised way. We base our approach on the SimCLR method for image classification, proposed by Chen et al. from the domain of computer vision. Results are promising as our self-supervised method achieves similar results to supervised training alternatives, and its overall effectiveness is comparable to current state-of-the-art methods. Code is available at .



There are no comments yet.


page 1


Semi-Supervised Learning with Self-Supervised Networks

Recent advances in semi-supervised learning have shown tremendous potent...

Self-supervised driven consistency training for annotation efficient histopathology image analysis

Training a neural network with a large labeled dataset is still a domina...

The Way to my Heart is through Contrastive Learning: Remote Photoplethysmography from Unlabelled Video

The ability to reliably estimate physiological signals from video is a p...

Exploiting Unlabeled Data in CNNs by Self-supervised Learning to Rank

For many applications the collection of labeled data is expensive labori...

Robust Self-Ensembling Network for Hyperspectral Image Classification

Recent research has shown the great potential of deep learning algorithm...

Consistent Explanations by Contrastive Learning

Understanding and explaining the decisions of neural networks are critic...

Exploiting the potential of unlabeled endoscopic video data with self-supervised learning

Purpose: Due to the breakthrough successes of deep learning-based soluti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

While obesity and eating-related diseases are affecting ever-growing portions of the population, awareness of our eating habits and behavior can play a very important role in both prevention and treatment. The under-reporting of eating in questionnaire-based studies is a well known fact [1]; as a result, technology-assisted monitoring using wearable sensors is gaining more and more attention.

Meaningful information is usually extracted from signals captured by wearable sensors by means of signal-processing algorithms; these algorithms are often based on supervised machine learning and thus require labeled training data to achieve satisfactory effectiveness. This is more important in deep-learning—based approaches where larger volumes of (annotated) data are required. Creating such large datasets, however, is challenging. Generating ground truth annotations requires a lot of manual work, where experts or specially trained personnel process each part of the dataset in detail in order to derive the annotations. Besides being laborious and time-consuming, this process is sometimes prone to errors (which is sometimes reduced by using multiple annotators for the same data) and can often introduce limitations in data collection. For example, in the case of video-based annotation, subjects are limited to the room/area covered by the cameras and data collection cannot take place in free-living conditions.

One way to overcome this is semi-supervised or unsupervised training methods. Such methods have already been used with great success in fields such as speech processing [2] as well as in applications with wearable sensors [3, 4, 5]. Self-supervised methods for image classification have received a lot research attention recently [6]. These methods use multiple augmentations on unlabeled images in order to learn effective image representations.

In this work, we adapt ideas from self-supervised image classification to 1D convolutional neural networks (CNN) with the goal to train a chewing-detection model on audio from an in-ear microphone, and focus on training the feature extractor using only unlabeled data. The model is a deep neural network (DNN) that includes convolutional and max-pooling layers for feature extraction and fully-connected (FC) layers for classification. We follow the approach of

[7, 8] and train the convolutional and max-pooling layers in a self-supervised way, and then use them as a fixed feature extraction mechanism to train the FC classification layers. We evaluate on a large and challenging dataset and compare with our previous work that uses only supervised training, as well as other algorithms from the literature, and obtain highly encouraging results.

Ii Network training

To study if we can successfully train a self-supervised feature extractor we use the architecture of our previous work on chewing detection [9]. In particular, we focus on the architecture of

s input window due to its effectiveness in the supervised learning setting. The network is split in two: the feature extraction sub-network

, and the classification sub-network, . Sub-network

maps an audio window to a feature vector, i.e.

(where ) and consists of five pairs of convolutional layers followed by max-pooling layers. The convolutional layers have progressively more filters (, , , , and respectively) and constant length ( samples for all layers except for the last one that has

); the activation function is ReLU. The max-pooling ratio is always

. Sub-network classifies a feature vector to a binary chewing vs. non-chewing decision, i.e. and consists of two FC layers of neurons with ReLU activation followed by a layer of a single neuron with sigmoid activation.

In [9], the entire network111Operator denotes function composition, i.e. is trained together on labeled data. In this work, our goal is to train in a self-supervised way, and only use labeled data for training .

Ii-a Self-Supervised feature learning

To train we follow the paradigm of [7] where each training sample is augmented with two different augmentations, and , in parallel, yielding samples and respectively. Given an initial training batch of samples, we create a new batch of double the original size as using the two augmentations. Given a similarity metric, the network is then trained to maximize the similarity between all pairs that are derived from the same original sample, i.e. , and to minimize the similarity between for , , and .

Following [7]

, we use the cosine similarity with temperature

[10] as our similarity metric:

where is the inner product, is the Euclidean norm, and is the temperature parameter. Higher values leads to “sharper” softmax at the network’s output, while lower values lead to “smoother” output which can yield more effective representations.

For the contrastive loss function we use the normalized temperature-scaled cross-entropy loss

[11] since it has been used on similar applications, mainly in the domain of 2D signals (such as images). Given a training batch of samples, i.e. and for , the loss for the -th positive pair, i.e. and , is defined as:

where is a boolean indicator that is equal to if and only if . The indicator is necessary because the similarity between the a vector and itself is always the same, i.e. . This, in turn, renders asymmetric, and the final loss between samples and is simply the average:

While it is possible to compute the similarity metric directly on the output of , it is usually better to use a projection head, , that is applied after [7]. We experiment with two projection heads: (a) a linear, , that consists of a single FC layer of 128 neurons with linear activation, and (b) a non-linear, , that consists of 2 FC layers of neurons with ReLU activations followed by a FC layer of neurons with linear activation. Thus, the resulting network is .

To train we use the LARS optimizer which has been proposed in [12]. We set the batch size to (thus obtaining samples after the dual augmentation process) and train for epochs, based on some initial experimentation with the dataset. We apply a warm-up schedule on the learning rate for of the total epochs (i.e. for epochs) and reach a maximum learning rate of , after which we apply cosine decay to the learning rate [13].

Ii-B Augmentations

An important part of this method is selecting augmentations that can help the training process to learn features that can be effectively used in the final classification task. Time-stretching (speeding up or slowing down) of audio has been used both in the speech recognition domain [14] as well as in more general applications (e.g. environmental sounds classification [15]) with relatively small changes, i.e. to . Other augmentations include pitch shifting, dynamic range compression, and adding background noise [15].

Based on the above as well as the nature of audio chewing signals, we focus on two augmentations: global amplification level and background noise. In particular, global amplification is where

is the global amplification level and is drawn from a uniform distribution in the range

(a different is drawn for each ). This choice for this augmentation is based on our experience with in-ear microphone signals [16], where placement, ear and sensor shape compatibility, and even movement can change the overall amplification level of the captured audio.

The second augmentation is the addition of noise and implemented as: where is a vector of IIR “noise” samples, drawn from uniform distributions in the range of . Value has been chosen as it roughly equal to

of the average standard deviation of audio signal across our entire dataset, yielding

dB SNR. Adding noise to the audio signal simulates noisy environments (such as noisy city streets, restaurants, etc) and helps our feature extractor learn to “ignore” its influence.

Ii-C Supervised classifier training

Given the trained feature extractor we can now train a chewing detection classifier based on label data. In this stage, the projection head can be discarded and thus the final network is , where only the weights of are trained. It is possible, however, to retain a part of the projection head in the final model [8]. We do this for the case of the non-linear projection head, : let where corresponds to the first layer of and to the remaining (second and third) layers of ; thus, the final network is (again, only weights of are trained here).

We use the ADAM optimizer [17] with a learning rate of and minimize binary cross-entropy based on ground truth values:

where is the ground truth value for the -th sample and is the output of .

Ii-D Post-processing of predicted labels

The predicted labels indicate chewing vs. non-chewing; thus, chewing “pulses” correspond to individual chews. Similarly to our previous works [18, 9], we aggregate chews to chewing bouts and then chewing bouts to meals. In short, (a) chewing bouts are obtained by merging chews that are no more than s apart, (b) chewing bouts of less than s are discarded, (c) meals are obtained by merging chewing bouts that are no more than s apart, (d) meals for which the ratio of “duration of bouts” over “duration of meal” is less than are discarded.

Iii Dataset

The dataset we use has been collected in the Wageningen University in 2015 during a pilot study of the EU SPLENDID project [19]. Recordings from individuals (approximately h) are available. Each subject had two meals in the university premises and was free to leave the university, engage in physical activities, and have as many other meals and snacks wished for the rest of the recording time. This dataset has also been used in [18] and [9].

The sensor is a prototype in-ear microphone sensor consisting of Knowles FG-23329-D65 microphone housed in a commercial ear bud. Audio was originally captured at kHz but we have down-sampled it at kHz (as in [18]) to reduce the computational burden; we have also applied a high-pass Butterworth filter with a cut-off frequency of Hz to remove very low spectrum content and the effect of DC drifting that was present in the chewing-sensor prototype.

Iv Evaluation

We split our dataset of subjects into two parts: a “development” set with subjects (selected randomly), , and a “final evaluation” set with the remaining subjects, . Note that and are disjoint. In the first part of the evaluation, we explore training hyper-parameters and architecture choices on . In the second part, we apply what we learned and, after training our models on , we evaluate them on .

In the first part of evaluation, our goal is to understand the effect of the temperature and the projection head on classification accuracy. In particular, we first train on the entire (all subjects) using self-supervised training (as described in Section II-A). We then train on the same subjects but in a supervised way (as described in Section II-C) in typical leave-one-subject-out (LOSO) fashion. During each LOSO iteration, data from subjects are available for training. We select a small part of the subjects (specifically subjects) as a validation set, and train on the remaining ( subjects). We train for epochs and compute the loss over the validation subjects’ data after each epoch; we select the model that minimizes the validation loss at the end of each epoch. We use a batch size of . Note that in these experiments self-supervised feature learning takes place in all subjects of (for computational reasons) and are only used here to obtain an assessment of the effect of temperature and not for evaluating our algorithm. Results of evaluation on the held-out dataset () are presented in Table IV.

We examine different values of temperature ; results are presented in Tables I - III. Table I shows results for the , where is trained with the linear projection head . Based on F1-score, the best results are obtained for while most other values of yield F1-score higher then .

Table II shows similar results; in this case the network is , where is trained with the non-linear projection head . Highest F1-score is obtained for ; in general, training seems to benefit more from smaller temperature values. Extremely large temperature values (e.g. ) seem to not be beneficial (this is also observed in [8]).

Finally, Table III shows similar results for ; this network is the same as before (i.e. ) but the first layer of the projection head, , is retained. Here, smaller temperatures seems to benefit the overall effectiveness more, with yielding the highest F1-score, and high temperatures () degrade the effectiveness completely.

prec. rec. F1-score acc. w. acc.
TABLE I: Results for the network on .
prec. rec. F1-score acc. w. acc.
TABLE II: Results for the network on .
prec. rec. F1-score acc. w. acc.
TABLE III: Results for the network on .

In the second part of the evaluation, we use evaluate on the subjects of with models trained on . In particular, we select the best network of the three different approaches (Tables I - III) based on F1-score. We train and on in an self-supervised way and then train again on in a supervised way. The three trained models are then evaluated on and the results are shown in Table IV. To compare, we also train a fourth model by training the entire network (i.e. ) on in a supervised way (similar to how models are trained in [9]). Results are shown in the fourth line of Table IV.

Results are particularly encouraging as the three networks with the self-supervised trained feature-extraction layers (lines -) not only achieve similar effectiveness with the completely supervised-trained network (line ) but also improve over it. The non-linear projection head (without retainment of the first layer, i.e. the ) achieves the highest F1-score and weighed accuracy among all the three self-supervised models and the supervised model.

model prec. rec. F1-score acc. w. acc.
TABLE IV: Final evaluation of effectiveness on , when training on between self-supervised feature learning and supervised classifier training (lines 1-3), and supervised training of the entire network (line 4).

As a final comparison, we repeat the results of three different algorithms of [20] and for the -sec architecture of [9], as presented in [9]. It is important to note that these results are averages across all

subjects of the dataset, so they are not directly comparable with all previous results. However, they can give an estimate about the overall effectiveness of our approach. Results indicate that self-supervised training exceeds the effectiveness of several of the methods proposed in the bibliography. Only the

-sec architecture achieves better results (F1-score of versus of ).

approach prec. rec. F1-score acc. w. acc.
MSEA [20]
MESA [20]
LPFSA [20]
-sec arch. [9]
TABLE V: Results for comparison with base-line as presented in [9]: three algorithms of [20] and the -sec arch. CNN chewing detector of [9] (supervised LOSO on all subjects, batch size of , Adam optimizer with learning rate of , and epochs).

V Conclusions

In this work, we have presented an approach for training the feature extraction layers of an audio-based chewing-detection neural network in an self-supervised way. Self-supervised training seems to lead to highly effective models, while at the same time reducing the labor needed for manual annotation as well as takes advantage of large amounts of unlabeled data for representation learning. Our experiments show very promising results, as self-supervised training achieves similar, and sometimes better, effectiveness compared to a similar fully-supervised approach. Additionally, best results (F1-score of ) are comparable to fully-supervised methods on the same dataset (F1-score of of [9]). Future work includes further studying the effect of additional augmentations for the self-supervised training part, and examining how much effectiveness is affected when there are less data available for the supervised training part. Additionally, the effectiveness of the trained self-supervised network can be evaluated on other problems such as individual chew detection or food type recognition.


The work leading to these results has received funding from the EU Commission under Grant Agreement No. 965231, the REBECCA project (H2020).