In the context of dietary monitoring, various wearable sensors have been proposed in order to measure different parameters of eating behavior. One of the first sensors that was used is the in-ear microphone: the in-ear placement enables the capturing chewing sensors clearly as they are transmitted through the skull during mastication .
Alternative sensors have also been studied in literature. A piezoelectric sensor has been used in ; the sensor is attached on the skin close to the jaw that captures muscle movement during mastication. The periodic nature of chewing is also present in the piezoelectric sensor’s signal and is used to detect chewing. Alternative placements of the piezoelectric sensor have also been examined, such as attached to smart glasses or to neck collars  . Surface electromyography (EMG) has also been used for chewing detection  , however, is currently one of the least discrete solutions.
Sensors for estimating the weight of a meal (or bite) have also been proposed, and achieve relatively high effectiveness and low errors. However, the sensors require manual placement and activation (e.g. plate weight scale ) or are part of the table  and thus cannot be used in free-living conditions such as eating outside or on-the-go.
More recently, the interest has been shifting to off-the-shelf solutions to eliminate the need for specialized hardware as well as decrease the sensors’ intrusiveness. In particular, the 3D accelerometers and gyroscopes that are commonly embedded in commercial smart-watches can be used to ambiently detect eating gestures (i.e. the repeated movements of bringing food to the mouth from a plate, tray, etc) and achieve very promising results in challenging, free-living conditions    . Alternatively, an accelerometer mounted on the temporalis [12, 13] has also been used to detect muscle contraction during mastication with promising effectiveness.
Additionally, analysis of photos taken with smart phones can provide detailed information about eating habits, including types of consumed food, ingredients, etc (for example, the goFOODTM  system can estimate the calorie and macro-nutrient content of a meal based on either two photos of the meal or a short video). A single photo is used in  to perform segmentation, recognition, and volume estimation of different foods, and results show similar effectiveness to methods that require multiple photos of the meal.
In this work, we propose a method for estimating bite weight using the audio signal of commercially available ear buds. Our approach includes extracting features that are used to train bite weight estimators, based on annotations of start and stop time-stamps of chews and food type. We evaluate different feature sets and different types of estimators on an in-house dataset we have collected, using leave-one-subject-out (LOSO) training and testing. We examine two cases, one where food type information is available (corresponding to a use-case where food information is obtained by asking the user directly or by some food-type recognition system such as ) and one where it is not (corresponding to a use-case where only chewing activity is detected using some audio-based automated method such as ).
Ii Related work
An algorithm for bite-weight estimation, also from sound captured by an in-ear microphone, has been proposed in . Audio was recorded at kHz; a total of hours were recorded from eight individuals.
The algorithm proposed in  used features (Table I of ) that can be extracted from a sequence of chews. Of these features, can be computed solely from the start and stop time-stamps of the chews, and only requires audio signal (i.e. mean signal energy). For each chewing bite, these features are extracted times: from the entire chewing bout, from the 1st, 2nd, and 3rd third of the chewing bout, and from the chewing bouts that consist of the first and chews only, respectively. This yields features; two more features are computed and a final vector of features is produced.
A linear regression (with bias) model is used to estimate bite weight. A different model is trained for each food type (a total of three food types are used: potato chip, lettuce, and apple). Two methods of feature selection are examined: manual selection based on Spearman’s correlation coefficient (between each feature and the bite-weight) and step-wise regression fit. Authors conclude that both methods yield similar results.
In this work, we differentiate significantly from  by (a) using commercially available ear buds, (b) focusing on audio-based features and exploring different aggregation methods, and (c) comparing different regression models for bite weight estimation.
Iii Bite-weight estimation algorithm
Our proposed approach aims to estimate the weight of a single bite. In short, we first extract a set of features (we examine both non-audio and audio features) which we then use to train an estimation model. Models are trained in the typical LOSO scheme, where a different model is trained for each subject of the dataset using the data from the other subjects each time.
Iii-a Feature extraction
We use two distinct sets of features. The first set (non-audio features) does not depend on the audio signal, but only on the start and stop time-stamps of the chews (Figure 1). Note that in this work the start and stop time-stamps have been determined manually. Specifically, let and for denote the start and stop time-stamps of a bout of chews. We compute the following six features: number of chews (i.e.
), mean and standard deviation of chew duration (chew duration is), and mean and standard deviation of chewing rate (instantaneous chewing rate is estimated as
), and food type (as is a categorical variable).
The second set of features is based on the audio signal. A challenge lays in the fact that (a) each chewing bout has a different duration and different number of chews, (b) each individual chew has a different duration. To overcome this, we follow a two-step process: first, one feature vector is extracted from each individual chew of a single chewing bout, and then, all the features vectors of the chews that belong to a single chewing about are aggregated together to produce a final, single feature vector for the chewing bout (this final vector is then used for training the weight predictors).
. The features include signal energy in log-scale energy bands, higher order statistics (including skewness and kurtosis), and fractal dimension. Estimating each of those features is independent of the length of the available audio signal (i.e. from chew duration); this allows us to obtain comparable values for each feature among chews of varying duration. After extracting the features for the entire dataset we standardize them by subtracting the mean (of each feature) and dividing by its standard deviation.
In the second step, we aggregate the features vectors of the chews of each chewing bout. We examine two similar approaches to this: bag-of-words (BoW) and vectors of locally aggregated descriptors (VLAD). Centroids are obtained over the available training portion of the dataset (AIC is used for selecting the number of centroids), which are then used both on the train as well as the test portions of the dataset.
Iii-B Bite-weight estimators
To estimate the bite weight from the available features, we examine four different algorithms. The first estimator we evaluate is LR, similar to . We have also experimented with models that include cross-product terms but have found that the overall effectiveness is not affected significantly.
The second algorithm is support vector regression (SVR). We use a radial-basis function (RBF) kernel, and use a grid search for hyper-parametersand by randomly splitting the data from the subjects to for training and for validation. We search for in and for , where is the length of the feature vector.
We also examine classic feed-forward neural networks (FFNN). We consider architectures with eitheror hidden layers, and , , , and neurons per layer (thus, a total of distinct architectures, as we do not consider architectures with different number of neurons per layer). The choice of the architecture is treated as the hyper-parameter of this model and is selected based on a training and validation split of the subjects, and is thus different for each subject. Training minimizes the mean absolute error (MAE); learning rate is set to
and the maximum number of epochs is. We use the BFGS Quasi-Newton back-propagation algorithm.
Finally, we also examine generalized regression neural networks (GRNN)  with Gaussian kernel. Similarly to the previous models, we also select the hyper-parameter of the kernel using a train-validation split on the subjects.
To evaluate our approach we have collected an in-house dataset. A total of 8 participants were enrolled for the data collection trials ( males and females, age years, body-mass index ). Four different food types were consumed: apple, banana, rice, and potato chips. These four types were selected as they have a unique combination of crispiness (apple and potato chips) and wetness (apple and banana). The dataset includes hours of eating and contains a total of chewing bouts and chews.
Audio signals were collected by using commercially available Samsung Galaxy Buds. We have created a custom Android application that captures synchronized audio (at kHz, bit) from the ear buds and plate weight (at Hz) from a Bluetooth-enabled plate scale. The plate weight scale has been used to derive ground truth values for bite weight. A video recording of each session has also been captured to further assist us in the annotation process.
To evaluate our approach we perform various combinations of feature sets and estimation models. Given the non-audio based feature set and the two methods of aggregating the audio based features, we examine the following five combinations: non-audio features (), audio with BoW (), audio with VLAD (), combination of non-audio and audio with BoW features (), and combination of non-audio and audio with VLAD features (). For each of these feature sets we examine four estimators (Section III-B): LR, SVR, FFNN, GRNN. Finally, we train five different models per combination: the first four are food specific, while the fifth is trained on the data from all food types. Table I shows the mean absolute error per experiment, and Table II shows the mean absolute percentage error (%) in the same structure. All experiments are performed in LOSO fashion; thus, each result is the mean across the participants of our dataset.
|Amft et al.|
|Amft et al.|
Based on the results, the combination of both non-audio and audio based features improves the estimation accuracy (yielding lower error) for most cases. This is more evident when training on all food types together (Figure 2). This conclusion is also inline with the results of the algorithm of  which uses a combination of non-audio features and audio feature, as it is better from our non-audio and audio-only approaches while slightly worse from our non-audio and audio combinations.
Based on the results, FFNN and GRNN are able to achieve the best results (lowest errors) compared to LR and SVR. When training on a single food type, GRNN-based models with achieve the lowest MAE (close to or less than g) and similarly low standard deviation of absolute errors. The only exception is for potato chips where FFNN achieves the lowest errors. However, GRNN achieves the second lowest errors and the difference (from FFNN)) is very small: g for GRNN versus for FFNN. When training on all food types together, FFNN with seems to achieve the lowest errors.
Comparing the two different types of aggregating audio features (from chews to chewing bouts) there seems to be no clear conclusion about whether BoW is better of VLAD. This holds both for when using only audio features (i.e. vs. ), as well as for when combining them with the non-audio features (i.e. vs. ). The only exception is GRNN that seems to benefit from the use of BoW; this can be attributed to the stricter quantization of the feature space that BoW applies.
Finally, MAE is quite lower for potato chips compared to the other food types. However, this is not a result of “better” trained estimation models, but of the fact that potato-chip bites are generally lighter (than apple bites for example). This can be confirmed by comparing MAPE errors that are shown in Table II.
Evaluation results for the algorithm of Amft et al.  are comparable with our approach. The are clearly surpassed though by our FFNN and GRNN based approaches. Overall, MAE is higher in our dataset compared to the values reported by the authors in their original work of . This can be attributed to the more challenging nature of our dataset as well as differences in the sound captured by our commercially available ear buds and their custom-made sensor.
In this work we have presented an approach for estimating bite weight from audio signal captured by commercially available ear buds. Using commercially available hardware is essential to enable higher adoption rates for such dietary monitoring approaches, since they reduce invasiveness and discomfort of the end user.
Our approach uses a combination of non-audio and audio features which are used to train estimation models. We evaluate on an in-house dataset of approximately hours. Our best results are obtained by training food-specific GRNN models and non-food-specific FFNN models. GRNN models yield MAE of approximately g or less, and FFNN yield a total MAE of g. We also compare with an existing algorithm from literature and achieve lower errors for all cases.
An important limitation of our approach is that it requires annotations for the start and stop time-stamps of individual chews, as well as food type annotations (for food-type–specific models). Future work includes evaluating on bigger and more diverse datasets with more food types and different data-capturing conditions (closer to free-living) as well as evaluating in combination with audio based chewing detectors and automatically detected food types.
The work leading to these results has received funding from the EU Commission under Grant Agreement No. 965231, the REBECCA project (H2020).
-  O. Amft, H. Junker, and G. Troster, “Detection of eating and drinking arm gestures using inertial body-worn sensors,” in ISWC’05, 2005, pp. 160–163.
-  E. Sazonov and J. M. Fontana, “A sensor system for automatic detection of food intake through non-invasive monitoring of chewing,” IEEE Sensors Journal, vol. 12, no. 5, pp. 1340–1348, 2012.
-  H. Kalantarian, N. Alshurafa, and M. Sarrafzadeh, “A wearable nutrition monitoring system,” in 2014 11th BSN, 2014, pp. 75–80.
-  M. Farooq and E. Sazonov, “Segmentation and characterization of chewing bouts by monitoring temporalis muscle using smart glasses with piezoelectric sensor,” IEEE JBHI, vol. 21, no. 6, pp. 1495–1503, 2017.
-  O. Amft and G. Troster, “On-body sensing solutions for automatic dietary monitoring,” IEEE Pervasive Computing, vol. 8, no. 2, pp. 62–70, 2009.
-  C. Sonoda et al., “Associations among obesity, eating speed, and oral health,” Obesity Facts, vol. 11, no. 2, pp. 165–175, 2018.
-  V. Papapanagiotou et al., “A novel chewing detection system based on PPG, audio and accelerometry,” IEEE JBHI, vol. 21, no. 3, pp. 607–618, 5 2017.
-  R. S. Mattfeld, E. R. Muth, and A. Hoover, “Measuring the consumption of individual solid and liquid bites using a table-embedded scale during unrestricted eating,” IEEE JBHI, vol. 21, no. 6, pp. 1711–1718, 2017.
-  K. Kyritsis, C. Diou, and A. Delopoulos, “Modeling wrist micromovements to measure in-meal eating behavior from inertial sensor data,” IEEE JBHI, vol. 23, no. 6, pp. 2325–2334, 2019.
-  ——, “A data driven end-to-end approach for in-the-wild monitoring of eating behavior using smartwatches,” IEEE JBHI, vol. 25, no. 1, pp. 22–34, 2021.
-  K. Kyritsis et al., “Assessment of real life eating difficulties in parkinson’s disease patients by measuring plate to mouth movement elongation with inertial sensors,” Scientific Reports, vol. 11, no. 1, p. 1632, 1 2021.
-  S. Wang, G. Zhou, L. Hu, Z. Chen, and Y. Chen, “Care: Chewing activity recognition using noninvasive single axis accelerometer,” in UbiComp, ser. UbiComp/ISWC’15 Adjunct. New York, NY, USA: Association for Computing Machinery, 2015, p. 109–112.
-  S. Wang et al., “Eating detection and chews counting through sensing mastication muscle contraction,” Smart Health, vol. 9-10, pp. 179–191, 2018, cHASE 2018 Special Issue.
Y. Lu et al.
, “gofoodtm: An artificial intelligence system for dietary assessment,”Sensors, vol. 20, no. 15, 2020.
-  ——, “A multi-task learning approach for meal assessment,” in Proceedings of the Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management, ser. CEA/MADiMa ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 46–52.
V. Papapanagiotou, C. Diou, and A. Delopoulos, “Chewing detection from an in-ear microphone using convolutional neural networks,” inEMBC 2017, 2017, pp. 1258–1261.
-  O. Amft, M. Kusserow, and G. Troster, “Bite weight prediction from acoustic recognition of chewing,” IEEE TBE, vol. 56, no. 6, pp. 1663–1672, 2009.
-  V. Papapanagiotou, C. Diou, J. van den Boer, M. Mars, and A. Delopoulos, “Recognition of food-texture attributes using an in-ear microphone,” in ICPR 2020, Part V, LNCS 12665 proceedings. Cham: Springer International Publishing, 2020 [in press].
-  D. Specht, “A general regression neural network,” IEEE Transactions on Neural Networks, vol. 2, no. 6, pp. 568–576, 1991.