Infant distress vocalizations are critical evolutionary signals that allow infants to communicate hunger, discomfort and pain to their caregivers. On average, healthy infants fuss and cry (hereafter, cry) for approximately 1.75 hours per day by the second week of life [cryingInInfancy_1962]. Crying duration peaks at approximately 2.75 hours by 6 weeks, and then begins to decrease, such that by 12 weeks, infants cry only 0.75 hour per day [cryingInInfancy_1962].
Approximately of infants exhibit excessive crying, or colic, in the first 6 weeks of life [colic_2017]. Colic is often defined via Wessel’s “rule of three” as crying for more than three hours a day, for more than three days a week, for more than three weeks [WESSEL421]. Excessive infant crying is among the most common reasons that families seek medical attention [earlyInfantCrying_2004], [CryingInfant_2007] and has been linked to later regulatory problems in children [CryingHyperactivity_2002]. Whether excessive or not, crying is a known parental stressor associated with decreased parenting quality [crying_parental] and increased risk for caregiver mental health issues [murray2001impact]. Crying can also increase the risk of child maltreatment and abuse, most significantly abusive head traumas caused by violent shaking, which also peak at six weeks [aht_crying].
Most studies reporting on infant crying rely on parent-report diaries [crying_holding_diary] or trained observers to record or live-annotate infant crying for short periods of time [infantCryingStability]. These methods have provided critical descriptive data on crying; however, they are subject to a number of limitations. First, parents report up to 4 - 5 times as much crying compared to observer-annotated tape recordings [overreportCrying_Barr], [overreportCrying_James-Roberts]. Second, observer annotations of infant crying are often conducted under laboratory conditions, which may not reflect the true home environment [10.1145/2638728.2641306], [garcia2017speaker]
. Third, human annotations are time-consuming, making objective large-scale longitudinal studies of infant crying infeasible[de2019automated]. In order to understand the dynamic between infant crying and caregiver mental health, and before we can begin to develop applications that could provide “just-in-time” support to caregivers in the home, we need automated cry-detection algorithms that perform robustly in naturalistic settings [de2019automated].
Classification of infant vocalizations in laboratory settings has proven successful [interspeech2018]. However, audio detection in real-world settings remains largely elusive. Generally, auditory behavior detection models are developed with small convenience samples in clean laboratory settings [10.1145/2638728.2641306], [garcia2017speaker] where extraneous sounds are minimized. By contrast, real-world environments, such as family households, typically include a variety of complex sounds that must be distinguished from behaviors of interest, including sounds from other humans, pets, music, appliances, outdoor sounds, etc. These additional sounds greatly increase the difficulty of detection and classification problems; additionally, if such sounds are not present in the training data, performance will deteriorate in real-world conditions [alameda2019multimodal].
Some models to detect infant crying in real-world settings exist. Most notably, the LENA (Language ENvironment Analysis) system is a commercially available hardware and software platform used widely in developmental psychology research to collect, segment and identify sounds in children’s real-world daily environments, including infant crying [lenaReliability]. However, infant crying detected from LENA has been shown to have relatively poor accuracy compared to human annotations [MegICIS]
, in particular when considering accuracy at short timeframes. This is problematic because in order to conduct moment-by-moment, time-locked analyses of infant crying and caregiver behavior and affect[deBtimeDynamics], high temporal precision is required.
In this work, we develop an automated cry-detection algorithm that performs robustly in naturalistic home environments. To develop our model, we collected and annotated a large dataset of over 780 hours of real-world audio data. Next, we train a Support Vector Machine (SVM) classifier using a combination of deep spectrum features and acoustic features. Deep representation features, generated from the higher layers of convolutional neural networks (CNNs), have been shown to have sufficient representational power to solve complex image recognition tasks[decaf_2013]. They have also been successfully applied to problems in the audio domain using spectral (i.e. visual) representations of sound. For example, deep spectrum features have been successfully used to classify emotion from speech [spectrumEmotion_2017], [spectrumEmotion_2019] and snoring [spectrumSnore_2017].
Our model dramatically outperforms state-of-practice (i.e. those implemented in commercially-available systems) and state-of-the-art (i.e. recently published) infant distress detection models trained and tested on equivalent real-world raw audio data of infants in their natural home environments. In particular, our model reaches an average F1 score of 0.597 (std: 0.185), with precision 0.612 (std: 0.209) and recall 0.630 (std: 0.175). By contrast, LENA cry detection model has an F1 score of 0.166 (std: 0.090; precision = 0.809 (std: 0.299); recall = 0.095 (std: 0.055)) and a recently published Interspeech Challenge baseline model [Interspeech2019] has an F1 score of 0.26 (precision = 0.159; recall = 0.706). We detail these comparisons and their implications in the results and discussion.
2.1 Collection of Real-world Raw Audio data
We collected two datasets of real-world natural household audio data to train and test our model. Both were collected using an infant-worn commercial audio sensor within a larger study developing a sensors-to-analytics platform to capture high-density markers of parent-child activity in real-world settings [Holding_2019].
The LENA audio recorder is a lightweight wearable audio recorder designed to record and analyze children’s early auditory environments. Parents place the LENA in a vest worn by the infant. The LENA can record up to 24 hours of recording before recharging, and audio data is stored in PCM format with one 16-bit channel at a 44.1khz sampling rate [lenaBattery].
Parents were instructed to record up to 72 hours total of household audio data, including two weeknights and a weekend. We requested that individual recordings last up to 24 hours, with parents told that they could pause and resume the recordings as needed. All data were thus collected in truly “natural” home settings which included highly variable daily household sounds, including the presence of multiple family members, noise from the child’s clothes rubbing the sensor, as well as large amounts of silence.
2.2 Training and Testing Data Details
We created our training dataset from 742 hours (mean: 30.917, range: 0.607 - 70.498) of raw audio data compiled from recordings of 24 infants. To increase the concentration of infant crying vocalizations in the training data, we filtered the raw audio samples collected in the study to identify high likelihood crying moments, which were then annotated by trained research assistants (see Section 2.4). This is important given that a healthy infant cries only 0.75 - 2.75 hours per day [cryingInInfancy_1962], resulting in a highly imbalanced dataset. Once filtered, the training dataset included 66.17 hours of raw audio data with 7.9 hours of annotated crying.
Our testing dataset consisted of 20 complete 24-hour continuous raw audio recordings from 20 infants. On average, infants cried 54.278 minutes (std: 27.641, range: 5.467 - 119.417) during each 24-hour training session.
Data from 44 infants (20 female) were included in the study. Infants had an average age of 4.84 months old (std: 2.68, range: 0.87 - 10.8). of the infants were reported as White, Hispanic, African-American and Multiracial. 21 out of 44 participants reported sibling information. 14 of 21 reporting families had at least two children in the home (range: 0 - 5) with an average sibling age of 4.8 years (std: 2.814).
A team of trained research assistants annotated raw audio data according to best practices in behavioral sciences. Four types of infant vocalizations were annotated: cry, fuss, scream, and laugh (training dataset inter-rater reliability kappa score: 0.8469, testing dataset kappa: 0.8023). We detail the cry and fuss annotation instructions below as these were included in the current paper. Crying is typically very loud, rhythmic, harsh and sudden and may feature wails or grunts. Fussing, on the other hand, is a continuation of negative vocalizations that is less intense than crying. It features a larger gap between vocalizations as well as quick breathing and closed-mouth noises. Annotators were trained to include only those instances of crying which lasted 3-seconds and those instances of fussing that lasted at least five seconds. Additionally, they combined all neighboring crying and fussing sounds occurring within 5 seconds of one another into a crying annotation. Fussing and crying annotations were collapsed into a single category labelled “crying”. All other infant sounds, as well as all unlabelled household audio in the training and testing data, were collapsed into a second category labelled “not crying”.
2.5 Extraction of LENA Outputs for Model Comparison
In order to provide a reference for our model’s accuracy, we extracted predicted outputs from the LENA system. The LENA system software can automatically label different sound sources, including female and child speech as well as infant distress (a combination of both fussing and crying) [lena_cryfuss]. In our results, we compare the automated distress annotations provided by LENA with our model outputs. Specifically, we ran the LENA software on our complete testing dataset, extracting all infant distress output labels. To facilitate comparison, we then applied the post-processing procedure we developed for our own model outputs (see Section 3.6) to the LENA outputs.
3 Proposed System
An overview of our system is depicted in Figure 1. It consists of 4 main components:
Pre-processing including filtering, windowing, data augmentation, and feature extraction.
Extraction of deep spectrum features using AlexNet.
Training of SVM classifier using deep spectrum features and acoustic features.
The pre-processing step includes filtering, windowing, data augmentation, and feature extraction.
The mean fundamental frequency (F0) of infant crying ranges from 441.8 to 502.9 Hz [cryFrequency_2003]. Thus, in both testing and training data, we removed all signals that were silent at frequencies higher than a 350 Hz threshold. To reduce fragmenting, we smoothed this output by grouping all remaining neighboring signals within 5 seconds of one another and removing any isolated spikes shorter than 5 seconds.
Smoothed signals were cut into 5-second windows (with 4-second overlap); all windows containing more than one label were removed.
3.2 Training Data Augmentation
Next, to reduce training data imbalance, we used two data augmentation methods to increase the representation of “crying” in our training dataset. We flipped the “crying” windows horizontally as well as used time masking deformation to randomly mask blocks of time steps for each window as described in [specAugment_2019]. Additionally, “not crying” windows were randomly undersampled to match the number of augmented “crying” windows.
3.3 Preliminary Feature Extraction
For testing and training data, we next extracted mel-scaled spectrograms and acoustic features. We applied short-time Fourier transform (samples per segment = 980, overlap = 490, Hann window) to each window to acquire mel-scaled spectrogram representations of size
. We also extracted 34 acoustic features for each second within the window and the mean, median and standard deviation of those features were calculated and those 102 features were used later in training. These 34 features were Zero Crossing Rate, Energy, Entropy of Energy, Spectral Centroid, Spectral Spread, Spectral Entropy, Spectral Flux, Spectral Rolloff, 13-element MFCCs, 12-element Chroma Vector, and Chroma Deviation.
3.4 Extracting Deep Spectrum Features
We used AlexNet to extract deep spectrum features from the mel-scaled spectrograms. The dimension of input was changed to
to match the shape of our mel-scaled spectrograms and the output layer had a dimension 2 for our binary classification problem. Additionally, we added batch normalization layers after every convolutional and fully-connected layer except the output layer. The last hidden layer with size
was used as deep spectrum features. AlexNet was trained using Adam Optimizer (learning rate = 0.001, beta_1 = 0.9, beta_2 = 0.999) for 50 epochs at a batch size of 128. Leave one participant out cross-validation was used.
3.5 SVM Classifier Training
We concatenated deep spectrum features with 102 acoustic features obtained from pre-processing and fed them into a SVM classifier with RBF kernel for final training. The predictions for each 5-second window were generated by SVM and the label for each second would be “crying” if any window containing that second had a prediction of “crying”.
A final post-processing step involved smoothing the stream of predictions. “Not crying” episodes that were shorter than or equal to 5 seconds, and “crying” episodes that were shorter than 5 seconds were eliminated by the smoothing procedure through reassignment to the particular other class.
We provide results for both our filtered training dataset, consisting of raw audio segments highly likely to contain crying, and our testing dataset, consisting of 20 24-hour continuous raw audio samples. While a filtered dataset was required to achieve a relatively balanced training dataset, given that our goal is to achieve robust performance on raw real-world household audio we believe the latter dataset is a more appropriate and “true” standard for assessment. Using leave one participant out cross validation our model achieved an F1 score of 0.615 (std: 0.170) for “crying” annotations in the training dataset (N = 24 participants), with an average precision of 0.521 (std: 0.191) and recall 0.820 (std: 0.147).
We then applied the model trained on all participants included in the training dataset to the testing dataset. We report average F1 score, precision, and recall for “crying” over all testing data sessions at second-by-second precision in Table1. Additionally, we calculate and report performance for state-of-practice LENA cry detection on our complete testing dataset (see Section 2.4). Overall, our model achieved a better performance than the LENA model. Our model achieved an F1 score of 0.597 with balanced precision and recall. Though LENA had a higher average precision, its recall was low at less than and its F1 score was only 0.166.
|Model||F1 score (std)||Precision (std)||Recall (std)|
|Model||0.597 (0.185)||0.612 (0.209)||0.630 (0.175)|
|LENA||0.166 (0.090)||0.809 (0.299)||0.095 (0.055)|
In this paper, we developed a model to detect and classify infant vocalizations from real-world audio recordings collected via a wearable audio recorder worn by infants in their home settings. Our model achieved an F1 score over 0.40 higher than a commercially-available state-of-practice model (i.e. the LENA), and, as we will detail below, over 0.30 higher than a recently published (state-of-the-art) real-world cry classification model. Additionally, we will discuss the technical aspects of our model that contribute to our model’s major gains in accuracy in this challenging real-world data context.
5.1 Comparison to State-of-the-Art Models
As an additional point of reference, we consider our model’s performance relative to two recently published infant vocalization baseline models from the 2018 and 2019 Interspeech Computational Paralinguistics Challenges (ComParE) [interspeech2018], [Interspeech2019]
. While the three models differ in dataset, annotation scheme and evaluation method, we can estimate a rough metric of comparison for these state-of-the-art audio classification models. Of note, the 2018 ComParE data were collected in a silent lab environment using a camcorder hanging above the infant, and the 2019 ComParE data were collected in natural home environments using the same infant-worn audio recorder used in this paper. Both ComParE models were multi-class models classifying infant vocalizations into multiple sounds, e.g. neutral/positive, fussing, crying, babbling, laughing, etc. Thus, to more directly compare our results, for both ComParE models we combined all vocalizations into two categories “crying” (including fussing and crying, as in our dataset) and “not crying” (including all other vocalizations). We used the baseline models as comparison given that the published submissions did not provide enough detail to calculate reference measures; however, published submissions[interspeech2019_baby_fisher], [interspeech2019_baby_attention] only improved baseline unweighted average recall (UAR) by and respectively.
Using our calculations, the baseline F1 score for the 2018 ComParE challenge was 0.69, and 0.26 for the 2019 ComParE challenge [interspeech2018], [Interspeech2019], with the large difference in F1 scores presumably reflecting the additional challenges of distress classification in real-world home environments. Thus, while our F1 score () was appreciably lower than the 2018 ComParE challenge, our model performance shows a major improvement in accuracy over the state-of-the-art distress classification in real-world audio settings. Additionally we note that in both ComParE challenges audio clips were segmented to the boundaries of individual vocalizations, whereas our model both detected and classified crying vocalizations from continuous raw audio, making our results even more impressive.
The LENA cry detection model was developed in 2008 [lena_algorithm]
. The original model was GMM-HMM using various acoustic features, including mel-frequency cepstrum features, perceptual minimum variance features, and distortionless response spectral subband centroids. The use of deep neural nets for audio classification is known to greatly improve model accuracy[CNN_audioset] and thus it is not surprising our model achieved better performance.
The baseline model published in Interspeech 2019 [Interspeech2019]
also leveraged the computational power of deep neural nets to extract features from mel-scaled spectrograms. However, our model differs in important ways. In particular, the interspeech model used an autoencoder to derive features from mel-scaled spectrograms and SVM for classification. They also trained 2 additional feature sets, ComParE Acoustic Feature Set and Bag-of-Audio-Words, totaling over 6000 features, separately with SVMs and fused the results using majority voting.
By contrast, we performed supervised training on our dataset using AlexNet to extract deep spectrum features, which gave us a more customized and targeted representation for detecting infant distress sounds. Using the AlexNet predictions alone, our testing data F1 score was 0.561 (std: 0.189; precision 0.532 (std: 0.209), recall 0.652 (std: 0.128). Additionally, we paired these deep spectrum features with additional acoustic features and combined these features in a secondary round of supervised learning with an SVM model, further increasing our F1 score to 0.597. However, given that the Alexnet output alone provided us withit appears that the supervised CNN training was contributed most substantially to the increased performance of our model.
Indeed, given that the challenge data reports that their complete dataset (including development and test set) comprises 11k vocalizations with a modal duration of 400 ms, we estimate that they may have as little as 1.256 hours of total infant vocalizations. By contrast, our training dataset alone had 7.9 hours of annotated cry data before data augmentation, and our testing set had 17.935 hours. This suggests that model performance in real-world audio classification tasks can be greatly improved by collecting and annotating larger datasets which can be leveraged for supervised models [moreData_2009].
Auditory behavior detection and classification models are typically developed with small convenience samples in clean laboratory settings [10.1145/2638728.2641306], [garcia2017speaker], meaning their performance deteriorates in real-world conditions [alameda2019multimodal]. In this paper we present a model combining deep spectrum features and acoustic features that detects and classifies infant distress in messy real-world data collected via a wearable audio recorder in continuous home recording conditions. Our accuracy improves upon performance of both state-of-the-art and state-of-practice distress classification models, with impressive improvements in F1 scores relative to these models.
This work was supported by NIMH K01 Award (1K01MH111957‐01A1) as well as a generous start-up package from The University of Texas at Austin to Kaya de Barbaro. We thank all of the families for their participation in our study as well as all of our student research assistants for the work to complete all of the annotations. In particular we thank Nina Nariman and Brooke Benson for their work managing the coding team to annotate our training and testing datasets.