Automated child speech understanding using machines is an inherently more difficult problem than that of adult speech. A variety of factors have been identified and addressed in recent years, both from a signal processing viewpoint (such as large within- and across-age and gender variability due to a developing vocal tract , errors in semantic structure of spoken language 
), and limited data availability which has necessitated data augmentation and transfer learning techniques.
An additional layer of complexity arises when addressing clinical and mental health applications involving children, where the condition may give rise to speech and language abnormalities. One such domain is autism spectrum disorder (ASD). ASD refers to a complex group of neuro-developmental disorders that are commonly characterized by social and communication idiosyncrasies, and whose reported prevalence in US children has been steadily rising (1 in 59 children as of 2014 ). Child-adult interactions have been used in the ASD domain primarily for diagnosis (ADOS ) and measuring intervention response (BOSCC ). Automated computational processing of the participants’ audio  and language streams  has provided objective descriptions that characterize the session progress and understanding the relation with symptom severity.
However, behavioral feature extraction in above studies has necessitated manual annotation for speaker labels, which can be expensive and time-consuming to obtain especially for large corpora. Automatic speaker label extraction involves a combination of speech activity detection (speech/non-speech classification) and speaker classification (categorization of speech regions intochild and adult). In this work, we assume that oracle speech activity detection is available and focus on building a robust child-adult classification model.
Interest in child-adult speaker classification in spontaneous conversations has increased recently. Some of the early works used traditional feature representations such as MFCC, PLP and i-vectors, and clustering techniques such as Bayesian information criterion, information bottleneck and agglomerative hierarchical clustering[22, 5, 25, 20, 14]. In , speech from children and adults was found to be sufficiently different in the embedding space to justify speaker classification. A recent work using DNN embeddings (x-vectors) explored augmentation of child speech for PLDA training . The authors also observed that splitting the adult speech into gender-specific portions while training the PLDA returned improvements in diarization performance. However, similar to most of the above works, the authors do not make any distinctions within the child speech.
Training a conventional child-adult classifier from speech has at least two major issues: 1) Large within-class variability especially forchild from age, gender, clinical symptom severity; and 2) Lack of sufficient amounts of balanced training data needed to tackle the above issue. We propose to address the above issues using meta-learning, also known as learning-to-learn . Meta-learning consists of two optimizations: the conventional learner which learns within a task; and a meta-learner which learns across tasks. This is in contrast to conventional supervised learning, which operates within a single task for training and testing, and learns across samples. Meta-learning is inspired by the human learning process for rapid generalization to new tasks, for instance children who have never seen a new animal before can learn to identify them using only a handful of images. As a consequence, meta-learning has demonstrated success in low-resource applications [18, 6]
in computer vision in recent years.
In this work, we model each session in the training set as a separate task. Hence, each task consists of two classes: child and adult from the particular session. During training, classes are not shared across tasks, i.e., child in one session is a separate class from child in another session. By optimizing the network to discriminate between child-adult speaker pairs across all training tasks (sessions), we mitigate the influence of within-class variabilities. Further, we remove the need for large amounts of training data by randomly sampling training and testing subsets (referred to as supports and queries respectively in meta-learning ) within each batch. We evaluate our proposed method under two settings: 1) Weak supervision - a handful of labeled examples are available from the test session, and 2) Unsupervised - automated clustering. The latter is similar to conventional speaker clustering in diarization systems. We show that the learnt representations outperform baselines in both settings.
2.1 Meta learning using prototypical networks
Meta-learning methods were introduced to address the problem of few-shot learning , where only a handful of labeled examples are available in new tasks not seen by the trained model. Deep metric-learning methods were developed within the meta-learning paradigm to specifically address generalization in the few-shot scenario. We choose prototypical networks (protonets)  which presume a simple learning bias when compared to other metric-learning methods, and have demonstrated state-of-the-art performance in image classification 
and natural language processing tasks such as sequence classification
. Protonets learn a non-linear transformation into an embedding space where each class is represented using a single sample, specifically the centroid of examples from that class. During inference, a test sample is assigned to the nearest centroid.
Our application of protonets for speaker classification is motivated by the fact that participants in a test session represent unseen classes, i.e., speakers in an audio recording to be diarized are typically assumed unknown. However, the target roles namely child and adult are shared across train and test sessions. Hence, by treating child-adult speaker classification in each train session as an independent task, we hypothesize that protonets learn the common discriminating characteristics between child and adult classes irrespective of local variabilities which might influence the task performance.
for learning speaker embeddings. Other than a recently proposed work which used protonet loss function for speaker ID and verification, to the best of our knowledge this work is one of the early applications of protonets for speaker clustering. Following, we illustrate the protonet training process using a single batch, then extend it to multiple training sessions.
2.1.1 Batch training
Consider a set of labeled training examples from classes = where each sample is a vector in -dimensional space and . Protonets learn a non-linear mapping where the prototype of each class is computed as follows:
represents the set of train samples belonging to class . For every test sample
, the posterior probability given classis as follows:
denotes distance metric. While the choice of can be arbitrary, it was shown in  that using Euclidean distance is equivalent to modeling the supports using Gaussian mixture density functions, and empirically performed better than other functions. Thus, we use Euclidean distance in this work. Learning proceeds by minimizing the negative log probability for the true class using gradient descent.
Pseudo-code for training a batch is provided in Algorithm 1.
2.1.2 Extension to multiple sessions
Consider sessions in the training corpus, with number of samples belonging to class in session . We iterate through each session , and randomly sample examples each from child and adult without replacement. These samples (supports) are used to construct the prototypes using Equation (1). From the remaining samples, samples are chosen without replacement from each class, where denotes the training batch size. These samples (queries) are used to update the weights in a single back-propagation step according to Equation (3
). Although a significant fraction of samples are not seen during a single epoch (1 epochbatches), random sampling of supports and queries over multiple epochs improve the generalizability of protonets.
2.2 Siamese networks
For unsupervised evaluation (clustering), we compare protonets with siamese networks , which learn a metric space to maximize pairwise similarity between same-class pairs and minimize similarity between different-class pairs. Specifically, we implement the variant used in speaker diarization , where the training label for each input pair represents the probability of belonging to the same speaker. The network jointly learns both the embedding space and distance metric for computing similarity. In our work, we randomly select same-speaker (child-child, adult-adult) and different speaker (child-adult) x-vector pairs to provide input to the model.
Fig. 1 illustrates the differences between siamese networks and protonets during training.
|Corpus||Duration(min)||Child Age(yrs)||# Utts|
|(mean std.)||(mean std.)||Child||Adult|
|ASD||17.76 11.99||9.02 3.10||11045||20313|
|ASD-Infants||10.35 0.51||1.87 0.78||1371||4120|
We select two types of child-adult interactions from the ASD domain: the gold-standard Autism Diagnostic Observation Schedule (ADOS ) which is used for diagnostic purposes and a recently proposed treatment outcome measure, Brief Observation of Social Communication Change (BOSCC ) for verbal children who fluently used complex sentences. The ADOS Module 3 typically lasts between 45 and 60 minutes and includes over 10 semi-structured tasks. The ADOS produces a diagnostic algorithm score which can be used to classify children between ASD vs. non-ASD groups. On the other hand, BOSCC is a treatment outcome measure used to track changes in social-communication skills over the course of treatment in individuals with ASD, and is applicable in different collection settings (clinics, homes, research labs). A BOSCC session lasts typically for 12 minutes and consists of 4 segments (two 4-minute-play segments with toys and two 2-minute-conversation segments). We used a combination of ADOS (n=3) and BOSCC (n=24) sessions which were administered by clinicians and manually labeled by trained annotators for speaking times and transcripts. We refer to this corpus as ASD. The sessions in ASD cover sources of variability in child age, collection centers (4) and amount of available speech per child (Table 1).
To check generalization performance, we train our models on ASD and evaluate on a different child-adult corpus within the autism diagnosis and intervention domain. The ASD-Infants corpus (Table 1) consists of BOSCC (n=12) sessions with minimally verbal toddlers and preschoolers with limited language (nonverbal, single words or phrase speech). As opposed to ASD, these sessions are administered by a caregiver, and represent a more naturalistic data collection setup aimed at early behavioral assessments with a familiar adult. The age differences between children in both corpora provides a significant domain mismatch.
3.2 Features and Model Architecture
We use x-vectors from the CALLHOME recipe111https://kaldi-asr.org/models/m6 as pre-trained audio embeddings in this work, which have demonstrated state-of-the-art performance in speaker diarization  and recognition systems 
. X-vectors are fixed-length embeddings extracted from variable length utterances using a time-delay neural network followed by a statistics pooling layer. In all our experiments, 128 dimensional x-vectors are input to a feed-forward neural network with 3 hidden layers (128, 64 and 32 units per layer). Embeddings from the third hidden layer (32-dimensional) are treated as speaker representations. Rectified linear unit (ReLU) non-linearity is used in between the layers. Batch-normalization and dropout (= 0.2) are used for regularization. Adam optimizer ( = , = 0.9, = 0.999) is used for weight updates. A batch size of 128 samples is employed. Since ASD
corpus contains only 27 sessions, we use nine-fold cross validation to estimate test performance. At each fold, 18 sessions are used for model training. The best model is chosen using validation loss computed with 6 sessions. The remaining 3 sessions are treated as evaluation data. No two folds share the data from same speaker.
3.3.1 Weak Supervision
We evaluate our models in a few-shot setting similar to the original formulation of protonets 
which is equivalent to sparsely labeled segments from the test session. In practice, such labels can be made available from the session through random selection or active learning. We train a baseline model using the architecture from Section 3.2
and a softmax layer to minimize cross-entropy loss betweenchild and adult classes. This model is directly used to estimate class posteriors on the testing data. We refer to this model as Base. We use a second baseline where the labeled samples from test sessions in each fold are made available during the training process, i.e., updating protonet weights using back-propagation (Base-backprop).
For protonets, we train two variants: P20 and P30 with 20 and 30 supports per class during training. A larger number of supports translates into more samples for reliable prototype computation, however it results in fewer queries for back-propagation. During evaluation, 5 samples from each class in the test session are randomly chosen as training data. These samples are used to compute prototypes for child and adult followed by minimum-distance based assignment for the remaining samples in that session. In order to estimate a robust performance measure for Base-backprop, P20 and P30, we repeat each evaluation 200 times by selecting a different set of 5 samples and compute the mean macro (unweighted) F1-score over the corpus.
3.3.2 Unsupervised: Clustering
Clustering x-vectors using AHC and PLDA scores (trained with supervision) is an integral part of recent diarization systems 
. This method forms our first baseline. We note that the training data for PLDA transformation represents significant domain mismatch with our corpora. We use k-means and spectral clustering (using cosine-distance based affinity matrix) as unsupervised clustering methods for comparing x-vectors, siamese embeddings and protonet embeddings. In the siamese network, the distance measure between a segment pair is learnt between outputs from the third hidden layer (32-dimensional). For protonets, we use the models trained for weak supervision and extract embeddings at the prototype space (32-dimensional) for clustering. We use purity as the clustering metric, which describes to what extent samples from a cluster belong to the same speaker.
Weakly-supervised classification results are presented in Table 2. In general, both variants of protonet outperform the baselines significantly in their respective corpora (ASD: <0.05, ASD-Infants: <0.01). However, all models degrade in performance on the ASD-Infants corpus as compared to ASD. As mentioned before, the data from younger children presents a large domain mismatch between training and evaluation data and we suspect this as the primary reason for lower performance. Surprisingly in ASD, updating network weights using samples from test session (Base-backprop) reduces classification performance. We suspect that the network overfits on the labeled samples. However in the case of ASD-Infants, the labeled samples from the test session provide useful information about the speakers resulting in modest improvement over a weaker Base. While protonets provide the best F1-scores in both corpora, the performance in ASD-Infants leaves room for improvement. We do not observe any significant difference between P20 and P30, suggesting that the performance is robust to the number of supports and queries during training.
Clustering x-vectors using AHC and PLDA scores results in a purity of 63.45% in ASD, which is significantly lower than both K-means and Spectral Clustering (SC) for all the models in Table 3
. This suggests that the supervised PLDA models may be susceptible to unknown speaker types. Unsupervised PLDA adaptation using x-vectors’ mean and variance fromASD marginally improves the performance to 64.32%, hence we do not include this method in the rest of our comparisons. As opposed to classification, clustering performance does not degrade in ASD-Infants, suggesting that discriminative information between child and adult speakers within a session is preserved in all the embeddings compared in Table 3. siamese networks present a modest improvement over x-vectors, upto 5.26% relative improvement for spectral clustering in ASD. However, protonets provide the best performance in both the corpora. In particular, P20 results in slightly higher purity scores than P30 across clustering methods and corpora. Hence, a larger number of queries within a batch appears beneficial for speaker clustering in this work. We also note that the best clustering performance (P20) is better in the out-of-domain corpus. We believe that the younger ages of children in ASD-Infants over ASD might benefit the clustering process.
4.3 TSNE Analysis
We provide a qualitative analysis using TSNE in Figure 2. We collect embeddings from both child and adult from a single-fold (3 sessions) in ASD and provide the TSNE visualizations for protonet embeddings and x-vectors. Embeddings from child and adult class are represented using 3 shades of red and blue respectively, one shade for each session. Although x-vectors cluster compactly within each speaker in a session, embeddings across sessions from the same class are spread apart. Protonets are able to cluster within classes compactly, while preserving the discriminative information between classes. In particular, embeddings belonging to child (which are expected to cover more sources of variability) are as compact as embeddings from adult. This suggests that protonets are able to learn across within-class variabilities for child-adult classification from speech.
In this work, we used meta-learning to perform child-adult speaker classification in spontaneous conversations. By modeling speaker classification from different sessions as separate tasks, we train protonets to learn speaker representations invariant to local variabilities. Using weakly-supervised and unsupervised settings, we show that protonets outperform x-vectors. Further, protonets outperform siamese networks for clustering when trained on the same input representations (x-vectors). In the future, we would like to train a generic speaker diarization system using protonets. Protonets are a suitable choice for this problem, since an arbitrary number of speakers can be accommodated in every training session, and speaker identities need not be shared across sessions.
-  (2018) Prevalence of autism spectrum disorder among children aged 8 years—autism and developmental disabilities monitoring network, 11 sites, united states, 2014. MMWR Surveillance Summaries 67 (6), pp. 1. Cited by: §1.
Use of machine learning to improve autism screening and diagnostic instruments: effectiveness, efficiency, and multi-instrument fusion. Journal of Child Psychology and Psychiatry 57 (8), pp. 927–937. Cited by: §1.
-  (2017-03) TristouNet: triplet loss for speaker turn embedding. In ICASSP, pp. 5430–5434. Cited by: §2.1.
-  (2011) Learning speaker-specific characteristics with a deep neural architecture. IEEE Transactions on Neural Networks 22 (11), pp. 1744–1756. Cited by: §2.1.
-  (2018) Talker diarization in the wild: the case of child-centered daylong audio-recordings. In Interspeech, pp. 2583–2587. Cited by: §1.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML-Volume 70, pp. 1126–1135. Cited by: §1, §2.1.
-  (2017) Speaker diarization using deep neural network embeddings. In ICASSP, pp. 4930–4934. Cited by: §2.2.
-  (2016) Measuring changes in social communication behaviors: preliminary development of the brief observation of social communication change (BOSCC). Journal of autism and developmental disorders 46 (7), pp. 2464–2479. Cited by: §1, §3.1.
-  (2000) The development of phonemic categorization in children aged 6–12. Journal of Phonetics 28 (4), pp. 377 – 396. Cited by: §1.
Siamese neural networks for one-shot image recognition.
ICML deep learning workshop, Vol. 2. Cited by: §2.2.
-  (2016) Objective language feature analysis in children with neurodevelopmental disorders during autism assessment.. In Interspeech, pp. 2721–2725. Cited by: §1.
-  (1999) Acoustics of children’s speech: developmental changes of temporal and spectral parameters. The Journal of the Acoustical Society of America 105 (3), pp. 1455–1468. Cited by: §1.
-  (2000) The autism diagnostic observation schedule—Generic: a standard measure of social and communication deficits associated with the spectrum of autism. Journal of autism and developmental disorders 30 (3), pp. 205–223. Cited by: §1, §3.1.
-  (2016) Speaker independent diarization for child language environment analysis using deep neural networks. In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 114–120. Cited by: §1.
-  (2018-09) Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Interspeech, pp. 2808–2812. Cited by: §3.2, §3.3.2.
-  (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §3.3.1.
-  (2018) Transfer Learning from Adult to Children for Speech Recognition: Evaluation, Analysis and Recommendations. arXiv preprint arXiv:805.03322. Cited by: §1.
-  (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §1, §1, §2.1.1, §2.1, Figure 1, §3.3.1.
-  (2018) X-vectors: robust dnn embeddings for speaker recognition. In ICASSP, pp. 5329–5333. Cited by: §3.2.
-  (2018-04) A novel lstm-based speech preprocessor for speaker diarization in realistic mismatch conditions. In ICASSP, pp. 5234–5238. External Links: Cited by: §1.
-  (2019) Centroid-based deep metric learning for speaker recognition. In ICASSP, pp. 3652–3656. Cited by: §1, §2.1, Figure 1.
-  (2018) A progressive deep learning approach to child speech separation. In ISCSLP, pp. 76–80. Cited by: §1.
-  (2019) Multi-plda diarization on children’s speech. In Interspeech, pp. 376–380. Cited by: §1.
-  (2018-06) Diverse few-shot text classification with multiple metrics. In NAACL: Human Language Technologies, Volume 1 (Long Papers), pp. 1206–1215. External Links: Cited by: §2.1.
-  (2016-10) Speaker diarization system for autism children’s real-life audio data. In ISCSLP, pp. 1–5. Cited by: §1.