Speaker counting is the task of estimating the number of people that are simultaneously speaking in successive segments of an audio recording. It can be seen as a subtask of speaker diarization, which consists in estimating who speaks and when, and which has long been limited to the case where one person speaks at a time, since it becomes much more complicated when several speech signals overlap [22, 1].
Although speaker counting has been relatively poorly addressed in the speech processing literature as a problem on its own, it can be an essential primary module for more complex machine audition tasks, in particular for source separation, localisation and tracking. Yet, the vast majority of speech/audio source separation and localisation methods either consider that the number of sources to process is known a priori or has been previously estimated [24, 15, 23, 18], or it is estimated from some clustering of the separation/localisation features, or they consider a maximum number of speakers [6, 14]. Speaker counting is also particularly useful for tracking, as it can help solving the difficult problem of detecting the appearance and disappearance of a speaker track along time as he/she starts/stops speaking .
In the literature, a few single-channel parametric methods for speaker counting correlate the number of speakers to some ad-hoc features extracted from the mixture signal,
. These methods fail at exploiting the spatial discrimination between the different sound sources. Classical multichannel approaches, based on the eigenvalue analysis of the estimated spatial covariance matrix (SCM), take this information into account[27, 16, 28]. However, they cannot be used in an underdetermined setup. Methods based on clustering in the time-frequency (TF) domain remove this restriction [4, 3, 10, 12, 29], but are generally limited to the anechoic setting, and often require additional a priori information, e.g. the maximum number of concurrent sources in the processed sequence.
More recently, some attempts have been made to apply deep learning to audio source counting. In20]
, Stöter et al. compared several representations in supervised learning for single-channel speaker counting. The best results were obtained with a bi-directional long short-term memory (bi-LSTM) neural network, short-time Fourier transform (STFT) features, and a classification configuration. This work was extended in with a convolutional recurrent neural network (CRNN) named CountNet, which has shown superior performance against traditional methods for estimating the maximum number of simultaneous speakers within an audio excerpt of 5 seconds. The importance of learning from reverberant audio examples was discussed, as well as the extension to a multichannel setup, although the latter has not been investigated. Therefore, being essentially a single-channel method, CountNet is blind to the spatial aspect of the source counting problem.
The contributions of the present paper are twofold:
First, we evaluate the benefit of using a multichannel input in a neural network for the speaker counting problem. In the present study, we use the Ambisonics multichannel audio format, due to its increasing interest in interactive spatial audio applications, like Facebook 360111https://facebookincubator.github.io/facebook-360-spatial-workstation/Documentation/SpatialWorkstation/SpatialWorkstation.html, visited on 02/03/2020 or YouTube222https://support.google.com/youtube/answer/6395969, visited on 02/03/2020
Second, we tackle the challenging problem of estimating the number of speakers at a short-term frame resolution. Compared to usual estimation on longer segments, this is a crucial novelty and advantage for exploiting speaker counting in further processes such as speaker separation and localization, since those latter are generally processed on a short-term frame-by-frame basis. Also, this would enable a low-latency (possibly real-time) overall process.
We show that the proposed CRNN with Ambisonics multichannel inputs leads to improved framewise counting performance upon a monochannel CRNN, and a state-of-the-art multichannel method.
Ii Proposed method
We now successively describe the input features, the output configuration, the mapping strategy and the network architecture we used for speaker counting.
Ii-a Input features
In the present study, we use the Ambisonics signal representation as a multichannel input of our neural network. The Ambisonics format is particularly well-suited to represent the spatial properties of a soundfield, and is, to some extent, agnostic to the microphone array configuration . That said, we do not claim that this representation is better than other (more conventional) multichannel formats for the speaker counting task, its use is here a choice of convenience in a general and more and more popular applicative framework.
The Ambisonics format is produced by projecting the recorded multichannel audio onto a basis of spherical harmonic functions. The number of retained coefficients defines the Ambisonics order: in practice, to obtain an Ambisonics representation of order , one needs a spherical microphone array containing at least capsules. Since, in theory, the capsules would need to be perfectly coincident, a set of phase calibration filters is applied beforehand. The use of first-order Ambisonics (FOA) () has been shown to provide a neural network with sufficient spatial information for single- and multi-speaker localization [17, 18], thus motivating our choice as the input features. FOA provides a decomposition of the signal into the first four Ambisonics channels denoted , , , . Channel (order-0 spherical harmonic) represents the soundfield as if it was recorded by an omnidirectional microphone at the observation point. Channels , and (order-1 spherical harmonics) correspond to the recordings of three polarized orthogonal bidirectional microphones. For a plane wave coming from azimuth and elevation , and bearing a sound pressure , the FOA components are given in the STFT domain by:333We adopt the N3D Ambisonics normalization standard .
where and denote the STFT time and frequency bins, respectively.
Because the phase of
is a common information across channels, it does not provide much information for the spatial discrimination of different speakers. Taking the magnitude of the FOA vector entries only discards the sign of the trigonometric terms, which leads to ambiguities only for specific spatial configurations. In short, the spatial information of the FOA channels is mostly encoded in their magnitude, We thus select this input representation: the magnitude of the STFT of the four FOA channels is computed and stacked to give a tridimensional tensorwith time frames, frequency bins and channels, which is the input feature for the neural network. The role of the number of frames , which is part of the parameters tested in our experiments, will be detailed in Section II-C. In our experiments, we use signals sampled at kHz, a ,-point STFT (hence ) with a sinusoidal analysis window and overlap. One frame thus represents ms of additional signal information.
We consider speaker counting as a classification problem where each class corresponds to a different number of active speakers from (i.e. only background noise) to a maximum of
active speakers. During training, the output target encoding the class probabilities is a one-hot vectorof size
. The softmax function is used at the output layer of the neural network, to represent the probability distribution over theclasses. The predicted number of speakers is the class with the highest output probability. For training we use the categorical cross-entropy loss.
Ii-C Sequence-to-sequence mapping
In , the maximum number of active speakers within a -s segment, i.e. a large sequence of short-term frames, was estimated. The whole segment was labeled with this maximum number of speakers, and a unique sequence-to-one decoding scheme was used to estimate this number for each segment. In contrast, in the present work, we target a much finer temporal resolution: We aim at predicting the total number of speakers for each short-term frame.444Of course, speaker counting with a lower time resolution can then be obtained by filtering the frame-wise results (e.g. with majority voting). To this aim, for training, each short-term frame is labeled by the total number of active speakers within the frame. Although we consider frame resolution at decoding, we still want to exploit a larger local context, but at a lower scale than the 5 s of : For each frame to classify, we use a short signal sequence of frames as corresponding input to the network. is within 10 to 30, i.e. a -ms to -s local context in our experiments. We treat the problem as a sequence-to-sequence scheme, combined with one-frame shift. Each input sequence of frames produces a synchronized sequence of decoded class probability vectors. We actually classify only the last frame within a sequence before proceeding to one-frame shift, and repeating the process. Our experiments demonstrated that the best decoding performance for that last frame was obtained by selecting the output vector at position in the decoded sequence, which we adopt hereafter. Detailed analysis of this process will be reported in .
Ii-D Network architecture
To design our network, we took inspiration from the CountNet CRNN architecture  (itself inspired from ), which was shown to be effective in exploiting the spectral information within a single-channel mixture. We speculate that this type of network can benefit from spatial information provided by the multichannel input.
The resulting architecture is illustrated in Fig. 1. The first part of the network is composed of two convolutional layers with and , filters per channel, respectively (applied in the time-frequency dimensions), ending by an max-pooling layer. This is followed by another two convolutional layers with and filters per channel, respectively, and again one
max-pooling layer. Padding is applied to keep the same dimensions after the convolutions. The extracted feature maps are then reshaped into amatrices, then fed into a LSTM layer with an output space of dimension . Finally, the output layer is a
-unit softmax layer that maps each
-D vector coming from the LSTM layer into the probability distribution of the number of speakers. After each convolutional layer, rectified linear unit (ReLU) activations are used, whereas in the LSTM cells, we use tanh activation except for the recurrent step which uses hard-sigmoid.
Note that the temporal dimension is preserved all throughout the network to give an output for each frame. In particular for the LSTM layer, as already stated in Section II-C, we used sequence-to-sequence decoding so that one output vector corresponds to one input frame of the input tensor. This contrasts with  where sequence-to-one decoding was used in the LSTM to find the maximum number of simultaneous speakers within a -second audio mixture. Our sequence-to-sequence set-up is driven by our goal to estimate the number of speakers at framewise resolution.
To train and test our CRNN network, we generated a dataset of synthesized speech mixtures with 0 (only spatially diffuse noise) up to speakers, who speak at random times all along the signal. Inspired by the general methodology of , we used the spatial room impulse response (SRIR) generator  to simulate “shoebox” room configurations ( for training, for validation and for test). Room length, width and height were randomly chosen within m, m and m, respectively. The reverberation time was randomly set between ms and ms. To simulate up to speakers in the same room, for each room we generated SRIRs from different positions with respect to a spherical microphone array randomly located in the room at least m from the walls. This yielded a total number of SRIRs for training, for validation, and for test.
We used -kHz speech signals from the TIMIT dataset . This dataset contains English-speaking sentences uttered by different speakers with different accents. For each room configuration, we created a set of speech mixtures of s length with a maximum number of speakers being between to . The general principle is to first create one-speaker signals by concatenating short sentences of one speaker with transitional silences, then mix those signals together. More specifically, a mixture signal with at most active speakers is generated as follows: i) randomly select a speaker; ii) initialize the signal with a silence of random length within s; iii) randomly pick a sentence from that speaker, concatenate it with the signal, and add a silence of random length within s; iv) repeat step iii) until a s signal is obtained. If too long, the final sentence is cropped and faded out in the last ms. Finally, v) convolve this dry single-speaker signal with one of the generated random SRIRs. The same procedure is repeated for the remaining speakers and other SRIRs from the same room to obtain single-speaker reverberant signals. Finally, the signals are added, plus a diffuse noise, to produce a quite realistic reverberant “conversational” signal, with a number of speakers varying between to .
As a result of the above mixture generation (and due to the intermittent nature of speech), frames with a large number of speakers are less likely to occur than the ones comprising none or few speakers. Therefore, to generate a more class-balanced dataset, the probabilities to generate a mixture (for a given room) where , , , , are respectively set to , , , and .
The signal-to-interference ratio (SIR) between the first source and the other sources is set randomly within – dB. The diffuse noise sequences used in the mixture are of various kinds (crowd, traffic, engine, nature sounds) from the freesound dataset.555https://freesound.org/
We use a random signal-to-noise ratio (SNR) between 10 and 20 dB with respect to the first source. The diffuse field was simulated by averaging the diffuse parts of two random SRIRs measured in a real reverberant room.
In addition to the control of the inserted silences, we used the TIMIT word timestamps to detect silences within each sentence at sample resolution, and we used them to create frame-wise resolution labels for our dataset. A frame-level label is defined as the maximum number of speakers among all the samples of the frame.
We took care of using different speakers and noise sequences in the train, validation and test sets so that the network is evaluated on speakers and noises sequences unseen during training. The overall duration of the generated mixture dataset is hours for training, and hours for validation and test.
In order to assess the influence of the temporal context on the speaker counting performance, we conducted experiments with different values for the number of frames in the input feature . We tested ( ms), ( ms) and ( s).
Iii-C Training procedure
Iii-D Metrics and baseline
As we are considering source counting as a classification problem, we use the per-class classification accuracy as a first metric. It measures the percentage of frames in the test set that are predicted in the same class as the ground truth. In addition, we also use the mean absolute error (MAE) per class :
where is the predicted class for frame of ground-truth class , and is the total number of frames of class (here, the different frames of class are arbitrarily “re-indexed” from to for simplicity of presentation, although they are not all consecutive frames in the mixture signals).
To assess the advantage of using multichannel features, we trained and tested the same CRNN with single-channel input features, using only the channel.
|number of sources|
Table I reports the classification accuracy and the MAE obtained for the single-channel and multichannel CRNNs, for . We have trained each neural network 10 times, then we evaluate each of them on the test set and average the obtained results.
Our CRNN seems rather effective for the -and -source mixtures, then its performance gradually decreases with the increasing number of concurrent speakers. The accuracy still remains at a satisfactory level for the multichannel configuration when , which obtains a score above % across all classes.
We see that the use of the multichannel FOA input yields better performance than the single-channel representation. For example, for , the accuracy is always better from multichannel compared to single-channel, and for we can see than multichannel leads to either a close performance to single-channel (for and speaker) or a significant improvement (for , , , speakers). The improvement is even better for a large number of sources, so spatial information seems to help the CRNN for better speaker distinction.
Temporal context seems to have a slight impact on the performance. We see that the results for are slightly better than the results for , except for and speaker for which the performance is almost the same.
The proposed multichannel CRNN yields competitive speaker counting performance at a framewise precision which can be very useful for online speech analysis task such as speaker localization or diarization. It performs with high accuracy for -to--source mixtures, and decent accuracy in the difficult configuration of more than sources. The presented results indicate that supplying the network with multichannel features leads to a noticeable improvement in counting performance over the corresponding single-channel model. Yet, we conjecture that there is a considerable margin for improvement, e.g. by introducing the higher order Ambisonics (HOA) features, or by using a more adequate network architecture.
-  (2012) Speaker diarization: a review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20 (2), pp. 356–370. External Links: Cited by: §I.
-  (2003) Estimating the number of speakers by the modulation characteristics of speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), (en). External Links: Cited by: §I.
-  (2010) A robust method to count and locate audio sources in a multichannel underdetermined mixture. IEEE Transactions on Signal Processing 58 (1), pp. 121–133. External Links: Cited by: §I.
Estimator for number of sources using minimum description length criterion for blind sparse source mixtures.
International Conference on Independent Component Analysis and Signal Separation, pp. 333–340. Cited by: §I.
-  (2001) Représentation de champs acoustiques, application à la transmission et à la reproduction de scènes sonores complexes dans un contexte multimédia. Ph.D. Thesis, Univ. Paris VI, (French). Cited by: §II-A, footnote 3.
-  (2012) Acoustic source localization and tracking of a time-varying number of speakers. IEEE Transactions on Audio, Speech, and Language Processing 20 (4), pp. 1409–1415. External Links: Cited by: §I.
-  (1993) DARPA timit acoustic phonetic continuous speech corpus cdrom. NIST. Cited by: §III-A.
-  (2020) Analysis of a multichannel convolutional recurrent neural network applied to speaker counting. In Forum Acusticum, (en). Note: To appear Cited by: §II-C.
-  (2006) Room impulse response generator. Technical report Technische Universiteit Eindhoven. Cited by: §III-A.
-  (2015) Unified approach for audio source separation with multichannel factorial HMM and DOA mixture model. In European Signal Processing Conference (EUSIPCO), Nice, France. External Links: Cited by: §I.
-  (2014) Adam: A Method for Stochastic Optimization. arXiv:1412.6980 (en). External Links: Cited by: §III-C.
-  (2017) An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA. External Links: Cited by: §I.
-  (2015) Singing voice detection with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia. Cited by: §II-D.
-  (2019) Online localization and tracking of multiple moving speakers in reverberant environments. IEEE Journal of Selected Topics in Signal Processing 13 (1), pp. 88–103. External Links: Cited by: §I.
-  S. Makino, T. Lee, and H. Sawada (Eds.) (2007) Blind Speech Separation. Springer (en). External Links: Cited by: §I.
Sample eigenvalue based detection of high-dimensional signals in white noise using relatively few samples. IEEE Transactions on Signal Processing 56 (7), pp. 2625–2638. External Links: Cited by: §I.
-  (2018) CRNN-based joint azimuth and elevation localization with the Ambisonics intensity vector. In International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan. External Links: Cited by: §II-A, §III-A.
-  (2019) CRNN-based multiple DoA estimation using acoustic intensity features for Ambisonics recordings. IEEE Journal of Selected Topics in Signal Processing 13 (1), pp. 22–33. External Links: Cited by: §I, §II-A.
-  (2010) Proposal of a new confidence parameter estimating the number of speakers – An experimental investigation. Journal of Information Hiding and Multimedia Signal Processing 1 (2), pp. 101–109. Cited by: §I.
-  (2018) Classification vs. regression in supervised learning for single-channel speaker count estimation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada. External Links: Cited by: §I.
-  (2019) CountNet: estimating the number of concurrent speakers using supervised learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (2), pp. 268–282 (en). External Links: Cited by: §I, §II-C, §II-D, §II-D, §III-D.
-  (2006) An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing 14 (5), pp. 1557–1565. External Links: Cited by: §I.
-  (2007) Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems 55 (3), pp. 216–228 (en). External Links: Cited by: §I.
-  (2011) Probabilistic modeling paradigms for audio source separation. In Machine Audition: Principles, Algorithms and Systems, pp. 161–185 (en). Cited by: §I.
-  (2015) Multitarget tracking. In Wiley Encyclopedia of Electrical and Electronics Engineering, pp. 1–15 (en). External Links: Cited by: §I.
-  (2018) Determining the number of speakers from single microphone speech signals by multi-label convolutional neural network. In IEEE Conference of the Industrial Electronics Society (IECON), External Links: Cited by: §I.
-  (1999) Detection: determining the number of sources. In Digital Signal Processing Handbook, pp. 1–10 (en). External Links: Cited by: §I.
Estimation of the number of sound sources using support vector machines and its application to sound source separation. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), (en). External Links: Cited by: §I.
-  (2019) Multiple sound source counting and localization based on tf-wise spatial spectrum clustering. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, (en). External Links: Cited by: §I.