Progress in speech processing such as speech recognition and text-to-speech enables users to interact with smart devices through a voice user interface (VUI) rather than directly controlling it. But these techniques mainly focus on spontaneousness during the interaction. Before that, it is practically important that the interaction starts well. Among various criteria where the device recognizes the start of the interaction, two typical approaches are keyword spotting and speaker verification.
Keyword spotting (KWS) is the task of detecting the prescribed spoken term in the input utterance. Many studies and applications have focused on pre-defining the keyword(s) as their product name for wake-up or command words for specific actions [chen2014small, alvarez2019end, tang2018deep]. Meanwhile, according to users’ convenience and customizing needs, some research on open-vocabulary KWS has attracted interest since the users can define any keywords. A typical way to handle arbitrary keywords is to express any words as acoustic word embeddings which are fixed-dimensional vector representations of arbitrary-length words. These embeddings learn the acoustic similarity between pronunciations of words pair so that they can encode acoustic information. In training, some approaches use cross-entropy loss [levin2013fixed, chen2015query], but triplet loss is mainly used because it can map the similarity directly to the relative distance in embedding space [kamper2016deep, settle2017query, jung2019additional]. Recently, an approach that considers connectionist temporal classification (CTC) [graves2006connectionist] based phonetic information together [lim2019interlayer] showed good results. Still, open-vocabulary KWS has a lot of room for improvement due to its challenging nature.
Speaker verification (SV) is the task of verifying the current speaker is a valid user. Here, we only deal with text-independent SV that does not have any restrictions on speech contents. SV requires an enrollment which is a process of registering the user’s speaker identity. Then, speaker information is extracted from each input utterance and compared with the enrolled data. For successful SV, this speaker information must be expressed as a speaker discriminative representation. Recent the most powerful approaches based on deep neural networks are encoding speaker information as a fixed-dimensional vector representation, so-called speaker embedding. For learning discriminative embeddings, the networks are trained to classify speakers using cross-entropy loss[snyder2017deep, snyder2018x] or to group speakers in embedding space using triplet loss [li2017deep, wan2018generalized]. The criticized problem of these systems is that a long utterance must be used for the input as well as the enrollment to extract speaker information reliably. It is because the amount of accumulated information increases as the speech lengthens under the assumption that there is one speaker for one utterance. To cover the problem, several approaches with pooling methods [okabe2018attentive, jung2019spatial, jung2019self] have been proposed to weight the relevant speech frames. However, if the input length is not long enough, their performances are still degraded. Accordingly, many short-duration SV studies are being conducted to have high performance even with a short utterance [kanagasundaram2011vector, bhattacharya2017deep, jung2019short].
Even though acoustic and speaker information consider each other as a marginal feature that should be suppressed for robust discriminative learning, both KWS and SV have been handled independently. The ideal situation we think of is that the device can detect the keyword and verify the user at the same time using a short word-level utterance defined by the user. In other words, open-vocabulary KWS and short-duration SV will eventually operate with the same input in the same conditions.
So in this paper, we propose a multi-task network that performs both KWS and SV simultaneously by fully utilizing acoustic, speaker, and phonetic information. The multi-task network consists of an enhancement network, acoustic feature extraction network, speaker feature extraction network, and pooling network. The sub-networks are trained by being shared or contributed to each other. In this process, we also introduce novel techniques of CTC-based soft voice activity detection (VAD) and global query attention. We evaluate our proposed approach on discrimination tasks for KWS and SV, respectively. Experimental results demonstrate that acoustic, speaker, and phonetic domains are interrelated and it is effective to integrate them for learning discriminative embeddings even in noisy environments, open-vocabulary, and short-duration conditions. Also, we present a visualization example to intuitively understand the proposed methods and results of ablation experiments to show the effectiveness of the multi-task network.
2 Multi-task Network
2.1 Enhancement Network
With the basic belief that both KWS and SV performance can be improved when the noise component is removed from the input speech, we share the enhancement network with the next two sub-networks. The enhancement network comprises two dilated convolutional neural network (CNN) cascaded with two residual paths and subtractions (Fig.2. (a)). Each dilated CNN consists of 5 convolution layers and their parameters are noted in Tab. 2.3. We extract a 256-dimensional log-magnitude spectrogram X with a frame length of 25 ms and a shift of 10 ms yielding a input, where
is the number of frames. The dilated CNN estimates a spectral distortion which is then subtracted from the input. After two consecutive subtractions, we use the output feature vectorsas an enhanced spectrogram.
2.2 Acoustic Feature Extraction Network
In the acoustic feature extraction network, we use two 2-layer bi-directional long short-term memory (LSTM)[hochreiter1997long] modules hierarchically ( and in Fig. 2. (b)) to represent as frame-level acoustic feature vectors, and . Also, we input into linear layers,
with ReLU activation andwith log-softmax, to capture frame-level phonetic information:
is log-probabilities of observing CTC label sequence, where is an element of the set including characters and blank . After trained with CTC loss, becomes a precise indicator of phonetically important frames. To utilize this property, we tried to transform into the soft VAD [mclaren2015softsad] posteriors. However, there were problems that the shape of probabilities is too spiky so most of frames are ignored and the blank indicates not only non-speech frames but also the repetition of the previous character. So we add 1-layer bi-directional LSTM module () to make the distribution more smoother. The LSTM has states followed by 1-dimensional projection and sigmoid activation. We call the output vector as CTC-based soft VAD posteriors.
2.3 Speaker Feature Extraction Network
We use the bottleneck states Z of Eq. 1 as phonetic conditional vectors. To adjust domain mismatch between phonetic and speaker domains, Z enters one convolution layer (Conv in Fig. 2. (c)), being augmented to 3-channel. We concatenate it with and get a phonetically conditioned feature vectors . The advantage of the phonetic conditioning is that the network can have more knowledge of input speech [zhou2019cnn] even in the short-duration condition so that it becomes easier to suppress unnecessary phonetic variations. The rest of the network consists of 6 modified version of ResNet [he2016deep], denoted as . Each module has one convolution layer and two residual blocks as described in Tab. 2.3. Since we want to get frame-level speaker information, we does not change the temporal shape. At the last average and transpose layer, the speaker feature vectors are extracted.