ClovaCall dataset and Pytorch LAS baseline code
Automatic speech recognition (ASR) via call is essential for various applications, including AI for contact center (AICC) services. Despite the advancement of ASR, however, most publicly available speech corpora such as Switchboard are old-fashioned. Also, most existing call corpora are in English and mainly focus on open-domain dialog or general scenarios such as audiobooks. Here we introduce a new large-scale Korean call-based speech corpus under a goal-oriented dialog scenario from more than 11,000 people, i.e., ClovaCall corpus. ClovaCall includes approximately 60,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain. We validate the effectiveness of our dataset with intensive experiments using two standard ASR models. Furthermore, we release our ClovaCall dataset and baseline source codes to be available via https://github.com/ClovaAI/ClovaCall.READ FULL TEXT VIEW PDF
ClovaCall dataset and Pytorch LAS baseline code
Call-based customer services are still prevalent in most online and offline industries. In particular, call center has played a crucial role in most business domains for a few decades and extended to contact center by providing additional functions such as email, VoIP, and text chatting111https://aircall.io/blog/call-center/contact-center-vs-call-center/
. However, the increasing operating costs and the harsh working environments of contact centers have brought the necessity to apply artificial intelligence (AI) to contact center operation. AI for contact center (AICC) is an AI agent that communicates with human customers via call, which rapidly increases in B2B markets . Since AICC is based on a telephone environment, automatic speech recognition (ASR) via the telephone channel is essential for successful AICC operation.
ASR has been one of the tasks remarkably improved by deep learning since early years of 2010s[22, 15]. It is well known that the improvement of ASR results from large scale speech corpora including Wall Street Journal , TIMIT , Switchboard , CallHome , and Librispeech  datasets. However, most publicly available call speech corpora are very old-fashioned such as Swichboard, Wall Street Journal, and CallHome because they were released over 20 years ago. Also, the language of most call corpora is English only, and thus call corpora of minor languages are very scarce. The other issue is is that the utterances of most corpora are based on open domains such as day-life conversation and audio-book contents. Even if small numbers of Korean speech corpora are publicly available222http://www.aihub.or.kr/aidata/105333https://github.com/goodatlas/zeroth, they contain general open domain dialog utterances. Therefore, ASR models trained from these speech data generally show poor recognition performance when applied to domain-specific tasks due to the differences in their data distribution and vocabularies. In particular, AICC requires an accurate ASR model to ensure the precise intent classification or slot extraction  from user natural language utterances.
Here we propose a new large scale Korean speech corpus containing goal-oriented dialog utterances under a restaurant reservation scenario, i.e., ClovaCall speech corpus. The proposed ClovaCall includes 61,000 pairs of short sentences and their utterances recorded via call by more than 11,000 people. In specific, the number of unique sentences is 8,990, and all of them are natural language questions and answers which frequently appear when making reservations. The utterances that each subject read given sentences aloud are recorded over a phone. Because most sentences are designated for reservation and short with at most 10 seconds, our dataset does not suffer from end point detection and alignment problems dissimilar to Librispeech, which is extracted from audiobooks. ClovaCall can be useful for diverse AICC-based reservation services because most words and expressions prevalent in reservations are commonly used regardless of its application domains, including time, people, date, and location.
We demonstrate the effectiveness of the proposed ClovaCall with extensive experiments. We employ two standard ASR models such as Deep Speech 2 (DS2)  and Listen, Attend and Speech (LAS)  under three training schemes including pretraining-finetuning, from-scratch training, and scratch training with data augmentation. Besides, we use two additional datasets for effective verification. One is an in-house Korean call-based goal-oriented dialog speech corpus on questions and answers for daily company life (QA Dataset) for verifying the necessity of task-specific data. The other is a large-scale Korean open domain dialog speech corpus from AIHub, an online Korean datalake site, for pretraining ASR models. Experimental results show the ASR models trained from large-scale open domain data only provide very poor recognition performances. Thus, the task-specific speech datasets are essential for speech recognition of goal-oriented dialogs like AICC. Interestingly, pretraining with open domain data remarkably improves the ASR accuracy compared to scratch training with task-specific data only even though their sampling rates are different from each other.
Our main contributions are summarized as follows:
For the first time, to the best of our knowledge, we release a large-scale Korean goal-oriented dialog speech corpus.
We analyze the differences between general open domain and task-specific goal-oriented dialog corpora.
We present the efficacy of our ClovaCall for AI for contact center with two standard ASR models under three training schemes.
Large-scale speech corpora publicly available allows ASR models to be applied to many valuable real-world applications. Early public speech corpora were released in 1990s, including Wall Street Journal , TIMIT , Switchboard , and CallHome . These datasets are still prevalent as benchmark datasets for evaluating ASR models [1, 20, 19]. More recently, Librispeech  is the most popular benchmark speech corpus on which the latest state-of-the-art ASR models are evaluated [10, 13, 17]. Despite their usefulness, existing speech corpora mainly deal with general open domain dialogs. Even if the large-scale corpora are helpful for pretraining ASR models, the models not finetuned with task-specific data are likely to provide poor recognition accuracy when applied to recognize user utterances in goal-oriented scenarios such as call centers and reservations services (See Sec4.2). This poor performance results from the distribution difference between open domain and task-specific goal-oriented dialogs. However, compared to open domain dialog speech corpora, goal-oriented speech corpora are rarely released publicly.
ClovaCall dataset construction is one of main subtasks in NAVER Clova AI for Contact Center (AICC) project 444https://clova.ai/aicontactcenter. The goal of AICC is to develop an AI agent which can help human contact center employees to communicate with customers via phone. In perspective of technology, the main functionality of AICC contains ASR, natural language understanding such as intent classification and slot filling, goal-oriented style dialog management, response generation, and voice synthesis. For achieving high quality AICC services, all functions are crucial. Here, we focus on its ASR component and construct a large-scale speech corpora concentrating on a restaurant reservation scenario.
ClovaCall contains 60,746 utterance and short sentence pairs on the restaurant reservation scenario via call. The process of data construction was carried out in the following order: 1) making a sentence pool, 2) call-based recording utterances with the sentences, and 3) refining the recorded speech data.
Sentence pool. We utilized Crowdworks555https://www.crowdworks.kr/, a Korean crowd sourcing platform, to make a pool of candidate sentences. First, we defined 10 categories, 86 intents, and 7 multi-turn situations for restaurant call scenarios. 10 categories, which are high-level topics, include reservation, delivery, and 8 FAQ categories like working time, menu, and discount. 86 intents belong to one of 10 categories contains whether the restaurant is opened now, closing time, recommended menu, etc. We also defined 7 multi-turn situations, which could be appear in a call to restaurants, which include reservation change and delivery call, etc. The crowd-workers were asked to imagine and generate multiple interrogative or answer sentences for given intents and situations. After quality assurance process was performed by human experts, 8,990 sentences, which are mainly answer sentences, were selected to comprise the candidate pool by eliminating duplicated sentences.
|Max / Min||17 / 1||48 / 3||116 / 5||30s / 0.3s|
|+ 0 / 0.7|
Call-based recording utterances. Utterance recording was performed based on crowd sourcing, operated by ourselves. 10 unique sentences are given to each crowd-worker. The crowd-worker reads each of the sentences aloud once or twice via call to make at most 20 utterances, which were transmitted into our server. From 11,000 people, we gathered more than 120k pairs of short sentences and utterances. Compared to Librispeech, there do not exist end point detection and alignment problems in our data because the utterances are short enough considering call-based reservation scenarios. Besides sentences, each utterance has its anonymous speaker index as one of the labels. This allows our data to be useful for speaker identification task as well.
Refining data. Data gathered via crowdsourcing are likely to contain many noises, and thus it is essential to refine the gathered data. First, we carried out qualitative evaluation on the gathered data, which was performed by human experts engaged in CrowdWorks so that we could select a total of 82,306 utterance-sentence pairs. This is the raw version of ClovaCall-Full. Next, we removed the starting or the ending silence regions below a specific energy level in the raw waveform of utterances. We used librosa  with 25db as the threshold for silence elimination. The silence-free data is called clean version. Finally, we selected top-30 intents with the most utterance-sentence pairs to be a dataset containing 60,746 pairs. We call the dataset ClovaCall-Base. We release the clean version of ClovaCall-Base via https://github.com/ClovaAI/ClovaCall. From now, ClovaCall denotes to the clean version of ClovaCall-Base for convenience.
We performed a statistical analysis on ClovaCall with one acoustic and three lexical attributes. To show the difference from open domain speech corpus, we compared our dataset with AIHub dataset, the largest Korean open domain dialog corpus. Fig. 1 illustrates the frequency histograms of words, characters, phonemes, and utterance length. Overall, most sentences in ClovaCall include more than 4 and less than 8 words , and more than 11 and less than 20 characters. Thus, the length of most utterances is more than 4 and less than 10 seconds. These distributions reveal the characteristics of restaurant reservation scenario. Compared to those of AIHub, the frequencies of each attribute are more concentrated on a specific region. We conjecture this pattern results from that most utterances in ClovaCall are likely to contain information for reservation while open domain dialog covers much more diverse topics and situations including both very short response utterance such as “Yes” and “Sure”, and long utterances. Table 1
depicts the number of unique elements, mean, standard deviation, maximum and minimum values for word, character, phoneme, and utterance length. Interestingly, the mean values of utterance length and silence time are very similar, which is caused by call-based recording setup. This property difference between two datasets enhances the necessity of goal-oriented dialog corpora.
Datasets. We use two additional datasets besides ClovaCall to effectively verify the efficacy of our dataset. One is our in-house speech corpus on internal questions and answers about company lives collected via phone calls, called QA Call dataset. The other is a large-scale Korean open domain dialog corpus from NIA AIHub666http://www.aihub.or.kr/aidata/105, an open data hub site of Korea Government. The AIHub speech is used for pretraining the ASR models. Also, we verify the results on ClovaCall-Full in addition to ClovaCall. While QA Call and ClovaCall are sampled with 8kHz, AiIHub contains the speech voices recorded with 16kHz sampling rate. As shown in Table 2 for experiments, we separate 59,662 and 1,084 sentence-utterance pairs from ClovaCall-Base as training and test sets. The training set of ClovaCall-Full contains approximately 22,000 more pairs whose intent is excluded from ClovaCall-Base. There is neither duplicated speaker nor sentence between two separated sets. For QA Call, we extract the same size of sentence-utterance pairs as ClovaCall as the training set. More data were used as test set of QA Call for robust evaluation. For fair comparison, the augmented amount is similar to the pretraining data of AIHub, which is explained in the next section. In addition, the finetuning data size of AIHub is equal to the training data size of two goal-oriented datasets.
Training schemes. For verifying the effectiveness of ClovaCall and the necessity of task-specific speech corpus, we employ three training scenarios: 1) pretraining and finetuing, 2) training from scratch, and 3) training from scratch with data augmentation. AIHub dataset is used for pretraining. Also, almost the same amount of AIHub data to the training portion of ClovaCall and QA Call was used for finetuning to investigate whether call-based goal-oriented utterance data are essential or not for task-specific services. We verify the effectiveness of pretraining with open domain speech corpora by comparing the results to those trained from data enhanced by two data augmentation methods such as noise augmentation and specaugment . We augmented data with noises using our in-house room simulator by adding different types of noise and reverberations  that we obtained from daily environmental recordings.
First, we upsample the 8Khz waveform datasets to 16Khz so that all datasets have the same frequency resolution. All models use log-spectrograms as input data, which are calculated with 20ms window size and 10ms stride size usinglibrosa. In addition, all spectrograms were normalized by instance-wise standardazation.
ASR Models. We use two standard ASR models such as DS2  and LAS  for verifying the effectiveness of our proposed ClovaCall. DS2 consists of a CNN and an RNN. In our setting, the CNN module has two 2D-Convolutional layers with 32 channels, which reduce both the frequency and the time resolution of the input spectrogram with stride 4 and 2 for each layer. The RNN module consists of five bidirectional LSTM layers. All these layers have 800 hidden units per direction, in total, 1600 units per layer. Next, one fully connected layer outputs the softmax distribution over characters. Finally, DS2 is trained with CTC loss . More details of DS2 are described in 
. LAS is a sequence-to-sequence model consisting of an encoder, decoder, and attention. The encoder includes a CNN module and an RNN module sequentially. The CNN module is identical to that of DS2. The RNN module of LAS encoder consists of three stacked bidirectional LSTMs with 512 units per direction. The decoder has two unidirectional LSTMs with 512 units and one fully connected layer to predict the character probability distribution. The attention learns the alignment between the encoder outputs and the decoder hidden states. Location-aware attention
is employed for the attention context of the previous state. All the experiments are performed based on NAVER Smart Machine Learning (NSML) platform[11, 21].
Metrics. We use character error rate (CER) as a metric:
where , are a predicted and a ground truth scripts. The distance is the Levenshtein distance between ,  and the length is a length of ground truth script .
|Models (parameters)||DS2 ||LAS |
|Pretraining and finetuning|
|From-scratch training with data augmentation|
|-Full /w NA||64.4||10.7||81.4||18.9|
|-Full /w SA||63.4||10.1||88.3||31.1|
Our experiments focus on verifying the effectiveness of task-specific speech corpora for a certain AICC services. Table 3 depicts the results of two popular ASR models under the three training scenarios described in Sec 4.1. In the pretraining and finetuning scheme, despite the largest size of the general domain dataset, AIHub, the performance of ASR models trained from only AIHub is very poor. We conjecture this poor performance results from the differences between open domain and goal-oriented dialog datasets as shown in Fig. 1. On the other hand, when pretrained with AIHub and finetuned with QA Call or ClovaCall, both models show remarkable improvement. This supports the necessity of using task-specific data for ASR models in real-world goal-oriented services.
In from-scratch training, both DS2 and LAS perform much better in the same domain than the different domain. Because, if a domain shifts, its data distribution and vocabulary also change. Moreover, QA Call provides more stable and better ASR performance than ClovaCall as well as in pretraning-finetuning scheme. We conjecture that these results are from larger size of QA Call testset. Also, QA Call contains a little more vocabulary and topics even though both speech corpora belong to goal-oriented dialog category.
In data augmentation experiments, no meaningful gain was found. We conjecture that two Call datasets have already been distorted by noises in the recording stage. In particular, the poor result of LAS on ClovaCall with SA is likely that too enhanced noise-based regularization harms the model capability of LAS with smaller parameter size, even if overall performance patterns of both models are similar to each other.
These results confirm that task-specific speech corpora play a crucial role in improving ASR models for real-world goal-oriented dialog services such as AICC. Therefore, we expect that our ClovaCall can considerably contribute to call-based reservation services. In addition, we can find that it is required to learn effective representation by pretraining with general-domain data to improve task-specific ASR models as well.
We did not perform experiments with a language model to enhance the ASR accuracy because we mainly focus on the efficacy our dataset in goal-oriented scenarios. When using language models, we can expect the improvement of ASR accuracy.
We release a large-scale Korean goal-oriented dialog speech corpus, i.e., ClovaCall, which is useful for AI for Contact Center (AICC) services. Our ClovaCall contains 111,853 short sentence and utterance pairs under a restaurant reservation scenario. To the best of our knowledge, our dataset is the first large-scale Korean goal-oriented dialog speech corpus. We verify the effectiveness of ClovaCall under three training schemes such as pretraining-finetuning, from-scratch learning, and data augmented from-scratch learning with two additional speech corpora. Experimental results support ClovaCall dataset remarkably improves the performance of ASR models, thus being crucial for call-based restaurant reservation services. Furthermore, our ClovaCall can contribute to ASR models for diverse call-based reservation services, considering that many reservation services share common expressions such as working time, location, availability, etc. For future work, we will extend our data to more call-based applications such as banking or delivery services.
The authors thank all members of Clova AI for supporting this work. In particular, we appreciate DUET, AICC, and Speech teams including Nako Sung and Icksang Han for data preparation and insightful discussion.
Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1, §4.1, Table 3.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §4.1.
Acoustic modeling using deep belief networks. IEEE transactions on audio, speech, and language processing 20 (1), pp. 14–22. Cited by: §1.
Light gated recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence 2 (2), pp. 92–102. Cited by: §2.