How does a spontaneously speaking conversational agent affect user behavior?

05/02/2022
by   Takahisa Iizuka, et al.
0

This study investigated the effect of synthetic voice of conversational agent trained with spontaneous speech on human interactants. Specifically, we hypothesized that humans will exhibit more social responses when interacting with conversational agent that has a synthetic voice built on spontaneous speech. Typically, speech synthesizers are built on a speech corpus where voice professionals read a set of written sentences. The synthesized speech is clear as if a newscaster were reading a news or a voice actor were playing an anime character. However, this is quite different from spontaneous speech we speak in everyday conversation. Recent advances in speech synthesis enabled us to build a speech synthesizer on a spontaneous speech corpus, and to obtain a near conversational synthesized speech with reasonable quality. By making use of these technology, we examined whether humans produce more social responses to a spontaneously speaking conversational agent. We conducted a large-scale conversation experiment with a conversational agent whose utterances were synthesized with the model trained either with spontaneous speech or read speech. The result showed that the subjects who interacted with the agent whose utterances were synthesized from spontaneous speech tended to show shorter response time and a larger number of backchannels. The result of a questionnaire showed that subjects who interacted with the agent whose utterances were synthesized from spontaneous speech tended to rate their conversation with the agent as closer to a human conversation. These results suggest that speech synthesis built on spontaneous speech is essential to realize a conversational agent as a social actor.

READ FULL TEXT VIEW PDF

Authors

page 5

page 8

05/21/2020

Conversational End-to-End TTS for Voice Agent

End-to-end neural TTS has achieved superior performance on reading style...
03/28/2022

STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent

We present STUDIES, a new speech corpus for developing a voice agent tha...
07/21/2021

Digital Einstein Experience: Fast Text-to-Speech for Conversational AI

We describe our approach to create and deliver a custom voice for a conv...
05/07/2022

Vector Representations of Idioms in Conversational Systems

We demonstrate, in this study, that an open-domain conversational system...
06/10/2020

Trust-UBA: A Corpus for the Study of the Manifestation of Trust in Speech

This paper describes a novel protocol for collecting speech data from su...
04/28/2017

Intelligent Personal Assistant with Knowledge Navigation

An Intelligent Personal Agent (IPA) is an agent that has the purpose of ...
05/10/2022

Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts

Adapting one's voice to different ambient environments and social intera...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nass and Moon[1] stated that humans interacting computers perceive them as social actors, and unconsciously behave socially towards them. In their several socio-psychological experiments, they observed that humans react to computers in the same social manner as they do to humans. These findings, however, do not mean that humans always respond to computers as they do to humans. Rather, people rarely exhibit anthropocentric reactions to computers that surround us today. They treat voice assistants[2, 3] as mere machines that can be controlled by voice commands. People rarely respond to voice agents with backchannels such as “Uh-huh,” interjections such as “Wow!,” or emotional expressions such as laughs, unlike they do to humans. To make human-computer interaction closer to human-human interaction, we think it necessary to change the current human attitude toward computers to be more social. In the “Computers are social actors” theory[4], cues that are closely associated with the human prototype, such as spoken words and interactivity, trigger prescribed behaviors in human-human interactions, which in turn cause mindless social responses for machines. This suggests the possibility that more human-like cues from machines encourage more social responses from humans. We are approaching the realization of such a conversational agent as a social actor from the viewpoint of speech synthesis. We assume that the way a conversational machine has social patiency [5] (= the capacity to have its face[6] threatened or affirmed by social action) reflects its paralanguage (= the way of speaking) as an observable that constitutes the human interactant’s levels of abstraction (LoA) [7]. In other words, we assume that humans do not take social actions to a conversational machine as they do to a human partly because current speech synthesis lacks some paralinguistic property that constitutes their LoA at which the machine is perceived as having social patiency.

Nonverbal social behavior such as backchannels, laughs, expressive interjections, and repetitions is an indication of attending conversations, and helps a lot in facilitating human-human conversations, which Den and his colleagues[8] called response tokens. Backchannels are short utterances spoken by a listener such as “Uh-huh” or “Yeah”. Maynard defined it as a brief expression sent by the listener while the speaker is speaking[9]. The role of the backchannel of a listener is to display that she/he understands what the speaker is saying or is paying attention to the speaker [10]. Expressive interjections are non-lexical speech sounds which indicate the speaker’s cognitive or affective state changes. They are distinguished from another type of interjections, filled pauses, in their unique morphological and pragmatic properties[11]. Repetitions have social functions, such as involvement in conversations, showing interest, concerning and surprising, and eliciting backchannel-like responses [12]. In addition to these response tokens, filled pauses play an important role in conversations. Studies on filled pauses suggest that they show hesitation or intention to take a turn[13, 14].

In a human-human conversation, the speaker takes advantage of these social responses as indicators of the listener’s state of understanding and dynamically redesigns her or his speech accordingly. Such social responses could be useful for smoother interaction between human and computer as well, for example, regulating computer’s speech rate and response timing, once humans begin to show social responses to computers.

Studies on spoken dialogue systems involve dialogue management[15, 16], speech recognition[17, 18, 19], and speech synthesis[20, 21, 22, 23, 24, 25, 26]. Recent studies on speech synthesis[24, 25, 26] have dramatically improved the quality of synthesized speech to the point where it is difficult to distinguish the synthesized speech from a real human voice. However, this does not mean that current speech technology has reached a status where it can fully reproduce all speech sounds that humans may utter in everyday conversation. So far, the construction of speech synthesizers has been assumed to involve written sentences read by voice professionals such as newscasters or voice actors. For this reason, current speech synthesizers speak as if a newscaster were reading a news or a voice actor were playing an anime character. In daily conversation, however, we speak spontaneously without a script. Spontaneous speech has quite different nature from read speech, especially in its prosody [27, 28]. Speech in conversation may exhibit some characteristic tone patterns including the final rise-fall[29, 30, 31]. These prosodic cues are considered to be crucial not only to sound like spontaneous speech, but also play a meta-communicative role such as turn coordination, which is intrinsic to real-time interactions of humans. Using a read speech corpus on scripted text for building a speech synthesizer is to ignore such an important aspect of speech in conversation.

A number of studies aim to apply the nonverbal aspects of human-human interactions, such as turn-taking[32], anthropomorphization[33], backchannel[34], and gaze[35], to enhance human-computer interactions. Regarding nonverbal aspects of synthesized speech for spoken dialogue systems, Chiba et al.[36] showed the effects of emotional speech synthesis in non-task-oriented dialogue systems. The study showed that the use of appropriate emotional expression could improve subjective impressions of humans such as dialogue richness and likeness to the agent. James et al.[37] showed that adding appropriate emotion to the speech of the robot could express empathy for users and be preferred to a typical voice of the robot. Misu et al.[38] investigated whether the dialogue- or monologue-style induced more backchannels from humans as training data for speech synthesizers. They showed that synthesized speech in dialogue-style induced more natural backchannels and nods than in monologue-style.

In almost all previous studies on spoken dialogue system including the above, the speech synthesizer used was the one trained with read speech. Despite the discrepancy between read and spontaneous speech as pointed out above, there have been quite few attempts to synthesize conversational speech using a spontaneous speech corpus. After an attempt at HMM-based speech synthesis using conversational speech[39], no work on spontaneous conversational speech synthesis was found in Interspeech conferences or SSW (Speech Synthesis Workshops) presentations until Ben-David and Shechtman[40], except for the authors’ works [41, 42]. One reason for this might be that it has been unclear what a spontaneously speaking machine could be useful for. The current study is the first one to demonstrate that speech synthesis based on spontaneous speech is actually useful for making human-machine interaction more like human-human interaction.

Our study investigates the effect of synthetic voice of conversational agent trained with spontaneous speech on human interactants. We hypothesize that the synthetic voices of current conversational agents, built on a read speech dataset, cause humans to behave as if they are mere machines rather than social actors. We also hypothesize that humans will exhibit more social responses when interacting with a conversational agent that has a synthetic voice built on a spontaneous speech dataset.

To test these hypotheses, we focused on nonverbal behavior of humans as listeners, which include backchanneling, interjections, laughing, and nodding because humans rarely exhibit this behavior while interacting with the existing spoken dialogue systems. We also observed the response time of human interactants. Timing is important because, in human-human communication, people attend closely to the turn-taking timing of their own and their partner, trying to keep the right timing when it goes awry[43]. On the other hand, if people do not regard a conversational agent as a social actor, they would not care about when to speak, and they would not mind keeping the agent waiting all the time. So far, the nonverbal behavior of spoken dialogue systems has been studied extensively [32, 33, 34, 35]. However, studies regarding nonverbal behavior of users have been very few[38, 44, 45, 46]. The conversation experiment designed in this paper focuses on the nonverbal behavior of human interactants. This allows us to examine whether humans produce more nonverbal responses to a spontaneously speaking conversational agent in shorter delays, i.e., their tendency to behave as social agents[5].

For the experiment, we set up two conversational agents. One is an agent whose utterances were synthesized using spontaneous speech, and the other is an agent whose utterances were synthesized using read speech. Two groups of subjects, assigned to either the “spontaneous” or “read” condition, participated in a chat with the conversational agent that follows an identical scenario. Then, the frequency of the subjects’ response tokens and the distribution of response time are compared for the two conditions. A subjective evaluation using a questionnaire on the impressions of the conversational agent was also conducted.

2 Speech synthesis

2.1 Corpus

2.1.1 Spontaneous speech corpus

For the spontaneous speech corpus, we used the Utsunomiya University Spoken Dialogue Database (UUDB) [47]

, where participants (12 females and 2 males) were engaged in the “four-frame cartoon sorting task” to estimate the original order of the shuffled frames. This corpus consists of 27 sessions and lasts about 130 min. In this study, utterances of a female speaker FTS were used because her recorded speech was longest in total duration in this corpus (about 18 min.).

2.1.2 Read speech corpus

For the read speech corpus, we used the Japanese speech corpus of Saruwatari-lab, the University of Tokyo (JSUT) [48]. JSUT contains speech data of a female who is not a professional speaker but has experience working with voices. We used all the subcorpora of JSUT, which are approximately 10 hours in length in total.

2.2 Method

As the speech synthesizer, we used Tacotron 2 [25]. Tacotron 2 is a combination of the Tacotron-style model[24] that generates mel-spectrograms from a sequence of characters and the modified WaveNet [49, 50] that conditions on mel-spectrograms to generate the waveform. Since this approach can directly learn the correspondence between characters and waveforms, it can generate speech of such high quality that it is difficult to distinguish from a real human voice.

In this study, we replaced the neural vocoder with MelGAN [51]. To train MelGAN, we used spontaneous monologue speech of 361 speakers in the Corpus of Spontaneous Japanese [52].

Two Tacotron 2 models were trained for synthesizing the agent’s utterances to be used in the conversation experiment. The “read” model was built on the read speech corpus, JSUT. Likewise, the “spontaneous” model was built on the spontaneous speech corpus, UUDB. However, model training could not be done straightforwardly unlike the “read” model, because UUDB is a natural dialogue corpus. Treatment of nonverbal sounds such as laughter was an issue. For the present study, we simply ignore them; utterance that contains nonlinguistic sounds was split so as to include spoken contents only. Another unique phenomenon in spontaneous speech is filled pauses and expressive interjections. Because these interjections have different acoustic properties from ordinary lexical sounds [53, 11], a dedicated vowel set was defined to transcribe these sounds for UUDB.

Another issue in using a natural dialogue corpus to build speech synthesizers is its insufficiency in size. To overcome this, the pretraining and fine-tuning approach was adopted. Starting from “read” model trained from JSUT with sufficient amount of data, the “spontaneous” model was trained by fine-tuning the initial model using UUDB. This allowed us to obtain a near conversational synthesized speech with reasonable quality, even with a small amount of data.

In this study, the mel spectrograms were calculated using a short-time Fourier transform using 50 ms frame size, 12.5 ms frame hop, and a Hann window function, as in the original Tacotron 2 paper

[25]

. The model hyperparameters were set to the default values of the NVIDIA’s implementation

[54], except for the threshold to generate the stop token to be 0.1, which was necessary to obtain stable outputs.

2.3 Analysis of prosody

The “spontaneous” model produces speech that gives a quite different impression than conventional speech synthesizers built on read speech corpora, primarily due to its prosody. Specifically, the “spontaneous” model reproduces tone patterns characteristic of conversational speech, particularly phrase-final tones. To quantitatively compare the prosody synthesized by the “spontaneous” and “read” models, a prosodic labelling based on the J_ToBI [55] (Japanese Tones and Break Indices) was performed for the synthesized speech. J_ToBI describes prosody from two aspects; prosodic pitch (Tone) and prosodic boundary strength (BI) in Japanese. It defines a set of phrase-final boundary tone labels L%, L%H%, and L%HL%, which roughly correspond to fall, rise, and rise-fall, and the latter two combined tones constitute the boundary pitch movements (BPMs). Because most of the sentences in read speech corpora are declarative, and voice professionals generally avoid using BPMs except at the end of a sentence when reading texts out loud, conventional speech synthesizers are exclusively capable of producing speech without BPMs, unless interrogative sentences are specially handled. Contrastively, spontaneous speech contains a lot of BPMs. Therefore, we assumed that the “spontaneous” model produces speech with a larger number of BPMs than the “read” model for a fixed set of text.

“read” model “spontaneous” model
L% 133 105
L%H% 9 20
L%HL% 0 17
Table 1: Frequencies of the phrase-final boundary tones appeared in the conversational agent’s synthesized utterances.

The distribution of phrase-final boundary tones in the synthesized utterances using the two models is shown in Table 1. Note that since the set of input texts was exactly the same, the total number of phrase-final boundary tones (corresponding to Break Index 2 [55]) was also the same for the two models. This result shows that there was no rise-fall pattern in the utterances synthesized with the “read” model, in contrast to the “spontaneous” model. This means that utterances synthesized with the “spontaneous” model reflected the prosodic properties of natural conversational speech used to train the model.

2.4 Subjective evaluation of synthesized speech

To compare the overall impressions of synthesized speech built from “read” and “spontaneous” models, a subjective evaluation test was performed. In addition to speech clarity as a common criterion in evaluating speech synthesizers, we also evaluated speech spontaneity, the degree to which the synthesized speech sounds like it was uttered on the spot without a script. The Likert scales for assessing clarity and spontaneity in the questionnaire were:

Clarity

Excellent

Good

Fair

Poor

Bad

Spontaneity

I am convinced that she was speaking what came to her mind on the spot.

I feel that she was speaking what came to her mind on the spot.

I am not sure whether she was speaking what came to her mind on the spot or a script.

I feel that she was speaking from a script.

I am convinced that she was speaking from a script.

The subjective evaluation test was conducted as a follow-up to the conversation experiment described in Sect. 3. The subjects were 26 undergraduate and graduate students who also participated in the conversation experiment and agreed to participate the additional experiment. The stimulus set consisted of 50 synthesized utterances, which was identical to that used in the conversation experiment described in the next section, per model. The experiment was conducted in a within-subjects design, i.e., each subject evaluated a total of utterances.

The results of the subjective evaluation test are shown in Fig. 1

. In the box-and-whiskers plot, the lower and upper hinges correspond to the first and third quartiles, the lower and upper whiskers extend from the hinge to the smallest and largest not-outlying values, and individual points correspond to the outlying values. The mean clarity was 4.34 and 2.93 for the “read” and “spontaneous” models, respectively (Fig. 

1(a)). From this result, it can be said that synthesized speech built on the read speech corpus was perceived clearer as a whole than that built on the spontaneous speech corpus.

Figure 1: Comparison of synthesized speech with the “read” and “spontaneous” models from the viewpoints of clarity and spontaneity.

Note, however, that this does not necessarily mean the inferiority of the speech synthesis based on spontaneous speech. Rather, we should argue whether the synthesized speech is intelligible enough for smooth communication. It is natural to assume that our daily speech is inherently less clear than read-aloud speech in, for example, news or dramas, since we try to save as much speech effort as possible to the extent that the speech act can achieve its goal. Conversely, one would feel unnatural if someone else spoke to her/him as clearly as a newscaster reads a news. From this perspective we think that the clarity of the “spontaneous” model is acceptable for the conversational agent’s voice.

The mean spontaneity was 1.80 and 3.78 for the “read” and “spontaneous” models, respectively (Fig. 1(b)). This result indicates that the “spontaneous” model tends to produce speech that sounds as if it was uttered on the spot more than the “read” model.

From these results, it was found that the speech synthesized with the “spontaneous” model indeed had a quality of spontaneous speech. By using a spontaneous speech corpus for training, it is possible to synthesize speech that is close to our everyday speech, at least in some aspects. Therefore, one might expect a human-machine interaction that is closer to a human-human interaction, by using “spontaneous” synthesized speech as the machine’s voice.

3 Conversation experiment with a spontaneously speaking agent

3.1 Overview of the conversational agent

The conversational agent used for this experiment was implemented using the MMDAgent [56]. MMDAgent is a platform for building spoken dialogue systems that have modules of speech recognition, speech synthesis, dialogue management, and 3D model motion management. Fig. 2 shows an overview of the conversational agent used in this experiment.

Figure 2: Overview of the conversational agent (adapted from [56].)

In this study, we did not use the default speech recognizer but instead applied the Wizard of Oz (WoZ) technique [57], where an experimenter operates the agent behind the subject. The reason for employing WoZ was to prevent bad impressions to the agent due to speech recognition errors or unnatural speech timings. The agent was designed to speak by playing back pre-synthesized utterances according to the wizard’s operation.

The dialogue scenario was designed so that the agent speaks almost unilaterally. In the scenario, the agent asks the human interactant if she/he is interested in traveling abroad, right after the initial greeting. No matter she/he is interested or not, the agent talks about countries she loves to go. After this, the agent continues to talk about various trivia about countries around the world. Sometimes she quizzes the human interactant, such as “Do you know which country is most famous for pyramids?” Basically, answering to these quizzes are only opportunity for the human interactant to take turns, forcing she/he to be a listener for the rest of the time.

Instead of writing the scenario by hand, we first recorded a dialogue between one of the authors and his close relative, where the author improvised the role of the agent, with no script at all. The recorded dialogue was then transcribed and transformed to an FST (finite-state transducer) for MMDAgent by an in-house tool. We consider it crucial to avoid handwritten scripts for the utterances of conversational agents, because actual words of spontaneous utterances have different linguistic characteristics from “imaginary” words, and human interlocutors tend to behave differently in response [58].

The wizard manipulated the agent’s behaviors, which include triggering the next utterance in the scenario, determining whether the subject’s answer to a quiz is correct or not, triggering a backchannel, triggering an utterance to encourage the subject to speak friendly, triggering a confirmation that the subject is attending, and triggering an expression to get the conversation back on track (such as “Anyway,”) when it is going to break down. Determining the correctness of the answers to the quiz was necessary to reflect on the agent’s next action. Sending backchannel was intended to make the agent behave more human-like when the subject is speaking. A previous study revealed that randomly generating acoustically different backchannels improves the naturalness of dialogue compared to repeatedly generating an identical backchannel [59]. Therefore, three similar but different backchannels were prepared and randomly selected for playback. The purpose of encouraging to speak friendly was to induce a relaxed and natural mood, as if the subject were talking with a friend. The purpose of asking if the subject was listening was to prevent the subject from becoming a mere listener and to encourage reactions. However, this operation was limited to twice at most in each conversation.

The appearance of the agent was replaced to a silhouette in order to prevent any inconsistency between the appearance of the agent and the individuality of the synthesized speech.

The system displayed subtitles simultaneously with the agent’s utterance. This prevented subjects from missing utterances even when the quality of synthesized speech was not sufficient.

3.2 Method

The subjects were 50 undergraduate and graduate students who were not engaged in speech research. They received both verbal and written explanations of the experiment, and provided written informed consent before the experiment. The experiment was approved by the Ethics Committee on Research Involving Humans, Utsunomiya University.

They were assigned to either the “spontaneous” or “read” condition, namely, the experiment was conducted in a between-subjects design. In the “spontaneous” condition, the subject had a conversation with the agent whose utterances were synthesized by the “spontaneous” model described in Sect. 2.2. The “read” condition was identical to the “spontaneous” condition except that the agent’s utterances were synthesized by the “read” model. A video excerpt of a conversation in the “spontaneous” condition is included as supplemental material.

It is not considered fair to have a subject participate in both conditions. If a subject interacted with both “spontaneous” and “read” agents, she/he would easily notice that the objective of the experiment was to test the effect of the agent’s voice. Eventually, the subject might also notice that one agent was speaking spontaneously unlike existing dialogue systems and, in an effort to be a “good subject,” might try to behave more favorably in her/his interaction with the “proposed” agent. Therefore, the conversation experiment should not be conducted in a within-subjects design, but in a between-subjects design.

As evaluation indices of how close the human-agent interaction and human-human interaction are, we examined the response time of subjects to the agent, the number of subjects’ response tokens (backchannels, expressive interjections, laughs, and filled pauses), and the number of nods. By investigating these nonverbal behaviors, it is possible to determine whether humans talking with a conversational agent behave as if it were a human-like social actor and not just a machine.

After the session was over, each subject was asked to rate her/his impressions of the agent and the quality of the conversation using a 6-item questionnaire. As a debriefing after the evaluation, the subjects were told that the conversational agent was not automated but human-operated.

4 Result

4.1 Nonverbal behavior

Fig. 3

shows the distribution of the indices for nonverbal behaviors during the interaction between the conversational agent and subjects. In the following paragraphs, summary statistics are shown in the form of mean and 95% confidence intervals.

Figure 3: Nonverbal behavior indices.

The mean response time was sec. for the “read” condition and sec. for the “spontaneous” condition (Fig. 3(a)), and the difference was significant (Welch’s -test, , ). This result indicates that subjects who interacted with the agent whose utterances were synthesized from spontaneous speech data tended to respond faster than those who interacted with the agent whose speech was synthesized from read speech data.

The average number of backchannels was for the “read” condition and for the “spontaneous” condition (Fig. 3(b)), and the difference was significant (, ). This result indicates that subjects who interacted with the agent whose utterances were synthesized from spontaneous speech data tended to show a larger number of backchannels than those who interacted with the agent whose speech was synthesized from read speech data.

The average number of expressive interjections was for the “read” condition and for the “spontaneous” condition (Fig. 3(c)), and the difference was not significant (, ).

The average number of filled pauses was for the “read” condition and for the “spontaneous” condition (Fig. 3(d)), and the difference was not significant (, ).

The average number of laughs was for the “read” condition and for the “spontaneous” condition (Fig. 3(e)), and the difference was not significant (, ).

The average number of nods was for the “read” condition and for the “spontaneous” condition (Fig. 3(f)), and the difference was not significant (, ).

In summary, humans interacting with the agent whose utterances were synthesized from spontaneous speech data tended to exhibit shorter response times and more response tokens, which can be interpreted that they behaved more like interacting with a human.

4.2 Questionnaire

Table 2

summarizes the subjects’ impressions of the agent and the quality of their conversations. Each section of the table shows the question, the meaning of the scale (5–1), and a contingency table (columns correspond to the response options and rows correspond to the conditions to which the subjects were assigned). The most notable result is for the question “How close was your conversation with Mei-chan to a conversation with a human?”. The mean was 3.60 for the “read” condition and 4.08 for the “spontaneous” condition. The Brunner-Munzel test showed that the difference in the response distributions for the two conditions was significant (

). This result indicates that subjects who interacted with the agent whose utterances were synthesized from spontaneous speech data tended to evaluate their conversation as closer to a human conversation.

Table 2: Questions and responses asking subjects’ impression of the agent and the conversation quality.

The differences for other items were not significant.

5 Discussion

Subjects in the “spontaneous” condition tended to respond faster than in the “read” condition. This result suggests that subjects who interacted with the spontaneously speaking conversational agent are more likely to feel pressure to respond at the right timing. Since one is less likely to feel such time pressure when interacting with mere a machine, this implies that the subjects tended to view the agent as a social actor rather than mere a machine.

Subjects in the “spontaneous” condition tended to show a larger number of backchannels than in the “read” condition. Since one is less likely to show backchannels to mere a machine, this may also be evidence that the subjects tended to view the agent as a social actor rather than mere a machine.

Subjects in the “spontaneous” condition tended to evaluate their conversation with the agent as closer to a human conversation than in the “read” condition. This result supports the above interpretation of the nonverbal behavior results, i.e., spontaneously speaking agent tend to be more viewed as a social actor.

These results suggest that speech synthesis built on spontaneous speech is essential to realize a conversational agent as a social actor.

At this time, however, the specific features of spontaneous speech that explain these results still remain unknown. One candidate for such a feature is the phrase-final boundary tones. The analysis described in Sect. 2.3 showed a clear difference between utterances synthesized with the “spontaneous” model and those synthesized with the “read” model in terms of phrase-final boundary tones. Considering that the final rise-fall tones are characteristic of conversational speech [29, 30, 31], and that they constitute important cues that determine the occurrence of backchannels in a computational model of dialogue prosody [60], it is natural to attribute the occurrence of backchannel in the “spontaneous” condition to the L%HL% (rise-fall) pattern. To prove this, we tried to find the relationship between the rise-fall patterns of the agent’s utterances and the subjects’ subsequent responses, but could not find any direct relationship. Additional experiments will be needed, such as controlling the phrase-final boundary tones of the agent’s utterances and seeing the effect on human behavior.

Another issue is the speaker individuality. If we had a pair of read speech corpus and spontaneous speech corpus of a same speaker, we could eliminate the extraneous variable, but building such a dataset would be very costly. We think the effect of speaker on the current experiment was minimal because the JSUT speaker and the UUDB speaker were both female and of the same generation.

This research is the antithesis of the conventional speech synthesis that has placed supreme importance on naturalness as professional speech. The current study suggests that speech synthesis for conversational agents should also aim to produce speech that sounds as if it were uttered on the spot. In the future, conversational agents will be used on a daily basis and will increasingly be treated as a partner or a friend, rather than just a tool. Speech synthesizers built on spontaneous speech will help to realize such a conversational agent. The current study clarifies the significance of using spontaneous speech for speech synthesis in the field of human-machine interaction research.

6 Conclusions

In this paper, we investigated the effect of synthetic voice of conversational agent trained with spontaneous speech on humans who interact with it. To quantitatively compare the prosody synthesized by the model trained with spontaneous speech and the model trained with read speech, a prosodic labeling was performed for the synthesized speech, and revealed that utterances synthesized with the model trained with spontaneous speech reflected the prosodic properties of natural conversational speech. A subjective evaluation test was also performed to assess the clarity and spontaneity of the synthesized speech, and the result showed that the model trained with spontaneous speech tended to produce speech that sounds more like spontaneous speech uttered on the spot. A large-scale conversation experiment was conducted with a conversational agent whose utterances were synthesized with either the model trained on spontaneous speech or that trained on read speech. This revealed that subjects who interacted with the agent whose utterances were synthesized from spontaneous speech tended to show shorter response times and a larger number of backchannels. Furthermore, the subjects who interacted with the agent whose utterances were synthesized from spontaneous speech tended to rate their conversation with the agent as closer to a human conversation.

In summary, it can be concluded that humans exhibit more social responses when interacting with a conversational agent that has a synthetic voice built on spontaneous speech, and such an agent is more likely to be viewed as a social actor.

References

  • [1] C. Nass and Y. Moon, “Machines and mindlessness: Social responses to computers,” Journal of Social Issues, vol. 56, no. 1, pp. 81–103, 2000.
  • [2] F. Bentley, C. Luvogt, M. Silverman, R. Wirasinghe, B. White, and D. Lottridge, “Understanding the long-term use of smart speaker assistants,” Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 2, no. 3, 2018.
  • [3] M. B. Hoy, “Alexa, Siri, Cortana, and more: An introduction to voice assistants,” Medical reference services quarterly, vol. 37, no. 1, pp. 81–88, 2018.
  • [4] C. Nass, J. Steuer, and E. R. Tauber, “Computers are social actors,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 72–78, Association for Computing Machinery, 1994.
  • [5] R. B. Jackson and T. Williams, “A theory of social agency for human-robot interaction,” Frontiers in Robotics and AI, vol. 8, pp. 1–15, 2021.
  • [6] P. Brown and S. C. Levinson, Politeness : Some universals in language usage. No. 4 in Studies in Interactional Sociolinguistics, Cambridge University Press, 1987.
  • [7] L. Floridi, “The method of levels of abstraction,” Minds and Machines, vol. 18, pp. 303–329, 2008.
  • [8] Y. Den, H. Koiso, K. Takanashi, and N. Yoshida, “Annotation of response tokens and their triggering expressions in Japanese multi-party conversations,” in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp. 1332–1337, 2012.
  • [9] S. K. Maynard, Kaiwabunseki. Tokyo: Kurosio Publishers, 1993.
  • [10] J. Horiguchi, “Komyunikeshon ni okeru kikite no gengokodo,” Journal of Japanese Language Teaching, vol. 64, pp. 13–26, 1988.
  • [11] H. Mori, “Morphology of vocal affect bursts: Exploring expressive interjections in Japanese conversation,” in Proc. Interspeech 2015, pp. 1309–1313, 2015.
  • [12] R. J. Beun, “The function of repetitions in information dialogues,” IPO Annual Progress Report, vol. 20, pp. 91–98, 1995.
  • [13] C. Yamane, Nihongo no danwa ni okeru fira. Tokyo: Kurosio Publishers, 2002.
  • [14] E. Mizukami and K. Yamashita, “An examination of a function of fillers as maintaining the speaker’s right to speak,” Cognitive Studies, vol. 14, no. 4, pp. 588–603, 2007.
  • [15] B. Thomson and S. Young, “Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems,” Computer Speech and Language, vol. 24, no. 4, pp. 562–588, 2010.
  • [16] Y. Xu, P. Huang, J. Tang, Q. Huang, Z. Deng, W. Peng, and J. Lu, “Policy optimization of dialogue management in spoken dialogue system for out-of-domain utterances,” in 2016 International Conference on Asian Language Processing (IALP), pp. 10–13, 2016.
  • [17]

    A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in

    Proc. ICML ’14, p. II–1764–II–1772, 2014.
  • [18] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, pp. 577–585, Curran Associates, Inc., 2015.
  • [19]

    W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in

    Proc. ICASSP 2016, pp. 4960–4964, 2016.
  • [20] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” in Proc. ICASSP 2000, vol. 3, pp. 1315–1318, 2000.
  • [21] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009.
  • [22] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in Proc. ICASSP 2013, pp. 7962–7966, 2013.
  • [23]

    K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura, “Speech synthesis based on hidden Markov models,”

    Proceedings of the IEEE, vol. 101, no. 5, pp. 1234–1252, 2013.
  • [24] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech 2017, pp. 4006–4010, 2017.
  • [25] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP 2018, pp. 4779–4783, 2018.
  • [26] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and high-quality end-to-end text to speech,” 2020, arXiv:2006.04558.
  • [27] T. Koriyama, T. Nose, and T. Kobayashi, “On the use of extended context for HMM-based spontaneous conversational speech synthesis,” in Proc. Interspeech 2011, pp. 2657–2660, 2011.
  • [28] T. Koriyama, T. Nose, and T. Kobayashi, “An F0 modeling technique based on prosodic events for spontaneous speech synthesis,” in Proc. ICASSP 2012, pp. 4589–4592, 2012.
  • [29] J. Pierrehumbert and M. Beckman, Japanese tone structure. MIT Press, 1988.
  • [30] J. J. Venditti, K. Maeda, and J. P. H. van Santen, “Modeling Japanese boundary pitch movements for speech synthesis,” in Proc. 3rd ESCA Speech Synthesis Workshop, pp. 317–322, 1998.
  • [31] C. T. Ishi, “The functions of phrase final tones in Japanese: Focus on turn-taking,” Journal of the Phonetic Society of Japan, vol. 10, no. 3, pp. 18–28, 2006.
  • [32] G. Skantze, “Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks,” in Proc. SIGdial 2017, (Saarbrücken, Germany), pp. 220–230, Aug. 2017.
  • [33] D. Kontogiorgos, A. Pereira, O. Andersson, M. Koivisto, E. Gonzalez Rabal, V. Vartiainen, and J. Gustafson, “The effects of anthropomorphism and non-verbal social behaviour in virtual assistants,” in Proc. IVA ’19, p. 133–140, 2019.
  • [34] A. Krogsager, N. Segato, and M. Rehm, “Backchannel Head Nods in Danish First Meeting Encounters with a Humanoid Robot: The Role of Physical Embodiment,” in Proc. HCI AIMT 2014, vol. 8511 of Lecture Notes in Computer Science, pp. 651–662, Springer International Publishing, 2014.
  • [35] G. Skantze, A. Hjalmarsson, and C. Oertel, “Exploring the effects of gaze and pauses in situated human-robot interaction,” in Proc. SIGdIAL 2013, pp. 163–172, 2013.
  • [36] Y. Chiba, T. Nose, T. Kase, M. Yamanaka, and A. Ito, “An analysis of the effect of emotional speech synthesis on non-task-oriented dialogue system,” in Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pp. 371–375, 2018.
  • [37] J. James, C. I. Watson, and B. MacDonald, “Artificial empathy in social robots: An analysis of emotions in speech,” in 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 632–637, 2018.
  • [38] T. Misu, E. Mizukami, Y. Shiga, S. Kawamoto, H. Kawai, and S. Nakamura, “Toward construction of spoken dialogue system that evokes users’ spontaneous backchannels,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences (Japanese Edition), vol. J95-A, no. 1, pp. 16–26, 2012.
  • [39] S. Andersson, J. Yamagishi, and R. Clark, “Utilising spontaneous conversational speech in HMM-based speech synthesis,” in Proc. 7th ISCA Speech Synthesis Workshop, 2010.
  • [40] A. Ben-David and S. Shechtman, “Acquiring conversational speaking style from multi-speaker spontaneous dialog corpus for prosody-controllable sequence-to-sequence speech synthesis,” in Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), pp. 66–71, 2021.
  • [41] T. Nagata, H. Mori, and T. Nose, “Robust estimation of multiple-regression HMM parameters for dimension-based expressive dialogue speech synthesis,” in Proc. Interspeech 2013, pp. 1549–1553, 2013.
  • [42] M. Yokoyama, T. Nagata, and H. Mori, “Effects of dimensional input on paralinguistic information perceived from synthesized dialogue speech with neural network,” in Proc. Interspeech 2018, pp. 3053–3056, 2018.
  • [43] H. H. Clark, “Speaking in time,” Speech Commun., vol. 36, no. 1–2, p. 5–13, 2002.
  • [44] A. Bliek, S. Bensch, and T. Hellström, “How can a robot trigger human backchanneling?,” in 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 96–103, 2020.
  • [45] A. Hjalmarsson and C. Oertel, “Gaze direction as a backchannel inviting cue in dialogue,” in Proceedings of the IVA 2012 workshop on Realtime Conversational Virtual Agents (RCVA 2012), (Santa Cruz, CA, USA), 2012.
  • [46] Y. Arimoto, N. Nomura, and N. Kamo, “Quantitative evaluation of agency identification against conversational character agent,” in Proceedings of the 2019 Spring Meeting of the Acoustical Society of Japan, pp. 813–816, 3 2019.
  • [47] H. Mori, T. Satake, M. Nakamura, and H. Kasuya, “Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics,” Speech Communication, vol. 53, no. 1, pp. 36–50, 2011.
  • [48] R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis,” 2017, arXiv:1711.00354.
  • [49] A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou, “Deep Voice 2: Multi-speaker neural text-to-speech,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, pp. 2963–2971, Curran Associates, Inc., 2017.
  • [50] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent WaveNet vocoder,” in Proc. Interspeech 2017, pp. 1118–1122, 2017.
  • [51]

    K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” in

    Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019.
  • [52] K. Maekawa, “Corpus of Spontaneous Japanese: Its design and evaluation,” Proceedings of The ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR 2003), pp. 7–12, 2003.
  • [53] K. Maekawa and H. Mori, “Comparison of voice quality between the vowels in filled pauses and ordinary lexical items,” Journal of the Phonetic Society of Japan, vol. 21, pp. 53–62, 2017.
  • [54] NVIDIA Corporation, “Tacotron 2.” https://github.com/NVIDIA/tacotron2.
  • [55] J. J. Venditti, “The J_ToBI model of Japanese intonation,” in Prosodic Typology: The Phonology of Intonation and Phrasing (S.-A. Jun, ed.), pp. 172–200, Oxford University Press, 2005.
  • [56]

    A. Lee, K. Oura, and K. Tokuda, “MMDAgent: A fully open-source toolkit for voice interaction systems,” in

    Proc. ICASSP 2013, pp. 8382–8385, 1 2013.
  • [57] N. M. Fraser and G. Gilbert, “Simulating speech systems,” Computer Speech and Language, vol. 5, no. 1, pp. 81–99, 1991.
  • [58] Y. Takamatsuya and H. Mori, “Effects of agent’s pre-recorded vs live speech on the appearance of listener’s response tokens,” in Proc. Human Communication Symposium 2020, IEICE, 2020.
  • [59] H. Mori, “Dynamic aspects of aizuchi and its influence on the naturalness of dialogues,” Acoustical Science and Technology, vol. 34, no. 2, pp. 147–149, 2013.
  • [60] H. Koiso, Y. Horiuchi, S. Tutiya, A. Ichikawa, and Y. Den, “An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs,” Language and Speech, vol. 41, no. 3-4, pp. 295–321, 1998.