Log In Sign Up

TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection

by   Davide Salvi, et al.

With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors, mainly due to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms. In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with other state-of-the-art sets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both mono and multimodal conditions, showing the need for multimodal forensic detectors and more suitable data.


FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset

While significant advancements have been made in the generation of deepf...

Forensic Analysis of Synthetically Generated Scientific Images

The widespread diffusion of synthetically generated content is a serious...

Media Forensics and DeepFakes: an overview

With the rapid progress of recent years, techniques that generate and ma...

Evoking Places from Spaces. The application of multimodal narrative techniques in the creation of "U Modified"

Multimodal diegetic narrative tools, as applied in multimedia arts pract...

Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors

Significant advancements made in the generation of deepfakes have caused...

Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text

Real world multimedia data is often composed of multiple modalities such...

Building a Manga Dataset "Manga109" with Annotations for Multimedia Applications

Manga, or comics, which are a type of multimodal artwork, have been left...

I Introduction

In recent years, deep learning technologies have grown fast and relentlessly. New scenarios that were only imaginable a few years ago are now possible, and others will arise soon. For example, virtual assistants, natural language processing and visual recognition algorithms have become commonplace and help simplify several daily tasks. New categories of media are also born, like deepfakes: videos obtained through AI-driven technologies capable of synthesizing a target person’s identity or biometric aspects. Deepfake generation technologies can create exciting unexplored scenarios, but can also lead to dangers and threats when misused. For example, deepfake techniques allow to generate video content representing a victim in deceiving situations and/or behaviors, which can lead to frauds, scam cases and fake news spreading 

[55, 63, 7]. This menace cannot be ignored, as we have reached a point where we are no longer able to always distinguish real media from artificially generated ones [41, 42].

The scientific community has embraced this problem and has started working in several directions to moderate deepfake online spread considering both audio and video threats [60]. For instance, international challenges have been organized to make people aware of the importance of fighting deepfake misuse. To this purpose the DFDC challenge [10] focused on video deepfake detection, while ASVspoof [57, 65] and ADD [66] challenges have been proposed in the audio field. Furthermore, part of the research community has focused on releasing deepfake datasets to help develop forensic detectors. This is the case of Faceforensics++ [45] and DeepfakeTIMIT [29] for videos, as well as WaveFake [13] for audio. Most important, multimedia forensics researchers have developed several detectors to discriminate deepfakes from pristine media. These leverage different characteristics, ranging from low-level artifacts left by the generators [11, 15] to more semantic aspects [33, 1].

Despite the considerable effort put into fighting deepfakes, a common trait of the developed detectors is that they primarily focus on monomodal analysis: they consider either the audio or video deepfake detection problem separately. Since videos usually come with audio tracks, and both the visual and audio content are subject to editing, performing a joint audio-visual multimodal analysis should be the preferred option. However, only a few approaches have been proposed to perform multimodal detection leveraging inconsistencies or traces orthogonal to different modalities to identify counterfeit materials. For example, [17] exploits the inconsistencies between emotions conveyed by audio and visual modalities to perform a joint audio-visual deepfake detection. The authors of [35] incorporate temporal information from series of images, audio and video data to provide a multimodal deepfake detection approach. Alternatively, the authors of [24] show that combining audio and video baselines in an ensemble-based method provides better detection performance than a monomodal system.

The main reason for the lack of multimodal forensic systems for deepfake detection is the scarcity of data to train and test them. Most of these systems are data-driven and require a large amount of data to be trained. Still, most of the deepfake datasets proposed in the literature are monomodal. There is a dearth of challenging fake video datasets that also contain fake audio, making it tough to develop multimodal systems.

In this paper we address the problem of lack of deepfake multimodal data by focusing on three main contributions

  • We propose a general pipeline to turn a monomodal video deepfake dataset from the literature into a multimodal audio-visual deepfake dataset.

  • We apply the proposed pipeline to the VidTIMIT [49] and DeepfakeTIMIT [29] datasets in order to build and release the novel multimodal TIMIT-TTS deepfake dataset containing almost tracks.

  • We benchmark the generated dataset by running a series of deepfake detection baselines that highlight the main challenges for future research.

The rationale behind the proposed pipeline is that realistic deepfake video datasets have been proposed in the literature, but these do not contain accompanying deepfake audio. Therefore, we present a technique to generate a synthetic speech track for a given input video. This approach allows us to generate fake audio content starting from any video containing speech, considering the most advanced state-of-the-art tts systems. Once generated, the synthetic track can be paired with the input video and, depending on the authenticity of the latter, an audio-only or an audio-visual deepfake is generated. Our pipeline thus provides a viable solution for making counterfeit multimodal materials, which is in general complex to perform.

To showcase the actual feasibility of the proposed deepfake generation approach, we apply it to the VidTIMIT dataset [49] and DeepfakeTIMIT dataset [29]. The former contains audio-video recordings of people speaking. The latter is a video deepfake version of the former. By generating synthetic speech for both video datasets, we end up with the proposed TIMIT-TTS, a synthetic speech dataset built using state-of-the-art tts techniques. On the one hand, TIMIT-TTS can be used as a standalone audio dataset to test the developed speech deepfake detectors, as it contains the most cutting-edge methods in the synthetic speech synthesis field. On the other hand, TIMIT-TTS can also be combined with VidTIMIT and DeepfakeTIMIT to provide multimodal audio-video deepfake data, which is an overlooked aspect in the current literature.

Finally, we run a series of tests to provide some information on the challenges proposed by this new multimodal dataset. We adopt the video deepfake detector proposed in [4] and the audio deepfake detector proposed in [51] to analyze videos and audio tracks in both monomodal and multimodal fashion. Results confirm that multimodal deepfake analysis should be preferred and show that audio deepfake attribution is an interesting topic for further research.

The rest of paper is structured as follows. Section II recap the motivations behind our work, and provides the reader with some useful background on generation and detection methods for speech deepfakes. Section III describes the proposed generation pipeline for the deepfake audio tracks, and provides an overview of the considered tts synthesis algorithms. Section IV explains the structure of the released TIMIT-TTS dataset. Section V presents the results of the analysis conducted on the released data. Finally, Section VI concludes the paper along with a brief discussion of possible future work.

Ii Overview and Background

This section provides the reader with some helpful background information needed to understand the primary rationale behind our proposal. First, we highlight the need for a multimodal deepfake dataset with a particular focus on synthetic speech generation, as the one proposed in this paper. Then, we provide a quick overview of synthetic speech generation and detection techniques, which are at the base of our proposed dataset and benchmarking work.

Ii-a Motivations

Numerous deepfake datasets have been proposed in recent years, both in the audio and video case, significantly pushing research towards developing new methods for recognizing counterfeit material. The publication of these sets leads to designing more innovative and effective detectors since they provide new data on which to train and test them. However, most of the presented datasets focus only on one modality at a time, resulting in valuable data for producing monomodal detectors but not relevant for multimodal methods. Indeed, to train and test multimodal detectors, there is a need for data that are altered in all the considered aspects (e.g., both video and audio). The lack of this data is one of the main reasons behind the lack of multimodal detectors investigations and is the primary motivation behind this work.

Recently, two multimodal deepfake datasets have been proposed, both containing counterfeit audio and video. These are DFDC [10] and FakeAVCeleb [25]. Although these propose a solution to the abovementioned problem, we cannot define either of these as complete, especially from an audio point of view. On one side, DFDC does not provide labels as to which of the audio or visual components are fake, but the content is labeled as fake when at least one of the two modalities is counterfeit. Therefore we do not have sufficient information to perform tests on different scenarios (e.g., fake audio and real video or vice versa) and investigate which aspects the detector leverages to discriminate between real and altered data. On the other hand, the multimodal deepfakes contained in FakeAVCeleb are generated overlooking the audio modality. All the fake audio tracks are synthesized using the same tts algorithm, and none of them is synchronized with the corresponding video. This results in a lack of both variety and realism in the released data.

In this paper we address both these problems by proposing a pipeline to generate multimodal deepfake datasets starting from video deepfake ones. Indeed, we can make use of the good quality video datasets proposed in the literature and automatize the generation of realistic speech tracks to be synchronized and matched with these videos. We use this pipeline to release TIMIT-TTS, a synthetic speech dataset that overcomes the abovementioned limitations. The released data include synthetic speech generated from different tts systems, providing an overview of the most advanced techniques in state-of-the-art. The tracks have also been synchronized with the related videos using a dtw technique, producing highly realistic content, as shown in Section V. These two aspects, namely the variety of speech generation algorithms and time warping, fill a gap that is present in state-of-the-art multimodal deepfake datasets. Given the variety of generation methods it includes, TIMIT-TTS can be used both as a standalone synthetic audio dataset and to perform multimodal deepfake studies, used in conjunction with other well-established deepfake video datasets.

Ii-B Speech Deepfake Generation methods

Deepfake content generation techniques are becoming increasingly simple to use and the data they produce are getting more and more realistic. In some cases, the generated synthetic material is so lifelike that it is difficult to discern from an authentic one [41]. Although this is true for both audio and video data, here we focus on the generation methods of speech deepfakes, which are the main subject of study in this paper.

As far as synthetic speech data generation is concerned, techniques can be broadly split into two main families: tts methods and vc methods. The difference between these two kinds of techniques is mainly the input of the generation system. tts algorithms produce speech signals starting from a given text. Conversely, vc methods take a speech signal as input and alter it by changing its style, intonation or prosody, trying to mimic a target voice.

Regarding tts methods, a long history of classical techniques based on vocoders and waveform concatenation has been proposed in the literature [28]. However, the first modern breakthrough that significantly outperformed all the classical methods was introduced by WaveNet [59], a neural network for generating raw audio waveforms capable of emulating the characteristics of many different speakers. This network has been overtaken over the years by other systems [30, 61], which made the synthesis of highly realistic artificial voices within everyone’s reach.

Most tts systems follow a two-step approach. First, a model generates a spectrogram starting from a given text. Then, a vocoder synthesizes the final audio from the spectrogram. This approach allows combining different vocoders for the same spectrogram generator and vice versa. Alternatively, some end-to-end models have been proposed, which generate speech directly from the input text [16].

Considering vc algorithms, the earliest models were based on spectrum mapping using parallel training data [56, 9]. However, most of the current approaches are gan-based [22, 23], allowing to learn a mapping from source to target speaker without relying on parallel data.

In this work we only consider tts methods as they are more investigated in the literature and allow us to build a more various dataset. Indeed, as we aim for a fully automated and high-quality pipeline, tts methods are less prone to errors compared to vc techniques. Moreover, in principle tts methods allow for easy editing of even just one single word within a speech. Furthermore, tts realistic voice styles are typically easier to tune compared to vc techniques whose fine-tuning if often challenging and time-consuming. Nevertheless, also vc methods are worth further studies and will be the subject of future versions of this dataset.

Ii-C Speech Deepfake detection methods

The speech deepfake detection task consists in determining whether a given speech track is authentic from a real speaker or has been synthetically generated. Recently, this has become a hot topic in the forensic research community, trying to keep up with the rapid evolution of counterfeiting techniques [36].

In general, speech deepfake detection methods can be divided into two main groups based on the aspect they leverage to perform the detection task. The first focuses on low-level aspects, looking for artifacts introduced by the generators at the signal level. In contrast, the second focuses on higher-level features representing more complex aspects as the semantic ones.

As an example of artifacts-based approaches, [62] aims to secure asv systems against physical attacks through channel pattern noise analysis. In [37], the authors assume that a real recording has more significant non-linearity than a counterfeit one, and they use specific features, such as bicoherence, to discriminate between them. Bicoherence is also employed in [5] along with several features based on modeling speech as an auto-regressive process. The authors investigate whether these features complement and benefit each other. Alternatively, the authors of [51] propose an end-to-end network to spot synthetic speech.

On the other hand, detection approaches that rely on semantic features are based on the hypothesis that deepfake generators can synthesize low-level aspects of the signals but fail in reproducing more complex high-level features. For example, [46] exploits the deepfake detection task by relying on classic audio features inherited from the music information retrieval community. The authors of [8] exploit the lack of emotional content in synthetic voices generated via tts techniques to recognize them. Finally, in [2] asv and prosody features are combined to perform synthetic speech detection.

Iii Dataset Creation Methodology

This section presents the methodology we propose to generate a deepfake speech track for a given input video, being the video real or fake. In doing so, we also detail all the implemented tts systems used to synthesize the signals and the techniques applied to post-process them. This is the pipeline we follow to generate the proposed dataset.

Iii-a Generation pipeline

The proposed pipeline to generate a synthetic speech track for a given video comprises several steps, as it is shown in Figure 1. The input to the whole process consists of a video that represents a speaking person. Here we consider a video as a multimedia object composed of both an audio speech content and a visual component depicting a person’s face, as in


where is the mixing operation between the audio and visual signals. Our final goal is to produce a forged video containing the same visual subject as but where the speech track is a deepfake synthetically generated. To summarize, we can write


where indicates the complete pipeline we propose.

To achieve our goal, the first operation we perform is to split into its components and . The speech track becomes the input of the audio generation pipeline, which outputs its synthetic counterpart . This segment is composed of three main blocks. The first is a speech-to-text algorithm, which transcribes the speech content of into the text . The second block is a tts algorithm that produces a synthetic audio track from a given string , as in


where is one of the tts systems presented later in this section. Finally, the third block consists of a post-processing step, which takes the generated track as input and outputs its processed version , which is more realistic and challenging to discriminate for deepfake detectors. is the deepfake version of the input speech track.

Two different post-processing techniques are implemented in our pipeline, which can be applied individually or together. In case neither is applied, we output the clean signal . The first technique is speech-to-speech synchronization based on dtw. Since the goal of the proposed system is to generate a fake speech track for a given video, we need the synthesized audio to be synchronized with the video itself. Without performing the alignment, the synthetic track will have a different temporal trend from the input audio and the corresponding video . This results in a deepfake that is very easy to detect for all the systems trained to analyze the discrepancies in time between the audio and video modalities. This pipeline step takes as input the two audio signals and and performs time warping on the former by mapping it to the latter. We do so through the alignment algorithm presented later in this section. The output track , being synchronized with , is also synchronized with the input video .

The second block of the post-processing step consists of data augmentation. Here we apply several algorithms, including noise injection, pitch shifting and lossy compression, to make the generated data more challenging to discriminate for those deepfake detectors that are not robust to such operations. In fact, this kind of processing operations hinder the traces that tts algorithms could leave, making the generated data tougher to identify. Finally, once the audio track has been obtained, we mix it with the input video generating a new multimodal deepfake content . Depending on the authenticity or not of the input video, will be a mono or multimodal deepfake.

Fig. 1: Pipeline of the proposed generation method.

Iii-B Speech synthesis

In the proposed pipeline, the tts block can support multiple speech generation algorithms . We did so to add the possibility of generating data with different characteristics, not related to a single algorithm and more representative of the state-of-the-art. Most of the considered tts algorithms follow a two-stage pipeline, while only a few methods have an end-to-end approach, generating speech signals directly from an input text. In the two-stage case, the first block takes a text as input and generates a spectrogram, while the second is a vocoder that sonifies the output of the first step. The two blocks are independent from each other and we can potentially use different vocoders for the same spectrogram generator. Here we consider a tts method as a fixed pair of generator and vocoder. Even though this interchangeability allows us to potentially have a large number of methods , in this study we want to limit the number of vocoders considered. We do so since we want to keep the differences between the generated speech tracks primarily attributable to the spectrogram generators. Nevertheless, we believe that the artifacts introduced by the vocoders are a noteworthy aspect and these will be the subject of subsequent versions of this dataset.

Here is a list of the considered spectrogram generators.

  • [leftmargin=*]

  • Tacotron [61] is a seq2seq model, which includes an encoder, an attention-based decoder, and a post-processing net. Both the encoder and decoder are based on Bidirectional GRU-RNN. We consider the version implemented in [19].

  • Tacotron2 [50] has the same architecture as Tacotron but improves its performance by adding a Location Sensitive Attention module to connect the encoder to the decoder.

  • GlowTTS [26]

    is a flow-based generative model. It searches for the most probable monotonic alignment between text and the latent representation of speech on its own, enabling robust and fast tts synthesis.

  • FastSpeech2 [44]

    is composed of a Transformer-based encoder and decoder, together with a variance adaptor that predicts variance information of the output spectrogram, including the duration of each token in the final spectrogram and the pitch and energy per frame.

  • FastPitch [32] is based on FastSpeech, conditioned on fundamental frequency contours. It predicts pitch contours during inference to make the generated speech more expressive.

  • TalkNet [3] is consists of two feed-forward convolutional networks. The first predicts grapheme durations by expanding an input text, while the second generates a Mel-spectrogram from the expanded text.

  • MixerTTS [53] is based on the MLP-Mixer architecture adapted for speech synthesis. The model contains pitch and duration predictors, with the latter being trained with an unsupervised tts alignment framework.

  • MixerTTS-X [53] has the same architecture as MixerTTS but additionally uses token embeddings from a pre-trained language model.

  • VITS [27] is a parallel end-to-end tts method that adopts variational inference augmented with normalizing flows and an adversarial training process to improve the expressive power of the generated speech.

  • SpeedySpeech [58]

    is a student-teacher network capable of fast synthesis, with low computational requirements. It includes convolutional blocks with residual connections in both student and teacher networks and uses a single attention layer in the teacher model.

  • gTTS [12] (Google Text-to-Speech) is a Python library and CLI tool to interface with Google Translate’s text-to-speech API. It generates audio starting from an input text through an end-to-end process.

  • Silero [54] pre-trained enterprise-grade tts model that works faster than real-time following an end-to-end pipeline.

Here is a list of the considered vocoders.

  • [leftmargin=*]

  • MelGAN [31] is a gan model that generates audio from mel-spectrograms. It uses transposed convolutions to upscale by the mel-spectrogram to audio. We considered this vocoder to generate speech from Tacotron2, GlowTTS, FastSpeech2, FastPitch, TalkNet, MixerTTS, MixerTTS-X, and SpeedySpeech.

  • WaveRNN [21]

    is a single-layer recurrent neural network with a dual softmax layer, able to generate audio 4x faster than real-time. We considered this vocoder to generate audio from Tacotron.

Most of the models mentioned above follow a deep-learning approach and the data they generate is highly dependent on the one seen during the training phase. This also affects the speakers’ number and identity that a model supports. In fact, if a system has been trained with numerous speakers, it will also be able to reproduce them at inference time, resulting in a multi-speaker generator. Conversely, if we train a system on one speaker only, it will be able to generate audio only with that tone of voice.

Here is a list of the datasets considered for training the used tts methods in order to obtain different voice styles.

  • [leftmargin=*]

  • LJSpeech [18] is a dataset containing short audio tracks of speech recorded from a single speaker reciting pieces from non-fiction books.

  • LibriSpeech [43] is a dataset that contains about hours of authentic speech from more than different speakers.

  • CSTR VCTK Corpus [64] (Centre for Speech Technology Voice Cloning Toolkit) is a dataset that includes speech data uttered by native speakers of English with various accents. Each speaker reads about sentences from a newspaper and a passage intended to identify the speaker’s accent.

Table I presents a summary of the datasets used to train each algorithm, together with the implemented number of speakers in TIMIT-TTS. The models trained on LibriSpeech and VCTK support multi-speaker synthesis, while those trained on LJSpeech only support a single speaker, which is an English female voice with an American accent. For gTTS, no dataset is indicated as it directly interfaces with Google Translate’s tts API and synthesizes speech using its pre-trained models. This model supports English in 4 different accents (United States, Canada, Australia and India). It is worth noting that several methods have been trained on LJSpeech, resulting in diverse systems able to generate speech with the same voice. This allows generating speech data that are not biased by the speaker’s identity and that are more difficult to discriminate by deepfake detectors, as shown in Section V.

Generator Dataset Num. Speakers
gTTS // 4
Tacotron LibriSpeech 8
GlowTTS LJSpeech, VCTK 9
FastPitch LJSpeech, VCTK 9
FastSpeech2 LJSpeech 1
MixerTTS LJSpeech 1
MixerTTS-X LJSpeech 1
SpeedySpeech LJSpeech 1
Tacotron2 LJSpeech 1
TalkNet LJSpeech 1
Silero LJSpeech 1
TABLE I: Datasets used to train each tts method and considered number of speakers in TIMIT-TTS.

Iii-C Audio-Video synchronization

To generate a realistic audio-video deepfake, we need its audio and visual components to be synchronized with each other. This is crucial as diverse semantic deepfake detectors leverage the inconsistencies between the two modalities to discriminate among authentic and counterfeited media contents [6] and having the two components asynchronous would result in deepfake easy to spot. To avoid this, we synchronize the generated tts track with the original audio of the input video. Since is aligned with the original video , the aligned tts signal turns out to be synchronized with the video itself.

We address this point using the dtw implementation provided by Synctoolbox library [38]. This toolbox integrates and combines several techniques for the given task, such as multiscale dtw, memory-restricted dtw, and high-resolution music synchronization. The method used was initially proposed for synchronizing music, but we also tested its effectiveness in the case of speech. The dtw process computes the chroma features of the analyzed tracks and warps them by bringing them into temporal correspondence. The pipeline block inputs the original speech track , together with the tts track , and outputs the warped signal . In particular, is the target signal and is the one to be warped. In our pipeline, both and contain the exact text and the output has the same length as the target .

To improve the performance of this pipeline block we adopt a combined method of vad + dtw. In fact, in real cases, audio tracks often contain silences at their beginning or end, differently from tts signals where silences are limited. These silences can ruin the synchronization performances of the two tracks, as they are not symmetric. To bypass this problem, we apply a vad on both tracks before the alignment, removing the head and tail silences. Then, we perform the dtw only on the voiced segments. Finally, we add the silences removed from the target track to the warped one, obtaining a signal of the desired length. This approach allows us to achieve more effective alignments and more realistic results. Figure 2 shows the complete pipeline of the alignment block.

Fig. 2: Pipeline of the speech-to-speech alignment block.

Iii-D Post-processing

Deepfake audio detectors generally perform very well when dealing with clean data, but their performance drops as these are post-processed. When dealing with in-the-wild conditions, post-processing techniques are introduced to hide some artifacts present in the generated deepfake audio tracks. For example, applying MP3 compression reduces the audio quality and hides some defects, while adding reverberation simulates the environment in which the audio was captured. In our pipeline, we introduce a data augmentation block that allows us to generate more challenging data. Table II shows the techniques we implemented and the parameters we considered for each transform, as shall be better explained in the next section. We performed all the operations using the Python library audiomentations [20].

Iv TIMIT-TTS Dataset Generation

This section provides all the details about the TIMIT-TTS dataset we release in this paper. After explaining its generation process, we illustrate its structure and possible applications.

Iv-a Reference dataset

To generate a counterfeit speech dataset through the pipeline proposed in Section III, we need to define an audio-video set to use as a reference. Our goal is to produce a new version of the dataset where its audio component is replaced with a synthetic one. Here we consider the VidTIMIT dataset [49, 48]. This includes video and audio recordings of people reciting short sentences from the TIMIT Corpus [14], for a total of videos. We chose this dataset for several reasons. First, it is state-of-the-art and highly regarded within the scientific community. Then, since the recorded sentences are extracted from the TIMIT Corpus, we are provided with all the transcripts of the video dialogues. Therefore, we can avoid the text transcription step of the pipeline (see Figure 1), which could introduce errors within the generated tracks, undermining the reliability of the released dataset. This is crucial since the VidTIMIT recordings were done in an office using a broadcast quality digital video camera, resulting in noisy audio tracks that are difficult to transcript. Moreover, the use of the official transcripts makes the generated speech perfectly synchronizable with the video, thus putting us in the most challenging forensic scenario where audio and video inconsistencies are minimal. Finally, a counterfeited version of this dataset was released in . This is called DeepfakeTIMIT [29] and includes

videos extracted from the VidTIMIT corpus modified using open-source software based on gan to create video deepfakes. Being the released TIMIT-TTS an audio deepfake version of VidTIMIT, when it is used together with DeepfakeTIMIT, it provides audio-video content that is counterfeited in both modalities. This is extremely useful for the development of new multimodal deepfake detectors.

Iv-B Generated dataset

To develop the TIMIT-TTS dataset, we consider the whole VidTIMIT corpus. We generate a set of synthetic speech tracks for each of the implemented generators, containing the same sentences as the reference videos. For the systems that support multispeaker synthesis, we synthesize a set of tracks for each speaker. We created several versions of the dataset corresponding to the different post-processing operations we apply to the generated speech tracks. In particular, we consider two different processes: audio-video synchronization (dtw) and data augmentation. This results in the following four versions of the dataset:

  • [leftmargin=*]

  • clean_data: all the synthetic audio tracks are clean and no post-processing is performed after the tts generation process.

  • dtw_data: dtw is applied to the generated data. Each speech is synchronized with the corresponding video track from VidTIMIT.

  • aug_data: data augmentation is applied to each speech track.

  • dtw_aug_data: both dtw and data augmentation are applied to the generated data. First, we warp the tracks in time and then we augment them. We do so to prevent degradation from affecting the alignment process.

Considering the number of tts methods and the number of speakers implemented, as shown in Table I, each dataset partition is composed of tracks, for a total of almost speech signals on the entire dataset. All the tracks are released in wav format considering a sampling rate of . The complete dataset can be downloaded at this link111

Each partition of the dataset contains two splits, named single_speaker and multi_speaker. The first one includes all the tracks generated using tts algorithms that support LJSpeech’s speaker. The second includes the signals generated from the generators that implement speakers from datasets other than LJSpeech. Each of the two splits contains a subfolder for each generator, where the audio tracks are stored. The name of each track is dir_track.wav, where dir and track are respectively the names of the directories in which VidTIMIT is structured and of the tracks it contains. We adopted this naming to make it easy to link each deepfake audio track with its corresponding video.

Regarding data augmentation, we applied all the implemented techniques to each speech track, with a probability and a random value contained in a specific range for each method. Following this application approach, some generated tracks will be edited with more than one method at a time, while others will remain clean. At the same time, different augmentation levels will be considered for each track. This results in a dataset that is highly diverse and challenging to identify. Table II shows all the augmentation techniques implemented, together with their considered ranges, while a list of the augmentation techniques applied to each signal can be found in a csv file included in the partition folder.

Augm. technique Parameter Application range
Gaussian Noise - Amplitude
Time Stretching - Rate
Pitch Shifting - Semitones
High-pass Filtering - Cutoff freq. [Hz]
MP3 compression - Bitrate
TABLE II: List of the implemented data augmentation techniques.

The possible applications of TIMIT-TTS are numerous. As regards synthetic speech detection, it is possible to perform that both in closed and open set scenarios. The high number of tts generators implemented within the dataset allows us to include some of them in the train set while introducing others only in the test partition, making the classification task more challenging. Furthermore, apart from binary classification, synthetic speech attribution can be performed. This consists of a multi-class classification problem, where for each of the proposed tracks, it is required to find the tts generation algorithm used to synthesize it. Performing this study on TIMIT-TTS is fascinating since several of the proposed spectrogram generators only support the LJSpeech speaker. Indeed, synthetic speech attribution could be relatively easy to perform when each generator supports different speakers, but it becomes challenging when all the systems reproduce the same speaker. This type of analysis is presented in Section V.

V Results and benchmarking

In this section, we benchmark the released dataset using objective metrics and show some of its possible applications, presenting the results obtained by testing it with state-of-the-art deepfake detectors. We perform deepfake detection in both monomodal and multimodal scenarios, showing the effectiveness of considering multiple modalities at the same time.

V-a TIMIT-TTS statistics

When generating synthesized audio data, many aspects need to be addressed to ensure the forged material is reliable and realistic. These aspects include track length, silence duration, speech naturalness and number of supported speakers. Overlooking these aspects, we risk generating biased or easy-to-discriminate data.

The first aspect we analyze is the duration of the generated audio tracks. As the dataset will be mainly used to develop deepfake detectors, we need the length of the audio tracks to be compatible with the window sizes used by most of the systems. Furthermore, we want to avoid differences between the duration of the signals generated with distinct tts algorithms to prevent tracks generated by different methods from being easily discriminated. The length of a signal generated through a tts technique depends on the source text used as input. In our case, all the considered sentences are fixed and extracted from the TIMIT Corpus.

Table III shows the duration values values for each tts generation system. The average length over the entire dataset is equal to , while considering the single algorithms the durations range from to

. The standard deviation between the duration of the different methods is not noticeable, being equal to

. This means the length of the tracks does not constitute a discriminating element between the different generation algorithms, resulting in a reliable dataset. When we apply dtw, the average length of the tracks rises to . In this case, the average duration is the same for all generation algorithms. This is because the generated tracks have the same duration as the target ones extracted from VidTIMIT, so their length is fixed.

Secondly, we examined the length of silences contained in each track. Although silence is a fundamental component of speech, this is often overlooked in data generation, leading to biased tracks that are easy to discriminate [40]. This is a common problem, especially when dealing with tts algorithms, where the prosodic component is less present [2] and the duration of the silences is shorter. Table III shows the silence durations of our tracks for both the original and the dtw cases. Here we observe a higher difference between the algorithms, with duration values ranging between and . However, when we apply dtw, both the silence duration increase and the differences between the generation methods are reduced, homogenizing the synthesized data.

Next, as we are dealing with speech data, we assessed the naturalness of the generated tracks. We do so to avoid releasing audio signals that sound too unrealistic. We assume the mos as a metric and compute it on the synthesized data through Mosnet [34]. mos is a numerical measure of the human-judged overall quality of an event or experience, ranging from (bad) to (excellent). In our case, we use it to evaluate the naturalness of the generated speech tracks. The results for each generation algorithm are shown in Table III. We score an average mos value greater than , which is the threshold used to determine if a signal is acceptable or not. This means that, even if we are dealing with synthetic data, we are not neglecting the realism of the speech. The application of dtw has adverse effects on the mos of the generated data, lowering the average computed on all the tracks by almost points.

Finally, a crucial aspect to address in generated speech data is the number of supported speakers. As the primary goal of the TIMIT-TTS dataset is to perform binary detection of deepfakes, it is essential to provide several speakers. Training a deepfake detector on a few speakers may make the model learn how to discriminate tracks based on the tone of voice they contain instead of the traces left by the tts generators, as we will highlight in the following experiments. Therefore, providing numerous speakers within the dataset helps avoid this bias and produce more effective models.

TIMIT-TTS implements a total of different speakers in diverse numbers depending on the models used. In addition to the LJSpeech voice, supported by numerous tts generators, each multi-speaker system implements different voices, male and female, from the VCTK or LibriSpeech datasets. The only exception is gTTS, which only supports English voices. In this case, we have included all within the dataset. The number of speakers implemented for each generation method is shown in Table I.

Generator Track dur. [s] Silences dur. [s] MOS
Clean DTW Clean DTW Clean DTW
gTTS 3.82 4.25 0.55 1.29 3.59 3.39
Tacotron 2.69 4.25 0.12 1.48 3.01 3.02
GlowTTS 3.57 4.25 0.78 1.39 3.54 3.51
FastPitch 2.74 4.25 0.41 1.47 3.48 3.35
VITS 2.85 4.25 0.59 1.51 3.69 3.43
FastSpeech2 3.03 4.25 0.05 1.32 3.03 3.00
MixerTTS 3.35 4.25 0.07 1.32 3.04 3.02
MixerTTS-X 3.34 4.25 0.11 1.35 3.02 3.01
SpeedySpeech 3.48 4.25 0.61 1.34 2.84 2.87
Tacotron2 3.21 4.25 0.09 1.34 3.09 3.04
TalkNet 3.02 4.25 0.05 1.33 3.00 2.99
Silero 3.04 4.25 0.09 1.39 2.97 2.97
Average 3.10 4.25 0.44 1.43 3.44 3.29
TABLE III: Speech metrics for each tts generator.

V-B Audio classification results

To benchmark the generated data on the deepfake classification task, we consider an audio baseline that performs deepfake detection. We adopt RawNet2 [51], a state-of-the-art end-to-end neural network that operates on raw waveforms. It has been introduced to perform binary classification between real and fake data during the ASVspoof 2019 challenge [57] and included as a baseline in the ASVspoof 2021 challenge [65].

Here we use the baseline for two different classification tasks. The first one is what it was initially proposed for, namely real vs. fake binary classification. The second is multiclass synthetic speech attribution. Here, given an audio signal generated with any tts technique, we train the network to discriminate which algorithm has been used to synthesize the audio itself. For this second task, we modified the output layer of the network so that it contains many neurons equal to the number of classes we are addressing. Although this is not the task for which the network was proposed, the problem is very close to that of deepfake classification and the considered model can address it without any issue 

[47]. Also, the synthetic speech attribution problem has not yet been explored extensively, so there are not many networks proposed explicitly for the task. We illustrate all the experiments performed with RawNet2 in the following sections.

V-B1 Audio binary classification: synthetic speech detection

In this experiment, we want to test how challenging the released dataset is in the deepfake detection task. We perform binary classification considering the audio tracks of the VidTIMIT dataset as real and those of TIMIT-TTS as fake. We use this dataset only in the test phase, following the approach presented in [39], which is helpful for testing the generalization capabilities of a detector. For this experiment we train RawNet2 on ASVspoof 2019, considering balanced classes and data augmentation on the training data. We test the detector on the individual partitions of TIMIT-TTS. When we consider augmented partitions, we also process real data from VidTIMIT following the same approach presented in the previous sections to make real and fake data as consistent as possible with each other.

Figure 3 shows the results of the analysis by means of roc curves and auc values, while Figure 4 shows the distributions of the scores for all the considered classes. In this case, higher scores mean higher confidence in identifying a track as fake. We observe that the detection performance deteriorates as we increment the post-processing operations applied to the speech tracks. In particular, the operation that degrades the accuracy the most is the speech-to-speech alignment, with an auc value that drops by between the clean and the dtw cases. This means that, although these tracks present a lower mos value in Table III, deepfake detectors must be explicitly trained on this type of data to discriminate them correctly. Also, this shows that the detection problem of DTW tracks is not solved and our dataset could help in building new detectors that are more robust to in-the-wild conditions. Finally, the augmented tracks are more challenging to detect than the clean ones, with an auc value that drops by between the two cases. As we mentioned above, such post-processing techniques hide some of the traces left by the tts generators, making it more challenging to identify the artifacts present in the synthesized tracks.

Fig. 3: Audio binary classification - ROC Curves.
Fig. 4: Audio binary classification - Scores distribution.

V-B2 Audio multi-class classification: synthetic speech attribution

In this experiment we want to test the TIMIT-TTS dataset on the synthetic speech attribution task. This consists in identifying, given an input tts track , which algorithm has been used to synthesize it. Formally, we have to determine where is the number of implemented tts generation methods. We consider all the generation methods available in the TIMIT-TTS dataset, including all implemented speakers. We split the corpus the into train and test sets following a % - % policy. We ensure a coherent number of tracks for each generation algorithm in both the partitions. We train the RawNet2 model for epochs, using Cross Entropy as loss function and a learning rate equal to .

Figure 5

shows the results of the analysis through a confusion matrix. We observe different performances for the considered algorithms. In particular, the systems trained to produce speech from multiple speakers are relatively easily identified, while those considering only one speaker are more challenging to distinguish. This is due to the fact that the detection algorithm seems to leverage the different speakers to perform classification rather than focusing on the traces left by each tts algorithm itself. On the other hand, the methods that implement the same speaker force the model to learn how to discriminate tracks adequately, and the deterioration in performance is due to the difficulty of the required task. To verify this hypothesis, we repeat the same experiment by independently considering the speech tracks generated by models trained on LJSpeech and those trained on other speakers. The results of this analysis are shown in Figures 

6 and 7 and confirm the same trend as before, with the initial balanced accuracy value of that drops from to when we consider LJSpeech models and rises to when considering the other models. We believe this aspect is paramount when dealing with both deepfake detection and attribution tasks, as we do not want the results obtained by the algorithm to be biased by the considered speakers. Indeed, having multiple tts methods trained to reproduce the same voice constitutes a more challenging scenario as it forces the detector to learn the traces left by the generators. TIMIT-TTS, providing numerous generation methods trained on LJSpeech, can help develop new attribution algorithms.

Fig. 5: Confusion matrix showing the baseline performance on the synthetic speech attribution task, considering all the implemented tts methods.
Fig. 6: Confusion matrix showing the baseline performance on the synthetic speech attribution task, considering only the tts methods trained on a single speaker.
Fig. 7: Confusion matrix showing the baseline performance on the synthetic speech attribution task, considering only the tts methods that produce speech with multiple speakers.

V-C Video classification results

As the final goal of our work is to use the proposed dataset to perform multimodal deepfake detection, we need to compare the final detection performance with those of the single modalities. For this reason, after analyzing the audio component, we operate the detection on the video one. In this case we consider as baseline an EfficientNetB4 [52] network modified by adding attention layers to improve its performance, following the implementation proposed in [4]. As we did in the audio case, we consider a model trained on an external dataset to test its generalization capabilities. We consider the model provided by the authors pre-trained on FaceForensics++ [45] and test it on VidTIMIT and DeepfakeTIMIT datasets, considering them as real and fake data, respectively.

We build two different versions of the test set, corresponding to two different compression stages of the videos. In particular, we generate a high and low-quality version of the data obtained by considering two different values of quantization parameters (QP= and QP=), where higher QP means lower quality. This has been done for two main reasons. First, this is the same compression approach considered in the FaceForensics++ dataset, so we used it to make our data comparable to those the model has been trained on. Second, we want to study the robustness of the model to compression and analyze how much this influences the detection performance. Robustness is a crucial aspect when dealing with deepfake detectors. The reason is that most of the multimedia material we deal with comes from social media, where they undergo several post-processing and compression steps. Developing a robust algorithm means being able to correctly analyze the multimedia material despite these operations.

Figure 8 shows the results of the detection task in terms of roc curves and auc, while Figure 9 shows the scores distributions in the considered cases. As in the previous experiment, higher score values mean a higher likelihood that the video is fake. The detection task is accomplished very well when considering “high quality” videos, with an auc value that is equal to . This is a significant result, but we will unlikely find data with such high quality in in-the-wild conditions. On the other hand, the performance significantly deteriorates when considering the “low quality” data, with an auc value that drops by almost . This leaves room for improvement in case of multimodal analysis.

Fig. 8: Video binary classification - ROC Curves.
Fig. 9: Video binary classification - Scores distribution.

V-D Multimodal classification results

In this experiment we test the deepfake detection performance of the implemented baselines when considering a multimodal approach. We want to sample if simultaneously examining multiple aspects of a multimedia material can improve the detection capabilities or not. To do so, we combine the VidTIMIT, DeepfakeTIMIT and TIMIT-TTS datasets and associate each audio track with its corresponding video. In this way, we obtain a set of data that is falsified in both audio and video modalities. During this study we analyze the two following scenarios:

  • Scenario 1 - We only consider videos where both their modalities belong to the same class, e.g., audio and video are both real or both fake.

  • Scenario 2 - We consider videos where all the combinations between classes are possible, including data that are counterfeited in only one modality at a time. In this case, we label a video as fake when at least one between its audio and video components is falsified.

We do so to consider two different application cases for a multimodal approach. In the first scenario, since the classes of the two components are the same, it would also be possible to use a monomodal approach. Nonetheless, we show that analyzing different aspects of the given material can help improve the detection performance.

The second scenario, on the other hand, is more similar to a real-world cases. Here, using a multimodal approach is fundamental since analyzing only one aspect at a time we would lose information and have partial results. For example, we would be unable to detect videos that are counterfeited in just one modality if that is different from the one we are analyzing. For both scenarios, we consider the baselines introduced above for the single modalities, and we fuse their score in two different ways. In the first case, we compute the average between the two scores, while in the second we consider the higher of the two, which identifies the analyzed element as more likely to be false.

The results conducted on the first scenario are shown in Figures 10 and 11, divided according to the compression applied to the video modality. The detection performance improves significantly, especially when dealing with post-processed data. In particular, the auc values improve in all the cases compared to the corresponding monomodal experiments. In the second scenario, likewise, the multimodal approach performs considerably better than the monomodal ones.

Figure 12 shows the obtained results in the case we consider clean audio data and a QP= for the video, where an auc improvement of is achieved over both the single modalities. This is very interesting since it allows us to detect fake videos that we could not find otherwise. We highlight that such positive results have been achieved by fusing the scores of the monomodal detectors in a very straightforward way. We are confident that combining them more smartly could further improve the performance, demonstrating the effectiveness of multimodal deepfake detectors.

Fig. 10: Multimodal binary classification - Scenario 1 (RR vs. FF) - QP=23.
Fig. 11: Multimodal binary classification - Scenario 1 (RR vs. FF) - QP=40.
Fig. 12: Multimodal binary classification - Scenario 2.

Vi Conclusion

In this work we presented a pipeline to forge synthetic audio content starting from an input video in order to generate a multimodal deepfake dataset. We used this pipeline to generate and release TIMIT-TTS, a synthetic speech dataset that includes audio tracks generated using 12 different tts systems, among the most advanced in the literature, for a total amount of almost tracks. The released dataset has several applications in the forensics field, such as synthetic speech detection and attribution. Moreover, it can be used in conjunction with other well-established deepfake video datasets to perform multimodal studies, bridging an overlooked aspect in the current state-of-the-art. From the presented results, it emerges that multimodal analyzes improve the performance of the detectors, producing more capable and robust systems. At the same time, however, the performances are not entirely satisfactory, so we need more multimodal deepfake datasets, like the one we release, to train and test the developed networks.

This is the dataset’s first version, and future developments will be released. There are several aspects worth investigating and synthesis algorithms that have not been included in this set. Regarding tts systems, we want to examine the effects of using different vocoders on the performance of deepfake detectors and implement a higher number of speakers for all the systems. Moreover, we also want to include vc algorithms in the study since they have not been involved in this work. Nonetheless, we hope this work will help the development of new multimodal deepfake detectors and provide new data to train and test existing systems to make them able to address in-the-wild conditions.


Research was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-20-2-0111, the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under agreement numbers FA8750-20-2-1004 and HR001120C0126, and by the National Science Foundation under Grant No. 1553610. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA and AFRL, the Army Research Office, the National Science Foundation, or the U.S. Government. This work was partly supported by the PREMIER project, funded by the Italian Ministry of Education, University, and Research within the PRIN 2017 program.


  • [1] S. Agarwal, H. Farid, T. El-Gaaly, and S. Lim (2020) Detecting deep-fake videos from appearance and behavior. In IEEE International Workshop on Information Forensics and Security (WIFS), Cited by: §I.
  • [2] L. Attorresi, D. Salvi, C. Borrelli, P. Bestagini, and S. Tubaro (2022) Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection. In

    International Conference on Pattern Recognition

    Cited by: §II-C, §V-A.
  • [3] S. Beliaev, Y. Rebryk, and B. Ginsburg (2020) TalkNet: fully-convolutional non-autoregressive speech synthesis model. arXiv preprint arXiv:2005.05514. Cited by: 6th item.
  • [4] N. Bonettini, E. D. Cannas, S. Mandelli, L. Bondi, P. Bestagini, and S. Tubaro (2021) Video face manipulation detection through ensemble of cnns. In International Conference on Pattern Recognition (ICPR), Cited by: §I, §V-C.
  • [5] C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro (2021) Synthetic speech detection through short-term and long-term prediction traces. EURASIP Journal on Information Security 2021 (1), pp. 1–14. Cited by: §II-C.
  • [6] K. Chugh, P. Gupta, A. Dhall, and R. Subramanian (2020) Not made for each other-audio-visual dissonance-based deepfake detection and localization. In International Conference on Multimedia (ACM), Cited by: §III-C.
  • [7] CNN Business Deepfakes are now trying to change the course of war. Note: Cited by: §I.
  • [8] E. Conti, D. Salvi, C. Borrelli, B. Hosler, P. Bestagini, F. Antonacci, A. Sarti, M. C. Stamm, and S. Tubaro (2022) Deepfake Speech Detection Through Emotion Recognition: a Semantic Approach. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §II-C.
  • [9] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad (2010) Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing 18 (5), pp. 954–964. Cited by: §II-B.
  • [10] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. C. Ferrer (2020) The deepfake detection challenge (DFDC) dataset. arXiv preprint arXiv:2006.07397. Cited by: §I, §II-A.
  • [11] R. Durall, M. Keuper, F. Pfreundt, and J. Keuper (2019) Unmasking deepfakes with simple features. arXiv preprint arXiv:1911.00686. Cited by: §I.
  • [12] P. N. Durette (2022) GTTS. GitHub. Note: Cited by: 11st item.
  • [13] J. Frank and L. Schönherr (2021) WaveFake: A Data Set to Facilitate Audio Deepfake Detection. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §I.
  • [14] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett (1993) DARPA timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93, pp. 27403. Cited by: §IV-A.
  • [15] L. Guarnera, O. Giudice, and S. Battiato (2020) Deepfake detection by analyzing convolutional traces. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §I.
  • [16] T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, S. Watanabe, T. Toda, K. Takeda, Y. Zhang, and X. Tan (2020) Espnet-tts: unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), External Links: Document Cited by: §II-B.
  • [17] B. Hosler, D. Salvi, A. Murray, F. Antonacci, P. Bestagini, S. Tubaro, and M. C. Stamm (2021) Do deepfakes feel emotions? a semantic approach to detecting deepfakes via emotional inconsistencies. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [18] K. Ito and L. Johnson (2017) The LJ Speech Dataset. Note: Cited by: 1st item.
  • [19] C. Jemine (2022) Real-time-voice-cloning. GitHub. Note: Cited by: 1st item.
  • [20] I. Jordal (2022) Audiomentations. GitHub. Note: Cited by: §III-D.
  • [21] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu (2018) Efficient neural audio synthesis. In

    International Conference on Machine Learning

    Cited by: 2nd item.
  • [22] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo (2018)

    Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks

    In IEEE Spoken Language Technology Workshop (SLT), Cited by: §II-B.
  • [23] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo (2019) Cyclegan-vc2: improved cyclegan-based non-parallel voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §II-B.
  • [24] H. Khalid, M. Kim, S. Tariq, and S. S. Woo (2021) Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In 1st workshop on synthetic multimedia-audiovisual deepfake generation and detection, Cited by: §I.
  • [25] H. Khalid, S. Tariq, M. Kim, and S. S. Woo (2021) FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset. In Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Cited by: §II-A.
  • [26] J. Kim, S. Kim, J. Kong, and S. Yoon (2020) Glow-tts: a generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems 33, pp. 8067–8077. Cited by: 3rd item.
  • [27] J. Kim, J. Kong, and J. Son (2021)

    Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech

    In International Conference on Machine Learning, Cited by: 9th item.
  • [28] D. H. Klatt (1987) Review of text-to-speech conversion for english. The Journal of the Acoustical Society of America 82 (3), pp. 737–793. Cited by: §II-B.
  • [29] P. Korshunov and S. Marcel (2018)

    Deepfakes: a new threat to face recognition? assessment and detection

    arXiv preprint arXiv:1812.08685. Cited by: 2nd item, §I, §I, §IV-A.
  • [30] O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V. Lavrukhin, J. Cook, et al. (2019) Nemo: a toolkit for building ai applications using neural modules. arXiv preprint arXiv:1909.09577. Cited by: §II-B.
  • [31] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville (2019) Melgan: generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems 32. Cited by: 1st item.
  • [32] A. Łańcucki (2021) Fastpitch: parallel text-to-speech with pitch prediction. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: 5th item.
  • [33] Y. Li, M. Chang, and S. Lyu (2018) In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In IEEE International Workshop on Information Forensics and Security (WIFS), Cited by: §I.
  • [34] C. Lo, S. Fu, W. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H. Wang (2019) Mosnet: deep learning based objective assessment for voice conversion. In Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §V-A.
  • [35] M. Lomnitz, Z. Hampel-Arias, V. Sandesara, and S. Hu (2020) Multimodal approach for deepfake detection. In IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Cited by: §I.
  • [36] S. Lyu (2020) Deepfake detection: current challenges and next steps. In IEEE international conference on multimedia & expo workshops (ICMEW), Cited by: §II-C.
  • [37] H. Malik (2019) Securing voice-driven interfaces against fake (cloned) audio attacks. In IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Cited by: §II-C.
  • [38] M. Müller, Y. Özer, M. Krause, T. Prätzlich, and J. Driedger (2021) Sync toolbox: a python package for efficient, robust, and accurate music synchronization. Journal of Open Source Software 6 (64), pp. 3434. Cited by: §III-C.
  • [39] N. M. Müller, P. Czempin, F. Dieckmann, A. Froghyar, and K. Böttinger (2022) Does audio deepfake detection generalize?. arXiv preprint arXiv:2203.16263. Cited by: §V-B1.
  • [40] N. M. Müller, F. Dieckmann, P. Czempin, R. Canals, K. Böttinger, and J. Williams (2021) Speech is silver, silence is golden: what do asvspoof-trained models really learn?. In Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §V-A.
  • [41] S. J. Nightingale and H. Farid (2022) AI-synthesized faces are indistinguishable from real faces and more trustworthy. Proceedings of the National Academy of Sciences 119 (8), pp. e2120481119. Cited by: §I, §II-B.
  • [42] NPR That smiling LinkedIn profile face might be a computer-generated fake. Note: Cited by: §I.
  • [43] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an ASR corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: 2nd item.
  • [44] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu (2020) Fastspeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558. Cited by: 4th item.
  • [45] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner (2019) Faceforensics++: learning to detect manipulated facial images. In IEEE/CVF international conference on computer vision, Cited by: §I, §V-C.
  • [46] M. Sahidullah, T. Kinnunen, and C. Hanilçi (2015) A comparison of features for synthetic speech detection. In Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §II-C.
  • [47] D. Salvi, P. Bestagini, and S. Tubaro (2022) Exploring the synthetic speech attribution problem through data-driven detectors. In IEEE International Workshop on Information Forensics and Security (WIFS), Cited by: §V-B.
  • [48] C. Sanderson and B. C. Lovell (2009) Multi-region probabilistic histograms for robust and scalable identity inference. In International conference on biometrics, Cited by: §IV-A.
  • [49] C. Sanderson (2002) The VidTIMIT database. Technical report IDIAP. Cited by: 2nd item, §I, §IV-A.
  • [50] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al. (2018) Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: 2nd item.
  • [51] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher (2021) End-to-end anti-spoofing with RawNet2. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §I, §II-C, §V-B.
  • [52] M. Tan and Q. Le (2019)

    Efficientnet: rethinking model scaling for convolutional neural networks

    In International conference on machine learning, Cited by: §V-C.
  • [53] O. Tatanov, S. Beliaev, and B. Ginsburg (2022) Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings. Cited by: 7th item, 8th item.
  • [54] S. Team (2021) Silero models: pre-trained enterprise-grade stt / tts models and benchmarks. GitHub. Note: Cited by: 12nd item.
  • [55] The New York Times Pennsylvania Woman Accused of Using Deepfake Technology to Harass Cheerleaders. Note: Cited by: §I.
  • [56] T. Toda, A. W. Black, and K. Tokuda (2007)

    Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory

    IEEE Transactions on Audio, Speech, and Language Processing 15 (8), pp. 2222–2235. Cited by: §II-B.
  • [57] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee (2019) ASVspoof 2019: Future horizons in spoofed and fake audio detection. In Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §I, §V-B.
  • [58] J. Vainer and O. Dušek (2020) Speedyspeech: Efficient neural speech synthesis. In Conference of the International Speech Communication Association (INTERSPEECH), Cited by: 10th item.
  • [59] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu (2016) WaveNet: a generative model for raw audio.. SSW 125, pp. 2. Cited by: §II-B.
  • [60] L. Verdoliva (2020) Media forensics and deepfakes: an overview. IEEE Journal of Selected Topics in Signal Processing 14 (5), pp. 910–932. Cited by: §I.
  • [61] Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous (2017) Tacotron: Towards End-to-End Speech Synthesis. In Conference of the International Speech Communication Association (INTERSPEECH), Cited by: §II-B, 1st item.
  • [62] Z. Wang, G. Wei, and Q. He (2011) Channel pattern noise based playback attack detection algorithm for speaker recognition. In IEEE International Conference on Machine Learning and Cybernetics (ICMLC), Cited by: §II-C.
  • [63] Wired A Zelensky Deepfake Was Quickly Defeated. The Next One Might Not Be. Note: Cited by: §I.
  • [64] J. Yamagishi, C. Veaux, K. MacDonald, et al. (2019) CSTR vctk corpus: english multi-speaker corpus for cstr voice cloning toolkit. Cited by: 3rd item.
  • [65] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, et al. (2021) ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. In Automatic Speaker Verification and Spoofing Countermeasures Challenge, Cited by: §I, §V-B.
  • [66] J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y. Bai, C. Fan, et al. (2022) ADD 2022: the first audio deep synthesis detection challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §I.