DeepAI
Log In Sign Up

Hide and Speak: Deep Neural Networks for Speech Steganography

02/07/2019
by   Felix Kreuk, et al.
0

Steganography is the science of hiding a secret message within an ordinary public message, which referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this paper, we explore the use of deep neural networks as steganographic functions for speech data. To this end, we propose to jointly optimize two neural networks: the first network encodes the message inside a carrier, while the second network decodes the message from the modified carrier. We demonstrated the effectiveness of our method on several speech data-sets and analyzed the results quantitatively and qualitatively. Moreover, we showed that our approach could be applied to conceal multiple messages in a single carrier using multiple decoders or a single conditional decoder. Qualitative experiments suggest that modifications to the carrier are unnoticeable by human listeners and that the decoded messages are highly intelligible.

READ FULL TEXT VIEW PDF
01/02/2021

Multi-Image Steganography Using Deep Neural Networks

Steganography is the science of hiding a secret message within an ordina...
03/30/2020

Deep Residual Neural Networks for Image in Speech Steganography

Steganography is the art of hiding a secret message inside a publicly vi...
11/14/2019

Adversarial Embedding: A robust and elusive Steganography and Watermarking technique

We propose adversarial embedding, a new steganography and watermarking t...
10/20/2021

Steganography of Complex Networks

Steganography is one of the information hiding techniques, which conceal...
02/18/2021

Deep Neural Networks based Invisible Steganography for Audio-into-Image Algorithm

In the last few years, steganography has attracted increasing attention ...
03/09/2015

Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages

We propose an efficient nonparametric strategy for learning a message op...
12/14/2022

Artificial Intelligence for Health Message Generation: Theory, Method, and an Empirical Study Using Prompt Engineering

This study introduces and examines the potential of an AI system to gene...

1 Introduction

Steganography (“steganos” – concealed or covered plus “graphein” – writing) is the science of concealing messages inside other messages. It is generally used to convey concealed “secret” messages to recipients who are aware of its presence, while keeping even their existence hidden from other unaware parties who only see the “public” or “carrier” message.

In this paper, we address the topic of speech steganography – hiding secret spoken messages within public audio files. Possibly the most common approach to speech steganography is to encode the secret message in the least significant bits of individual sound samples (Jayaram et al., 2011). The effect of modifying these bits is imperceptible, particularly when the number of bits used to represent the secret message is significantly fewer than the number of bits required to represent the carrier message itself. Other methods include concealing the secret message in the phase of the frequency components of the carrier (Dong et al., 2004) or in the form of the parameters of a miniscule echo that is introduced into the carrier signal (Bender et al., 1999). Yet other approaches embed the signal into redundant frequency bands with minimal carrier energy (Li & Yu, 2000), or low-energy or silence regions (Shirali-Shahreza & Shirali-Shahreza, 2008). Other methods attempt to hide the information in various transform domains (Djebbar et al., 2012).

In each case the hiding of information exploits actual or perceptual redundancies in the carrier signal, such as the least significant bits in samples, perceptual robustness to phase modifications, and redundant frequency bands. The redundancies to be exploited by any particular steganographic method must be explicitly selected, and are generally a feature of the algorithm. The algorithm itself must take care to limit the effect of the hiding to imperceptible aspects of the signal, to prevent detection or distortion of the carrier. For instance, a phase-modification based algorithm that distorts the carrier phase too heavily will immediately result in audible artifacts that can be identified. Similarly, a redundant-frequency-band modulation scheme must ensure that post-modulation frequencies do not shift into audible ranges. These restrictions limit the capacity of the carrier to carry hidden information. As a consequence, most schemes only attempt to fit a single hidden message into the carrier, at any time. The quality of the hidden audio too is often degraded, to fit within the available bit rate.

We contend that a priori choice of any specific form of redundancy to exploit is suboptimal. Moreover, the actual redundancy in audio signals exceeds the mere robustness to phase, least significant bits, or inaudible frequency bands, as evidenced by the fact that audio signals are routinely compressed by large factors, or even orders of magnitude, without appreciable loss of perceptual fidelity (Mitchell, 2004). However, identifying and isolating the specific redundancies in the signal is difficult, leading perhaps to the naïve solutions proposed so far that focus on a priori known and identifiable aspects of the signal.

In this paper we propose the use of deep neural networks as learnable steganographic functions, that learn to optimally exploit redundancies in audio data to conceal messages. The proposed model comprises three key parts. The first learns to extract a map of potential redundancies from the carrier signal. The second utilizes the map to best “stuff” a secret message into the carrier such that the carrier is minimally affected. The third learns to extract the hidden message from the steganographically-modified carrier.

The proposed model works at the frequency domain, however in order to transmit audio as a time domain signal we applied Short Time Fourier Transform (STFT) and Inverse Short Time Fourier Transform (ISTFT) during the training process as differentiable layers, thus imposing another constraint on the network outputs.

We demonstrate qualitatively and quantitatively that the proposed method is able to both effectively hide secret messages into a carrier and recover them from the carrier. More importantly, the scheme now permits us to hide multiple secret messages into a single carrier, each potentially with a different intended recipient who is the only person who can recover it. In experiments we have seen that we can hide up to five independent voice messages into a speech recording in this manner. Qualitative experiments suggest that modifications to the carrier are unnoticeable by human listeners and that the decoded messages are highly intelligible as well.

Our Contribution:

  • We explore the use of neural networks as stenographic functions for speech data. As such, we create a steganographic function that is not restricted to specific forms of signal redundancy.

  • We embed multiple voice messages in a single carrier using both conditional and several decoder networks.

  • We use differentiable STFT/ISTFT layers during training to care for noise introduced when converting signals from frequency to time domain and back.

  • We provide an empirical analysis and subjective analysis of the reconstructed files and show that the produced carriers are indistinguishable from the original carriers, while keeping the decoded messages highly intelligible.

The paper is organized as follows, Section 2 summarizes the related work. In Section 3 we formulate all the notations. In Section 4 we describe the proposed model. Section 5 and Section 6 present the results and an empirical analysis. We conclude the paper in Section 7 with a discussion and future work.

2 Related Work

A large variety of steganography methods have been proposed over the years, where most of them are applied to images (Morkel et al., 2005; Kessler, 2004). The most common approach manipulates the least significant bits (LSB) of the input data to place the secret information. This can be done uniformly or adaptively through simple replacement or through more complicated techniques (Fridrich et al., 2001; Tamimi et al., 2013). Although often not observable by humans, statistical analysis of the perturb data can reveal whether a given file contains a hidden message or not (Fridrich et al., 2001). Advanced methods, such as HUGO (Pevnỳ et al., 2010), attempt to preserve the input statistics in order to generate better steganography functions. Another example is WOW (Wavelet Obtained Weights) (Holub & Fridrich, 2012) which penalizes distortion of predictable regions of the input data using a bank of directional filters.

A closely related task is Watermarking. Both approaches aim to encode a secret message into a data file. However, in steganography the goal is to perform secret communication while in watermarking the goal is verification and ownership protection. Several watermarking techniques use LSB encoding (Van Schyndel et al., 1994; Wolfgang & Delp, 1996). Recently, Uchida et al. (2017); Adi et al. (2018a) suggested to embed watermarks into neural networks parameters.

Adversarial examples are synthetic patterns carefully crafted by adding a peculiar noise to legitimate examples. They are indistinguishable from the legitimate examples by a human, yet they have demonstrated a strong ability to cause catastrophic failure of state of the art neural systems (Szegedy et al., 2014; Cisse et al., 2017; Kreuk et al., 2018; Papernot et al., 2017; Sharif et al., 2017). While the existence of adversarial examples is usually seen as a disadvantage of neural networks, it can be a desirable property for hiding secret information. Instead of injecting perturbations that lead to wrong classification, one can consider the possibility of encoding useful information via adding specific perturbations.

Recently, neural networks have been used for both steganography and steganalysis (Baluja, 2017; Zhu et al., 2018; Hayes & Danezis, 2017; Qian et al., 2015; Pibre et al., 2016). Unlike traditional methods, neural networks model the entire data hiding pipeline and are trained end-to-end. Baluja (2017), first suggested to train neural networks to hide an entire image within another image. Hayes & Danezis (2017) suggested to use adversarial learning to generate stenographic images, where Zhu et al. (2018) combined both methods and additionally explored encoding robustness. However, all of these works were done specifically for images and hide a single message only.

3 Problem Setting

In this section, we formulate the task of speech steganography rigorously and set the notation for the rest of the paper. We denote the domain of acoustic feature vectors by

. Therefore, the representation of a speech signal is a sequence of vectors , where for all . The length of the input signal varies from one signal to another. Thus is not fixed. We denote by the set of all finite-length sequences over .

Recall, in speech steganography the goal is to conceal a hidden message inside a speech segment. Specifically, the steganography system is a function that gets as input a carrier utterance, denote as , and a hidden message, denote as . The outputs of the system are and such that the following constraints are satisfied: (i) both and should be perceptually similar to and respectively in the ears of a human listener; (ii) should be recoverable from ; (iii) Lastly, an uninformed listener should not be able to detect the presence of a hidden message embedded in . We would like to stress that the system is not trained or limited to a specific carrier or message.

4 Model

Our model is composed of three main components: (i) an Encoder Network; (ii) a Carrier Decoder Network; and (iii) a Message Decoder Network. The model is depicted schematically in Figure 1.

The Encoder Network , gets as input a carrier , and a message , and outputs a joint latent representation of both carrier and message, . There are numerous ways for generating such representation. In our study, we follow the work by Zhu et al. (2018) where , where is a Carrier Encoder and is the concatenation operator. In other words, the output of the encoder network is the concatenation of the carrier, message and the encoded carrier along the convolutional channel axis.

The Carrier Decoder Network , gets as input the aforementioned representation and outputs , the carrier embedded with the hidden message.

Lastly, the Message Decoder Network , gets as input and outputs , the reconstructed hidden message. Figure 1 provides a visual description of our model.

Each of the above components is a neural network, where the parameters are found by minimizing the mean squared error between the carrier and the embedded carrier and between the original message and the reconstructed message.

Formally, we optimize the following loss function,

(1)

where and are set to balance between the reconstruction of the carrier to the reconstruction of the message.

Figure 1: Model overview: the encoder gets as input the carrier and the message , it encodes using the carrier encoder and concatenates to and to generate . Then, the decoder carrier generates the new encoded carrier, from which the decoder message decodes the message . During training the reconstruction loss is applied between and to and , respectively.

4.1 Concealing Multiple Messages

So far we presented a model that can conceal a speech message within a speech carrier. This model can be extended to conceal multiple messages within the same carrier. We explored two possible approaches to implement such extension.

Multiple Decoders.

In this setting, the model is provided with a single carrier , and a set of messages, , where . In this case, is the concatenation of the encoded carrier, the original carrier, and all messages. Since our goal is to decode all messages separately, we use message decoders, denoted by where , one for each message. The modified loss function is therefore,

(2)

In words, each message decoder is trained to decode the th message . Notice, the following setup can be viewed as a generalization of single message decoding where is set to one.

Conditional Decoder.

The above setup leads to linear growth in memory costs in the size of . In other words, concealing messages requires separate decoders, which can be memory intensive for large values of . To mitigate that, we further explore the use of a single conditional decoder instead of multiple decoders. In this setup, the encoding process is identical to the case of multiple decoders. During decoding we condition the decoder with a set of codes . Therefore, gets as input not only but also a code indicating the message index. The corresponding loss function is as follows,

(3)

Each code is represented as a one-hot vector of size with at the th index and zeros elsewhere. We follow the work by Choi et al. (2018), where the conditioning label is spatially replicated and concatenated with the original input.

4.2 Representation and Inference

The carrier , the embedded carrier , and the message are all represented using the spectrogram, namely, the magnitude of the STFT. However, our goal is to transmit the reconstructed carrier as a time-domain audio signal and the messages must ideally be decoded from the recomposed audio signal. In order to reconstruct the carrier audio, we must invert the decoded carrier spectrogram to an audio signal. This will require the phase of the spectrum in each column of the spectrogram for .

One can explore several ways for approximating the phase of : (i) use the phase of the original carrier . This will result in a natural sounding reconstructed carrier audio. However, since the phase that is borrowed from the original carrier does not match the spectral magnitudes of , the magnitude of the STFT computed from the reconstructed carrier audio will not be identical to that of itself. As a result, the decoded messages will be incomprehensible; (ii) Use a phase approximation algorithm, such as the Griffin-Lim algorithm (Griffin & Lim, 1984). While this approach is straightforward and resulted in a better message decoding, the reconstructed carrier now has degraded quality; (iii) Use a neural-based vocoder, such as WaveNet (Van Den Oord et al., 2016). Although WaveNet was shown to produce high quality audio, it comes at the cost of high computational resources; (iv) Statistically model the discrepancy caused by the STFT ISTFT transformations using the original carrier phase. In this work, we follow the latter, and propose two possible settings. The first would be to train the model without converting to the time-domain, then, fine-tune on STFT ISTFT of and the original phase. We denote this setup as FTD. The second one would be to apply STFT ISTFT on during the training process. This can be done by implementing STFT and ISTFT as differentiable complex 1D-convolution layers. We denote this setup as SFS. Using STFT as a neural layer was successfully explored in (Défossez et al., 2018). Figure 2 illustrates the proposed architecture.

For completeness, we additionally explore training the model using FTD but fine-tunning all network components, we denote this by FTA; and training the model as in SFS for 5k iterations and fine-tuning only for additional 5k iterations. Overall this model was trained for 10k iterations, and we denote it by SFS+FTD.

Figure 2: The new STFT ISTFT component inside the gray box. The carrier decoder produces the new carrier power spectrogram which is then being transformed to the time-domain and back using the STFT ISTFT layers.

5 Experimental Results

In this section, we present our experimental results. First, we describe the experimental setup. Then, we evaluate our model on two speech datasets, and lastly, we provide visual analysis of the spectrograms. We implemented the code using PyTorch

(Paszke et al., 2017). Code and samples will be available under: https://bit.ly/2Bj163d.

5.1 Experimental Setup

We evaluated our approach on TIMIT (Garofolo et al., 1993) and YOHO (Campbell, 1995) datasets using the standard train/val/test splits. Each utterance was sampled at 16kHz and represented as its power spectrum by applying the STFT with 512 FFT frequency bins and 10ms sliding window. Training examples were generated by randomly selecting one utterance as carrier and other utterances as messages for . Thus, the selection of carrier and message is completely arbitrary. All models were trained using Adam for 10k iterations with a learning rate of . We balance between the carrier and message reconstruction losses using for TIMIT and for YOHO.

Each component in our model is implemented as a Gated Convolutional Neural Network (Gated ConvNets) as proposed by

Dauphin et al. (2017). Specifically, is composed of three blocks of gated convolutions, is composed of four blocks of gated convolutions, and is composed of six blocks of gated convolutions. Each block contains 64 kernels of size 33.

Both TIMIT and YOHO were recorded in sterile conditions, i.e. in a noise-free environment. This is usually not the case in real world settings. To simulate real world conditions, we inject background noise to all carrier segments. Specifically, we use random crops of restaurant background noise and inject them into carriers as follows:

(4)

Where is the average carrier energy, is the cropped noise and is the average noise energy.

5.2 Single Message

Table 1 reports the Mean-Squared-Error (MSE) for both carrier and message using single message embedding for all four settings (FTD, SFS, FTA, and SFS+FTD) using TIMIT and YOHO datasets. Notice, FTD yields the lowest carrier MSE in both datasets, but translates to poor message reconstruction loss. This can be explained by the fact that the network is not constrained by the STFT ISTFT layer, hence it is not forced to output a carrier spectrogram which retains the hidden message after the STFT ISTFT transformation. To validate that, we ran the FTA setup where we fine-tune all model components rather than only. Results reveal that althogh the carrier loss increased, the message loss greatly improved. However, using the STFT ISTFT layer from the beginning of training proves to be beneficial in both SFS and SFS+FTD settings, with SFS achieving the best message reconstruction loss and best overall loss according to Equation (1).

Figure 3: A comparison between carrier and messages training loss for . The upper pair of figures correspond to using multiple decoders, the bottom pair of figures correspond to using a single conditional decoder.

Dataset Setup Carrier loss Msg. loss
TIMIT ftd 0.005 0.126
sfs 0.017 0.038
fta 0.024 0.049
sfs+ftd 0.033 0.042
adv. 0.016 0.054
YOHO ftd 0.002 0.051
sfs 0.006 0.017
fta 0.005 0.019
sfs+ftd 0.012 0.017
adv. 0.006 0.025
Table 1: MSE for both carrier and message using single message embedding. Results are reported for both TIMIT and YOHO datasets.

5.3 Multiple Messages

In order to evaluate our models for concealing more than one message we explored two settings, using multiple decoders or one conditional decoder. Table 2 summarizes the results. Both settings achieve comparable reconstruction losses for embeddings three messages in a single carrier. Notice, an increase in the number of messages translates to higher loss values in both carrier and messages. These results can be expected as the model is forced to work at higher compression rates due to concealing and recovering more messages while keeping the carrier dimension the same. To further explore the affect of on loss values, we train additional models using five messages. Figure 3 compares the training loss values while embedding 1, 3 and 5 messages using multiple decoders and a single conditional decoder. For the multiple decoders, increasing the number of messages comes at the cost of reconstruction quality in terms of average message reconstruction loss, however the carrier loss converges roughly to the same place. On the other hand, the behaviour of the conditional decoder is similar to that of multiple encoders


Dataset Model Carrier loss Msg. loss
TIMIT multi 3 0.051 0.063
cond 3 0.071 0.066
multi 5 0.079 0.081
cond 5 0.079 0.089
YOHO multi 3 0.017 0.031
cond 3 0.015 0.026
multi 5 0.021 0.035
cond 5 0.029 0.038
Table 2: Carrier loss and average message loss for concealing multiple messages using either multiple decoders or a single conditional decoder. Results are reported for both TIMIT and YOHO datasets. All results are reported for SFS setting. The number of messages is denoted by
Figure 4: Steganography results. First three images of each set: original carrier , embedded carrier and their residual error. Last three images of each set: original message , decoded message and their residual error. The first row corresponds to a single message decoding, the second row corresponds to decoding a single preprocessed message, and the last row corresponds to decoding three messages using three decoders.

5.4 Adversarial Training

Zhu et al. (2018) suggested to add adversarial loss to the network in order to further improve the embedded carrier quality to be indistinguishable from the original carrier. Although adversarial learning had great success in the domains of text and vision (Ganin et al., 2016; Zhang et al., 2017; Choi et al., 2018; Goodfellow et al., 2014), it is unclear what is its effect in speech processing (Adi et al., 2018b). To further investigate that, we added adversarial loss terms to our optimization, resulting the following objective,

(5)
(6)

where is the adversarial loss coefficient, and is the discriminator loss. is a discrimination network that receives as input spectrograms and outputs a number in the simplex . aims to discriminate between original carrier files and carriers generated by our steganography model. More formally, is a function . is implemented as a fully convolutional neural network using six convolutional layers with 64 kernels of 33 in each layer. Results are summarized in Table 1, the affect of the adversarial loss terms had a perceptually unnoticeable affect on and a minor impact on the carrier loss value.

5.5 Subjective Experiments

Carrier ABX Testing.

To validate that the difference between the original carrier and the carrier embedded with a hidden message is not detectable by humans, we performed ABX testing. An ABX test is a standard way to asses detectable differences between two choices of sensory stimuli. We present each human with two audio samples A and B. Each of these two samples is either the original carrier or the carrier embedded with a hidden message. These two samples are followed by a third sound X randomly selected to be either A or B. Next, the human must choose whether X is the same as A or B. We generated 50 audio samples, for each audio sample we recorded 20 answers from Amazon Mechanical Turk (1000 answers overall). Only 48.2% of the carriers embedded with hidden messages could be distinguished from the original ones by humans (the optimal ratio is 50%). Therefore the modifications made by the steganographic function are not statistically significant and are not distinguishable by a human ear.

Message Intelligibility.

A major metric in evaluating a speech steganography system is the intelligibility of the reconstructed messages. To quantify this measure we conducted an additional subjective experiment in Amazon Mechanical Turk. We generated 50 samples: 25 of which are original messages, and the additional 25 are reconstructed messages by our model. We recorded 20 answers for each sample (1000 answers overall). The participants were instructed to transcribe the presented samples, and the Word Error Rate (WER) and Character Error Rate (CER) were measured. While WER is a coarse measure, CER provides finer evaluation of transcription error. A small difference between the original messages and reconstructed messages in terms of WER and CER would suggest that the steganographic system does not degrade intelligibility of the hidden messages. The CER/WER measured on original and reconstructed messages was 15.38%/32.35% and 15.30%/33.31% respectively. We therefore deduce that our system does not degrade the intelligibility of speech signal. Despite the fact that the error rates differences between and are low, the absolute values are relatively high. In order to better understand this phenomenon, we performed a manual analysis of the errors made by the transcribers. This can be investigated via a comparison of the transcribed text against the target label. Table 3 presents sampled mismatched transcription and target label pairs. Notice that although the error rates (CER/WER) are high, the semantic meaning of the transcribed text remains the same.


Target label Human Transcription
westchester is a county westchesters a county
westchester is a county west chesters a county
previous speakers previous speaker
no price is too high no price is to high
the water contain water contains
it must be and it must be
Table 3: Several samples from the transcribed messages. Although WER and CER are relatively high, a closer look reveals semantically equivalent transcriptions.

6 Analysis

In standard steganographic applications it is assumed that the original, clean, carrier is kept private and can not be accessed by an adversary. However, in order to further analyze the ability of the generated carrier to encode the hidden message, we experienced with subtracting the original carrier from the embedded carrier and visualizing the resulted spectrograms.

Carrier visualizations consist of the original unmodified carrier, the carrier embedded with a hidden message and their residual difference. Message visualizations consist of the original message, the message decoded by the steganograpy model and their residual difference. We notice that the residual difference, both in audio form and visual representation, is not sufficient evidence to decypher the embedded message.

In some cases, vaguely resembles in the ears of a human listener when presented with the original message (Spieth et al., 1954). Such setup where the adversary has access to the original carrier and some reference of the original message is extreme and highly unlikely.

Nevertheless, additional precautions can be taken to further obfuscate the hidden message and challenge such adversary. One can utilize many pre-processing stages before passing it to the steganography function, such as: frequency permutation, dividing the spectrograms to tiles and re-arranging, etc. In this study we explore a more basic operation of time and frequency flipping of the message. We flip the message spectrogram along the time and frequency axes and feed it to the steganography model. We leave additional and more intricate transformations for future work.

The above pre-processing stage results in a incomprehensible while retaining the same message reconstruction ability after repeating the flipping operation.

Figure 4

presents visualizations of carrier and message for a single message decoding, preprocessed single message decoding and three messages decoding. Notice, in single message decoding the residual difference resembles the embedded message in areas of high energy, both visually and audibly. When preprocessing the message, although the residual difference resembles the message visually, it is incomprehensible to a human listener. This visual similarity can be mitigated using more complex transformations as described above. Lastly, in multiple messages decoding, the residual difference appears to be very noisy and bares no resemblance to any particular single embedded message.

7 Discussion and Future Work

In this study, we explored for the first time the use of neural networks for speech steganography. We demonstrated the ability of our model to hide several messages in the a single carrier using both multiple decoders and one conditional decoder. The proposed method leverages the redundancy of speech data representation, hence it might not perform as well when applying to domains other than speech such as music or generic audio sequences. We would like to further investigate the above domains in future work.

Another research direction could be utilizing our model as a measure for information redundancy in an arbitrary media. One could potentially use the reconstruction loss as a proxy for measuring redundancy in the carrier: a carrier with high redundancy will be capable of containing more hidden messages than an extremely rich one. We leave further exploration of this for future work.

Lastly, we would like to consider other types of neural architectures such as recurrent neural networks, attention mechanisms, and models working on the time-domain.

References

  • Adi et al. (2018a) Adi, Y., Baum, C., Cisse, M., Pinkas, B., and Keshet, J. Turning your weakness into a strength: Watermarking deep neural networks by backdooring. USENIX, 2018a.
  • Adi et al. (2018b) Adi, Y., Zeghidour, N., Collobert, R., Usunier, N., Liptchinsky, V., and Synnaeve, G. To reverse the gradient or not: An empirical comparison of adversarial and multi-task learning in speech recognition. arXiv preprint arXiv:1812.03483, 2018b.
  • Baluja (2017) Baluja, S. Hiding images in plain sight: Deep steganography. In Advances in Neural Information Processing Systems, pp. 2069–2079, 2017.
  • Bender et al. (1999) Bender, W., Gruhl, D., and Morimoto, N. Method and apparatus for echo data hiding in audio signals, April 6 1999. US Patent 5,893,067.
  • Campbell (1995) Campbell, J. P. Testing with the yoho cd-rom voice verification corpus. In ICASSP, 1995.
  • Choi et al. (2018) Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., and Choo, J.

    Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.

    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 8789–8797, 2018.
  • Cisse et al. (2017) Cisse, M. M., Adi, Y., Neverova, N., and Keshet, J. Houdini: Fooling deep structured visual and speech recognition models with adversarial examples. In Advances in Neural Information Processing Systems, pp. 6977–6987, 2017.
  • Dauphin et al. (2017) Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 933–941. JMLR. org, 2017.
  • Défossez et al. (2018) Défossez, A., Zeghidour, N., Usunier, N., Bottou, L., and Bach, F. Sing: Symbol-to-instrument neural generator. In Advances in Neural Information Processing Systems, pp. 9055–9065, 2018.
  • Djebbar et al. (2012) Djebbar, F., Ayad, B., Meraim, K. A., and Hamam, H. Comparative study of digital audio steganography techniques. EURASIP Journal on Audio, Speech, and Music Processing, 2012(1):25, 2012.
  • Dong et al. (2004) Dong, X., Bocko, M. F., and Ignjatovic, Z. Data hiding via phase manipulation of audio signals. In Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International Conference on, volume 5, pp. V–377. IEEE, 2004.
  • Fridrich et al. (2001) Fridrich, J., Goljan, M., and Du, R. Detecting lsb steganography in color, and gray-scale images. IEEE multimedia, 8(4):22–28, 2001.
  • Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • Garofolo et al. (1993) Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n, 93, 1993.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Griffin & Lim (1984) Griffin, D. and Lim, J.

    Signal estimation from modified short-time fourier transform.

    IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
  • Hayes & Danezis (2017) Hayes, J. and Danezis, G. Generating steganographic images via adversarial training. In Advances in Neural Information Processing Systems, pp. 1954–1963, 2017.
  • Holub & Fridrich (2012) Holub, V. and Fridrich, J. Designing steganographic distortion using directional filters. In 2012 IEEE International workshop on information forensics and security (WIFS), pp. 234–239. IEEE, 2012.
  • Jayaram et al. (2011) Jayaram, P., Ranganatha, H., and Anupama, H. Information hiding using audio steganography–a survey. The International Journal of Multimedia & Its Applications (IJMA) Vol, 3:86–96, 2011.
  • Kessler (2004) Kessler, G. An overview of steganography for the computer forensics examiner. retrieved february 26, 2006, 2004.
  • Kreuk et al. (2018) Kreuk, F., Adi, Y., Cisse, M., and Keshet, J. Fooling end-to-end speaker verification by adversarial examples. Proc. ICASSP, 2018.
  • Li & Yu (2000) Li, X. and Yu, H. H. Transparent and robust audio data hiding in subband domain. In itcc, pp.  74. IEEE, 2000.
  • Mitchell (2004) Mitchell, J. L. Introduction to digital audio coding and standards. Journal of Electronic Imaging, 13(2):399, 2004.
  • Morkel et al. (2005) Morkel, T., Eloff, J. H., and Olivier, M. S. An overview of image steganography. In ISSA, pp. 1–11, 2005.
  • Papernot et al. (2017) Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519. ACM, 2017.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
  • Pevnỳ et al. (2010) Pevnỳ, T., Filler, T., and Bas, P. Using high-dimensional image models to perform highly undetectable steganography. In International Workshop on Information Hiding, pp. 161–177. Springer, 2010.
  • Pibre et al. (2016) Pibre, L., Pasquet, J., Ienco, D., and Chaumont, M. Deep learning is a good steganalysis tool when embedding key is reused for different images, even if there is a cover sourcemismatch. Electronic Imaging, 2016(8):1–11, 2016.
  • Qian et al. (2015) Qian, Y., Dong, J., Wang, W., and Tan, T. Deep learning for steganalysis via convolutional neural networks. In Media Watermarking, Security, and Forensics 2015, volume 9409, pp. 94090J. International Society for Optics and Photonics, 2015.
  • Sharif et al. (2017) Sharif, M., Bhagavatula, S., Bauer, L., and Reiter, M. K. Adversarial generative nets: Neural network attacks on state-of-the-art face recognition. arXiv preprint arXiv:1801.00349, 2017.
  • Shirali-Shahreza & Shirali-Shahreza (2008) Shirali-Shahreza, S. and Shirali-Shahreza, M. Steganography in silence intervals of speech. In International conference on intelligent information hiding and multimedia signal processing, pp. 605–607. IEEE, 2008.
  • Spieth et al. (1954) Spieth, W., Curtis, J. F., and Webster, J. C. Responding to one of two simultaneous messages. The Journal of the acoustical society of America, 26(3):391–396, 1954.
  • Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In Proc. ICLR, 2014.
  • Tamimi et al. (2013) Tamimi, A. A., Abdalla, A. M., and Al-Allaf, O. Hiding an image inside another image using variable-rate steganography. International Journal of Advanced Computer Science and Applications (IJACSA), 4(10), 2013.
  • Uchida et al. (2017) Uchida, Y., Nagai, Y., Sakazawa, S., and Satoh, S. Embedding watermarks into deep neural networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 269–277. ACM, 2017.
  • Van Den Oord et al. (2016) Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. CoRR abs/1609.03499, 2016.
  • Van Schyndel et al. (1994) Van Schyndel, R. G., Tirkel, A. Z., and Osborne, C. F. A digital watermark. In Image Processing, 1994. Proceedings. ICIP-94., IEEE International Conference, volume 2, pp. 86–90. IEEE, 1994.
  • Wolfgang & Delp (1996) Wolfgang, R. B. and Delp, E. J. A watermark for digital images. In ICIP (3), pp. 219–222, 1996.
  • Zhang et al. (2017) Zhang, Y., Barzilay, R., and Jaakkola, T. Aspect-augmented adversarial networks for domain adaptation. arXiv preprint arXiv:1701.00188, 2017.
  • Zhu et al. (2018) Zhu, J., Kaplan, R., Johnson, J., and Fei-Fei, L. Hidden: Hiding data with deep networks. In European Conference on Computer Vision, pp. 682–697. Springer, 2018.