In recent years, an increasing number of security-sensitive voice-controlled systems such as virtual assistants have been introduced. While these systems are typically equipped with a speaker verification model, their vulnerabilities to multiple types of replay attacks have become a new security concern, e.g., an attacker can play a pre-recorded or synthesized speech sample to spoof the speaker verification system [7, 22, 5, 9, 8]. Therefore, developing an effective countermeasure to distinguish between genuine and replayed samples has become a recent research focus [19, 33, 16, 10]. While there have been many prior efforts in this area [14, 36, 32, 1, 21, 23, 4, 35, 6, 15, 17, 38, 11, 27]), they only focus on detecting replay attacks based on single-channel input and therefore only leverage the temporal and spectral features. However, we identified three reasons why a countermeasure designed using multi-channel audio input could provide improved performance. First, multi-channel audio captured by a microphone array contains spatial information, which can contain useful cues to help distinguish genuine and replayed samples [12, 39]. Second, it is relatively easy for an attacker to manipulate the temporal and spectral features to fool an anti-spoofing module  by simply modifying the replayed signal, while spatial features (e.g., time difference of arrival (TDoA)) are harder to manipulate and hence are more reliable. Third, multi-channel speech recognition techniques have been extensively studied and adopted [2, 30] and most modern far-field speech recognition systems are already equipped with microphone arrays, which makes it easy to obtain multi-channel audio.
Only few prior studies focused on multi-channel voice anti-spoofing. In , the authors propose VoiceLive, which captures TDoA changes in a sequence of phoneme sounds to the two microphones and uses such unique TDoA dynamics to distinguish between replayed and genuine samples. The limitation is that this method requires the microphones to be placed very close (1-6cm) to the mouth. In , the authors use the “pop noise” caused by breathing to identify a live speaker based on two-channel input, where one channel is used to filter the pop noise and another is used as a reference. The limitation is that the pop noise effect disappears over larger distances and thus the method is only applicable to close-field situations. In [28, 37], the authors use generalized cross-correlation (GCC) of the non-speech sections of a stereo signal to detect the replay attack. The idea is that loudspeakers tend to generate electromagnetic noise during the non-speech section, and therefore the cross-correlation between the two-channel signals will be higher for replayed samples than genuine samples in the non-speech section. The limitation is that in order to make the electromagnetic noise detectable, a suitable background noise level and high-fidelity microphone is required, which is not always met in realistic settings.
In summary, previous multi-channel replay attack detection methods: 1) have only been designed for two-channel audio input while modern microphone arrays usually have more microphones and contain richer spatial information; 2) rely on hand-crafted features and calibrating the features for a different microphone array system or a different environment can be difficult; and 3) have relatively few applicable scenarios due to the close-field or low SNR requirement.
In order to overcome the above-mentioned limitations, in this paper, we propose a novel neural network-based replay attack detection model that has the following advantages. First, the proposed model is completely data-driven, i.e., no manual spectral or spatial feature engineering is needed, and the model can be used for inputs of any number of channels without knowing microphone array specifics (such as the array geometry) whenever training data is available. For the same reason, the proposed model can adapt to different environments using the training data and, therefore, there are no explicit constraints on usage scenarios. Second, all components (i.e., beamformer, feature extraction, and classification) are part of a neural network framework, which makes it easy to train them using existing massive optimizing methods and combine them with other neural-based countermeasure models. This work is the first neural-based multi-channel replay attack detector. We perform experiments using the recently collected ReMASC corpus that contains genuine and replayed samples recorded in a variety of environments using different microphone array systems. We find that by leveraging the multi-channel audio input, a significant performance improvement (up to 46.6%) can be achieved compared to a single-channel input model.
Ii The Multi-channel End-to-end Replay Attack Detection Network
One classic microphone array signal processing technique is the filter-and-sum beamformer . For a multi-channel audio with channels , the filter-and-sum beamformer filters each audio channel using an -point FIR filter with a delay (or advance) of a steering time difference of arrival (TDOA) to conduct the time alignment, and then sum the output of each channel together to obtain the output :
The filter-and-sum beamformer can be decomposed into two sub-processes: 1) estimating theand 2) finding the optimal filter . The first sub-process can be done by using a separate time-delay estimation module. But in order to implement the entire model in a neural network framework, we follow the method described in  to implicitly absorb the steering delay into the filter parameters and use a bank of filters for each channel to capture different :
and where denotes the convolution operation. In the remainder of this paper, we refer to as a front-end filter to distinguish it from other filters in the neural network.
Usually, an optimal filter
is designed using a separate optimization objective (e.g., minimum variance distortionless response (MVDR) or multichannel Wiener filtering (MWF) ), which is usually different from the objective of the actual learning task (e.g., word error rate for the speech recognition task). While this is acceptable for tasks such as speech recognition since low speech distortion and noise level are likely to improve the recognition accuracy, it might lead to the opposite effect for the replay attack detection task, because the filters might remove the useful cues in noisy or high-frequency components that the replay attack detector relies on. Therefore, a better strategy is to jointly optimize the beamformer with the replay attack detector. Note that our network design will also consider that replay attack detection is a sample-level classification task, i.e., there is only one label for each audio sample (either genuine or replayed) and thus using part of the input can be sufficient and will significantly lower the computational overhead.
We design the network based on the architecture presented in [30, 29], which was previously used for the speech recognition task. As shown in Figure 1, first, in order to lower the computational overhead, we use only the first 500ms of each audio and perform a non-overlapped framing of 20ms (frame length =882 at a sample rate of 44.1kHz) and feed each frame to the next layer. Then, we conduct the filter-and-sum beamforming as described in Equations 2 and 3 to obtain a 2-D time-frequency representation for each frame. We use filter length =630 and test different values for the filter number in our experiments. Note that the filter-and-sum beamformer differs from the simpler delay-and-sum beamformer in that an independent weight is applied to each of the channels before summing them. Therefore, the filters
for each channel do not share the weight. Similar convolution layer designs have been widely adopted in image processing tasks to process the three-channel RGB input, but not for the purpose of beamforming. To lower the computational overhead, we do not apply padding for this convolution layer. After that, we conduct a global max-pooling in time and apply a ReLU
nonlinear activation function to the beamformer output and getand feed it to a frequency convolution layer that consists of 256 18 filters with a max-pooling of size 3. The pooled output is then fed to a 256-dimensional fully-connected layer. The output is the representation of this frame. We conduct the same above operations to all frames and feed the sequence of to three stacked LSTM  layers, each with 832 hidden units and feed the output of the last frame to a single fully-connected layer to obtain the prediction.
The entire model is end-to-end from audio waveform to prediction and is trained jointly using a unified objective of minimizing the weighted cross-entropy loss. We re-weight the cross-entropy loss for each class using the normalized reciprocal of the sample number of the class in the training set to avoid the class-imbalance problem. To summarize, the proposed model has the following advantages: 1) It is completely data-driven, hence no manual feature engineering is needed, and the model can be used for inputs of any number of channels without knowing the microphone array information such as array geometry whenever training data is available. 2) All components (beamformer, feature extraction, and classification) are in the neural network framework, which makes it easy to train using existing massive optimizing methods. The intermediate tensoris a standard time-frequency representation, so it is easy to further improve the model by combining with other advanced neural network-based replay attack detectors. 3) By taking advantage of the fact that replay attack detection does not necessarily need to use the entire sequence, a few strategies such as non-overlap framing and convolution without padding are used to speed up computation.
|Device||Model||Sample Rate||Bit Depth||#Channels|
|D2||Respeaker 4 Linear||44100||16||4|
We previously collected the ReMASC (Realistic Replay Attack Microphone Array Speech) corpus  to facilitate experiments on microphone arrays. The ReMASC corpus differs from other publicly available voice anti-spoofing datasets (e.g., the RedDots replayed dataset ) as follows: First, the ReMASC corpus contains recordings by a variety of microphone array systems instead of a single microphone. The microphone array geometry and the corresponding recording settings are shown in Figure 2 and Table I, respectively. Therefore, the ReMASC dataset is particularly well-suited for multi-channel voice anti-spoofing research.
Second, instead of using audio simulation tools , we recorded the ReMASC corpus in a variety of realistic usage scenarios and settings. Specifically, the data set contains recordings from 50 subjects of both genders and with different ages and accents. The recordings have been obtained in four different environments (two indoor, one outdoor, and one moving vehicle scenario) with varying types and levels of noise, and consisting of 132 voice commands. The distance between speaker and device varies from 0.5m to 6m, the dataset for the indoor environments uses different placements of the loudspeaker and the microphone array. The replayed recordings are played using a variety of loudspeakers. Therefore, the data has a large degree of variety. More details about the ReMASC corpus can be found in .
Iii-B Data Splitting and Training Scheme
The ReMASC corpus is split into the core set (6331 genuine and 17175 replayed samples) and the evaluation set (2118 genuine and 14162 replayed samples). Both sets are speaker-independent and therefore strictly non-overlapping. In addition, there are additional synthetic speech data in the evaluation set. This configuration allows us to test the generalization capability of an anti-spoofing model.
In this study, we use the core set for training and development, and the evaluation set for testing. Specifically, we use 90% of the core set to train the model and use the development set to choose the batch size and initial learning rate as well as to implement the early stopping strategy (i.e., stop training when the evaluation metric on the development set stops improving). We use a learning rate decay with warm-up period strategy, i.e., the learning rate starts at the initial learning rate and linearly increases to 10 times larger in the first 20 epochs (the warm-up period), and then drops by half every 20 epochs until the equal error rate (EER) stops improving on the development set or the max epoch of 100 is reached. We select the batch size of 64 and the initial learning rate of 1e-5 through a grid search. We use ADAM optimizer with a l2-norm regularizer[18, 25]
with the weight decay coefficient of 1e-3. All experiments are repeated three times with different random seeds and the mean value is reported. We use EER as our evaluation metric. Since the four microphone arrays were mounted on a stand and recorded simultaneously during the data collection, the data volume recorded by each of the 4 microphone arrays is roughly 1/4 of the total data volume (D1 has less data due to hardware crashes in the data collection). Since the microphone arrays use completely different hardware and geometry, for our experiments, we train and evaluate the machine learning model separately for each microphone array by only using data collected by that array.
We compare the following models in our experiment:
This model is exactly the same as the proposed model described in Section II except that only the first channel is fed to the neural network. By comparing this model with the proposed multi-channel model, we want to observe if multi-channel audio processing can indeed lower the error rate.
Iii-C2 NN-Dummy Multichannel
This model is exactly the same as the proposed model in Section II, but here we feed multi-channel input into the neural network. The difference is that we feed replications of the first channel data as “dummy” multi-channel input to the neural network. The reason why we include this model in our experiment is that only comparing NN-Single with the proposed multi-channel model is not absolutely fair: when the number of input channels increases, the neural network architecture also changes. Specifically, the number of the total front-end filters (i.e., ) linearly increases with the input channel numbers. Therefore, it is possible that the performance difference between NN-Single and the proposed multi-channel model is actually due to the increasing number of front-end filters instead of the spatial information in the multi-channel input. In contrast, this NN-Dummy Multichannel model has exactly the same neural network architecture and number of parameters as the proposed multi-channel model, and therefore the performance difference between the two models can only be attributed to input difference, and we can observe if the spatial information in the multi-channel input can lower the error rate.
This is the proposed model described in Section II. For each microphone array, all channels are fed into the neural network.
Iii-D1 Model Comparison
The results of the model comparison are shown in Table II. In this section and III-D2, we use a fixed of 64. Our key findings are as follows. First, the NN-Multichannel model clearly outperforms the NN-Single model, with an average EER improvement of 26.2%. For the microphone arrays recorded at 44100Hz (D1, D2, D3), we observe that the more channels the microphone array has, the larger the relative EER improvement by leveraging the multichannel input (e.g., the 2-mic D1 has a 6.6% lower EER, while the 6-mic D3 has a 40.2% lower EER). In contrast, the NN-Dummy Multichannel performs about the same as NN-Single. Since NN-Multichannel and NN-Dummy Multichannel have exactly the same neural network architecture and the only difference is the input, this demonstrates that the performance improvement of NN-Multichannel is not due to its larger number of front-end filters, but due to effectively leveraging the spatial information in the multichannel input.
|Device||# Used Channels|
Iii-D2 Impact of the Number of Input Channels
In the previous section, we found that using all channels can significantly outperform using a single channel of the same audio. In this section, we further investigate the impact of the number of input channels. We gradually add the input channels from one channel to the total available channels of each microphone array and measure the EER. The order with which we add the microphones is (microphone indexed as shown in Figure 2): D1: 1-2; D2: 1-4-2-3; D3: 1-4-2-5-3-6; D4: 1-4-2-5-3-6-7. The rule is that we add the microphone furthest from the previously added microphone. For device 2, we further test a different order of 1-2-3-4. As shown in Table III, we have the following findings: First, while the results show that EER drops with an increasing number of microphones, indicating that more microphones help improve the defense performance, the best performance of D3 (46.6% multi-channel improvement) and D4 (30.2% multi-channel improvement) is not achieved when all microphones are used, but when only 5 and 6 microphones are used, respectively. We believe a possible reason is that too many channels may increase the risk of model over-fitting. Second, we observe that the performance improves most significantly when the second microphone (the microphone that has the longest distance from the first microphone) is added for D1, D3, and D4, and that the performance gradually saturates with more microphones. Nevertheless, D2 gets the most significant performance improvement when the third microphone is added, and if we switch the order to 1-2-3-4, we find that the most significant performance improvement is achieved when the second microphone is added. This indicates that it is not always optimal to use a microphone array with a larger dimension. We intend to explore these phenomena further in our future research.
|# Front-end Filters Per Channel|
|Mean EER (%)||23.4||20.6||21.0||19.7||19.6||19.5|
Iii-D3 Impact of the Number of Front-end Filters
In this experiment, we explore the impact of the number of front-end filters per channel. As shown in Table IV, we find that the model performance does improve with the increasing number of front-end filters, but saturates around . The result is as expected since more front-end filters can help extract spatial and special features from the raw waveform.
In this paper, we introduce a novel neural network-based replay attack detection model that leverages both the spectral and spatial information in the multi-channel audio and is able to significantly improve the replay attack detection performance. Compared to previous efforts, the proposed model supports arbitrary number of input channels and is completely data-driven in a neural network framework, which will make it easy to combine the proposed method with other neural-based anti-spoofing countermeasures.
-  (2018) Replay spoofing attack detection using deep neural networks. In 2018 26th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. Cited by: §I.
-  (2008) Microphone array signal processing. Vol. 1, Springer Science & Business Media. Cited by: §I, §II.
-  (2013) Microphone arrays: signal processing techniques and applications. Springer Science & Business Media. Cited by: §II.
-  (2017) Countermeasures for automatic speaker verification replay spoofing attack: on data augmentation, feature representation, classification and fusion.. In INTERSPEECH, pp. 17–21. Cited by: §I.
-  (2016) Hidden voice commands.. In USENIX Security Symposium, pp. 513–530. Cited by: §I.
-  (2017) ResNet and model fusion for automatic spoofing detection.. In INTERSPEECH, pp. 102–106. Cited by: §I.
-  (2014) Your voice assistant is mine: how to abuse speakers to steal information and control your phone. In Proc. of the 4th ACM Workshop on Security and Privacy in Smartphones & Mobile Devices, pp. 63–74. Cited by: §I.
Real-time adversarial attacks.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 4672–4680. External Links: Cited by: §I.
-  (2017) Crafting adversarial examples for speech paralinguistics applications. arXiv preprint arXiv:1711.03280. Cited by: §I.
-  (2018) An overview of vulnerabilities of voice controlled systems. arXiv preprint arXiv:1803.09156. Cited by: §I.
-  (2018) Protecting voice controlled systems using sound source identification based on acoustic cues. In 2018 27th International Conference on Computer Communication and Networks (ICCCN), pp. 1–9. Cited by: §I.
-  (2019) ReMASC: realistic replay attack corpus for voice controlled systems. Proc. Interspeech 2019, pp. 2355–2359. Cited by: §I, §I, §III-A, §III-A.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II.
-  (2017) Spoof detection using source, instantaneous frequency and cepstral features.. In INTERSPEECH, pp. 22–26. Cited by: §I.
-  (2018) Exploration of compressed ilpr features for replay attack detection. In Proc. Interspeech 2018, pp. 631–635. External Links: Cited by: §I.
-  (2020) Advances in anti-spoofing: from the perspective of asvspoof challenges. APSIPA Transactions on Signal and Information Processing 9. Cited by: §I.
-  (2018) Effectiveness of speech demodulation-based features for replay detection. Proc. Interspeech 2018, pp. 641–645. Cited by: §I.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-B.
-  (2017) ASVspoof 2017: automatic speaker verification spoofing and countermeasures challenge evaluation plan. Training 10 (1508), pp. 1508. Cited by: §I.
-  (2017) Reddots replayed: a new replay spoofing attack corpus for text-dependent speaker verification research. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 5395–5399. Cited by: §III-A.
Audio replay attack detection with deep learning frameworks.. In Interspeech, pp. 82–86. Cited by: §I.
-  (2017) The insecurity of home digital voice assistants–amazon alexa as a case study. arXiv preprint arXiv:1712.03327. Cited by: §I.
-  (2017) A study on replay attack and anti-spoofing for automatic speaker verification. arXiv preprint arXiv:1706.02101. Cited by: §I.
-  (2019) Adversarial attacks on spoofing countermeasures of automatic speaker verification. arXiv preprint arXiv:1910.08716. Cited by: §I.
-  (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §III-B.
-  (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §II.
-  (2019) Energy Separation-Based Instantaneous Frequency Estimation for Cochlear Cepstral Feature for Replay Spoof Detection. In Proc. Interspeech 2019, pp. 2898–2902. External Links: Cited by: §I.
-  Improving replay attack detection by combination of spatial and spectral features. Cited by: §I.
-  (2015) Convolutional, long short-term memory, fully connected deep neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584. Cited by: §II.
Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (5), pp. 965–979. Cited by: §I, §II, §II, §III-A.
-  (2015) Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification. In Sixteenth annual conference of the international speech communication association, Cited by: §I.
-  (2017) Constant q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Computer Speech & Language 45, pp. 516–535. Cited by: §I.
-  (2019) Asvspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441. Cited by: §I.
-  (1988) Beamforming: a versatile approach to spatial filtering. IEEE assp magazine 5 (2), pp. 4–24. Cited by: §II.
-  (2017) Feature selection based on cqccs for automatic speaker verification spoofing.. In INTERSPEECH, pp. 32–36. Cited by: §I.
-  (2017) Audio replay attack detection using high-frequency features.. In INTERSPEECH, pp. 27–31. Cited by: §I.
-  (2019) Replay attack detection using generalized cross-correlation of stereo signal. In 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5. Cited by: §I.
-  (2018) Feature with complementarity of statistics and principal information for spoofing detection. In Proc. Interspeech 2018, pp. 651–655. External Links: Cited by: §I.
-  (2016) Voicelive: a phoneme localization based liveness detection for voice authentication on smartphones. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1080–1091. Cited by: §I, §I.