Acoustic echo will arise if the sound is listened by the speaker itself . This phenomenon is very commonplace no matter in communications, entertainments or man-machine interaction, and somewhere else. It may be useful in some scenarios, such as entertainments. But, in most cases, especially for voice interactions and communications, it is interfering and should be cancelled from the significant speech audio .
Since there is a reference signal representing the source of echo, adaptive filters are always employed for acoustic echo cancellation (AEC). There are many adaptive algorithms available, such as least mean square (LMS) , normalized LMS (NLMS) , block LMS (BLMS) 
, and etc. Each has its own merits and special applications. For obtaining considerable performance, filter lengths of several hundreds and sometimes thousands are required. Due to the significant reduction in computational load by using the fast Fourier transform (FFT) for implementing the block BLMS algorithm efficiently, the frequency domain block adaptive filter (FDBAF) based on the LMS algorithm is considered to be most suitable. Moreover, for accommodating long block delay and large quantization error in FFT, a more flexible frequency domain adaptive filter structure, called the multidelay block frequency domain (MDF) adaptive filter was proposed . Further, for obtaining robust echo cancellation, methods for adjusting the learning rate to vary according to conditions such as double-talk and echo path change were also raised . In brief, there are plenty of algorithms by using adaptive filter for AEC, giving considerable performance.
Unfortunately, there would be some residual echo after adaptive filtering. Though it is much smaller than speech audio in terms of amplitude in most cases, it could also be perceived by human ear and would make communication annoy. These residual echo includes linear residue introduced by mismatching between estimation and the reality and non-linear residue which are mostly caused by non-linear components on the audio devices. The linear residue can be reduced with elaborate structure and methods, such as  [9, 10, 11, 12], leaving the non-linear residue intractable for suppression. Though, some non-linear processing (NLP) methods have already be raised, the algorithm processing are complicated and could be inefficient for suppression [13, 14]. Moreover, these NLP methods would bring damage to the speech audios . In addition, some other methods such as non-linear filtering  and modeling estimation  are also used for non-linear echo cancellation.
By comparing the spectrum of residual echo with that of the speech audio, this residue can be considered as a type of noise. In addition, the far-end reference signal could also provide some relations for residue suppression. Inspired by this, a combination scheme by concatenating adaptive filter and neural network is proposed in this paper. The echo interfered speech audio is first processed by MDF filter with adaptive learning rate for cancelling primary echo signal. Thereafter, a neural network with perspicuous structure is elaborately designed and trained for residual echo suppression. This method is compared with other prevailing methods in terms of echo return loss enhancement (ERLE), logarithmic spectral distance (LSD), response time (RT), model size.
2 Algorithm Structure
2.1 Combination Scheme
The integration scheme by combining adaptive filter and the neural network is depicted in Fig. 1. Adaptive filter is used for cancelling the linear echo introduced by the multi-path or the room impulse response (RIR) . It has been proved to give considerable performance with low complexity. The weighting coefficients of the finite impulse filter (FIR) can be adjusted in time for estimating the RIR, then getting the estimated transcript of the echo signal. However, due to the non-linear components equipped on the devices, such as the loudspeaker with poor linearity, non-linear echo would be introduced. It cannot be cancelled by the adaptive filtering with FIR structure, resulting in residual echo. As is depicted in Fig. 2, the residual echo after adaptive filtering would be decreased to a little scale compared with speech audio in terms of amplitude. It could be considered as a special type of noise. Meanwhile, this noise could have some relations with the far-end reference signal. Therefore, based on these observations, a neural network will be designed and specialized trained for suppressing such residual echo as illustrated in Fig. 1.
2.2 Adaptive Filter
Due to the numerous merits involved in the multidelay block frequency domain adaptive filter, such as less memory storage, small FFT size, and allows different configurations to be chosen depending on the hardware used 
, it is employed in the combination scheme for linear echo cancellation. Moreover, as is used in the open source of Speex[19, 20], the learning rate in the adaptive filter is controlled varying according to conditions such as double-talk and echo path change. In this case, the linear echo can be cancelled greatly and adaptively.
The complex NLMS filter of length is defined as
with adaptation step as
Where is the far-end signal, is the received microphone signal, is the estimated echo by adaptive filter, is the corresponding estimated error, are the filter weights at time , and is the estimated one, is the learning rate.
For obtaining a fast response in the case of double-talk in order to prevent the filter from diverging when double-talk starts, the learning rate is updated by 
where and are the frequency domain counterparts of the and , and is the frequency index and is the frame index,
is the estimate leakage coefficient that represents the misadjustment of the filter. It is equal to the linear regression coefficient between the estimated echo powerand output power :
where the correlations and are averaged recursively as:
where is the base learning rate for the leakage estimate and and are the total power of the estimated echo and the output signal. The variable averaging parameter prevents the estimate from being adapted when no echo is present.
However, due to non-linear component involved in the device, non-linear residual echo will be arise at the output of adaptive filter. Moreover, some linear residual echo could be introduced if the estimated RIR and the actual one are mismatched. These would all result in considerable residual echo as depicted in Fig. 2. This residual echo would be more severe with the increasing nonlinearity introduced to the device and the increasing estimated error of the RIR.
2.3 Neural Network
2.3.1 Network Structure
. Here, each module of RNN is realized by Gated Recurrent Unit (GRU) for data memory and network calculation. This type of structure mainly refers to the functional architecture of conventional echo cancellation, including three functional modules, i.e., double-talk detection, echo estimation and echo cancellation. Double-talk detection detects the signals of the far-end and near-end in real time, and only when a signal at the far-end is detected, would echo suppression be carried out. At this moment, residual echo is estimated from the signal after adaptive filtering. Echo cancellation, estimating the gain of the subband, rapidly changes the level of each frequency band in order to attenuate the echo but allow the signal to pass through. The reason for utilizing subband gain for computation is that it makes the model very simple, requiring very few band calculations. In addition, there are no so-called musical noise artifacts.
In order to reduce the number of neurons thus reducing the model size, samples or spectrum are not directly used. Instead, the frequency band with bark scale is employed, matching the human perception. In this case, a total of 22 frequency subbands are used, namely, bark-frequency cepstral coefficients (BFCC). In addition, the first-order and second-order differences of the first six BFCC features, the discrete cosine transform (DCT) of the first six pitch correlation coefficients, and the dynamic features, i.e., pitch period and spectral non-stationarity metric were extracted. These all result in 42 features in total, acting as the input data of residual echo suppression neural network.
Double-talk detection (DTD). Only speech signal together with residual echo would be reserved after adaptive filtering. Since the amplitude of residual echo after adaptive filtering is small, the voice activity of speech can be easily detected. Meanwhile, the voice activity of the reference signal from far-end can also be easily detected due to its purity. In this case, two voice activity detections (VADs) for each channel can be implemented independently, reducing the difficulty of DTD.
Residual echo estimation. As a realization of recurrent neural network (RNN), gated recurrent unit (GRU) module is used for estimating residual echo with input features of reference signal, output signal of adaptive filtering and the DTD results. Due to the memory function of RNN model, residual echo can be better estimated compared with other models.
Residual echo suppression. A GRU module concatenated by a dense layer is used for echo suppression by calculating subband gains. It will approach to zero if the VAD of the near-end, i.e., the output of adaptive filtering, is zero, and will approach to one, if the VAD of the reference signal from the far-end is zero. Otherwise, a decimals representing the ratio between the speech and that superimposed by residual echo is estimated.
Since only band gains are calculated by network, it can not be directly applied to each frequency. Therefore, linear interpolation between bands for obtaining frequency gain is required which is illustrated in Fig.4 and can be formulated as
where is the gain of the -th frequency for the -th band, and are the band gains for the -th and the ()-th bands, denoting the band length of the -th band.
2.3.2 Training of Double-Talk Detection
The training data can be either manually annotated or simulated. Manual labeling data is obtained by listening whether there is audio either at the far-end or the near-end and where there are. The audio file is recorded by an audio recording device where the audio played by the source superimposed by the one from the device itself is recorded. However, this method is time consuming. Therefore, simulation data is used for training. This can be illustrated in Fig. 5 and be summarized as follows:
Far-end data preparation.
Far-end data is the reference signal for echo cancellation which is the audio file transmitted in the reference channel before playing out by the loudspeaker of the devices itself. This reference signal is framed and windowed, then used for energy calculation. This energy value is compared with two thresholds so that it is labelled by “1” if it is larger than the higher threshold, and labelled by “0” if it is lower than the lower threshold, otherwise, labelled by “0.5”. This labels are calculated frame by frame, representing the probability of audio existing, together with feature vectors.
Near-end data preparation. Here, the near-end data is the signal after adaptive filtering where vast echo especially for the linear echo will be cancelled. As for the echo signal, it is obtained by convolving the reference signal with the RIRs. This echo signal is mixed with a clean speech audio file for simulating the microphone receiving signal. This microphone signal is then processed by adaptive filtering. Thereafter, clean speech mixed by residual echo is obtained, representing the near-end data for training. The labels representing whether there is clean speech or not can be easily obtained by directly calculating the energy and comparing with the thresholds. It is notable that since the amplitude of residual echo is relative small compared with that of clean speech, the labels can be also obtained by directly calculating the signal energy after adaptive filtering. Similarly, the feature vectors corresponding to each frame is calculated.
Training process. Since the labels for the two channels can be directly obtained by comparing the frame energy with thresholds, the voice detections can be implemented individually with VADs. With the feature vectors and their labels, the VAD modules for each channel can be trained without too much difficulty.
2.3.3 Training of Residual Echo Suppression
The aim for residual echo suppression network is calculating the band gains whose training process is depicted in Fig. 6. The far-end and the near-end data are prepared in the same way as aforementioned except for the labels of band gains. These can be obtained by calculating the band energy of clean speech denoted by and that of the residual signal after adaptive filtering denoted by , and then dividing them band by band for getting the labels, i.e., . Meanwhile, the feature vectors of these two channels are the same as aforementioned.
3 Performance Evaluation
. The model structure can be shown in 3. A total of 10 hours of speech and 5 hours of echo data are constructed, resulting in 20 hours for training by using various combinations of gains and filters. In training process, three objective functions should be learned, i.e., the VAD of speech signal, the VAD of reference signal and the band gains for suppression. As is shown in Fig. 7, both the training loss and the validation loss go down gradually approaching zero, revealing that a considerable model has been trained.
. a) Band Gain. A piece of audio speech consisting by a string of wakeup words interfered by text to speech (TTS) audio played by itself are used for measurements. The calculated results about the VADs and the band gains are depicted in Fig. 8. It can be found that, the band gains would approach zero if reference signal is detected at the momentum of wakeup word appearing. Since the energy of residual echo gathers at low bands, therefore the band gains for suppression would be lower for the low bands than that of the higher bands. b) Wave Observation. For evaluating the performance, methods from prevailing open source codes are extracted for comparisons. These can be seen from Fig. 9 that the residual echo after the proposed RNN algorithm can be suppressed a lot compared with those of Speex and WebRTC. These are more obvious at the speech gaps where only residual echo exist. It can also be found that the spectrums at high bands are cut after WbeRTC AEC, which may be introduced by the non-linear processing (NLP) in the algorithm.
c) Performance Comparisons. The ERLE value representing the echo suppression performance, and the LSD representing the spectrum loss in terms of the voice caused by AEC are evaluated and listed in Table 1. Since AEC module is mostly implemented on devices, the response time (RT) obtained on the same platform representing the processing speed, and the model size representing the algorithm complexity should also be considered. It can be seen that the proposed scheme can obtain higher ERLE with considerable spectrum loss, processing time. Though the module size of the proposed scheme is larger, since the reference signal is clean speech, the model structure of VAD for this channel can be tailored. Meanwhile, the intermediate results of VADs for echo estimation in the model structure are likely to be clipped. These all could reduce the model size.
|Speex||25 dB||1.01 dB||0.42 ms/frame||106 kb|
|WebRTC||40 dB||1.66 dB||0.45 ms/frame||140 kb|
|Proposed||68 dB||1.18 dB||1.63 ms/frame||450 kb|
A combination scheme by concatenating adaptive filter and neural network is proposed for acoustic echo cancellation. The echo can be cancelled in a large scale after adaptive filtering, especially for linear echo, leaving the residual echo a bit. The spectrum of residue is much different compared with the speech audio, and can be considered as a special type of noise. Therefore, this residue is suppressed to a considerable level by employing neural network. Experiments reveal that the proposed scheme can obtain higher performance of echo suppression with considerable spectrum damage and response time.
-  E. Hnsler, G. Schmidt, “Topics in acoustic echo and noise control: selected methods for the cancellation of acoustical echoes, the reduction of background noise, and speech processing,” Springer Berlin Heidelberg, 2006.
-  J. Benesty, T. Gnsler, “Advances in network and acoustic echo cancellation,” Advances in network and acoustic echo cancellation, Springer, 2001.
-  E. Ferrara, “Fast implementations of LMS adaptive filters,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 474–475 1980.
-  R. Tyagi, R. Singh and R. Tiwari, “The performance study of NLMS algorithm for acoustic echo cancellation,” in International Conference on Information, Communication, Instrumentation and Control, ICICIC, 2017, pp. 1–5, Indore.
-  G. A. Clark, S. K. Mitra, and S. R. Parker, “Block implementation of adaptive digital filters,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP–29, pp. 744–752, June 1981.
-  P ez Borrallo Jos M., M. G. Otero, “On the implementation of a partitioned block frequency domain adaptive filter (PBFDAF) for long acoustic echo cancellation,” Signal Processing, vol. 27, no. 3, pp. 301–315, 1992.
-  J. S. Soo, K. K. Pang, “Multidelay block frequency domain adaptive filter,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38, no. 2, pp. 373–376, 1990.
-  J. Valin, “On adjusting the learning rate in frequency domain echo cancellation with double-talk,“ IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1030–1034, 2007.
-  Z. Yuan and X. Songtao, “Application of new LMS adaptive filtering algorithm with variable step size in adaptive echo cancellation,” in IEEE International Conference on Communication Technology, ICCT, 2017, pp. 1715–1719.
-  J. Benesty, H. Rey, L. R. Vega and S. Tressens, “A nonparametric VSS NLMS algorithm,” IEEE Signal Processing Letters, vol. 13, no. 10, pp. 581–584, 2006.
-  C. Paleologu, S. Ciochina and J. Benesty, “Double-talk robust VSS-NLMS algorithm for under-modeling acoustic echo cancellation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2008, pp. 245–248.
-  Mohammad Asif Iqbal and S. L. Grant, “Novel variable step size nlms algorithms for echo cancellation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2008, pp. 241–244.
-  O. Tanrikulu and K. Dogancay, “A new non-linear processor (NLP) for background continuity in echo control,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2003, pp. V–588.
-  M. Doroslovacki, “Optimal non-linear processor control for residual-echo suppression,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2003, pp. V–608.
-  B. Panda, A. Kar and M. Chandra, “Non-linear adaptive echo supression algorithms: A technical survey,” in International Conference on Communication and Signal Processing, ICCSP, 2014, pp. 076–080.
M. Z. Ikram, “Non-linear acoustic echo cancellation using cascaded Kalman filtering,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2014, pp. 1320–1324.
-  M. I. Mossi, C. Yemdji, N. Evans, and etc., “Robust and low-cost cascaded non-linear acoustic echo cancellation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2011, pp. 89–92.
-  J. Mourjopoulos, “On the variation and invertibility of room impulse response functions,” Journal of Sound Vibration, vol. 102, no. 2, pp. 217–228, 1985.
-  J. M. Valin, “Speex: A free codec for free speech,” Speex A Free Codec for Free Speech, 2016.
-  P. Srivastava, K. Babu and T. Osv, “Performance evaluation of Speex audio codec for wireless communication networks,” in International Conference on Wireless and Optical Communications Networks, WOCN, 2011, pp. 1–5.
Valin, Jean-Marc, “A hybrid DSP/deep learning approach to real-time full-band speech enhancement,” inIEEE International Workshop on Multimedia Signal Processing, MMSP, 2018, pp. 1–5.