AECMOS: A speech quality assessment metric for echo impairment

10/06/2021 ∙ by Marju Purin, et al. ∙ 0

Traditionally, the quality of acoustic echo cancellers is evaluated using intrusive speech quality assessment measures such as ERLE <cit.> and PESQ <cit.>, or by carrying out subjective laboratory tests. Unfortunately, the former are not well correlated with human subjective measures, while the latter are time and resource consuming to carry out. We provide a new tool for speech quality assessment for echo impairment which can be used to evaluate the performance of acoustic echo cancellers. More precisely, we develop a neural network model to evaluate call quality degradations in two separate categories: echo and degradations from other sources. We show that our model is accurate as measured by correlation with human subjective quality ratings. Our tool can be used effectively to stack rank echo cancellation models. AECMOS is being made publicly available as an Azure service.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Acoustic echo arises when a near end microphone picks up the near end loudspeaker signal and a far end user hears their own voice. The presence of acoustic echo is a top complaint in user ratings of audio call quality.

Acoustic echo cancellers (AECs) significantly improve audio call quality by canceling echo. This is achieved by first comparing the transmitted signal to the received signal taking into account delay and signal distortions, and then subtracting the result from the transmitted signal. The goal is to only transmit a clean near end signal.

Recently, there has been a lot of innovation in AECs. The rise of deep learning has led to better performing models as compared to their classical counterparts

[29, 26, 9, 33, 12, 23]. Also, hybrid models, combining both classical and deep learning methods such as using adaptive filters and recurrent neural networks (RNNs) [25, 23], have shown great results [31, 35, 8]. With the rapid advancements in the field of echo cancellation, it becomes ever more critical to be able to objectively measure and compare the performance of AECs.

We propose a speech quality assessment metric for evaluating echo impairment that overcomes the drawbacks of conventional methods. Our model, called AECMOS, directly predicts human subjective ratings for call echo impairment. It can be used to evaluate the end-to-end performance of AECs and to rank different AEC methods based on (degradation) mean opinion score

(MOS) estimates with great accuracy.

Our model architecture is a deep neural network comprising of convolutional layers, GRU (gated recurrent unit) layers, and dense layers. AECMOS is trained using the ground truth human ratings obtained following guidance from ITU-T Rec. P.831

[16], ITU-T Rec. P.832 [20] and ITU-T Rec. P.808 [15] as described in [6].

AECMOS is extremely effective in aiding with the development and research of AECs. We are providing the AECMOS as an Azure service for other researchers to use as well. The details of the API are at https://github.com/microsoft/AEC-Challenge/tree/main/AECMOS. We have already received requests for accessing our service ranging from university researchers to big companies. For examples of AECMOS (aka DECMOS) use in AEC model development we refer the reader to [25, 29] which include winners of the INTERSPEECH 2021 Acoustic Echo Cancellation Challenge [8].

AECMOS does not require a clean speech reference for the near or far end, nor a quiet environment. A clean reference is typically not available in non-artificial scenarios, and the test set that we use for evaluating AECMOS can be much more realistic and representative than artificial scenarios. Our test set was selected from over 5000 different scenarios, each with different room acoustics, different devices, and different human speakers. AECMOS can be used in actual customer calls to monitor the quality of real calls; it is not restricted to lab or development usage, but has operational utility.

2 Related work

Common methods of evaluating AEC models [34, 10, 11, 28] include using intrusive objective measures such as echo return loss enhancement (ERLE) [19] and perceptual evaluation of speech quality (PESQ) [21]. ERLE can only be measured when having access to both the echo and processed echo signals, in a quiet environment without near end speech. ERLE can be approximately measured by:

(1)

where is the microphone recording of the far end signal (with no echo suppression), and is the residual echo after cancellation. PESQ requires a clean speech reference in addition to the degraded speech.

Unfortunately, metrics such as ERLE and PESQ are often not well correlated with human subjective ratings of call echo degradation quality [6]. This is especially true in the presence of background noise or double talk [4, 6]. Carrying out laboratory tests following standards such as ITU-T Rec. P.831 [16], while more accurate, is expensive, time consuming, and not a scalable solution.

There are objective standards to help characterize AEC performance. IEEE 1329 [14] defines metrics like terminal coupling loss for single talk (TCLwst) and double talk (TCLwdt), which are measured in anechoic chambers. TIA 920 [32] uses many of these metrics but defines required criteria. ITU-T Rec. G.122 [17] defines AEC stability metrics, and ITU-T Rec. G.131 [18] provides a useful relationship of acceptable Talker Echo Loudness Rating versus one way delay time. ITU-T Rec. G.168 [19] provides a comprehensive set of AEC metrics and criteria. However, it is not clear how to combine these dozens of metrics to a single metric, or how well these metrics correlate to subjective quality.

Commercially available objective metrics include EQUEST [2] which measures single talk echo performance. EQUEST gives a Pearson Correlation Coefficient (PCC)=0.8 with subjective quality when evaluated with 150 wideband test conditions. An objective metric for double talk echo performance is ACOPT 32 [3] which implements the 3GPP standard TS 26.132 [1].

3 Data

We use supervised learning to train AECMOS. Each example in the dataset consists of three audio signals: near end microphone signal, far end signal, and the output from an echo canceller which we call the enhanced signal. The label for a given example is its (degradation) Mean Opinion Score (MOS) obtained from crowdsourcing as described in

[6].

In generating the data set, we distinguished between single talk and double talk scenarios. For the near end single talk case, we asked for the overall quality [15]. For far end single talk, we asked for echo ratings [16]. For double talk, we asked for ratings for both echo annoyance and other degradations in two separate questions 111Question 1: How would you judge the degradation from the echo? Question 2: How would you judge other degradations (noise, missing audio, distortions, cut-outs)?. All impairments were rated on the degradation category scale (from 1:Very annoying to 5: Imperceptible). The ratings were then used to obtain a MOS label for the examples. For near end single talk, the echo label was set to and for far end single talk the degradation label was set to . We found including ratings for double talk non-echo degradations to improve the correlation between experts and naive raters [6].

The model was trained on samples ranging in duration from seconds to seconds with a mean of seconds, for a total of 145.8 hours of data. The training data included 17 submitted models with a total of 14K audio clips from the ICASSP 2021 AEC Challenge [31]. The breakdown of the training data set by scenario is: near end single talk, far end single talk, and double talk.

The test set consisted of submissions from the INTERSPEECH 2021 AEC Challenge [7, 8]. There were a total of contestants submitting their work for double talk, far end single talk, and near end single talk examples. In addition, we included the enhanced signals of of our own deep models along with digital signal processing based models. In this manner, we obtained a total of enhanced speech signals or hours of data. The ground truth labels were gathered via crowdsourcing using the tool [6], where we collected votes per clip.

4 Model

Advancements in deep learning have shown great potential in various speech enhancement tasks [33, 13, 28], including for speech quality assessment [27, 30]. Here we propose a deep neural network to replace humans as call quality raters for speech impairment due to echo and other degradations.

4.1 Features

The AECMOS model takes as input three signals: near end microphone signal, far end signal, and the output from an echo canceller also referred to as the enhanced signal. The task is to evaluate the quality of the enhanced signal with regard to echo impairment on a human subjective rating scale. Observe that the very nature of acoustic echo necessitates a comparison of signals. Knowledge of both the near end microphone signal and the far end signal is needed to determine whether the enhanced signal contains echo as opposed to some background noise, for example. This is in contrast to evaluating noise suppression quality which can be done successfully without reference signals [27].

The aforementioned processing included calculating the STFT transform with a DFT window size of and a hop fraction of . Finally, log power was applied to the results.

In addition to the three input signals, AECMOS also takes an optional scenario marker as part of the input. This marker encodes which of the three scenarios we are in: near end single talk, far end single talk, or double talk. For online deployment, the marker is not used. For offline AEC model evaluations, when scenario information is readily available, activating the scenario marker improves model performance as shown in Table 4

. We found it effective to prepend a one-hot vector of a fixed length to the three model input signals indicating which signal(s) should be considered active for a particular sample.

We used micro augmentations during training to increase model stability. By this, we mean imperceptible data augmentations: removing the initial 10 ms of the near end microphone signal, or changing the energy level by 0.5 dB.

Details about the data set and the crowdsourcing approach for obtaining the ground truth labels appear in Section 3 and [6, 7, 27].

4.2 Architecture

For the model architecture, we explored different configurations of convolutional models. A significant boost in MOS prediction correlations with the ground truth labels came from incorporating GRU layers into the model. Table 1 shows the architecture for the best performing model. In developing the model, we saw that for all models the double talk scenario posed the hardest challenge. As shown in Table 4, incorporating GRU layers improved model performance most significantly for double talk.

The input to the model is a stack of three log power spectrograms obtained from the near end, far end, and enhanced signals. The spectrograms were computed with a DFT size of 512 and a hop size of 256 over clips that were sampled at 16 kHz. For an 8 second clip, this results in an input dimension of . AECMOS handles variable length inputs natively.

Layer Output Dimensions
Input:
Conv: 32,   , LeakyReLU
MaxPool: (2 x 2), Dropout(0.4)
Conv: 64,   , LeakyReLU
MaxPool: (2 x 2), Dropout(0.4)
Conv: 64,   , LeakyReLU
MaxPool: (2 x 2), Dropout(0.4)
Conv: 128,   , LeakyReLU
MaxPool: (2 x 2), Dropout(0.4)
Global MaxPool (1, 128)
Bidirectional GRU: 128, NumLayers 2
HiddenUnits 64, Drouput(0.2)
Dense: 64,   LeakyReLU   Dropout(0.4)
Dense: 64,   LeakyReLU   Dropout(0.4)
Dense: 2,   1 + 4* sigmoid
Table 1: AECMOS architecture

The model was trained on GPUs with a batch size of 10 per GPU. We used the Adam optimizer [22]

and MSE loss function until the loss saturated.

The model outputs two MOS predictions on a scale of : one for echo and one for other MOS. These outputs correspond to the ground truth labels as described in Section 3.

5 Experiments

We evaluate the accuracy of our model by measuring the correlation between the predictions of our AECMOS and the ground truth human ratings.

More precisely, we used the enhanced signals from a total of 22 models. For each model, the outputs were produced for 300 double talk, 300 far end single talk, and 200 near end single talk examples. See Section 3 for more details. For each audio clip, we obtained the objective MOS rating using our model and the crowdsourced MOS score.

5.1 Results

In order to evaluate our model, we calculate the PCC for the AECMOS predictions versus the ground truth MOS labels.

For each far end single talk example, we also calculate the ERLE and PESQ score.

Figure 1: far end Single Talk Per Clip: AECMOS

As seen in Table 2, in the far end single talk scenario, AECMOS outperforms both ERLE and PESQ.

For the near end single talk scenario, we can compare our AECMOS with DNSMOS model which was developed for evaluating noise suppression models [27]. AECMOS has a more difficult task than DNSMOS: evaluate both echo and other degradations and do so independently of each other. Nonetheless, we believe that AECMOS has very good potential to improve in the near end single talk category. For one, DNSMOS was trained on about audio clips [27], while AECMOS saw only about half that many for training and only a quarter as many, roughly , were near end single talk clips. With this in mind, AECMOS performance is very promising.

AECMOS DNSMOS ERLE PESQ
ST far end DMOS 0.847 0.541 0.710
ST near end MOS 0.611 0.640
DT Echo DMOS 0.582
DT Other DMOS 0.751
Table 2: Per Clip PCC for AECMOS, and other commonly used metrics: DNSMOS, ERLE, PESQ.

Expectedly, the most challenging scenario to evaluate is the double talk scenario. Here the model needs to evaluate separate qualities, echo and other degradations, simultaneously yet independently of each other. Measuring and improving double talk performance is important as not being able to interrupt others speaking has been shown to impair meeting inclusiveness and participation rate [5].

For evaluating the stack ranking of different echo cancellers, we compute the average of ratings across the entire test set for each model. We calculate the same for AECMOS ratings. Finally, we calculate the Spearman’s Rank Correlation Coefficient (SRCC) between the two. The results are given in Table 3. We report an SRCC of and a PCC of in the far end single talk scenario, which is the most common scenario for echo cancellation. We note that the best performing submitted models were very close to each other in the contest. In fact, so much so that the human MOS rankings were within error bars of each other [8].

PCC SRCC
All Scenarios Echo DMOS 0.981 0.970
All Scenarios (Other) MOS 0.902 0.954
ST far end Echo DMOS 0.996 0.969
ST near end MOS 0.923 0.831
DT Echo DMOS 0.898 0.863
DT Other DMOS 0.927 0.955
Table 3: AECMOS Per Contestant PCC and SRCC: All Scenarios refers to far end single talk and double talk for echo; and near end single talk MOS and double talk Other MOS.
Figure 2: Per contestant echo degradation (far end single talk and double talk): MOS vs. AECMOS
Figure 3: Per contestant other MOS (near end single talk and double talk): MOS vs. AECMOS

6 Ablation Study

During the development of the model, we experimented with various architectures, features, and training methods. In this section, we give an overview of some of the key findings.

We started out with a baseline model consisting of convolutional layers followed by dense layers. The input to all models was a stack of three features: log power of STFT applied to the near end, far end, and enhanced signal. All models were trained using micro augmentations as described in Section 4.

The first improvement in our model’s performance came from including scenario labels as part of the model input. As we were exploring avenues for model development, we conducted a small experiment where we incorporated label information in the loss function and asked the model to predict scenario labels in addition to the MOS labels. Curiously, the model predicted the far end single talk scenario of the time while only of the labels were actually far end single talk with the remaining labels being double talk. Introducing an optional scenario marker for offline use helped model performance as described in Table 4. More discussion of input features can be found in Section 4.1.

The second improvement came about when we were investigating how to improve double talk performance. Experiments with convolution kernel size led us to believe that our model was having difficulties incorporating information along the temporal axis. Introducing GRU layers into the model remedied this issue. In adding a new layer, we needed to remove a convolutional layer so that we would not be down sampling too aggressively before reaching the GRU layer.

The effects of the aforementioned key improvements are summarized in Table 4.

Baseline + scenario + GRU
All Scenarios Echo DMOS 0.732 0.746 0.797
All Scenarios (Other) MOS 0.735 0.775 0.802
ST far end Echo DMOS 0.780 0.825 0.847
ST near end MOS 0.434 0.534 0.611
DT Echo DMOS 0.458 0.422 0.582
DT Other DMOS 0.577 0.657 0.751
Table 4: Per Clip Pearson Correlation Coefficients: Baseline Convolutional Model; add scenario markers to model input; remove a convolution layer and add a GRU layer to obtain AECMOS.

We also experimented with using log power of Mel spectrogram for model input features. Mel spectrogram corresponds well with human subjective hearing and has been successfully used in evaluating noise suppression [27]

. Our best model that takes Mel spectrogram features as input uses 160 Mel bins. While we experimented with different settings, we found a consistent behavior that the Mel models achieve lower correlation scores for echo and higher scores for other degradations. This matches our intuition as detecting the presence of echo is less dependent on human subjective experience than classifying a sound as noise.

AECMOS AECMOS Mel
All Scenarios Echo DMOS 0.797 0.742
All Scenarios (Other) MOS 0.802 0.819
ST far end Echo DMOS 0.847 0.739
ST near end MOS 0.611 0.604
DT Echo DMOS 0.582 0.553
DT Other DMOS 0.751 0.772
Table 5: Per Clip Pearson Correlation Coefficients: AECMOS; AECMOS trained with Mel spectrogram features.

Finally, we experimented with the self-teaching paradigm [27] and training with bias-aware [24] loss. Interestingly, neither provided significant improvements in PCC.

7 Improvements due to growing training data.

We conjecture that our model’s performance improves with more training data. In particular, we have observed that increasing the size of our training data from 64K audio clips to 108K clips and testing on 5K clips, gives a model that is just as accurate for inference whether or not it uses scenario label information. The difference in correlations was roughly across all scenarios. However, since we used more data for training we could not validate this claim as thoroughly as we have done for our work in the previous sections of this paper.

8 Conclusions

Our AECMOS model provides a speech quality assessment metric that is accurate, expedient, and scalable. It can be used to stack rank echo cancellers with very good accuracy and thereby accelerate research in echo cancellation. In the future, we would like to further improve the model by increasing the number of ratings per clip, exploring additional data augmentations, and learning custom filter banks.

References

  • [1] (2020) 3GPP TS 26.132: Speech and video telephony terminal acoustic test specification. Cited by: §2.
  • [2] H. Acoustics (2012) EQUEST: Echo Quality Evaluation of Speech in Telelcommunication. Cited by: §2.
  • [3] H. Acoustics (2015) ACOPT 32 (Code 6859) Speech Based Double Talk. Cited by: §2.
  • [4] A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, and J. Gehrke (2019) Non-intrusive speech quality assessment using neural networks. In ICASSP, Vol. , pp. 631–635. Cited by: §2.
  • [5] R. Cutler, Y. Hosseinkashi, J. Pool, S. Filipi, R. Aichner, Y. Tu, and J. Gehrke (2021) Meeting effectiveness and inclusiveness in remote collaboration. Proceedings of the ACM on Human-Computer Interaction 5 (CSCW1), pp. 1–29. Cited by: §5.1.
  • [6] R. Cutler, B. Nadari, M. Loide, S. Sootla, and A. Saabas (2020) Crowdsourcing approach for subjective evaluation of echo impairment. In ICASSP, Cited by: AECMOS: A speech quality assessment metric for echo impairment, §1, §2, §3, §3, §3, §4.1.
  • [7] R. Cutler, A. Saabas, T. Parnamaa, M. Loide, R. Aichner, H. Braun, S. Srinivasan, and K. Sorensen (2021) INTERSPEECH 2021 Acoustic Echo Cancellation Challenge. In INTERSPEECH, Cited by: §3, §4.1.
  • [8] R. Cutler, A. Saabas, T. Parnamaa, M. Loide, S. Sootla, M. Purin, H. Gamper, S. Braun, K. Sorensen, R. Aichner, et al. (2021) INTERSPEECH 2021 acoustic echo cancellation challenge. In INTERSPEECH, Cited by: §1, §1, §3, §5.1.
  • [9] A. Fazel, M. El-Khamy, and J. Lee (2020) CAD-AEC: Context-aware deep acoustic echo cancellation. In ICASSP, Vol. , pp. 6919–6923. Cited by: §1.
  • [10] A. Fazel, M. El-Khamy, and J. Lee (2019) Deep multitask acoustic echo cancellation. In INTERSPEECH, Cited by: §2.
  • [11] Guerin,Alexandre, G. Faucon, and R. Le Bouquin-Jeannès (2003) Nonlinear acoustic echo cancellation based on volterra filters. IEEE Transactions on Speech and Audio Processing 11 (6), pp. 672–683. Cited by: §2.
  • [12] M. M. Halimeh and W. Kellermann (2020) Efficient multichannel nonlinear acoustic echo cancellation based on a cooperative strategy. In ICASSP, Vol. , pp. 461–465. Cited by: §1.
  • [13] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine 29 (6), pp. 82–97. Cited by: §4.
  • [14] (2010) IEEE 1329-2010 Standard method for measuring transmission performance of handsfree telephone sets. Cited by: §2.
  • [15] (1998) ITU-T P.808 supplement 23 ITU-T coded-speech database supplement 23 to ITU-T P-series recommendations (previously ccitt recommendations). Cited by: §1, §3.
  • [16] (1998) ITU-T P.831 Subjective performance evaluation of network echo cancellers ITU-T P-series recommendations. Cited by: AECMOS: A speech quality assessment metric for echo impairment, §1, §2, §3.
  • [17] (2012-02) ITU-T recommendation G.122: influence of national systems on stability and talker echo in international connections. Cited by: §2.
  • [18] ITU-T Recommendation G.131 (2003) Talker echo and its control. International Telecommunication Union, Geneva. Cited by: §2.
  • [19] (2012-02) ITU-T recommendation G.168: digital network echo cancellers. Cited by: AECMOS: A speech quality assessment metric for echo impairment, §2, §2.
  • [20] ITU-T Recommendation P.832 (2000) Subjective performance evaluation of hands-free terminals. International Telecommunication Union, Geneva. Cited by: AECMOS: A speech quality assessment metric for echo impairment, §1.
  • [21] (2001-02) ITU-T recommendation P.862: perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Cited by: AECMOS: A speech quality assessment metric for echo impairment, §2.
  • [22] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR). Cited by: §4.2.
  • [23] L. Ma, H. Huang, P. Zhao, and T. Su (2020) Acoustic echo cancellation by combining adaptive digital filter and recurrent neural network. arXiv preprint arXiv:2005.09237. Cited by: §1.
  • [24] G. Mittag, S. Zadtootaghaj, T. Michael, B. Naderi, and S. Möller (2021) Bias-aware loss for training image and speech quality prediction models from multiple datasets. In 2021 13th International Conference on Quality of Multimedia Experience (QoMEX), Vol. , pp. 97–102. External Links: Document Cited by: §6.
  • [25] R. Peng, L. Cheng, C. Zheng, and X. Li (2021) Acoustic Echo Cancellation Using Deep Complex Neural Network with Nonlinear Magnitude Compression and Phase Information. In INTESPEECH, pp. 4768–4772. External Links: Document Cited by: §1, §1.
  • [26] L. Pfeifenberger, M. Zoehrer, and F. Pernkopf (2021) Acoustic Echo Cancellation with Cross-Domain Learning. In INTERSPEECH, pp. 4753–4757. Cited by: §1.
  • [27] C. K. A. Reddy, V. Gopal, and R. Cutler (2021) DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppressors. In ICASSP, Cited by: §4.1, §4.1, §4, §5.1, §6, §6.
  • [28] J. F. Santos and T. H. Falk (2018) Speech dereverberation with context-aware recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (7), pp. 1236–1246. Cited by: §2, §4.
  • [29] E. Seidel, J. Franzen, M. Strake, and T. Fingscheidt (2021) Y-Net FCRN for acoustic echo and noise suppression. In INTERSPEECH, Cited by: §1, §1.
  • [30] J. Serrà, J. Pons, and P. Santiago (2021)

    SESQA: semi-supervised learning for speech quality assessment

    .
    In INTERSPEECH, Cited by: §4.
  • [31] K. Sridhar, R. Cutler, A. Saabas, T. Parnamaa, M. Loide, H. Gamper, S. Braun, R. Aichner, and S. Srinivasan (2021) ICASSP 2021 acoustic echo cancellation challenge: datasets, testing framework, and results. In ICASSP, pp. 151–155. Cited by: §1, §3.
  • [32] (2002-12) TIA-920: transmission requirements for wideband digital wireline telephones. Cited by: §2.
  • [33] J. Valin, S. Tenneti, K. Helwani, U. Isik, and A. Krishnaswamy (2020) LOW-complexity, real-time joint neural echo control and speech enhancement based on percepnet. In ICASSP, Cited by: §1, §4.
  • [34] H. Zhang and D. Wang (2018) Deep learning for acoustic echo cancellation in noisy and double-talk scenarios. In INTERSPEECH, Cited by: §2.
  • [35] S. Zhang, Y. Kong, S. Lv, Y. Hu, and L. Xie (2021) F-t-lstm based complex network for joint acoustic echo cancellation and speech enhancement. arXiv preprint: 2106.07577. Cited by: §1.