MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement

05/13/2019 ∙ by Szu-Wei Fu, et al. ∙ 0

Adversarial loss in a conditional generative adversarial network (GAN) is not designed to directly optimize evaluation metrics of a target task, and thus, may not always guide the generator in a GAN to generate data with improved metric scores. To overcome this issue, we propose a novel MetricGAN approach with an aim to optimize the generator with respect to one or multiple evaluation metrics. Moreover, based on MetricGAN, the metric scores of the generated data can also be arbitrarily specified by users. We tested the proposed MetricGAN on a speech enhancement task, which is particularly suitable to verify the proposed approach because there are multiple metrics measuring different aspects of speech signals. Moreover, these metrics are generally complex and could not be fully optimized by Lp or conventional adversarial losses.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial networks (GANs) (Goodfellow et al., 2014) has shown its powerful generative ability in many different applications. In particular, for conditional GANs (CGANs) (Mirza & Osindero, 2014), in addition to the adversarial loss, there is an loss, to guide the learning of generators. Ideally, the adversarial loss should make generated data indistinguishable from real (target) data. However, some applications of image (Ledig et al., 2017; Wang et al., 2018) and speech processing (Pandey & Wang, 2018; Wang & Chen, 2018; Donahue et al., 2018; Michelsanti & Tan, 2017) show that this loss term provides very marginal improvement (sometimes even degrade the performance) in terms of objective evaluation scores (in the case of image processing, the subjective score can be improved). For instance, Donahue et al. (2018) applied CGAN on speech enhancement (SE) for automatic speech recognition (ASR); however, the following conclusion was obtained: ”Our experiments indicate that, for ASR, simpler regression approaches may be preferable to GAN-based enhancement.” This may be because the method that the discriminator uses to judge whether each sample is real or fake is not fully related to the metrics that we consider. In other words, similar to loss, the way the adversarial loss guides the generator to generate data is still not matched to the evaluation metrics. We call this problem discriminator-evaluation mismatch (DEM). In this study, we propose a novel MetricGAN to solve this problem. We tested the proposed approach on the SE task because the metrics for SE are generally complex and difficult to directly optimize or adjust.

For human perception, the primary goal of SE is to improve the intelligibility and quality of noisy speech (Benesty et al., 2005). To evaluate a SE model in different aspects, several objective metrics have been proposed. Among the human perception-related objective metrics, the perceptual evaluation of speech quality (PESQ) (Rix et al., 2001) and short-time objective intelligibility (STOI) (Taal et al., 2011) are two popular functions to evaluate speech quality and intelligibility, respectively. The design of these two metrics considers human auditory perception and has shown higher correlation to subjective listening tests than simple or distance between clean and degraded speech.

In recent years, various deep learning-based models have been developed for SE

(Lu et al., 2013; Xu et al., 2014; Wang et al., 2014; Xu et al., 2015; Ochiai et al., 2017; Luo & Mesgarani, 2018; Grais et al., 2018; Germain et al., 2018; Chai et al., 2018; Choi et al., 2019). Most of these models were trained in a supervised fashion by preparing pairs of noisy and clean speeches. The deep models were then optimized by minimizing the distance between generated speech and clean speech. However, the distance (objective function) is usually based on simple loss (where = 1 or 2), which does not reflect human auditory perception or ASR accuracy (Bagchi et al., 2018) well. In fact, several researches have indicated that an enhanced speech with a smaller distance, does not guarantee a higher quality or intelligibility score (Fu et al., 2018b; Koizumi et al., 2018).

Therefore, optimizing the evaluation metrics (i.e., STOI, PESQ, etc.) may be a reasonable direction to connect the model training with the goal of SE. Some latest studies (Fu et al., 2018b; Koizumi et al., 2018; Zhang et al., 2018; Zhao et al., 2018a; Naithani et al., 2018; Kolbæk et al., 2018; Venkataramani et al., 2018; Venkataramani & Smaragdis, 2018; Zhao et al., 2018b) have focused on STOI score optimization to improve speech intelligibility. A waveform based utterance-level enhancement manner is proposed to optimize the STOI score (Fu et al., 2018b). The results of a listening test showed that by combining STOI with MSE as an objective function, the speech intelligibility can be further increased. On the other hand, because the PESQ function is not fully differentiable and significantly more complex compared with STOI, only few (Koizumi et al., 2018; Zhang et al., 2018; Koizumi et al., 2017; Martín-Doñas et al., 2018)

have considered it as an objective function. Reinforcement learning (RL) techniques such as deep Q-network (DQN) and policy gradient were employed to solve non-differentiable problems, as

(Koizumi et al., 2017) and (Koizumi et al., 2018), respectively.

In summary, the abovementioned existing techniques can be categorized into two types depending on whether the details of evaluation metrics have to be obtained: (1) white-box: these methods approximate the complex evaluation metrics with a hand-crafted, simpler one; thus, it is differentiable and easy to be applied as a loss function. However, the details of the metrics have to be known; (2) black-box: these methods mainly treat the metrics as a reward and apply RL-based techniques to increase the scores. However, because of less efficiency in training, most of them have to be pre-trained by conventional supervised learning.

In this study, to solve the drawbacks of the abovementioned methods and the DEM problem, the discriminator in GAN is associated with the evaluation metrics of interest (Although these evaluation functions are complex, Fu et al. (2018a)

showed that they can be approximated by neural networks). In particular, when training the discriminator, instead of always giving a false label (e.g., ”0”) to the generated speech, the labels of MetricGAN are given according to the evaluation metrics. Therefore, the target space of discriminator transforms from discrete (1 (true) or 0 (false)) to continuous (evaluation scores). Through this modification, the discriminator can be treated as a learned surrogate of the evaluation metrics. In other words, the discriminator iteratively estimates a surrogate loss that approximates the sophisticated metric surface, and the generator uses this surrogate to decide a gradient direction for optimization. Compared with previous existing methods, the main advantages of MetricGAN are as follows:

(1) The surrogate function (discriminator) of the complex evaluation metrics is learned from data. In other words, it is still in a black-box setting and no computational details of the metric function have to be known.

(2) Experiment result shows that the training efficiency of MetricGAN to increase metric score is even higher than conventional supervised learning with loss.

(3) Because the label space of the discriminator is now continuous, any desired metric scores can be assigned to the generator. Therefore, MetricGAN has the flexibility to generate speech with specific evaluation scores.

(4) Under some non-extreme conditions, MetricGAN can even achieve multi-metrics assignments by employing multiple discriminators.

2 CGAN for SE

GAN has recently attracted a significant amount of attention in the community. By employing an alternative mini-max training scheme between a generator network () and a discriminator network (), adversarial training can model the distribution of real data. One of its applications is to serve as a trainable objective function for a regression task. Instead of explicitly minimizing the losses, which may cause over smoothing problems, provides a high-level abstract measurement of ”realness” (Liao et al., 2018).

In the applications of GAN on SE, CGAN is usually employed to generate enhanced speech. To achieve this, is trained to map noisy speech to its corresponding clean speech by minimizing the following loss function (as in (Pascual et al., 2017). The least-squares GAN (LSGAN) approach (Mao et al., 2017) is used with binary coding (1 for real, 0 for fake)):

(1)

Because usually simply learned to ignore the noise prior in the CGAN (Isola et al., 2017), we directly neglected it here. The first term in Eq. (1) is called adversarial loss for cheating with a weighting factor . The goal of is to distinguish between real data and generated data by minimizing the following loss function:

(2)

We argue that to optimize the metric scores, the training of should be associated with the metric.

3 MetricGAN

3.1 Associating the Discriminator with the Metrics

The main difference between the proposed MetricGAN and the conventional CGAN is how the discriminator is trained. Here, we first introduce a function to represent the evaluation metric to be optimized, where is the input of the metric. For example, for PESQ and STOI, is the pair of speech that we want to evaluate and the corresponding clean speech . Therefore, to ensure that behaves similar to , we simply modify the objective function of :

(3)

Because we can always map to , which is between 0 and 1 (here, 1 represents the best evaluation score), Eq. (3) can be reformulated as

(4)

where 0 1. There are two main differences between Eq. (4) and Eq. (2):

1.) In CGAN, as long as the data is generated, its label for is always a constant 0. However, the target label of the generated data in our MetricGAN is based on its metric score. Therefore, can evaluate the degree of realness (clean speech), instead of just distinguishing real and fake. (Therefore, maybe ”” should be called an evaluator; however, here we just follow the convention of GAN.)

2.) The condition used in the of CGAN is the noisy speech , which is different from the condition used in the proposed MetricGAN (clean speech ). This is because we want and to have similar behavior. Therefore, the input argument of is chosen to be the same as .

3.2 Continuous Space of the Discriminator Label

The training of is similar to Eq. (1). However, we found that the gradient provided by in our MetricGAN is more efficient than the loss. Therefore, the training of can completely rely on the adversarial loss :

(5)

where is the desired assigned score. For example, to generate clean speech, we can simply assign to be 1. On the contrary, we can also generate more noisy speech by assigning a smaller . This flexibility is caused by the label of the generated speech in , which is now continuous and related to the metric. Unlike surrogate loss learning in the multi-class classification (Hsieh et al., 2018), because the output space of our is continuous, the local neighbors need not be explicitly selected to learn the behavior of metric surface.

Figure 1: Learning process of MetricGAN to optimize the evaluation metric (the horizontal axis represents the weights of ). In each iteration, there are three steps. First, some data are generated by with weights . Second, training to have similar behavior as metric in those points. Third, training according to the gradient provided by (cheat ).

3.3 Explanation of MetricGAN

In MetricGAN, the target of G is to cheat D to reach specified score, and D tries to not be cheated by learning the true score. Here, we also explain the learning process of MetricGAN in a different manner. As shown in Figure 1, training of can be treated as learning a local surrogate of ; and training of is to adjust its weights toward the optimum value of . Because may only approximate well in the observed region (Fu et al., 2019), this learning framework should be alternatively trained until convergence.

4 Experiments

4.1 Network Architecture

The input features for is the normalized noisy magnitude spectrogram utterance. The generator used in this experiment is a BLSTM (Weninger et al., 2015) with two bidirectional LSTM layers, each with 200 nodes, followed by two fully connected layers, each with 300 LeakyReLU nodes and 257 sigmoid nodes for mask estimation, respectively. When this mask (between 0 to 1) is multiplied with the noisy magnitude spectrogram, the noise components should be removed. In addition, as reported in (Koizumi et al., 2018), to prevent musical noise, flooring was applied to the estimated mask before T-F-mask processing. Here, we used the lower threshold of the T-F mask as 0.05.

Noisy IRM (L1) IRM (CGAN) PE policy grad*(P) MetricGAN (P) MetricGAN (S)
SNR (dB) PESQ STOI PESQ STOI PESQ STOI PESQ STOI PESQ STOI PESQ STOI
12 2.375 0.919 2.913 0.935 2.879 0.936 2.995 0.927 2.967 0.936 2.864 0.939

6
1.963 0.831 2.52 0.878 2.479 0.876 2.595 0.869 2.616 0.881 2.486 0.885

0
1.589 0.709 2.086 0.787 2.053 0.786 2.144 0.776 2.200 0.796 2.086 0.802

-6
1.242 0.576 1.583 0.655 1.551 0.653 1.634 0.644 1.711 0.668 1.599 0.679

-12
0.971 0.473 1.061 0.508 1.046 0.507 1.124 0.500 1.169 0.521 1.090 0.533
Avg. 1.628 0.702 2.033 0.753 2.002 0.751 2.098 0.743 2.133 0.760 2.025 0.768
Table 1: Performance comparisons of different loss functions in terms of PESQ and STOI (* represents pre-trained from another model).

The discriminator herein is a CNN with four two-dimensional (2-D) convolutional layers with the number of filters and kernel size as follows: [15, (5, 5)], [25, (7, 7)], [40, (9, 9)], and [50, (11, 11)]. To handle the variable-length input (different speech utterance has different length), a 2-D global average pooling layer was added such that the features can be fixed at 50 dimensions (50 is the number of feature maps in the previous layer). Three fully connected layers were added subsequently, each with 50 and 10 LeakyReLU nodes, and 1 linear node. In addition, to make a smooth function (we do not want a small change in the input spectrogram can result in a significant difference to the estimated score), it is constrained to be 1-Lipschitz continuous by spectral normalization (Miyato et al., 2018). Our preliminary experiments found that adding this constraint can stabilize the training of . All models are trained using Adam (Kingma & Ba, 2014) with = 0.9 and = 0.999.

4.2 Experiment on the TIMIT Dataset

In this section, we show the experiments about PESQ and STOI scores. PESQ was designed to evaluate the quality of processed speech, and the score ranges from -0.5 to 4.5. STOI was designed to compute the speech intelligibility, and the score ranges from 0 to 1. Both the two metrics are the higher the better.

Figure 2: Learning curve of different objective functions evaluated on the validation set (structure of is fixed). In terms of: (a) PESQ score and (b) STOI score.

Figure 3:

Results of AB preference test (with 95% confidence intervals) on speech quality compared between proposed

MetricGAN(P) and the two baseline models.

Figure 4: Spectrograms of a TIMIT utterance in the teset set: (a) clean target, (b) noisy speech (engine noise at 0 dB). (c) to (f): enhanced speech with different loss functions.

4.2.1 Dataset

In this experiments, the TIMIT corpus (Garofolo et al., 1988) was used to prepare the training, validation, and test sets. 300 utterances were randomly selected from the training set of the TIMIT database for training in this experiment. These utterances were further corrupted with 10 noise types (crowd, 2 machine, alarm and siren, traffic and car, animal sound, water sound, wind, bell, and laugh noise) from (Hu, )

, at five SNR levels (from -8 dB to 8 dB with steps of 4 dB) to form 15000 training utterances. To monitor the training process and choose the hyperparameters, we randomly selected another clean 100 utterances from the TIMIT training set to form our validation set. Each utterance was further corrupted with one of the noise types (different from those already used in the training set) from

(Hu, ) at five different SNR levels (from -10 dB to 10 dB with steps of 5 dB). To evaluate the performance of different training methods, 100 utterances from the TIMIT test set were randomly selected as our test set. These utterances were mixed with four unseen noise types (engine, white, street, and baby cry), at five SNR levels (-12 dB, -6 dB, 0 dB, 6 dB, and 12 dB). In summary, 2000 utterances exist in the test set.

4.2.2 Objective Evaluation with Different Loss Functions

In this experiment, to evaluate the performance of different objective functions, the structure of is fixed and trained with different losses. As one of our baseline models, we adopt ideal ratio mask (IRM) (Narayanan & Wang, 2013) based mask estimation with loss (denoted as IRM (L1)). The other baseline (denoted as IRM (CGAN)) is the CGAN with the loss function of shown in Eq. (1). Compared to IRM (L1), IRM (CGAN) has an additional adversarial loss term with = 0.01 as in (Bagchi et al., 2018; Pascual et al., 2017). A parameter exploring policy gradients (Sehnke et al., 2010) based black-box optimization, which is similar to the one used in (Zhang et al., 2018), is also compared. However, we found that this method is very sensitive to the hyperparameters (e.g., weight initialization, step size of jitter, etc.). We could only obtain improved results for PESQ optimization (denoted as PE policy grad (P)). In addition, because of the lower training efficiency, its generator was first pre-trained from IRM (L1). The proposed MetricGAN with PESQ or STOI metric as , is indicated as MetricGAN (P) and MetricGAN (S), respectively.

Table 1 presents the results of the average PESQ and STOI scores on the test set for the baselines and proposed methods. From this table, we can first observe that the performance of IRM (CGAN) is similar to or slightly worse than the simple IRM (L1), which is in agreement with the results presented in previous papers. (Pandey & Wang, 2018; Donahue et al., 2018). This implies that the adversarial loss term used to cheat is not helpful in this application. One possible reason for this result may be that the decision boundary of is very different from the metrics we consider. We also attempted to train IRM (CGAN) with larger ; however, their evaluation scores were worse than the reported scores. Although PE policy grad (P) can obtain some PESQ scores improvements, the STOI scores decreased compared to its initialization, IRM (L1). On the contrary, when we employed PESQ as in our MetricGAN, it could achieve the highest PESQ scores among all the models with the second highest STOI score. Note that unlike loss, the loss function of in MetricGAN is Eq. (5), and there is no specific target for each T-F bin. In terms of the STOI score, MetricGAN (S) outperforms the other models, and the improvement is most evident for the low SNR conditions (where speech intelligibility improvement is most critical).

In addition to the final results of the test set, the learning process of different loss functions evaluated on the validation set are also presented in Figure 2. For both the scores, we can observe that the learning efficiency (in terms of the number of iterations) of MetricGAN is higher than the others. This implies that the gradient provided by (surrogate of ) is the most accurate toward the maximum value of . However, if the used to train MetricGAN does not match the evaluation metric, the performance is sub-optimal. Therefore, the information from is important; our preliminary experiment also shows that without , the learning cannot converge. The conventional adversarial loss term in IRM (CGAN) is not helpful for improving the scores and training efficiency.

Finally, an example of the enhanced spectrograms by different training objective functions are shown in Figure 4. The spectrogram generated by IRM (CGAN) is similar to that of IRM (L1). If we simply increase the weight of the adversarial loss term in Eq.(1), some unpleasant artifacts begin to appear (this is not shown here, owing to limited space). Interestingly, in comparison to others, the spectrogram (f) generated by MetricGAN (S) can best recover the speech components with clear structures (as shown by the black-dashed rectangles) and hence, obtain the highest STOI score.

4.2.3 Subjective Evaluation

To evaluate the perceptual quality of the enhanced speech, we conducted AB preference tests to compare the proposed method with the baseline models. Three pairs of listening tests were conducted: IRM(CGAN) versus IRM(L1), MetricGAN(P) versus IRM(CGAN), and MetricGAN(P) versus IRM(L1). Each pair of samples are presented in a randomized order. For each listening test, 20 sample pairs were randomly selected from the test set; 15 listeners participated. Listeners were instructed to select the sample with the better quality. The stimuli were played to the subjects in a quiet environment through a set of Sennheiser HD headphones at a comfortable listening level. In Figure 3 (a), we can observe that the preference score between IRM (L1) and IRM (CGAN) overlap in the confidence interval, which is in agreement with the result of the objective evaluation. Further, as shown in Figure 3 (b) and Figure 3 (c), MetricGAN(P) significantly outperforms both baseline systems, without an overlap in the confidence intervals.

4.2.4 Assigning Any Desired Score to the Generator

Because the label of in the conventional GAN is in a discrete space, there is no guarantee that the generated data can gradually improve toward real data when the label assigned to (i.e., the constant in the first term of Eq.(1)) increases from 0 (fake) toward 1 (real). For example, the generated data from label 0.9 is not necessarily better (more like real data) than that from label 0.8. However, as pointed out in section 3.2, because the output of in MetricGAN is continuous according to , we can assign any desired score during the training of as in Eq. (5). Therefore, different in Eq. (5) correspond to generated speech with different qualities. Interestingly, setting as a small value can convert the generator from a speech enhancement model to a speech degradation model. This provides us with another method to understand the factors that affect the metric. To achieve this, a uniform mask constraint (penalize estimated mask away from 0.5) was also applied to so that has to choose the most efficient way to attain the assigned score without significantly changing the initialized mask. (Owing to the sigmoid activation used in the output layer of , all the initially estimated mask values were close to 0.5). Figure 5 shows an example of assigning different to , and the learning process evaluated on the validation set is also illustrated in Figure 5 (c) and (g). Compared to the generation of clean speech (the entire learning process for generating clean speech is presented in Figure 2), MetricGAN can attain the desired score more easily when is small. This phenomenon is because the number of solutions decreases gradually when increases (it is easier to obtain noisy speech than a clean speech). Therefore, the solution for a large is considerably difficult to obtain. Figures 5 (d) to (f) and (h) to (j) present the generated speech by assigning different with STOI and PESQ as , respectively. Intriguingly, the speech components gradually disappear when we attempt to generate a speech with low STOI score (the speech components are almost removed as shown by the black rectangle in Figure 5 (f)). Because STOI measures the intelligibility of speech, it is reasonable that the speech component is most crucial in this metric. On the contrary, because PESQ measures the quality of speech, the generated speech with lower seems to become more noisy (for extremely low values (Figure 5 (j)), in spite of not as serious as the STOI case, there is also some speech components being removed). These results verify that the MetricGAN can generate data according to the designate metric score and make the label space of continuous.

Figure 5: Results of assigning different to Eq. (5) for the generator training. Note that the learning curves of generating clean speech in (c) and (g) are not yet converged. For more complete learning processes, please refer to Figure 2.

Figure 6: Learning curves of assigning different pairs of (STOI, PESQ) scores (shown in the title of each figure). Given a specified STOI score, the upper row and lower row is the maximum and minimum PESQ scores MetricGAN can reach, respectively. Note that the PESQ score is normalized between 0 to 1 with the original score shown in the parentheses.
  Input: desired score for metric to for metric (assume there are different metrics).
  repeat
     1) Find metric index with the largest distance between achieved and assigned score:    2) Train to minimize the loss from :    3) Train all to minimize the distance from :  
  until converge
Algorithm 1 Multi-Metric Scores Assignment

4.2.5 Multi-Metric Scores Assignment

In this section, we further explore the assignment of scores for multiple metrics simultaneously. Compared with single metric assignment, this is a more difficult task because the requirement to achieve other metrics can be treated as adding constraints.

Algorithm 1 shows the proposed training method for multi-metric scores assignment. Assuming that there are different metrics, we have to employee discriminators. In each iteration, only with the largest distance between achieved score, , and assigned score, , would guide the learning of (steps 1 and 2). However, in the training of , all the discriminators are updated, irrespective of whether it is used to provide loss to (step 3).

Figure 6 shows the learning curves for the case of =2. To explore more possible combinations, these results are based on the subset (top 10% metric score) of the original validation set. To clearly illustrate the results of multi-metric learning, in each column of this figure, the assignment of STOI score is fixed with different PESQ scores. Because different metrics may have some positive correlation between each other, MetricGAN is difficult to converge when the score assignments are too extreme (in this case, the solution may not even exist). However, we still obtain some flexibility to generate speech with desired multiple scores. This experiment verifies that MetricGAN can approximate and distinguish different metrics well.


PESQ CSIG CBAK COVL
Noisy 1.97 3.35 2.44 2.63
SEGAN 2.16 3.48 2.94 2.80
MMSE-GAN 2.53 3.80 3.12 3.14
WGAN-GP 2.54 - - -
Deep Feature Loss - 3.86 3.33 3.22
SERGAN 2.62 - - -
MetricGAN (P) 2.86 3.99 3.18 3.42
Table 2: Compared MetricGAN with other state-of-the-art methods. The highest score per metric is highlighted with bold text.

4.3 Comparison with Other State-of-the-Art SE Models

To further compare the proposed MetricGAN with other state-of-the-art methods, we use a publicly available dataset released by (Valentini-Botinhao et al., 2016). This dataset contains a large amount of pre-mixed noisy-clean paired data and is already used by several SE models. By using the exact same training and test dataset split, we can establish a fair comparison with them easily.

Experimental Setup and Results:

Details about the data can be found in the original paper. Except for input features and activation functions, the network architecture and training strategy are the same as described in the previous section. In addition to the PESQ score, we also report another three metrics over the test set to compare with previous works:

CSIG predicts the mean opinion score (MOS) of the signal distortion, CBAK predicts the MOS of the background noise interferences, and COVL predicts the MOS of the overall speech quality, these three metrics range from 1 to 5.

Five baseline models that rely on another network to provide loss information are compared with the proposed MetricGAN (P). We briefly explain these models as follows: SEGAN (Pascual et al., 2017) directly operates on the raw waveform and the model is trained to minimize the combination of adversarial and losses. MMSE-GAN (Soni et al., 2018) is a time-frequency masking-based method that uses a GAN objective along with loss. Similar to the structure of SEGAN, WGAN-GP and SERGAN (Baby & Verhulst, 2019) introduced Wasserstein loss and relativistic least-square loss for GAN training, respectively. Finally, Deep Feature Loss (Germain et al., 2018)

also operates on the raw waveform and is trained with a deep feature loss from another network that classifies acoustic environments. Table

2 summarizes that our proposed method outperforms all previous works with respect to three metrics. This implies that although MetricGAN is only trained to optimize a certain score (PESQ), it also has a great generalization ability to other metrics.

5 Conclusion

In this paper, we proposed a novel MetricGAN approach to directly optimize generators based on one or multiple evaluation metric scores. By associating a discriminator with the metrics of interest, MetricGAN can be treated as an iterative process between surrogate loss learning and generator learning. This surrogate can successfully capture the behavior of the metrics and provides accurate gradients guiding the generator updates. In addition to outperforming other loss functions and state-of-the-art models in SE, MetricGAN can also be trained to generate data according to the designate metric scores. To the best of our knowledge, this is the first work that employs GAN to directly train the generator with respect to multiple evaluation metrics.

References