1 Introduction
Automatic music transcription (AMT) concerns automated methods for converting acoustic music signals into some form of musical notation [4]. AMT is a multifaceted problem and comprises a number of subtasks, including multipitch estimation (MPE), note tracking, instrument recognition, rhythm analysis, score typesetting, etc. MPE predicts a set of concurrent pitches that are present at each instant, and it is closely related to the task of note tracking, which predicts the onset and offset timings of every note in audio. In this paper, we address an issue in the recent approaches for MPE and note tracking, where the probabilistic dependencies between the labels are often overlooked.
A common approach for MPE and note tracking is through the prediction of a twodimensional representation that is defined along the time and frequency axes and contains the pitch tracks of notes over time. Piano rolls are the most common example of such representations, and deep salience [5] is another example that can contain more granular information on pitch contours. Once such representation is obtained, pitches and notes can be decoded by thresholding [23]
or other heuristic methods
[25, 17].To train a model that predicts a twodimensional target representation from an input audio representation , where is the number of pitch labels and
is the number of time frames, a common approach is to minimize the elementwise sum of a loss function
:(1) 
where is the ground truth. In a probabilistic perspective, we can interpret as the negative loglikelihood of the model parameters of a discriminative model :
(2) 
which indicates that each element of the label is conditionally independent with each other given the input . This encourages the model to predict the average of the posterior, making blurry predictions when the posterior distribution is multimodal, e.g. natural images [10].
Music data is highly contextual and multimodal, and the conditional independence assumption does not hold in general. This is why many computational music analysis models employ a separate postprocessing stage after sequence prediction. One approach is to factorize the joint probability using the chain rule and assume the Markov property:
(3) 
This corresponds to appending hidden Markov models (HMMs)
[36]or recurrent neural networks (RNNs)
[39, 17] to the transcription model. The Markov assumption is effective for onedimensional sequence prediction tasks, such as chord estimation [35] and monophonic pitch tracking [32], but when predicting a twodimensional representation, it still does not address the interlabel dependencies along the frequency axis.There exist a number of models in the computer vision literature that can express interlabel dependencies in twodimensional predictions, such as the neural autoregressive distribution estimator (NADE)
[27], PixelRNN [43], and PixelCNN [42]. However, apart from a notable exception using a hybrid RNNNADE approach [7], the effect of learning the joint posterior distribution for polyphonic music transcription has not been well studied.To this end, we propose a new approach for effectively leveraging interlabel dependencies in polyphonic music transcription. We pose the problem as an image translation task and apply an adversarial loss incurred by a discriminator network attached to the baseline model. We show that our approach can consistently and significantly reduce the transcription errors in Onsets and Frames [17], a stateoftheart music transcription model.
2 Background
2.1 Automatic Transcription of Polyphonic Music
Automatic transcription models for polyphonic music can be classified into frame or notelevel approaches. Framelevel transcription is synonymous with multipitch estimation (MPE) and operates on tiny temporal slices of audio, or frames, to predict all pitch values present in each frame. Notelevel transcription, or note tracking, operates at a higher level, predicting a sequence of note events that contains the pitch, the onset time, and optionally the offset time of each note. Note tracking is typically implemented as a postprocessing stage on the output of MPE
[3], by connecting and grouping the pitch estimates over time to produce discrete note events. In this sense, we can say that MPE is at the core of polyphonic music transcription.Two categories of approaches for MPE have been most successful in recent years: matrix factorization and deep learning. Factorizationbased models for music transcription [40] use nonnegative matrix factorization (NMF) [29] to factorize a timefrequency representation as a product of a dictionary matrix and an activation matrix , where is the number of pitch labels to be transcribed, e.g. 88 keys for piano transcription. This allows for an intuitive interpretation of each matrix, where each column of contains a spectral template for a pitch label, and each row of contains the activation of the corresponding pitch over time. Various extensions of factorizationbased methods have been proposed to leverage sparsity [1], adaptive estimation of harmonic spectra [44, 14], and modeling of attack and decay sounds [2, 12]. In all of these approaches, an iterative gradientdescent algorithm is used to minimize an elementwise divergence function between the matrix factorization and the target matrix [13].
Deep learning [28] methods for music transcription are increasingly popular [3], as larger labeled datasets and more powerful hardware become accessible. These approaches use neural networks (NN) to produce music transcriptions from the input audio. An early work [34]
used deep belief networks
[20] to extract audio features which are subsequently fed to pitchwise SVMHMM pairs to predict the target piano rolls. More recent approaches are based on convolutional [5, 23] and/or recurrent neural networks [6, 39, 17], which are also optimized with gradient descent to minimize an elementwise loss of predicting the target timefrequency representations.Onsets and Frames [17] is a stateoftheart piano transcription model that we use as our baseline. It uses multiple columns of convolutional and recurrent layers to predict onsets, offsets, velocities, and frame labels from the Mel spectrogram input, as shown in Figure 1. Predicted onset and frame posteriors are then used for decoding the note sequences, where a threshold value is used to create binary onset and frame activations, and frame activations without the corresponding onsets are disregarded.
As discussed above, most NMF and NNbased methods, including Onsets and Frames, use an elementwise optimization objective which does not consider the interlabel dependencies. This motivates the adversarial training scheme that is outlined in the following subsection.
2.2 Generative Adversarial Networks and pix2pix
Generative adversarial networks (GANs) [16] refer to a family of deep generative models which consist of two components, namely the generator and the discriminator . Given a data distribution and latent codes , GAN performs the following minimax game:
(4) 
and are implemented as neural networks trained in an adversarial manner, where the discriminator learns to distinguish the generated samples from the real data, while the generator learns to produce realistic samples to fool the discriminator. GANs are most renowned for their ability to produce photorealistic images [22] and have shown promising results on music generation as well [11, 9, 45]. We refer the readers to [15, 8] for a comprehensive review of the techniques, variants, and applications of GANs.
The second term in Equation 4 has nearzero gradients when , which is usually the case in early training. To avoid this, a nonsaturating variant of GAN is suggested in [16] where the generator is trained with the following optimization objective instead:
(5) 
The nonsaturating GAN loss is used more often than the minimax loss in Equation 4 and is implemented by flipping the labels of fake data while using the same loss function. Leastsquares GAN [31]
is an alternative method to address the vanishing gradient problem, which replaces the cross entropy loss in Equations
45 with squared errors:(6) 
While the default formulation of GAN concerns unconditional generation of samples from , conditional GANs (cGAN) [33] produce samples from a conditional distribution . To do this, the generator and the discriminator are defined in terms of the condition variable as well:
(7) 
pix2pix [21] is an image translation model that learns a mapping between two distinct domains of images, such as aerial photos and maps. A pix2pix model takes paired images as its training data and minimizes the conditional GAN loss along with an additional L1 loss:
(8) 
which encourages the conditional generator to learn the forward mapping from to . It can be thought that the GAN loss in Equation 7 is finetuning the mapping learned by the L1 loss in Equation 8, resulting in a predictive mapping that better respects the probabilistic dependencies within the labels .
In this paper, we adapt this approach to music transcription tasks and show that we can indeed improve the performance by introducing an adversarial loss to an existing music transcription model.
3 Method
We describe a general method for improving an NNbased transcription model that performs prediction of a twodimensional target from an input audio representation . Say the original model is trained by minimizing the loss between the predicted target and the groundtruth . The main idea of our method is to adapt pix2pix [21] to this setup, by introducing an adversarial discriminator during the training process. The adversarial training objective includes the conditional GAN loss (Equation 7):
(9) 
where
is a hyperparameter that controls how much the conditional GAN loss contributes to the gradient steps relative to the discriminative loss
. Figure 2 illustrates how the two components are connected in the computation graph and how the loss terms are calculated.Adversarial training with allows the model to learn the interlabel dependencies as desired, even when is defined only in terms of elementwise operations between and , as in Equation 1. In the next subsection, we describe a neural network architecture for the cGAN discriminator that leverages prior knowledge on music.
3.1 Musically Inspired Adversarial Discriminator
Following pix2pix, we use a fully convolutional architecture [30] for the discriminator. By being fully convolutional, the discriminator has translation invariance not only along the time axis (as in HMMs and RNNs) but also along the frequency axis. Since the discriminator determines how realistic a polyphonic note sequence is, the translation invariance enforces that the decision does not depend on the musical key, but only on the relative pitch and time intervals between the notes. This effectively implements a music language model (MLM) [7, 39] and biases the transcription toward more realistic note sequences.
Unlike the imagetoimage translation problem, the input representations (e.g. Mel spectrograms) and the output representations (e.g. piano rolls) of a music transcription model can have different dimensions. This makes combining and in a fully convolutional manner difficult. For this reason, we make the discriminator a function of only, simplifying the objective in Equation 7 to:
(10) 
Note that is also omitted in Equation 10, as we follow [21] and implement the stochasticity of only in terms of dropout layers [41], without explicitly feeding random noises into the generator. This causes a mode collapse problem where the learned is not diverse enough, but it does not harm our purpose of producing more realistic target representations.
3.2 TTUR and mixup to Stabilize GAN Training
Although an ideal GAN generator can fully reconstruct the data distribution at the global optimum [16]
, training of GANs in practice is notoriously difficult, especially for highdimensional data
[15]. This led to the inventions of a plethora of techniques for stabilizing GAN training, among which we employ the twotimescale update rule (TTUR) [19] and mixup [47]. TTUR means simply setting the generator’s learning rate a few times larger than that of the discriminator, which has been empirically shown to stabilize GAN training significantly.The other technique, mixup
, is an extension to empirical risk minimization where training data samples are drawn from convex interpolations between pairs of empirical data samples. For a pair of featuretarget tuples
and sampled randomly from the empirical distribution, their convex interpolation is given by:(11) 
where , and is the mixup hyperparameter which controls the strength of interpolation. When
, the Beta distribution becomes
which recovers the usual GAN training without mixup.algocf[b!]
mixup is readily applicable to the binary classification task of GAN discriminators. In our conditional GAN setup, we have an additional advantage of having paired samples of a real label and a fake label , which allow us to replace Equation 10 with:
(12) 
where is the binary cross entropy (BCE) function. With this mixup setup, the discriminator now has to operate on the convex interpolation between the predicted representation and the corresponding ground truth. This makes the discriminator’s task even more difficult when the prediction gets close to the ground truth, which is desirable because the discrimiantor should be inconclusive (i.e. everywhere) at the global optimum [16].
Algorithm LABEL:alg:training details the procedure of training the conditional GAN using mixup, based on Equations 10 and 12. Note that for training the generator network, we perform label flipping in similarly as in Equation 5. Also, to train a leastsquares GAN (Equation 6) instead, we can simply replace with a mean squared error (MSE) loss.
Hyperparameter  Values 

Generator learning rate  0.0006 
Discriminator learning rate  0.0001 
Discriminator loss function  {BCE, MSE} 
Batch size  8 
pix2pix weight  100 
mixup strength  {0, 0.2, 0.3, 0.4} 
Activation threshold  0.5 
Training sequence length  327,680 
mixup strength  Frame Metrics  Note Metrics  Note Metrics with Offsets  Note Metrics with Offsets & Velocity  

F1  P  R  F1  P  R  F1  P  R  F1  P  R  
Baseline  .899  .946  .857  .179  .013  .130  .036  .942  .990  .899  .802  .842  .765  .790  .830  .755  
NonSaturating GAN  0.3  .914  .931  .898  .156  .012  .089  .054  .956  .981  .932  .813  .835  .793  .802  .823  .782 
LeastSquares GAN  0.3  .906  .942  .875  .167  .013  .113  .042  .950  .988  .916  .810  .841  .781  .799  .830  .771 
4 Experimental Setup
To verify the effectiveness of our approach, we compare Onsets and Frames [17], a stateoftheart piano transcription model, with variants of the same model that are trained with the adversarial loss. We also aim to evaluate the choices of the GAN loss and the mixup strength .
4.1 Model Architecture
We use the extended Onsets and Frames model [18] which increased the CNN channels to 48/48/96, the LSTM units to 256, and the FC units to 768. The extended model has total 26.5 million parameters. We do not use the frame loss weights described in [17] in favor of the offset stack introduced in the extended version (see Figure 1). During inference, we first calculate the posteriors corresponding to overlapping chunks of audio, with the same length as the training sequences, and perform overlapadd using Hamming windows to obtain the fulllength posterior. This is because the effects of adversarial learning do not continue further than the training sequence length when we let the recurrent networks continue to predict longer sequences.
The input to the discriminator has two channels for the onset and frame predictions. The discriminator has 5 convolutional layers: c32k3s2p1, c64k3s2p1, c128k3s2p1, c256k3s2p1, c1k5s1p2
, where the numbers indicate the number of output channels, the kernel size, the stride amount, and the padding size. At each nonfinal layer, dropout of probability 0.5 and leaky ReLU activation with negative slope 0.2 are used. The mean of the final layer output along the time and frequency axes is taken as the discriminator output.
4.2 Hyperparameters
Table 1 summarizes the hyperparameters used during the experiments, which are mostly taken directly from [17] and [21]. Also following [17], we use Adam [26] and apply learning rate decay of factor 0.98 in every 10,000 iterations, for both the generator and the discriminator. We examine two types of GAN losses, the nonsaturating GAN () and the leastsquares GAN (). For each GAN loss, multiple values of mixup strengths are compared with , i.e. no mixup. Training runs for one million iterations, and the iteration that best performs on the validation set are used for evaluation on the test set.
4.3 Dataset
We use the MAESTRO dataset [18], which contains Disklavier recordings of 1,184 classical piano performances. The dataset consists of 172.3 hours of audio, which are provided with 140.1, 15.3, and 16.9 hours of train/validation/test splits such that recordings of one composition only appear in the same split. We resample the audio to 16 kHz and downmix into a single channel. Following [17], an STFT window of 2,048 samples is used for producing 229bin Mel spectrograms, and a hop length of 32 ms is used. Training sequences sliced at random positions are used, unlike the official implementation which slices training sequences at silence or zero crossings.
4.4 Evaluation Metrics
The Onsets and Frames model perform both framelevel and notelevel predictions, and their performance can be evaluated with the standard precision, recall, and F1 metrics. For multipitch estimation, we also report the error rate metrics defined in [36], which include total error, substitution error, miss error, and false alarm error. We use the mir_eval [37] library for all metric calculations. For the notelevel metrics, we use the default settings of the library, which use 50 ms for the onset tolerance, 50 ms or 20% of the note length (whichever is longer) for the offset tolerance, and 0.1 for the velocity tolerance.
5 Results
5.1 Comparison with the Baseline Metrics
Table 2 and 3 summarize the transcription performance, clearly showing a consistent improvement in the conditional GAN models over the Onsets and Frames baseline. Table 2 shows that both nonsaturating GAN and leastsquares GAN achieve the highest frame and note F1 scores when the mixup strength is used, and they both outperform the baseline. The binary piano rolls are easy to distinguish from the nonbinary predictions, which may cause imbalanced adversarial training. mixup allows nonbinary piano rolls to be fed to the discriminator, making its task more challenging and leading to higher performance.
Table 3 shows an important trend of the cGAN results compared to the baseline that cGAN trades off a bit of precision for a significant improvement in recall; this is a side effect of the cGAN producing more confident predictions, as will be discussed in the following subsections.
While the percentage differences are moderate, our method achieves statistically significant improvements in F1 metrics on the MAESTRO test dataset ( for all 4 metrics, twotailed paired test). The distribution of pertrack improvement in each F1 metric is shown in Figure 6, which indicates that the improvements are evenly distributed across the majority of the tracks. These improvements are especially promising, considering that Onsets and Frames is already a very strong baseline.
5.2 Visualization of Frame Activations
To better understand the inner workings of the conditional GAN framework, we visualize the frame posteriorgrams created by the baseline and the best performing conditional GAN model in Figure 3. In contrast to the baseline posteriorgrams which have many blurry segments, the posteriorgrams generated by our method mostly contain segments with solid colors, meaning that the model is more confident in its prediction. Figure 6 shows that the proportion of frame activation values in is noticeably higher in the baseline, thus making the output less sensitive to the threshold choice. This is because indecisive predictions are penalized by the discriminator, since they are easy to distinguish from the groundtruth which contains only binary labels. The generator is therefore encouraged to output the most probable note sequences even when it is unsure, rather than producing blurry posteriorgrams that might hamper the decoding process. This allows for an interpretation in which the GAN loss provides a prior for valid onset and frame activations, and the model learns to perform MAP estimation based on this prior.
mixup strength  

Baseline  GAN type  0  0.2  0.3  0.4  
Frame F1  0.899  NonSaturating  0.664  0.912  0.914  0.907 
LeastSquares  0.904  0.903  0.906  0.898  
Note F1  0.942  NonSaturating  0.717  0.953  0.956  0.951 
LeastSquares  0.944  0.947  0.950  0.943 
5.3 Training Dynamics and The Generalization Gap
Figure 6 shows the learning curves for the frame F1 and note F1 scores, where the scores on the training dataset are plotted in dotted lines. It is noticeable in the figure that the validation F1 scores for the baseline stagnate after 300k iterations, while the F1 scores of our model steadily grow until the end of 1 million iterations. Thanks to this, the generalization gap — the difference between the training and validation F1 scores — is significantly smaller for the conditional GAN model. This means that the GAN loss works as an effective regularizer that encourages the trained model to generalize better to unseen data, rather than memorizing the note sequences in the training dataset as LSTMs are known to be capable of [46].
6 Conclusions
We have presented an adversarial training method that can consistently outperform the baseline Onsets and Frames model, using the standard framelevel and notelevel transcription metrics and visualizations that show how the improved model predicts more confident output. To achieve this, a discriminator network is trained competitively with the transcription model, i.e. a conditional generator, so that the discriminator serves as a learned regularizer that provides a prior for realistic note sequences.
Our results show that modeling the interlabel dependencies in the target distribution is important and brings measurable performance improvements. Our method is generic, and any model that involves predicting twodimensional representation should be able to benefit from including an adversarial loss. These approaches are common not only in transcription models but also in speech or music synthesis models that predict spectrograms as an intermediate representation [38, 24].
Our results do not include the effects of using data augmentation [18], which is orthogonal to our approach and should bring additional performance improvements when applied. As discussed, the discriminator imposes the prior on the target domain whereas data augmentation enriches the input audio distribution. This implies that our method would be less effective when the majority of errors are due to the discrepancy in the audio distribution between the training and test datasets. How to apply adversarial learning for better generalization on the input distribution is a potential future research direction.
0.92
References
 [1] Samer A Abdallah and Mark D Plumbley. Unsupervised analysis of polyphonic music by sparse coding. IEEE Transactions on Neural Networks, 17(1):179–196, 2006.
 [2] Emmanouil Benetos and Simon Dixon. Multipleinstrument polyphonic music transcription using a temporally constrained shiftinvariant model. The Journal of the Acoustical Society of America, 133(3):1727–1741, 2013.
 [3] Emmanouil Benetos, Simon Dixon, Zhiyao Duan, and Sebastian Ewert. Automatic music transcription: An overview. IEEE Signal Processing Magazine, 36(1):20–30, 2019.
 [4] Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems, 41(3):407–434, 2013.
 [5] Rachel M Bittner, Brian McFee, Justin Salamon, Peter Li, and Juan Pablo Bello. Deep salience representations for f0 estimation in polyphonic music. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 63–70, 2017.
 [6] Sebastian Böck and Markus Schedl. Polyphonic piano note transcription with recurrent neural networks. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 121–124, 2012.

[7]
Nicolas BoulangerLewandowski, Yoshua Bengio, and Pascal Vincent.
Modeling temporal dependencies in highdimensional sequences:
Application to polyphonic music generation and transcription.
In
Proceedings of the International Conference on Machine Learning (ICML)
, 2012.  [8] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1):53–65, 2018.

[9]
HaoWen Dong, WenYi Hsiao, LiChia Yang, and YiHsuan Yang.
Musegan: Multitrack sequential generative adversarial networks for
symbolic music generation and accompaniment.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [10] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pages 658–666, 2016.
 [11] Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. GANSynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710, 2019.
 [12] Sebastian Ewert and Mark Sandler. Piano transcription in the studio using an extensible alternating directions framework. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11):1983–1997, 2016.
 [13] Cédric Févotte and Jérôme Idier. Algorithms for nonnegative matrix factorization with the divergence. Neural computation, 23(9):2421–2456, 2011.
 [14] Benoit Fuentes, Roland Badeau, and Gaël Richard. Harmonic adaptive latent component analysis of audio and application to music transcription. IEEE Transactions on Audio, Speech, and Language Processing, 21(9):1854–1866, 2013.
 [15] Ian Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 [16] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 [17] Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. Onsets and frames: Dualobjective piano transcription. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 50–57, 2018.
 [18] Curtis Hawthorne, Andrew Stasyuk, Adam Roberts, Ian Simon, ChengZhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
 [19] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
 [20] Geoffrey E Hinton, Simon Osindero, and YeeWhye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.

[21]
Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros.
Imagetoimage translation with conditional adversarial networks.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 1125–1134, 2017.  [22] Tero Karras, Samuli Laine, and Timo Aila. A stylebased generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
 [23] Rainer Kelz, Matthias Dorfer, Filip Korzeniowski, Sebastian Böck, Andreas Arzt, and Gerhard Widmer. On the potential of simple framewise approaches to piano transcription. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 475–481, 2016.
 [24] Jong Wook Kim, Rachel Bittner, Aparna Kumar, and Juan Pablo Bello. Neural music synthesis for flexible timbre control. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
 [25] Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A convolutional representation for pitch estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 161–165, 2018.
 [26] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, (ICLR), 2015.
 [27] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 29–37, 2011.
 [28] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
 [29] Daniel D Lee and H Sebastian Seung. Algorithms for nonnegative matrix factorization. In Advances in Neural Information Processing Systems, pages 556–562, 2001.
 [30] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
 [31] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2017.
 [32] Matthias Mauch and Simon Dixon. pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 659–663. IEEE, 2014.
 [33] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [34] Juhan Nam, Jiquan Ngiam, Honglak Lee, and Malcolm Slaney. A classificationbased polyphonic piano transcription approach using learned feature representations. In Proceedings of the 12th International Society for Music Information Retrieval (ISMIR) Conference, pages 175–180, 2011.
 [35] Yizhao Ni, Matt McVicar, Raul SantosRodriguez, and Tijl De Bie. An endtoend machine learning system for harmonic analysis of music. IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1771–1783, 2012.
 [36] Graham E Poliner and Daniel PW Ellis. A discriminative model for polyphonic piano transcription. EURASIP Journal on Advances in Signal Processing, 2007(1):048317, 2006.
 [37] Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raffel. mir_eval: A transparent implementation of common MIR metrics. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, 2014.
 [38] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj SkerrvRyan, et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
 [39] Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. An endtoend neural network for polyphonic piano music transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(5):927–939, 2016.
 [40] Paris Smaragdis and Judith C Brown. Nonnegative matrix factorization for polyphonic music transcription. In 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 177–180, 2003.
 [41] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [42] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
 [43] Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pages 1747–1756, 2016.
 [44] Emmanuel Vincent, Nancy Bertin, and Roland Badeau. Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Transactions on Audio, Speech, and Language Processing, 18(3):528–537, 2010.
 [45] LiChia Yang, SzuYu Chou, and YiHsuan Yang. Midinet: A convolutional generative adversarial network for symbolicdomain music generation. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 324–331, 2017.
 [46] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 [47] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
Comments
There are no comments yet.