SPICE: Self-supervised Pitch Estimation

by   Beat Gfeller, et al.

We propose a model to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. We acknowledge the fact that obtaining ground truth annotations at the required temporal and frequency resolution is a particularly daunting task. Therefore, we propose to adopt a self-supervised learning technique, which is able to estimate (relative) pitch without any form of supervision. The key observation is that pitch shift maps to a simple translation when the audio signal is analysed through the lens of the constant-Q transform (CQT). We design a self-supervised task by feeding two shifted slices of the CQT to the same convolutional encoder, and require that the difference in the outputs is proportional to the corresponding difference in pitch. In addition, we introduce a small model head on top of the encoder, which is able to determine the confidence of the pitch estimate, so as to distinguish between voiced and unvoiced audio. Our results show that the proposed method is able to estimate pitch at a level of accuracy comparable to fully supervised models, both on clean and noisy audio samples, yet it does not require access to large labeled datasets



There are no comments yet.


page 1

page 2

page 3

page 4


Self-supervised Audio Spatialization with Correspondence Classifier

Spatial audio is an essential medium to audiences for 3D visual and audi...

Self-supervised Attention Model for Weakly Labeled Audio Event Classification

We describe a novel weakly labeled Audio Event Classification approach b...

A Framework for Contrastive and Generative Learning of Audio Representations

In this paper, we present a framework for contrastive learning for audio...

Emerging Properties in Self-Supervised Vision Transformers

In this paper, we question if self-supervised learning provides new prop...

Self-supervised audio representation learning for mobile devices

We explore self-supervised models that can be potentially deployed on mo...

Image-Graph-Image Translation via Auto-Encoding

This work presents the first convolutional neural network that learns an...

Self-Supervised Moving Vehicle Detection from Audio-Visual Cues

Robust detection of moving vehicles is a critical task for any autonomou...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Pitch represents the perceptual property of sound that allows ordering based on frequency, i.e., distinguishing between high and low sounds. For example, our auditory system is able to recognize a melody by tracking the relative pitch differences along time. Pitch is often confused with the fundamental frequency (), i.e., the frequency of the lowest harmonic. However, the former is a perceptual property, while the latter is a physical property of the underlying audio signal. Despite this important difference, outside the field of psychoacoustics pitch and fundamental frequency are often used interchangeably, and we will not make an explicit distinction within the scope of this paper. A comprehensive treatment of the psychoacoustic aspects of pitch perception is given in [20].

Pitch estimation in monophonic audio received a great deal of attention over the past decades, due to its central importance in several domains, ranging from music information retrieval to speech analysis. Traditionally, simple signal processing pipelines were proposed, working either in the time domain [10, 4, 31, 8], in the frequency domain [6] or both [25, 17], often followed by post-processing algorithms to smooth the pitch trajectories [22, 28].

Until recently, machine learning methods had not been able to outperform hand-crafted signal processing pipelines targeting pitch estimation. This was due to the lack of annotated data, which is particularly tedious and difficult to obtain at the temporal and frequency resolution required to train fully supervised models. To overcome these limitations, a synthetically generated dataset was proposed in 

[26], obtained by re-synthesizing monophonic music tracks while setting the fundamental frequency to the target ground truth. Using this training data, the CREPE algorithm [18] was able to achieve state-of-the-art results when evaluated on the same dataset, outperforming signal processing baselines, especially under noisy conditions.

In this paper we address the problem of lack of annotated data from a different angle. Specifically, we rely on self-supervision, i.e., we define an auxiliary task (also known as a pretext task) which can be learned in a completely unsupervised way. To devise this task, we started from the observation that for humans, including professional musicians, it is typically much easier to estimate relative pitch, related to the frequency interval between two notes, than absolute pitch, related to the actual fundamental frequency [36]

. Therefore, we design SPICE (Self-supervised PItCh Estimation) to solve a similar task. More precisely, our network architecture consists of a convolutional encoder which produces a single scalar embedding. We aim at learning a model that linearly maps this scalar value to pitch, when the latter is expressed in a logarithmic scale, i.e., in units of semitones of an equally tempered chromatic scale. To do this, we feed two versions of the same signal to the encoder, one being a pitch shifted version of the other by a random but known amount. Then, we devise a loss function that forces the difference between the scalar embeddings to be proportional to the known difference in pitch. For convenience, we perform pitch shifting in the domain defined by the constant-Q transform, because this corresponds to a simple translation along the log-spaced frequency axis. Upon convergence, the model is able to estimate relative pitch. To translate this output to an absolute pitch scale we apply a simple calibration step against ground truth data. Since we only require to estimate a single scalar offset, a very small annotated dataset can be used for this purpose.

Another important aspect of pitch estimation is determining whether the underlying signal is voiced or unvoiced. Instead of relying on handcrafted thresholding mechanisms, we augment the model in such a way that it can learn the level of confidence of the pitch estimation. Namely, we add a simple fully connected layer that receives as input the penultimate layer of the encoder and produces a second scalar value which is trained to match the pitch estimation error.

As an illustration, Figure 1 shows the CQT frames of one of the evaluation datasets (MIR-1k [14]), which are considered to be voiced and sorted by the pitch estimated by SPICE.

In summary, this paper makes the following key contributions:

  • We propose a self-supervised (relative) pitch estimation model, which can be trained without having access to any labelled dataset.

  • We incorporate a self-supervised mechanism to estimate the confidence of the pitch estimation, which can be directly used for voicing detection.

  • We evaluate our model against two publicly available monophonic datasets and show that in both cases we outperform handcrafted baselines, while matching the level of accuracy attained by CREPE, despite having no access to ground truth labels.

  • We train and evaluate our model also in the noisy conditions, where background music is present in addition to monophonic singing, and show that also in this case, match the level of accuracy obtained by CREPE.

The rest of this paper is organized as follows. Section II contrasts the proposed method against the existing literature. Section III illustrates the proposed method, which is evaluated in Section IV. Conclusions and future remarks are discussed in Section V.

Ii Related work

Pitch estimation: Traditional pitch estimation algorithms are based on hand-crafted signal processing pipelines, working in the time and/or frequency domain. The most common time-domain methods are based on the analysis of local maxima of the auto-correlation function (ACF) [10]. These approaches are known to be prone to octave errors, because the peaks of the ACF repeat at different lags. Therefore, several methods were introduced to be more robust to such errors, including, e.g., the PRAAT [4] and RAPT [31] algorithms. An alternative approach is pursued by the YIN algorithm [8], which looks for the local minima of the Normalized Mean Difference Function (NMDF), to avoid octave errors caused by signal amplitude changes. Different frequency-domain methods were also proposed, based, e.g., on spectral peak picking [21] or template matching with the spectrum of a sawtooth waveform [6]. Other approaches combine both time-domain and frequency-domain processing, like the Aurora algorithm [25] and the nearly defect-free F0 estimation algorithm [17]. Comparative analyses including most of the aforementioned approaches have been conducted on speech [16, 29] , singing voices [1] and musical instruments [33]. Machine learning models for pitch estimation in speech were proposed in [13, 19]. The method in  [13]

first extracts hand-crafted spectral domain features, and then adopts a neural network (either a multi-layer perceptron or a recurrent neural network) to compute the estimated pitch. In 


consensus of other pitch trackers is used to get ground truth, and a multi-layer perceptron classifier is trained on the principal components of the autocorrelations of subbands from an auditory filterbank. More recently the CREPE 

[18] model was proposed, an end-to-end convolutional neural network which consumes audio directly in the time domain. The network is trained in a fully supervised fashion, minimizing the cross-entropy loss between the ground truth pitch annotations and the output of the model. In our experiments, we compare our results with CREPE, which is the current state-of-the-art.

Pitch confidence estimation

: Most of the aforementioned methods also provide a voiced/unvoiced decision, often based on heuristic thresholds applied to hand-crafted features. However, the confidence of the estimated pitch in the voiced case is seldom provided. A few exceptions are CREPE 

[18], which produces a confidence score computed from the activations of the last layer of the model, and [9], which directly addresses this problem, by training a neural network based on hand-crafted features to estimate the confidence of the estimated pitch. In contrast, in our work we explicitly augment the proposed model with a head aimed at estimating confidence in a fully unsupervised way.

Pitch tracking and polyphonic audio: Often, post-processing is applied to raw pitch estimates to smoothly track pitch contours over time. For example,  [5]

applies Kalman filtering to smooth the output of a hybrid spectro-temporal autocorrelation method, while the pYIN algorithm 

[22] builds on top of YIN, by applying Viterbi decoding of a sequence soft pitch candidates. A similar smoothing algorithm is also used in the publicly released version of CREPE [18]. Pitch extraction in the case of polyphonic audio remains an open research problem [27]. In this case, pitch tracking is even more important to be able to distinguish the different melody lines [28]. A machine learning model targeting the estimation of multiple fundamental frequencies, melody, vocal and bass line was recently proposed in [2] .

Self-supervised learning: The widespread success of fully supervised models was stimulated by the availability of annotated datasets. In those cases in which labels are scarse or simply not available, self-supervised learning has emerged as a promising approach for pre-training deep convolutional networks both for vision [24, 34, 32] and audio-related tasks [15, 30, 23]. Somewhat related to our paper are those methods that try to use self-supervision to obtain point disparities between pairs of images [7], where shifts in the spatial domain play the role of shifts in the log-frequency domain.

Fig. 2: SPICE model architecture.

Iii Methods

Audio frontend

The proposed pitch estimation model receives as input an audio track of arbitrary length and produces as output a timeseries of estimated pitch frequencies, together with an indication of the confidence of the estimates. The latter is used to discriminate between unvoiced frames, in which pitch is not well defined, and voiced frames.

To better illustrate our method, let us first introduce a continuous-time model of an ideal harmonic signal, that is:


where denotes the fundamental frequency and ,

, its higher order harmonics. The modulus of the Fourier transform is given by


where is the Dirac delta function. Therefore, the modulus consists of spectral peaks at integer multiples of the fundamental frequency . When the signal is pitch-shifted by a factor of , these spectral peaks move to . If we apply a logarithmic transformation to the frequency axis, , i.e., pitch-shifting results in a simple translation in the log-frequency domain.

This very simple and well known result is at the core of the proposed model. Namely, we preprocess the input audio track with a frontend that computes the constant-Q transform (CQT). In the CQT domain, frequency bins are logarithmically spaced, as the center frequencies obey the following relationship:


where is the frequency of the lowest frequency bin, is the number of bins per octave, and is the number of frequency bins. Given an input audio track, the CQT produces a matrix of size , where depends on the selected hop length. Note that the frequency bins are logarithmically spaced. Therefore, if the input audio track is pitch-shifted by a factor , this results in a translation of bins in the CQT domain.

Pitch estimation

The proposed model architecture is illustrated in Figure 2. Starting from the observation above, the model computes the modulus of the CQT , and from each temporal frame (where is equal to the batch size during training) it extracts two random slices , spanning the range of CQT bins , , where

is the number of CQT bins in the slice and the offsets are sampled from a uniform distribution, i.e.,

). Then, each vector is fed to the same encoder to produce a single scalar

. The encoder is a neural network with convolutional layers followed by two fully-connected layers. Further details about the model architecture are provided in Section IV.

We design our main loss in such a way that is encouraged to encode pitch. First, we define the relative pitch error as


Then, the loss is defined as the Huber norm of the pitch error, that is:




The pitch difference scaling factor is adjusted in such a way that when pitch is in the range , namely:


The values of and  are determined based on the range of pitch frequencies spanned by the training set. In our experiments we found that the Huber loss makes the model less sensitive to the presence of unvoiced frames in the training dataset, for which the relative pitch error can be large, as pitch is not well defined in this case.

In addition to , we also use the following reconstruction loss


where , , is a reconstruction of the input frame obtained by feeding into a decoder . Therefore, the overall loss is defined as:


where and are scalar weights that determine the relative importance assigned to the two loss components.

Given the way it is designed, the proposed model can only estimate relative pitch differences. The absolute pitch of an input frame is obtained by applying an affine mapping:


which depends on two parameters. We consider two cases: estimating only the intercept , and setting ; estimating both the intercept and the slope . This is the only place where our method requires access to ground truth labels. However, we can observe that: i) only very few labelled samples are needed, as only one or two parameters need to be estimated; ii) synthetically generated labelled samples could be used for this purpose; iii) some applications (e.g., matching melodies played at different keys) might require only relative pitch. Section IV provides further details on the robustness to the calibration process.

Note that pitch in (10) is expressed in semitones and it can be converted to frequency (in Hz) by:


Confidence estimation

In addition to the estimated pitch , we design our model such that it also produces a confidence level . Indeed, when the input audio is voiced we expect to produce high confidence estimates, while when it is unvoiced pitch is not well defined and the output confidence should be low.

To achieve this, we design the encoder architecture to have two heads on top of the convolutional layers, as illustrated in Figure 2. The first head consists of two fully-connected layers and produces the pitch estimate . The second head consists of a single fully-connected layer and produces the confidence level . To train the latter, we add the following loss:


This way the model will produce high confidence

when the model is able to correctly estimate the pitch difference between the two input slices. At the same time, given that our primary goal is to accurately estimate pitch, during the backpropagation step we stop the gradients so that

only influences the training of the confidence head and does not affect the other layers of the encoder architecture.

Length # of frames
Dataset # of tracks min max total voiced total
MIR-1k 1000 3s 12s 133m 175k 215k
MDB-stem-synth 230 2s 565s 418m 784k 1.75M
SingingVoices 88 25s 298s 185m 194k 348k
TABLE I: Dataset specifications.
(a) MIR-1k
(b) MDB-stem-synth
(c) SingingVoices
Fig. 3: Range of pitch values covered by the different datasets.

Handling background music

The accuracy of pitch estimation can be severely affected when dealing with noisy conditions. These emerge, for example, when the singing voice is superimposed over background music. In this case, we are faced with polyphonic audio and we want the model to focus only on the singing voice source. To deal with these conditions, we introduce a data augmentation step in our training setup. More specifically, we mix the clean singing voice signal with the corresponding instrumental backing track at different levels of signal-to-noise (SNR) ratios. Interestingly, we found that simply augmenting the training data was not sufficient to achieve a good level of robustness. Instead, we also modified the definition of the loss functions as follows. Let and denote, respectively, the CQT of the clean and noisy input samples. Similarly, and denote the corresponding outputs of the encoder. The pitch error loss is modified by averaging four different variants of the error, that is:


The reconstruction loss is also modified, so that the decoder is asked to reconstruct the clean samples only. That is:


The rationale behind this approach is that the encoder is induced to represent in its output only the information relative to the clean input audio samples, thus learning to denoise the input by separating the singing voice from noise.

Iv Experiments

Model parameters

First we provide the details of the default parameters used in our model. The input audio track is sampled at  kHz. The CQT frontend is parametrized to use bins per octave, so as to achieve a resolution equal to one half-semitone per bin. We set equal to the frequency of the note , i.e., Hz and we compute up to CQT bins, i.e., to cover the range of frequency up to Nyquist. The hop length is set equal to 512 samples, i.e., one CQT frame every 32 ms. During training, we extract slices of CQT bins, setting and . The Huber threshold is set to and the loss weights equal to, respectively, and . We increased the weight of the pitch-shift loss to when training with background music.

The encoder receives as input a 128-dimensional vector corresponding to a sliced CQT frame and produces as output two scalars representing, respectively, pitch and confidence. The model architecture consists of convolutional layers. We use filters of size

and stride equal to

. The number of channels is equal to , where for the encoder and

for the decoder. Each convolution is followed by batch normalization and a ReLU non-linearity. Max-pooling of size

and stride is applied at the output of each layer. Hence, after flattening the output of the last convolutional layer we obtain an embedding of size elements. This is fed into two different heads. The pitch estimation head consists of two fully-connected layers with, respectively, 48 and 1 units. The confidence head consists of a single fully-connected layer with 1 output unit. The total number of parameters of the encoder is equal to 2.38M. Note that we do not apply any form of temporal smoothing to the output of the model.

The model is trained using Adam with default hyperparameters and learning rate equal to

. The batch size is set to . During training, the CQT frames of the input audio tracks are shuffled, so that the frames in a batch are likely to come from different tracks.

(a) MIR-1k
(b) MDB-stem-synth
Fig. 4: Raw Pitch Accuracy.
MIR-1k MDB-stem-synth
Model # params Trained on RPA (CI 95%) VRR RPA (CI 95%)
SWIPE - - 86.6% - 90.7%
CREPE tiny 487k many 90.7% 88.9% 93.1%
CREPE full 22.2M many 90.1% 84.6% 92.7%
SPICE 2.38M SingingVoices 86.8%
SPICE 180k SingingVoices 90.5%
TABLE II: Evaluation results.
(b) CREPE full
Fig. 5: Pitch error on the MIR-1k dataset, conditional on ground truth pitch and model confidence.


We use three datasets in our experiments, whose details are summarized in Table I. The MIR-1k [14] dataset contains 1000 audio tracks with people singing Chinese pop songs. The dataset is annotated with pitch at a granularity of 10 ms and it also contains voiced/unvoiced frame annotations. It comes with two stereo channels representing, respectively, the singing voice and the accompaniment music. The MDB-stem-synth dataset [26] includes re-synthesized monophonic music played with a variety of musical instruments. This dataset was used to train the CREPE model in [18]. In this case, pitch annotations are available at a granularity of 29 ms. Given the mismatch of the sampling period of the pitch annotations across datasets, we resample the pitch time-series with a period equal to the hop length of the CQT, i.e., 32 ms. In addition to these publicly available datasets, we also collected in-house the SingingVoices dataset, which contains 88 audio tracks of people singing a variety of pop songs, for a total of 185 minutes.

Figure 3 illustrates the empirical distribution of pitch values. For SingingVoices, there are no ground-truth pitch labels, so we used the ouput of CREPE (configured with full model capacity and enabling Viterbi smoothing) as a surrogate. We observe that MDB-stem-synth spans a significantly larger range of frequencies (approx. 5 octaves) than MIR-1k and SingingVoices (approx. 3 octaves).

We trained SPICE using either SingingVoices or MIR-1k and used both MIR-1k (singing voice channel only) and MDB-stem-synth to evaluate models in clean conditions. To handle background music, we repeated training on MIR-1k, but this time applying data augmentation by mixing in backing tracks with a SNR uniformly sampled from [-5dB, 25dB]. For the evaluation, we used the MIR-1k dataset, mixing the available backing tracks at different levels of SNR, namely 20dB, 10dB and 0dB. In all cases, we apply data augmentation during training, by pitch-shifting the input audio tracks by an amount in semitones uniformly sampled in the set .

(b) CREPE full
Fig. 6: Pitch error on the MDB-stem-synth dataset, conditional on ground truth pitch and model confidence.
Fig. 7: Voicing Detection - ROC (MIR-1k).
Model # params Trained on clean 20dB 10dB 0dB
SWIPE - 86.6% 84.3% 69.5% 27.2%
CREPE tiny 487k many 90.7% 90.6% 88.8% 76.1%
CREPE full 22.2M many 90.1% 90.4% 89.7% 80.8%
SPICE 2.38M MIR-1k + augm.
TABLE III: Evaluation results on noisy datasets.


We compare our results against two baselines, namely SWIPE [6] and CREPE [18]. SWIPE estimates the pitch as the fundamental frequency of the sawtooth waveform whose spectrum best matches the spectrum of the input signal. CREPE is a data-driven method which was trained in a fully-supervised fashion on a mix of different datasets, including MDB-stem-synth [26], MIR-1k [14], Bach10 [35], RWC-Synth [22], MedleyDB [3] and NSynth [11]. We consider two variants of the CREPE model, by using model capacity tiny or full, and we disabled Viterbi smoothing, so as to evaluate the accuracy achieved on individual frames. These models have, respectively, 487k and 22.2M parameters. CREPE also produces a confidence score for each input frame.

Evaluation measures

We use the evaluation measures defined in [27] to evaluate and compare our model against the baselines. The raw pitch accuracy (RPA) is defined as the percentage of voiced frames for which the pitch error is less than 0.5 semitones. To assess the robustness of the model accuracy to the initialization, we also report the interval , where

is the sample standard deviation obtained collecting the RPA values computed using the last 10 checkpoints of 3 separate replicas. For CREPE we do not report such interval, because we simply run the model provided by the CREPE authors on each of the evaluation datasets. The

voicing recall rate (VRR) is the proportion of voiced frames in the ground truth that are recognized as voiced by the algorithm. We report the VRR at a target voicing false alarm rate equal to 10%. Note that this measure is provided only for MIR-1k, since MDB-stem-synth is a synthetic dataset and voicing can be determined based on a simple silence thresholding.

Main results

The main results of the paper are summarized in Table II and Figure 4. On the MIR-1k dataset, SPICE outperforms SWIPE, while achieving the same accuracy as CREPE in terms of RPA (90.7%), despite the fact that it was trained in an unsupervised fashion and CREPE used MIR-1k as one of the training datasets. Figure 5

illustrates a finer grained comparison between SPICE and CREPE (full model), measuring the average absolute pitch error for different values of the ground truth pitch frequency, conditioned on the level of confidence (expressed in deciles) produced by the respective algorithm. When excluding the decile with low confidence, we observe that above 110Hz, SPICE achieves an average error around 0.2-0.3 semitones, while CREPE around 0.1-0.5 semitones.

(a) MIR-1k
(b) MDB-stem-synth
Fig. 8: Calibration of the pitch head output.
Fig. 9: Robustness of the RPA on MIR-1k when varying the number of frames used for calibration.

We repeated our analysis on the MDB-stem-synth dataset. In this case the dataset has remarkably different characteristics from the SingingVoices dataset used for the unsupervised training of SPICE, in terms of both frequency extension (Figure 3) and timbre (singing vs. musical instruments). This explains why in this case the gap between SPICE and CREPE is wider (88.9% vs. 93.1%). Figure 6 repeats the fine-grained analysis for the MDB-stem-synth dataset, illustrating larger errors at both ends of the frequency range. We also performed a thorough error analysis, trying to understand in which cases CREPE and SWIPE outperform SPICE. We discovered that most of these errors occur in the presence of a harmonic signal, in which most of the energy is concentrated above the fifth-order harmonics, i.e., in the case of musical instruments characterized by a spectral timbre considerably different from the one of singing voice.

We also evaluated the quality of the confidence estimation comparing the voicing recall rate (VRR) of SPICE and CREPE. Results in Table II show that SPICE achieves results comparable with CREPE (86.8%, i.e., between CREPE tiny and CREPE large), while being more accurate in the more interesting low false-positive rate regime (see Figure 7).

In order to obtain a smaller, thus faster, variant of the SPICE model, we used the MorphNet [12] algorithm. Specifically, we added to the training loss (9) a regularizer which constrains the number of floating point operations (FLOPs), using as regularization hyper-parameter. MorphNet produces as output a slimmed network architecture, which has 180k parameters, thus more than 10 times smaller than the original model. After training this model from scratch, we were still able to achieve a level of performance on MIR-1k comparable to the larger SPICE model, as reported in Table II.

Table III shows the results obtained when evaluating the models in the presence of background music. We observe that SPICE is able to achieve a level of accuracy very similar to CREPE across different values of SNR.


The key tenet of SPICE is that is an unsupervised method. However, as discussed in Section III, the raw output of the pitch head can only represent relative pitch. To obtain absolute pitch, the intercept (and, optionally, the slope ) in (10) needs to be estimated with the use of ground truth labels. Figure 8 shows the fitted model for both MIR-1k and MDB-stem-synth as a dashed red line. We qualitatively observe that the intercept is stable across datasets. In order to quantitatively estimate how many labels are needed to robustly estimate , we repeated 100 bootstrap iterations. At each iteration we resample at random just a few frames from a dataset, fit (and ) using these samples, and compute the RPA. Figure 9 reports the results of this experiment on MIR-1k

(error bars represent 2.5% and 97.5% quantiles). We observe that using as few as 200 frames is generally enough to obtain stable results. For

MIR-1k this represents about 0.09% of the dataset. Note that these samples can also be obtained by generating synthetic harmonic signals, thus eliminating the need for manual annotations.

V Conclusion

In this paper we propose SPICE, a self-supervised pitch estimation algorithm for monophonic audio. The SPICE model is trained to recognize relative pitch without access to labelled data and it can also be used to estimate absolute pitch by calibrating the model using just a few labelled examples. Our experimental results show that SPICE is competitive with CREPE, a fully-supervised model that was recently proposed in the literature, despite having no access to ground truth labels.


We would like to thank Alexandra Gherghina, Dan Ellis, and Dick Lyon for their help with and feedback on this work.


  • [1] O. Babacan, T. Drugman, N. Henrich, and T. Dutoit (2013) A comparative study of pitch extraction algorithms on a large variety of singing sounds. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 1–5. External Links: Link Cited by: §II.
  • [2] R. M. Bittner, B. Mcfee, and J. P. Bello (2018) Multitask Learning for Fundamental Frequency Estimation in Music. Technical report External Links: 1809.00381v1, Link Cited by: §II.
  • [3] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello (2014) MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, External Links: Link Cited by: §IV.
  • [4] P. Boersma and P. Boersma (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. IFA Proceedings 17, pp. 97–110. External Links: Link Cited by: §I, §II.
  • [5] B. T. Bönninghoff, R. M. Nickel, S. Zeiler, and D. Kolossa (2016) Unsupervised Classification of Voiced Speech and Pitch Tracking Using Forward-Backward Kalman Filtering. In Speech Communication; 12. ITG Symposium, pp. 46–50. External Links: ISBN 9783800742752 Cited by: §II.
  • [6] A. Camacho and J. G. Harris (2008-09) A sawtooth waveform inspired pitch estimator for speech and music. The Journal of the Acoustical Society of America 124 (3), pp. 1638–1652. External Links: Document, ISSN 0001-4966, Link Cited by: §I, §II, §IV.
  • [7] P. H. Christiansen, M. F. Kragh, Y. Brodskiy, and H. Karstoft (2019-07) UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor. Technical report External Links: 1907.04011, Link Cited by: §II.
  • [8] A. De Cheveigné and H. Kawahara (2002) YIN, a fundamental frequency estimator for speech and music a). Journal of the Acoustical Society of America 111 (4), pp. 1917–1930. External Links: Document, Link Cited by: §I, §II.
  • [9] B. Deng, D. Jouvet, Y. Laprie, I. Steiner, and A. Sini (2017) Towards Confidence Measures on Fundamental Frequency Estimations. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, External Links: Link Cited by: §II.
  • [10] J. Dubnowski, R. Schafer, and L. Rabiner (1976-02) Real-time digital hardware pitch detector. IEEE Transactions on Acoustics, Speech, and Signal Processing 24 (1), pp. 2–8. External Links: Document, ISSN 0096-3518, Link Cited by: §I, §II.
  • [11] J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi (2017-04)

    Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

    External Links: 1704.01279, Link Cited by: §IV.
  • [12] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, and E. Choi (2018-06) MorphNet: fast & simple resource-constrained structure learning of deep networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    External Links: Link Cited by: §IV.
  • [13] K. Han and D. Wang (2014) Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio Speech and Language Processing 22 (12). External Links: Document, Link Cited by: §II.
  • [14] C. H. Jang and J. Roger (2009) On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset. IEEE Transactions on Audio, Speech, and Language Processing. External Links: Link Cited by: §I, §IV, §IV.
  • [15] A. Jansen, M. Plakal, R. Pandya, D. P. W. Ellis, S. Hershey, J. Liu, R. C. Moore, and R. A. Saurous (2018-11) Unsupervised Learning of Semantic Audio Representations. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 126–130. External Links: 1711.02209, Link Cited by: §II.
  • [16] D. Jouvet and Y. Laprie (2017) Performance Analysis of Several Pitch Detection Algorithms on Simulated and Real Noisy Speech Data. In EUSIPCO, European Signal Processing Conference, External Links: Link Cited by: §II.
  • [17] H. Kawahara, A. de Cheveigné, H. Banno, T. Takahashi, and T. Irino (2005) Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT. In Interspeech, pp. 537–540. External Links: Link Cited by: §I, §II.
  • [18] J. W. Kim, J. Salamon, P. Li, and J. P. Bello (2018-02) CREPE: A Convolutional Representation for Pitch Estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, External Links: 1802.06182, Link Cited by: §I, §II, §II, §II, §IV, §IV.
  • [19] B. S. Lee and D. P. W. Ellis (2012) Noise robust pitch tracking by subband autocorrelation classification. 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 1, pp. 706–709. External Links: ISBN 9781622767595 Cited by: §II.
  • [20] R. F. Lyon (2017-05) Human and Machine Hearing. Cambridge University Press. External Links: Document, ISBN 9781107007536, Link Cited by: §I.
  • [21] P. Martin (1982) Comparison of pitch detection by cepstrum and spectral comb analysis. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 180–183. Cited by: §II.
  • [22] M. Mauch and S. Dixon (2014-05) pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 659–663. External Links: Document, ISBN 978-1-4799-2893-4, Link Cited by: §I, §II, §IV.
  • [23] M. Meyer, J. Beutel, and L. Thiele (2017) Unsupervised Feature Learning for Audio Analysis. In Workshop track - ICLR, External Links: 1712.03835v1, Link Cited by: §II.
  • [24] M. Noroozi and P. Favaro (2016-03) Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In European Conference on Computer Vision (ECCV), pp. 69–84. External Links: 1603.09246, Link Cited by: §II.
  • [25] T. Ramabadran, A. Sorin, M. McLaughlin, D. Chazan, D. Pearce, and R. Hoory (2004) The ETSI extended distributed speech recognition (DSR) standards: server-side speech reconstruction. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Vol. 1, pp. I–53–6. External Links: Document, ISBN 0-7803-8484-9, Link Cited by: §I, §II.
  • [26] J. Salamon, R. Bittner, J. Bonada, J. J. Bosch, E. Gómez, and J. P. Bello (2017) An Analysis/Synthesis Framework for Automatic F0 Annotation of Multitrack Datasets. In 18th International Society for Music Information Retrieval Conference, External Links: Link Cited by: §I, §IV, §IV.
  • [27] J. Salamon, E. Gomez, P.W. D. Ellis, and G. Richard (2014) Melody extraction from Polyphonic Music Signals: Approaches, Applications and Challenges. IEEE SIgnal Processing Magazine. External Links: Document, Link Cited by: §II, §IV.
  • [28] J. Salamon and E. Gómez (2012) Melody Extraction from Polyphonic Music Signals using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing 20 (6), pp. 1759 – 1770. External Links: Link Cited by: §I, §II.
  • [29] S. Strömbergsson (2016) Today’s most frequently used F 0 estimation methods, and their accuracy in estimating male and female pitch in clean speech. In Interspeech, External Links: Document, Link Cited by: §II.
  • [30] M. Tagliasacchi, B. Gfeller, F. d. C. Quitry, and D. Roblek (2019-05) Self-supervised audio representation learning for mobile devices. Technical report External Links: 1905.11796, Link Cited by: §II.
  • [31] D. Talkin (1995) A Robust Algorithm for Pitch Tracking (RAPT). In Speech Coding and Synthesis, pp. 495–518. External Links: Link Cited by: §I, §II.
  • [32] O. V. van den Oord, Yazhe Li (2019) Representation Learning with Contrastive Predictive Coding. Technical report External Links: 1807.03748v2, Link Cited by: §II.
  • [33] A. Von Dem Knesebeck and U. Zölzer (2010) Comparison of pitch trackers for real-time guitar effects. In Digital Audio Effects (DAFX), External Links: Link Cited by: §II.
  • [34] D. Wei, J. Lim, A. Zisserman, and W. T. Freeman (2018) Learning and Using the Arrow of Time. In Computer Vision and Pattern Recognition Conference (CVPR), pp. 8052–8060. External Links: Link Cited by: §II.
  • [35] Zhiyao Duan, B. Pardo, and Changshui Zhang (2010-11) Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions. IEEE Transactions on Audio, Speech, and Language Processing 18 (8), pp. 2121–2133. External Links: Document, ISSN 1558-7916, Link Cited by: §IV.
  • [36] N. Ziv and S. Radin (2014) Absolute and relative pitch: Global versus local processing of chords.. Advances in cognitive psychology 10 (1), pp. 15–25. External Links: Document, ISSN 1895-1171, Link Cited by: §I.