Pitch represents the perceptual property of sound that allows ordering based on frequency, i.e., distinguishing between high and low sounds. For example, our auditory system is able to recognize a melody by tracking the relative pitch differences along time. Pitch is often confused with the fundamental frequency (), i.e., the frequency of the lowest harmonic. However, the former is a perceptual property, while the latter is a physical property of the underlying audio signal. Despite this important difference, outside the field of psychoacoustics pitch and fundamental frequency are often used interchangeably, and we will not make an explicit distinction within the scope of this paper. A comprehensive treatment of the psychoacoustic aspects of pitch perception is given in .
Pitch estimation in monophonic audio received a great deal of attention over the past decades, due to its central importance in several domains, ranging from music information retrieval to speech analysis. Traditionally, simple signal processing pipelines were proposed, working either in the time domain [10, 4, 31, 8], in the frequency domain  or both [25, 17], often followed by post-processing algorithms to smooth the pitch trajectories [22, 28].
Until recently, machine learning methods had not been able to outperform hand-crafted signal processing pipelines targeting pitch estimation. This was due to the lack of annotated data, which is particularly tedious and difficult to obtain at the temporal and frequency resolution required to train fully supervised models. To overcome these limitations, a synthetically generated dataset was proposed in, obtained by re-synthesizing monophonic music tracks while setting the fundamental frequency to the target ground truth. Using this training data, the CREPE algorithm  was able to achieve state-of-the-art results when evaluated on the same dataset, outperforming signal processing baselines, especially under noisy conditions.
In this paper we address the problem of lack of annotated data from a different angle. Specifically, we rely on self-supervision, i.e., we define an auxiliary task (also known as a pretext task) which can be learned in a completely unsupervised way. To devise this task, we started from the observation that for humans, including professional musicians, it is typically much easier to estimate relative pitch, related to the frequency interval between two notes, than absolute pitch, related to the actual fundamental frequency 
. Therefore, we design SPICE (Self-supervised PItCh Estimation) to solve a similar task. More precisely, our network architecture consists of a convolutional encoder which produces a single scalar embedding. We aim at learning a model that linearly maps this scalar value to pitch, when the latter is expressed in a logarithmic scale, i.e., in units of semitones of an equally tempered chromatic scale. To do this, we feed two versions of the same signal to the encoder, one being a pitch shifted version of the other by a random but known amount. Then, we devise a loss function that forces the difference between the scalar embeddings to be proportional to the known difference in pitch. For convenience, we perform pitch shifting in the domain defined by the constant-Q transform, because this corresponds to a simple translation along the log-spaced frequency axis. Upon convergence, the model is able to estimate relative pitch. To translate this output to an absolute pitch scale we apply a simple calibration step against ground truth data. Since we only require to estimate a single scalar offset, a very small annotated dataset can be used for this purpose.
Another important aspect of pitch estimation is determining whether the underlying signal is voiced or unvoiced. Instead of relying on handcrafted thresholding mechanisms, we augment the model in such a way that it can learn the level of confidence of the pitch estimation. Namely, we add a simple fully connected layer that receives as input the penultimate layer of the encoder and produces a second scalar value which is trained to match the pitch estimation error.
In summary, this paper makes the following key contributions:
We propose a self-supervised (relative) pitch estimation model, which can be trained without having access to any labelled dataset.
We incorporate a self-supervised mechanism to estimate the confidence of the pitch estimation, which can be directly used for voicing detection.
We evaluate our model against two publicly available monophonic datasets and show that in both cases we outperform handcrafted baselines, while matching the level of accuracy attained by CREPE, despite having no access to ground truth labels.
We train and evaluate our model also in the noisy conditions, where background music is present in addition to monophonic singing, and show that also in this case, match the level of accuracy obtained by CREPE.
The rest of this paper is organized as follows. Section II contrasts the proposed method against the existing literature. Section III illustrates the proposed method, which is evaluated in Section IV. Conclusions and future remarks are discussed in Section V.
Ii Related work
Pitch estimation: Traditional pitch estimation algorithms are based on hand-crafted signal processing pipelines, working in the time and/or frequency domain. The most common time-domain methods are based on the analysis of local maxima of the auto-correlation function (ACF) . These approaches are known to be prone to octave errors, because the peaks of the ACF repeat at different lags. Therefore, several methods were introduced to be more robust to such errors, including, e.g., the PRAAT  and RAPT  algorithms. An alternative approach is pursued by the YIN algorithm , which looks for the local minima of the Normalized Mean Difference Function (NMDF), to avoid octave errors caused by signal amplitude changes. Different frequency-domain methods were also proposed, based, e.g., on spectral peak picking  or template matching with the spectrum of a sawtooth waveform . Other approaches combine both time-domain and frequency-domain processing, like the Aurora algorithm  and the nearly defect-free F0 estimation algorithm . Comparative analyses including most of the aforementioned approaches have been conducted on speech [16, 29] , singing voices  and musical instruments . Machine learning models for pitch estimation in speech were proposed in [13, 19]. The method in 19]
consensus of other pitch trackers is used to get ground truth, and a multi-layer perceptron classifier is trained on the principal components of the autocorrelations of subbands from an auditory filterbank. More recently the CREPE model was proposed, an end-to-end convolutional neural network which consumes audio directly in the time domain. The network is trained in a fully supervised fashion, minimizing the cross-entropy loss between the ground truth pitch annotations and the output of the model. In our experiments, we compare our results with CREPE, which is the current state-of-the-art.
Pitch confidence estimation
: Most of the aforementioned methods also provide a voiced/unvoiced decision, often based on heuristic thresholds applied to hand-crafted features. However, the confidence of the estimated pitch in the voiced case is seldom provided. A few exceptions are CREPE, which produces a confidence score computed from the activations of the last layer of the model, and , which directly addresses this problem, by training a neural network based on hand-crafted features to estimate the confidence of the estimated pitch. In contrast, in our work we explicitly augment the proposed model with a head aimed at estimating confidence in a fully unsupervised way.
Pitch tracking and polyphonic audio: Often, post-processing is applied to raw pitch estimates to smoothly track pitch contours over time. For example, 
applies Kalman filtering to smooth the output of a hybrid spectro-temporal autocorrelation method, while the pYIN algorithm builds on top of YIN, by applying Viterbi decoding of a sequence soft pitch candidates. A similar smoothing algorithm is also used in the publicly released version of CREPE . Pitch extraction in the case of polyphonic audio remains an open research problem . In this case, pitch tracking is even more important to be able to distinguish the different melody lines . A machine learning model targeting the estimation of multiple fundamental frequencies, melody, vocal and bass line was recently proposed in  .
Self-supervised learning: The widespread success of fully supervised models was stimulated by the availability of annotated datasets. In those cases in which labels are scarse or simply not available, self-supervised learning has emerged as a promising approach for pre-training deep convolutional networks both for vision [24, 34, 32] and audio-related tasks [15, 30, 23]. Somewhat related to our paper are those methods that try to use self-supervision to obtain point disparities between pairs of images , where shifts in the spatial domain play the role of shifts in the log-frequency domain.
The proposed pitch estimation model receives as input an audio track of arbitrary length and produces as output a timeseries of estimated pitch frequencies, together with an indication of the confidence of the estimates. The latter is used to discriminate between unvoiced frames, in which pitch is not well defined, and voiced frames.
To better illustrate our method, let us first introduce a continuous-time model of an ideal harmonic signal, that is:
where denotes the fundamental frequency and ,
, its higher order harmonics. The modulus of the Fourier transform is given by
where is the Dirac delta function. Therefore, the modulus consists of spectral peaks at integer multiples of the fundamental frequency . When the signal is pitch-shifted by a factor of , these spectral peaks move to . If we apply a logarithmic transformation to the frequency axis, , i.e., pitch-shifting results in a simple translation in the log-frequency domain.
This very simple and well known result is at the core of the proposed model. Namely, we preprocess the input audio track with a frontend that computes the constant-Q transform (CQT). In the CQT domain, frequency bins are logarithmically spaced, as the center frequencies obey the following relationship:
where is the frequency of the lowest frequency bin, is the number of bins per octave, and is the number of frequency bins. Given an input audio track, the CQT produces a matrix of size , where depends on the selected hop length. Note that the frequency bins are logarithmically spaced. Therefore, if the input audio track is pitch-shifted by a factor , this results in a translation of bins in the CQT domain.
The proposed model architecture is illustrated in Figure 2. Starting from the observation above, the model computes the modulus of the CQT , and from each temporal frame (where is equal to the batch size during training) it extracts two random slices , spanning the range of CQT bins , , where
is the number of CQT bins in the slice and the offsets are sampled from a uniform distribution, i.e.,
). Then, each vector is fed to the same encoder to produce a single scalar. The encoder is a neural network with convolutional layers followed by two fully-connected layers. Further details about the model architecture are provided in Section IV.
We design our main loss in such a way that is encouraged to encode pitch. First, we define the relative pitch error as
Then, the loss is defined as the Huber norm of the pitch error, that is:
The pitch difference scaling factor is adjusted in such a way that when pitch is in the range , namely:
The values of and are determined based on the range of pitch frequencies spanned by the training set. In our experiments we found that the Huber loss makes the model less sensitive to the presence of unvoiced frames in the training dataset, for which the relative pitch error can be large, as pitch is not well defined in this case.
In addition to , we also use the following reconstruction loss
where , , is a reconstruction of the input frame obtained by feeding into a decoder . Therefore, the overall loss is defined as:
where and are scalar weights that determine the relative importance assigned to the two loss components.
Given the way it is designed, the proposed model can only estimate relative pitch differences. The absolute pitch of an input frame is obtained by applying an affine mapping:
which depends on two parameters. We consider two cases: estimating only the intercept , and setting ; estimating both the intercept and the slope . This is the only place where our method requires access to ground truth labels. However, we can observe that: i) only very few labelled samples are needed, as only one or two parameters need to be estimated; ii) synthetically generated labelled samples could be used for this purpose; iii) some applications (e.g., matching melodies played at different keys) might require only relative pitch. Section IV provides further details on the robustness to the calibration process.
Note that pitch in (10) is expressed in semitones and it can be converted to frequency (in Hz) by:
In addition to the estimated pitch , we design our model such that it also produces a confidence level . Indeed, when the input audio is voiced we expect to produce high confidence estimates, while when it is unvoiced pitch is not well defined and the output confidence should be low.
To achieve this, we design the encoder architecture to have two heads on top of the convolutional layers, as illustrated in Figure 2. The first head consists of two fully-connected layers and produces the pitch estimate . The second head consists of a single fully-connected layer and produces the confidence level . To train the latter, we add the following loss:
This way the model will produce high confidence
when the model is able to correctly estimate the pitch difference between the two input slices. At the same time, given that our primary goal is to accurately estimate pitch, during the backpropagation step we stop the gradients so thatonly influences the training of the confidence head and does not affect the other layers of the encoder architecture.
|Length||# of frames|
|Dataset||# of tracks||min||max||total||voiced||total|
Handling background music
The accuracy of pitch estimation can be severely affected when dealing with noisy conditions. These emerge, for example, when the singing voice is superimposed over background music. In this case, we are faced with polyphonic audio and we want the model to focus only on the singing voice source. To deal with these conditions, we introduce a data augmentation step in our training setup. More specifically, we mix the clean singing voice signal with the corresponding instrumental backing track at different levels of signal-to-noise (SNR) ratios. Interestingly, we found that simply augmenting the training data was not sufficient to achieve a good level of robustness. Instead, we also modified the definition of the loss functions as follows. Let and denote, respectively, the CQT of the clean and noisy input samples. Similarly, and denote the corresponding outputs of the encoder. The pitch error loss is modified by averaging four different variants of the error, that is:
The reconstruction loss is also modified, so that the decoder is asked to reconstruct the clean samples only. That is:
The rationale behind this approach is that the encoder is induced to represent in its output only the information relative to the clean input audio samples, thus learning to denoise the input by separating the singing voice from noise.
First we provide the details of the default parameters used in our model. The input audio track is sampled at kHz. The CQT frontend is parametrized to use bins per octave, so as to achieve a resolution equal to one half-semitone per bin. We set equal to the frequency of the note , i.e., Hz and we compute up to CQT bins, i.e., to cover the range of frequency up to Nyquist. The hop length is set equal to 512 samples, i.e., one CQT frame every 32 ms. During training, we extract slices of CQT bins, setting and . The Huber threshold is set to and the loss weights equal to, respectively, and . We increased the weight of the pitch-shift loss to when training with background music.
The encoder receives as input a 128-dimensional vector corresponding to a sliced CQT frame and produces as output two scalars representing, respectively, pitch and confidence. The model architecture consists of convolutional layers. We use filters of size
and stride equal to. The number of channels is equal to , where for the encoder and and stride is applied at the output of each layer. Hence, after flattening the output of the last convolutional layer we obtain an embedding of size elements. This is fed into two different heads. The pitch estimation head consists of two fully-connected layers with, respectively, 48 and 1 units. The confidence head consists of a single fully-connected layer with 1 output unit. The total number of parameters of the encoder is equal to 2.38M. Note that we do not apply any form of temporal smoothing to the output of the model.
The model is trained using Adam with default hyperparameters and learning rate equal to. The batch size is set to . During training, the CQT frames of the input audio tracks are shuffled, so that the frames in a batch are likely to come from different tracks.
|Model||# params||Trained on||RPA (CI 95%)||VRR||RPA (CI 95%)|
We use three datasets in our experiments, whose details are summarized in Table I. The MIR-1k  dataset contains 1000 audio tracks with people singing Chinese pop songs. The dataset is annotated with pitch at a granularity of 10 ms and it also contains voiced/unvoiced frame annotations. It comes with two stereo channels representing, respectively, the singing voice and the accompaniment music. The MDB-stem-synth dataset  includes re-synthesized monophonic music played with a variety of musical instruments. This dataset was used to train the CREPE model in . In this case, pitch annotations are available at a granularity of 29 ms. Given the mismatch of the sampling period of the pitch annotations across datasets, we resample the pitch time-series with a period equal to the hop length of the CQT, i.e., 32 ms. In addition to these publicly available datasets, we also collected in-house the SingingVoices dataset, which contains 88 audio tracks of people singing a variety of pop songs, for a total of 185 minutes.
Figure 3 illustrates the empirical distribution of pitch values. For SingingVoices, there are no ground-truth pitch labels, so we used the ouput of CREPE (configured with full model capacity and enabling Viterbi smoothing) as a surrogate. We observe that MDB-stem-synth spans a significantly larger range of frequencies (approx. 5 octaves) than MIR-1k and SingingVoices (approx. 3 octaves).
We trained SPICE using either SingingVoices or MIR-1k and used both MIR-1k (singing voice channel only) and MDB-stem-synth to evaluate models in clean conditions. To handle background music, we repeated training on MIR-1k, but this time applying data augmentation by mixing in backing tracks with a SNR uniformly sampled from [-5dB, 25dB]. For the evaluation, we used the MIR-1k dataset, mixing the available backing tracks at different levels of SNR, namely 20dB, 10dB and 0dB. In all cases, we apply data augmentation during training, by pitch-shifting the input audio tracks by an amount in semitones uniformly sampled in the set .
|Model||# params||Trained on||clean||20dB||10dB||0dB|
|SPICE||2.38M||MIR-1k + augm.|
We compare our results against two baselines, namely SWIPE  and CREPE . SWIPE estimates the pitch as the fundamental frequency of the sawtooth waveform whose spectrum best matches the spectrum of the input signal. CREPE is a data-driven method which was trained in a fully-supervised fashion on a mix of different datasets, including MDB-stem-synth , MIR-1k , Bach10 , RWC-Synth , MedleyDB  and NSynth . We consider two variants of the CREPE model, by using model capacity tiny or full, and we disabled Viterbi smoothing, so as to evaluate the accuracy achieved on individual frames. These models have, respectively, 487k and 22.2M parameters. CREPE also produces a confidence score for each input frame.
We use the evaluation measures defined in  to evaluate and compare our model against the baselines. The raw pitch accuracy (RPA) is defined as the percentage of voiced frames for which the pitch error is less than 0.5 semitones. To assess the robustness of the model accuracy to the initialization, we also report the interval , where
is the sample standard deviation obtained collecting the RPA values computed using the last 10 checkpoints of 3 separate replicas. For CREPE we do not report such interval, because we simply run the model provided by the CREPE authors on each of the evaluation datasets. Thevoicing recall rate (VRR) is the proportion of voiced frames in the ground truth that are recognized as voiced by the algorithm. We report the VRR at a target voicing false alarm rate equal to 10%. Note that this measure is provided only for MIR-1k, since MDB-stem-synth is a synthetic dataset and voicing can be determined based on a simple silence thresholding.
The main results of the paper are summarized in Table II and Figure 4. On the MIR-1k dataset, SPICE outperforms SWIPE, while achieving the same accuracy as CREPE in terms of RPA (90.7%), despite the fact that it was trained in an unsupervised fashion and CREPE used MIR-1k as one of the training datasets. Figure 5
illustrates a finer grained comparison between SPICE and CREPE (full model), measuring the average absolute pitch error for different values of the ground truth pitch frequency, conditioned on the level of confidence (expressed in deciles) produced by the respective algorithm. When excluding the decile with low confidence, we observe that above 110Hz, SPICE achieves an average error around 0.2-0.3 semitones, while CREPE around 0.1-0.5 semitones.
We repeated our analysis on the MDB-stem-synth dataset. In this case the dataset has remarkably different characteristics from the SingingVoices dataset used for the unsupervised training of SPICE, in terms of both frequency extension (Figure 3) and timbre (singing vs. musical instruments). This explains why in this case the gap between SPICE and CREPE is wider (88.9% vs. 93.1%). Figure 6 repeats the fine-grained analysis for the MDB-stem-synth dataset, illustrating larger errors at both ends of the frequency range. We also performed a thorough error analysis, trying to understand in which cases CREPE and SWIPE outperform SPICE. We discovered that most of these errors occur in the presence of a harmonic signal, in which most of the energy is concentrated above the fifth-order harmonics, i.e., in the case of musical instruments characterized by a spectral timbre considerably different from the one of singing voice.
We also evaluated the quality of the confidence estimation comparing the voicing recall rate (VRR) of SPICE and CREPE. Results in Table II show that SPICE achieves results comparable with CREPE (86.8%, i.e., between CREPE tiny and CREPE large), while being more accurate in the more interesting low false-positive rate regime (see Figure 7).
In order to obtain a smaller, thus faster, variant of the SPICE model, we used the MorphNet  algorithm. Specifically, we added to the training loss (9) a regularizer which constrains the number of floating point operations (FLOPs), using as regularization hyper-parameter. MorphNet produces as output a slimmed network architecture, which has 180k parameters, thus more than 10 times smaller than the original model. After training this model from scratch, we were still able to achieve a level of performance on MIR-1k comparable to the larger SPICE model, as reported in Table II.
Table III shows the results obtained when evaluating the models in the presence of background music. We observe that SPICE is able to achieve a level of accuracy very similar to CREPE across different values of SNR.
The key tenet of SPICE is that is an unsupervised method. However, as discussed in Section III, the raw output of the pitch head can only represent relative pitch. To obtain absolute pitch, the intercept (and, optionally, the slope ) in (10) needs to be estimated with the use of ground truth labels. Figure 8 shows the fitted model for both MIR-1k and MDB-stem-synth as a dashed red line. We qualitatively observe that the intercept is stable across datasets. In order to quantitatively estimate how many labels are needed to robustly estimate , we repeated 100 bootstrap iterations. At each iteration we resample at random just a few frames from a dataset, fit (and ) using these samples, and compute the RPA. Figure 9 reports the results of this experiment on MIR-1k
(error bars represent 2.5% and 97.5% quantiles). We observe that using as few as 200 frames is generally enough to obtain stable results. ForMIR-1k this represents about 0.09% of the dataset. Note that these samples can also be obtained by generating synthetic harmonic signals, thus eliminating the need for manual annotations.
In this paper we propose SPICE, a self-supervised pitch estimation algorithm for monophonic audio. The SPICE model is trained to recognize relative pitch without access to labelled data and it can also be used to estimate absolute pitch by calibrating the model using just a few labelled examples. Our experimental results show that SPICE is competitive with CREPE, a fully-supervised model that was recently proposed in the literature, despite having no access to ground truth labels.
We would like to thank Alexandra Gherghina, Dan Ellis, and Dick Lyon for their help with and feedback on this work.
-  (2013) A comparative study of pitch extraction algorithms on a large variety of singing sounds. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 1–5. External Links: Cited by: §II.
-  (2018) Multitask Learning for Fundamental Frequency Estimation in Music. Technical report External Links: Cited by: §II.
-  (2014) MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, External Links: Cited by: §IV.
-  (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. IFA Proceedings 17, pp. 97–110. External Links: Cited by: §I, §II.
-  (2016) Unsupervised Classification of Voiced Speech and Pitch Tracking Using Forward-Backward Kalman Filtering. In Speech Communication; 12. ITG Symposium, pp. 46–50. External Links: Cited by: §II.
-  (2008-09) A sawtooth waveform inspired pitch estimator for speech and music. The Journal of the Acoustical Society of America 124 (3), pp. 1638–1652. External Links: Cited by: §I, §II, §IV.
-  (2019-07) UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor. Technical report External Links: Cited by: §II.
-  (2002) YIN, a fundamental frequency estimator for speech and music a). Journal of the Acoustical Society of America 111 (4), pp. 1917–1930. External Links: Cited by: §I, §II.
-  (2017) Towards Confidence Measures on Fundamental Frequency Estimations. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, External Links: Cited by: §II.
-  (1976-02) Real-time digital hardware pitch detector. IEEE Transactions on Acoustics, Speech, and Signal Processing 24 (1), pp. 2–8. External Links: Cited by: §I, §II.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. External Links: Cited by: §IV.
-  (2018-06) MorphNet: fast & simple resource-constrained structure learning of deep networks. In , External Links: Cited by: §IV.
-  (2014) Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio Speech and Language Processing 22 (12). External Links: Cited by: §II.
-  (2009) On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset. IEEE Transactions on Audio, Speech, and Language Processing. External Links: Cited by: §I, §IV, §IV.
-  (2018-11) Unsupervised Learning of Semantic Audio Representations. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 126–130. External Links: Cited by: §II.
-  (2017) Performance Analysis of Several Pitch Detection Algorithms on Simulated and Real Noisy Speech Data. In EUSIPCO, European Signal Processing Conference, External Links: Cited by: §II.
-  (2005) Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT. In Interspeech, pp. 537–540. External Links: Cited by: §I, §II.
-  (2018-02) CREPE: A Convolutional Representation for Pitch Estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, External Links: Cited by: §I, §II, §II, §II, §IV, §IV.
-  (2012) Noise robust pitch tracking by subband autocorrelation classification. 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 1, pp. 706–709. External Links: Cited by: §II.
-  (2017-05) Human and Machine Hearing. Cambridge University Press. External Links: Cited by: §I.
-  (1982) Comparison of pitch detection by cepstrum and spectral comb analysis. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 180–183. Cited by: §II.
-  (2014-05) pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 659–663. External Links: Cited by: §I, §II, §IV.
-  (2017) Unsupervised Feature Learning for Audio Analysis. In Workshop track - ICLR, External Links: Cited by: §II.
-  (2016-03) Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In European Conference on Computer Vision (ECCV), pp. 69–84. External Links: Cited by: §II.
-  (2004) The ETSI extended distributed speech recognition (DSR) standards: server-side speech reconstruction. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Vol. 1, pp. I–53–6. External Links: Cited by: §I, §II.
-  (2017) An Analysis/Synthesis Framework for Automatic F0 Annotation of Multitrack Datasets. In 18th International Society for Music Information Retrieval Conference, External Links: Cited by: §I, §IV, §IV.
-  (2014) Melody extraction from Polyphonic Music Signals: Approaches, Applications and Challenges. IEEE SIgnal Processing Magazine. External Links: Cited by: §II, §IV.
-  (2012) Melody Extraction from Polyphonic Music Signals using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing 20 (6), pp. 1759 – 1770. External Links: Cited by: §I, §II.
-  (2016) Today’s most frequently used F 0 estimation methods, and their accuracy in estimating male and female pitch in clean speech. In Interspeech, External Links: Cited by: §II.
-  (2019-05) Self-supervised audio representation learning for mobile devices. Technical report External Links: Cited by: §II.
-  (1995) A Robust Algorithm for Pitch Tracking (RAPT). In Speech Coding and Synthesis, pp. 495–518. External Links: Cited by: §I, §II.
-  (2019) Representation Learning with Contrastive Predictive Coding. Technical report External Links: Cited by: §II.
-  (2010) Comparison of pitch trackers for real-time guitar effects. In Digital Audio Effects (DAFX), External Links: Cited by: §II.
-  (2018) Learning and Using the Arrow of Time. In Computer Vision and Pattern Recognition Conference (CVPR), pp. 8052–8060. External Links: Cited by: §II.
-  (2010-11) Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions. IEEE Transactions on Audio, Speech, and Language Processing 18 (8), pp. 2121–2133. External Links: Cited by: §IV.
-  (2014) Absolute and relative pitch: Global versus local processing of chords.. Advances in cognitive psychology 10 (1), pp. 15–25. External Links: Cited by: §I.