Similarity graphs for the concealment of long duration data loss in music

by   Nathanaël Perraudin, et al.

We present a novel method for the compensation of long duration data gaps in audio signals, in particular music. The concealment of such signal defects is based on a graph that encodes signal structure in terms of time-persistent spectral similarity. A suitable candidate segment for the substitution of the lost content is proposed by an intuitive optimization scheme and smoothly inserted into the gap. Extensive listening tests show that the proposed algorithm provides highly promising results when applied to a variety of real-world music signals.



There are no comments yet.


page 3


Low Resource Audio-to-Lyrics Alignment From Polyphonic Music Recordings

Lyrics alignment in long music recordings can be memory exhaustive when ...

Gaussian Processes for Music Audio Modelling and Content Analysis

Real music signals are highly variable, yet they have strong statistical...

Improving the efficiency of spectral features extraction by structuring the audio files

The extraction of spectral features from a music clip is a computational...

Non-uniform time-scaling of Carnatic music transients

Gamakas are an integral aspect of Carnatic Music, a form of classical mu...

GACELA – A generative adversarial context encoder for long audio inpainting

We introduce GACELA, a generative adversarial network (GAN) designed to ...

A context encoder for audio inpainting

We studied the ability of deep neural networks (DNNs) to restore missing...

wavEMS: Improving Signal Variation Freedom of Electrical Muscle Stimulation

There has been a long history in electrical muscle stimulation (EMS), wh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The loss or corruption of entire segments of audio data is a highly important problem in music enhancement and restoration. Such corruptions can range from short bursts in the range of few milliseconds to extended distortions that persist over several hundred or even thousands of milliseconds. Short distortions such as clicks or clipping have seen extensive coverage in the literature [siedenburg2013audio, adler2011constrained, godsill2002digital], while the concealment of moderate length distortions, roughly in the range of to at most  ms, is treated in packet loss compensation [perkins1998survey, bahat2015self] and previous work on audio inpainting [adler2012audio, siedenburg2013audio, lagrange2005long]. For such corruptions, it is often reasonable to assume that the lost signal is almost stationary for the duration of the corruption and/or can be inferred from the reliable information surrounding the unreliable segment. For longer duration loss, such an assumption is increasingly unrealistic and a restoration technique cannot rely only on local information. Here, we propose a method to compensate for such extended data loss by considering information from the entirety of uncorrupted audio available.

Data loss or corruption in the range of seconds can have various causes, e.g. partially damaged physical media, such as phonograph cylinders, shellac or vinyl records or even magnetic tapes. In live music recordings, imperfections due to unwanted noise sources originating from the audience, the artists themselves or the environment are quite common. Even in audio transmission, a short, but total, loss of the connection between transmitter and receiver may lead to data loss beyond just a few hundred milliseconds. In each of these scenarios, the data loss has highly unpleasant consequences for a listener, and it is usually not feasible to reconstruct the lost content from local information only.

Previous work on concealment of data loss in audio, though mostly considering shorter corruption duration, has been performed under various names, depending on the target application and the employed methodology: Audio inpainting [adler2012audio]

, audio interpolation 

[Etter1996:Interpolation_AR], waveform substitution [goodman1986waveform]

, or imputation 

[smaragdis2009missing] to name but a few. We will use the terminology of audio inpainting in the remainder of this contribution. When missing parts have a length no longer than ms, sparsity-based techniques can be successful [adler2012audio, siedenburg2013audio, adler2011constrained]. Otherwise, techniques relying on auto-regressive modeling [Etter1996:Interpolation_AR], sinusoidal modeling [lagrange2005long, lukin2008:parametric.interp.gaps] or based on self-content [bahat2015self] have been proposed. The latter provided promising results for speech signals with distortions up to seconds, while the former rely on a simple signal model that does not comply with complex music signals.

In this contribution, we propose a new algorithm, specifically targeted at the concealment of long duration distortions in the range of several seconds given a single piece of music. The task of determining distortion locations is highly application-dependent and may be anything from trivial to very difficult. For the sake of focus, we assume the location of the distortion to be known. Our method arises from the assumption that, across many musical genres, the repetition, or variation, of distinct and recurring patterns (themes, melodies, rhythms, etc) is a central stylistic element and thus heavily featured. When listening to music, we detect and memorize such internal redundancies, thereby learning the mid- and large-scale structures of a music piece [mcadams1987]. The exploitation of such redundancies in the computational analysis and processing of music seems only natural and, indeed, has been proposed before, see e.g. [foote1999visualizing, jehan2005creating, jehan2004event] or [rafii2013repeating]. The latter also provides a more extensive discussion of repetition as an essential element of many musical genres. Although music information retrieval (MIR) provides many sophisticated methods for the analysis of micro- and macroscopic structures in music, properly handled, a simple time-frequency analysis can provide all the necessary information to uncover significant similarities in music signals. The contributions of this work are the design of appropriate time-frequency features and their use for generating a map of similarities in music signals, as well as the use of the generated similarity map to drive the automatic concealment of long duration data loss.

I-a Related Work

Self-similarity in music has previously been employed in several areas of music analysis and processing, e.g. beat estimation and segmentation, and is often based on similarity matrices, as proposed by Foote 

[foote1999visualizing]. The similarity matrix can be constructed from various features, see e.g. [foote2001beat, bartsch2001catch, cooper2002automatic, foote2000automatic]. Self-similarity has also been successfully used for music/voice separation and speech enhancement [rafii2013repeating, rafii2013online]. Finally, the automatic analysis of musical structure based on similarities is already found in [silva2016simple], where it was used across songs for cover song detection. An alternate approach can be found in [jehan2005creating, jehan2004event]111These studies led to the founding of "The Echo Nest", see, a company specialized into audio feature design. The idea of a similarity graph already appears in the infinite jukebox: There, the division of music into short, rhythm-dependent pieces is proposed, each of which is supposed to correspond to a single beat. Local features are obtained for each piece by combining previously established rhythm, timbre and pitch features, but the implementation details of their method are not disclosed. In this contribution, we propose a simple time-frequency feature built from the short-time Fourier magnitude and phase that implicitly encodes rhythmic, timbral and pitch characteristics of the analyzed signal all at once. We build a sparse similarity graph based on this feature that highlights only the strongest connections in a music piece. This similarity graph can be seen as a post-processed variant of Foote’s similarity matrix and is used to perform data loss concealment by detecting suitable transitions between similar segments in a piece of music.

The audio inpainting problem has mainly been addressed from a sparsity point of view. The hypothesis is that audio is often approximately sparse in a time-frequency representation, i.e. it can be estimated using only a few time-frequency atoms. Using classical or optimization techniques, algorithms have been designed to inpaint short audio gaps [adler2012audio, siedenburg2013audio]

. Such methods strive for approximate recovery of the lost data by sparse approximation in a time-frequency representation such as the short-time Fourier transform (STFT). Both their numerical and perceptual restoration quality quickly degrade when the duration of lost data intervals exceeds

 ms. When applied to significantly longer gaps, these methods will simply fade out/in at the gap border and introduce silence in the inner gap region. Audio inpainting is known as "waveform substitution" [goodman1986waveform] by the community addressing packet loss recovery techniques [perkins1998survey]. Most packet loss methods, however, are naturally designed for low delay processing and compromise computation speed over quality, see also [bahat2015self] for a short overview. In that contribution, Bahat et al. propose an algorithm searching for similar parts of the signal using time-evolving features, conceptually resembling our own contribution. The method in [bahat2015self] is designed for packet loss concealment in speech transmission, however, and was tested only on gaps up to seconds. The reliance on Mel frequency cepstral coefficients (MFCC) is a good match for speech, but not optimally suited for music. In another approach, Martin et al. [martin2011exemplar] proposed an inpainting algorithm taking advantage of the redundancy in tonal feature sequences of a music piece. Their method is able to conceal defects with a length of several seconds, but performance of this algorithm depends on the amount of repetitive tonal sequences in a music piece [martin2011exemplar] and it was only applied when a recurrence of the lost tonal sequence was present in the reliable signal. It should be noted that parallel work on audio inpainting using self-similarity by Manilow and Pardo [manilow2017leveraging] has been presented while the present manuscript was under review.

I-B Structure of the paper

After the introduction, we introduce the idea of the similarity graph, Section II. The general method and construction of the graph is presented in Section III. Technical details about the graph construction such as the exact choice of features and parameters are deferred to Section IV. In Section V, we detail how the similarity graph can be used for audio inpainting. Finally, the performance of the algorithm is discussed, based on both a basic verification experiment and though extensive listening tests, Section VI.

Ii A transition graph encoding music structures

The problem we consider, i.e. how to restore a piece of music when an extended, connected piece has been lost or corrupted, often requires us to abandon the idea of exact recovery. In the case where only a short segment (up to about ms) has been lost [adler2012audio], or the signal can be described by a very simple structure [lagrange2005long], it may be possible to infer the missing information from the regions directly adjacent to the distortion with sufficient quality. However, for complex music signals and corruptions of longer duration, such inference remains out of reach. Instead, we employ an analysis of the overarching medium- and large-scale structure of a music piece, determining redundancies in the signal to be exploited in the search for a replacement for the distorted signal segment.

Conceptually, such analysis can be seen as a music segmentation into chorus and verse, motifs and their variation, sections of equal or different meters, etc [macpherson2008form]. The main difference to our approach is that, instead of working with high-level cognitive concepts such as meter and motifs, we instead consider a basic time-frequency representation of the signal. In that representation, all the structures contained in a music recording are still preserved, although it is not always easily accessible to the human observer.

It is clear that repetition and less obvious redundancies do not occur to an equivalent degree in every music piece. While they are an essential stylistic element to pop and rock music, certain movements, e.g. in contemporary music, attempt the active avoidance of the familiar. But even if a pattern is not repeated in the exactly same fashion, the conscious variation of previous structures, rhythmic, harmonic or otherwise, is an integral part of most music. Note that the grade of self-similarity inside a single recording may vary greatly.

Going back to the original problem of music restoration, it seems natural to exploit this type of redundancy in the musical piece to be restored. The temporal evolution of spectral content provides a surprisingly suitable first approximation of musical features. Inspired by this observation, we construct an audio similarity graph. The vertices of the graph represent small parts of musical content, while the edges indicate the similarity between the segments in terms of local spectral content. The crucial step towards good performance is the enforcement of temporal coherence. This is achieved by selecting transitions that persist over time, i.e. similarity is not instantaneous, but present for some period of time.

Iii Method

The ultimate goal of this contribution is to provide a means for autonomous concealment of signal defects with a duration of a few hundred to several thousand milliseconds. Here, we assume that the position of the defects is already known. The restoration should sound natural and respect the overall structure of the signal under scrutiny. For short distortions, this implies, to some degree, the recovery of the lost information in the defective region. For long gaps and dynamic signals, we argue that it is of much greater importance than the transitions between the reliable signal segments and the proposed restoration sound natural. The further away from the transition points we are into the restored region, the less important exact recovery becomes versus the restoration making sense in the signal context. Therefore, we suggest an analysis of the signal structure with the proposed similarity graph, to determine the most natural fit for the distorted region from unaffected portions of the signal. The resulting method is an abstracted and autonomous version of manual restoration by searching the reliable signal for a replacement for the defective region. Since the proposed method forgoes the synthesis of new audio content and provided that enough reliable signal content is available, the proposed method can handle signal defects of arbitrary length without affecting audio quality.

We obtain from a short-time Fourier transform [allen1977unified, Grochening:2001a, ga46] simple similarity features carrying important temporal and spectral information. On the basis of these features, a similarity graph is constructed, representing the temporal evolution and structure of the signal. If some signal segment is known to be defective, it is now sufficient to determine another segment of similar length, such that the beginning and end of the substitute resemble the signal before and after the defect. By placing the candidate segment at the previously corrupted position, the defect can be concealed.

The proposed algorithm, illustrated in Figure 1, searches for a replacement segment that optimally satisfies the three following criteria:

  1. The transitions and (light green dashed lines) resulting from the pasting operation should be perceptually transparent, i.e., the listener should not be able to notice the transition, even if the replacement segment does not correspond exactly to the missing data.

  2. Some leeway is required for placing the transitions around the gap, represented by , . However, the transition areas should not be unnecessarily long.

  3. The length of the piece should remain approximately the same, i.e., the replacement duration should be close to the gap plus its surroundings, .

Some margin for compromise is, however, essential to the construction of a good solution. Since the question of how strictly the reliable content is to be preserved, i.e. how long and may be, is highly application-dependent, a parameter in the optimization scheme enables the tuning of this property.

Figure 1: Illustration of the proposed inpainting method. The determined candidate segment of duration is to be substituted for the gap. The optimal transition points and are determined together with the candidate segment by jointly optimizing (i) the similarity feature at , , (ii) the difference and the length of the necessary transition areas and .

In practice, at least for the inpainting problem, it is unnecessary to construct the full similarity graph. Consequently, we construct a sparsified graph which considers unique and strong matches only. Weak matches are discarded. Only the strongest from a cluster of (temporally close) matches are considered. Finally, only edges connected to at least one node in the vicinity of the gap are relevant, since and are supposed to be small, see Figure 1.

Iii-a Creation of the similarity graph

The generation of the graph can be structured coarsely into distinct stages. In this section we disregard some technical details, instead concentrating on the general idea. The technical details of the individual steps of our method can be found in Sections IV and V.

1. Compute basic similarity features

To determine temporal similarities in a signal, we have to settle on a feature that encodes the local signal behavior and a distance measure

that allows the comparison of feature vectors. For simplicity, and because the results were comparable to more sophisticated features, we settle here on a weighted combination of two features obtained directly from a short-time Fourier (STFT) analysis of the signal. Let

be the matrix of short-time Fourier coefficients, with denoting the coefficient obtained at the -th time position in the -th channel. , see Section IV, can be decomposed uniquely into its magnitude and phase as

Since the human auditory system perceives loudness approximately as a logarithmic function of sound pressure, the first part of our proposed feature is essentially a time slice of the dB-spectrogram, i.e.

Note that direct spectrogram features have already proven to be useful in other applications, e.g. repetition-based source separation, see [rafii2013repeating].

Additionally, the time-direction partial derivative of the phase provides an estimate of the local instantaneous frequency [augfla95:reassign, Holighaus:2016:RSG:2910117.2910282]. Let denote the -matrix containing the values of the time direction partial derivative of the phase, i.e. . The second part of our proposed feature is essentially

and . While puts a strong emphasis on signal components of high amplitude, attains large values for sinusoidal, or slowly frequency-varying, components independent of their magnitude, see also Figure 2. This second part of the feature serves to emphasize low amplitude harmonic components, which may be highly important for perceived similarity. The actual feature , defined in Section IV, is conceptually equivalent to , but implements some additional scaling. Locality of the features is implied by obtaining the features from a STFT. The distance between two features at is simply the squared Euclidean distance of and .

Figure 2: Local audio features for an exemplary audio signal. The log-spectrogram (top) encodes the time-dependent intensity of frequency components. The smoothed partial phase derivative (bottom) has large values in the area of stable, harmonic components, independent of the component magnitude.

2. Create a preliminary similarity graph

The full (unprocessed) similarity graph determined from the given feature vectors would simply have all the time positions as vertices and edges connecting each vertex to every other vertex, with the associated weights derived from the distance between the associated features.

The creation of such a graph is not only very expensive, but we are further only interested in a small number of strongest connections for every vertex. Therefore, we only determine the nearest neighbors, in terms of feature distance. Since this operation is expensive, we use the FLANN library (Fast Library for Approximate Nearest Neighbors) [muja2014scalable] to efficiently provide an approximate solution. For the determined neighbors, the edge weights are recorded in the adjacency matrix as


for some , following a traditional graph construction scheme, see also Figure 3 (left).

3. Enhance time-persistent similarities

The individual features obtained from the STFT usually characterize signal’s properties on a local time interval and do not capture the long-term signal’s spectral characteristics. In order to capture longer temporal structures of a signal, we refine the graph by emphasizing its edges whenever a sequence of features at consecutive time positions is similar to another. In practice, this is achieved by convolving the weight matrix with a diagonal kernel , for some , with

The resulting adjacency matrix is given as


see also Figure 3 (middle). Note that, in order to obtain an -matrix and for the above equation to be valid is implicitly extended to an -matrix with zeros on any side.

Figure 3: Weight matrix based on feature vectors calculated for an exemplary audio signal without a gap. Left panel: Preliminary weight matrix, , of the initial graph. Center panel: Convolved weight matrix, . Right panel: Excerpt of the weight matrix, , of the sparsified graph.

4. Delete insignificant similarities/Merge clustered similarities

After the convolution with the diagonal kernel, the weight matrix of our graph has been populated with a large number of nonzero entries, clustered around the entries of . The maxima of such clusters represent the strongest similarities between two regions of the signal. Moreover, only strong connections indicate significant similarities. Therefore, we delete all edges with weights below a certain threshold and select from every cluster of connections only the strongest, i.e the one with locally the largest weight. This last step leads to the weight matrix which is associated to the graph we use for our inpainting algorithm. For an example of the final, sparsified adjacency matrix, see Figure 3 (right). Figure 4 shows the difference between the original graph after Step 2 and part of the refined graph after Step 4.

Figure 4: Graphs based on feature vectors calculated for an exemplary audio signal without a gap. Left panel: Initial non-sparse graph, , corresponding to the weight matrix , shown in Figure 3 (left). Right panel: Sparse graph, (only local maximum weights above the threshold considered), corresponding to the weight matrix , shown in Figure 3 (right).

Iii-B Application: Audio inpainting and the reduced similarity graph

The usage of the similarity graph for solving an inpainting problem is rather straightforward. According to the paradigm described in Figure 1, we want to find two edges and , such that

  • is close to the beginning of the distorted region and is close to its end,

  • is approximately equal to and

  • and are large.

An appropriate choice of and is determined by optimizing these criteria over all possible choices, for and in some limited range around the signal defect. The signal segment corresponding to the local features is then substituted for the original signal in the range corresponding to .

For the purpose of inpainting, we are only interested in edges that connect to at least one vertex either shortly before, or shortly after, the signal defect. Hence, only a small horizontal (or vertical) slice of the sparse matrix has to be computed, greatly reducing the complexity of the graph creation. Figure 5 shows an example of such a reduced graph (not to be confused with the sparse graph) and the determined transitions indexed by and indexed by for an exemplary signal and defect. In practice we use the reduced for graph all experiment of this paper.

Figure 5: Final reduced graph based on exemplary audio features calculated from an audio signal with a gap. The regions considered for the transitions are in gray with the gap in between them in white. All available transitions for the reconstruction are in light gray with the optimally selected and in blue. The nodes indexes and correspond the beginnings and the ends of the transitions and

Iv The similarity graph in detail

Iv-a Local audio features

Building a similarity graph for full music pieces from STFT features is in practice challenging simply due to the number and size of the obtained features. To be efficient, the number of features has to remain small in contrast to the complexity of audio signals. Our solution leverages two techniques to obtain a good trade-off: 1) an adequate sub-sampling, and 2) a tight low-redundancy STFT.

While audio signals are often sampled at a very high rate, to compute reliable audio features, a much lower rate is usually sufficient. We choose a maximum sampling rate of  Hz (default  kHz, see Table II for all default parameters). If a given signal is sampled at a higher rate  Hz, is decimated with a decimation factor , after the application of an anti-aliasing filter. We denote the decimated signal by .

The short-time Fourier transform (STFT) of with respect to a (real-valued) window function , hop size and channels is defined as

for and . Recall the decomposition of into magnitude and phase: , , . By default, we choose to equal a -point Itersine window [wesfreid1993adapted], and . This particular construction leads to an redundant tight frame, hence preserving equally each signal frequency component.

The separate parts and of the feature vector are obtained as follows.


Let , and . For more convenient handling, is limited to a fixed range and peak-normalized, resulting in

where , if , and otherwise. By default,  dB. Figure 2 (top) shows for an exemplary audio signal.

Relative instantaneous frequency

In [augfla95:reassign], the authors show that an instantaneous frequency estimate can be associated to by


where and is a discrete derivative of . The second term in the equation above is in fact an equivalent expression for the partial derivative of , with respect to . might fluctuate quickly and its range depends on . Both these properties are undesired for our purpose. Therefore, we consider only its relative part, i.e. the second term in Eq. 3, and perform a channel-wise smoothing of each , , by convolution with a localized kernel (default: -point Hann window). Additionally, the expression for is unstable in regions of small magnitude [xxlbayjailsoend11]. With , we define

The combined feature vector is obtained as

for . We choose a default value of , since this choice resulted in similar importance placed on both sub-features.

Iv-B Creation of the similarity graph

When it comes to the graph creation, we desire an automatic parameter selection adapting to the audio features. For the creation of the initial graph, we only need to determine the value of in the expression (1) for the preliminary weight matrix. Denoting as the set of approximate nearest neighbors of the vertex , our solution is to set to the average squared nearest neighbor distance

Thus if and are close, and decreasing towards , the more and differ. Our experiments showed that of is a good default value, which should be increased if the music is expected to be very redundant.

To obtain from in (2), the length of the convolution kernel must be fixed. After the convolution, the edges in the graph describe the similarity of signal segments of  seconds duration. The choice of determines the importance of long duration similarities over such with short duration. We used as a default value in order to consider roughly half-second segments for signals sampled at  kHz, see Figure 6.

Figure 6: Convolution kernel used to enhance the diagonal shape of the weight matrix. Here .

To transition from to , we first perform a thresholding by . In , each entry can be 1 at maximum, see (1). In , solitary entries will be smaller than 1 and entries surrounded by other high-valued entries will be larger than 1. In order to suppress solitary entries, we used as a default value. The final step consisting of selecting the local maxima by choosing points that are equal to or larger than the four direct neighbors. In detail is defined as

The notation should be interpreted as the collection of all possible choices, i.e. the maximum over all direct neighbors in

When applying the calculation of the transition graph to a signal where the distorted area is known, the computational cost can be further reduced. In particular, only a partial transition graph needs to be computed because we are only interested in outgoing connections within the short region before, and incoming connections within a short region immediately after the distortion. Conceptually, we consider only small and , cp. Figure 1 and Section V-A. Therefore, the nearest neighbors search and all the following operations, is not performed on all nodes, but only for a small subset of features in the direct vicinity of the signal defect. This allows us to not only greatly reduce the computation cost, but also reduce the size of the optimization problem described in Section V-A. An example of such resulting graph is given in Figure 5.

V The inpainting step in detail

V-a Selection of optimal transitions

To select the optimal transition, we need to transform the three conditions of Section III-B into a mathematical objective function. Let denote the index of the nodes corresponding to the start and end of the distorted region. In the notation of the previous section, only edge with and are considered acceptable. In our experiments, we observed that setting to a length corresponding to approximately 5 seconds yielded good results. The region considered for possible transition can be seen as the red interval in Fig 5.

Among all acceptable edges, we search for the solution that minimizes the objective function


Compare the definition of with Figure 1 to see that: The first term controls the difference , the second term the distances from the defect and the third term controls the quality of the transitions. By tuning and , we can vary the importance of the individual terms. In our experiments, and have provided good results.

Since the number of acceptable transitions is small, the computational benefit from using a sophisticated optimization algorithm is negligible. Hence, we solve the optimization problem by simply computing exhaustively the values of the objective function for each set with and .

V-B Signal reconstruction

When two audio signals are concatenated naively, discontinuities and phase jumps might result in clicking artifacts. To reduce these effects, a smoothed transition is clearly preferred. We propose the following: Since the features are obtained from a STFT with time step , with respect to a possibly decimated signal, the time resolution of the similarity graph analysis equals samples. In other words, the preliminary solution obtained in the previous step suggests the insertion of the signal samples in place of . To further improve the transition, we allow to adjust the transition positions and by up to half the similarity graph’s time resolution, i.e. samples. The optimal adjustment is determined by maximizing a correlation, as proposed in [bahat2015self] and described below. Denote by the length of the analysis window and . The final transitions are given by , where

Here, is the vector

The obtained indices maximize the correlations between the original signal and the inpainting candidate.

In order to obtain smooth transitions in the restored signal, we perform a time-frequency domain cross-fading. Conceptually, this requires us to consider different arrays of short-time Fourier coefficients with time step offsets , and , respectively:

This ensures that in , the -th time frame is centered at the signal position and in , the -th time frame is centered at position . The analysis window is chosen, such that on the undecimated signal , it mimics acting on . Hence, its length and the number of channels are set to . Thus, by default, we choose to also be of Itersine shape.

The restored signal can now be obtained by applying the inverse STFT to the combined matrix,

In practice, complexity is further reduced without altering the result, by computing , , only for the time-positions relevant to the cross-fading, thus obtaining two small submatrices of . Note that any coefficient vector , , , only affects the reconstruction on an interval equal to the window length . Hence, both transitions have a duration of and the first can be recovered from

and similarly for the second transition. Here, is a generous estimate of the ratio between the window length and the hop size. The inverse STFT is then applied to these submatrices and the cross-fade regions, which are obtained as the central part of those inverse STFTs, are placed at the desired position in the signal. All other operations are performed in the time domain. To ensure equivalence with a complete STFT computation, the segments have to start/end samples before/after the cross-fading.

Vi Numerical evaluations

In this section we provide a numerical evaluation of the proposed algorithm.

First, we verify the algorithm in a setting where the gap content is provided with the remaining signal. A correct implementation should be able to perfectly replace the gap by exactly the lost content. Second, we investigate algorithm’s computational performance in terms of average runtime.

For the evaluations, the algorithm was implemented in MATLAB. The implementation is based on LTFAT [ltfatnote030]

for feature extraction, and on the GSPBox

[perraudin2014gspbox] for graph creation. For non-commercial use, the algorithm is available online222, alongside a browser-based demonstration 333 Table II provides a summary of the algorithm parameters used for the evaluations.

Vi-a Verification

Here, we address the question whether the algorithm perfectly recovers the gap when an exact copy of the missing segment is present within the reliable signal. For this purpose, we used a set of uncorrupted audio signals with various content and at the sampling rate of Hz. First, redundant signals were created by repeating the signal, i.e. placing a copy of the signal at its end. Then, each redundant signal was corrupted by creating a gap of seconds. For each signal, the experiment was repeated five times with randomly chosen position of a gap, yielding corrupted signals. Then the algorithm was applied on each of the corrupted signal. In all reconstructions, the -norm difference between the original and reconstructed signals was in the range of numerical precision, implying that each corrupted signal was perfectly restored. Hence, we consider the implementation of the presented algorithm as verified.

Vi-B Computational complexity

The algorithm can be separated into different steps that all have different computational requirements. Here, we investigated the individual costs of each step and their relative importance in the overall performance of the algorithm. The evaluation was performed on a modern notebook ( GHz Intel i7, 2 cores, GB RAM) for the same set of corrupted signals as in Sec. VII. Table I

shows mean and standard deviation of the computation time per minute of audio signal. On average, each minute of audio signal required

-s computation time for the reconstruction.

The feature computation, graph creation and the selection of the optimal transition, performed on the reduced sparse graph (see Figure 5), scale linearly with the length of the provided reliable data, in terms of both storage and time complexity. As a result, our result consists of the timing per minute of analyzed music. In all our experiments, the reliable data was given by a full song, without the corrupted segment. Note that linear complexity can only be achieved by considering the reduced graph. For the full sparse graph , complexity of the graph creation is and the transition selection would even scale roughly quadratically, i.e. be . Even if the selection is restricted to the range considered in the reduced graph, linear complexity would be out of reach. Therefore, the computation time per minute is not a reliable indicator anymore. We just remark that on the dataset used, the graph construction was on average times slower, while the average duration for transition selection increased by a factor of , when performed on the full sparse graph. Although we did not systematically evaluate memory usage of the method, it should be noted that restricting to the reduced graph is considerably more efficient in that regard, as well.

If multiple corruptions are to be removed using the same set of reliable data, the algorithm benefits from the fact that features only need to be computed once. Since the feature computation is the bottleneck of the method (this can be seen in Table I), this may lead to significant boosts of computational performance in the case of multiple gaps.

Processing step Reduced graph (Mean) Reduced graph (STD)
Feature extraction
Graph construction
Transition selection
Signal reconstruction
Table I: Average execution time of the proposed methods per minute of provided audio (Based on a database of songs) for the reduced graphs, see Fig. 5.

Vii Perceptual evaluation

In order to estimate the potential of the proposed algorithm for music, we conducted a psychoacoustic test, in which we evaluated the impact of the artifacts occurring from inpainting various songs from a music database. In particular, we were interested in addressing the following questions:

  1. How often are subjects able to detect an alteration (detectability)? The answer gives us access to how often our algorithm is able to fool the listener.

  2. How precisely can subjects pinpoint the alteration? The answer gives us an indication of the inpainting quality and of the confidence of the test subject.

  3. How disturbing are the detected artifacts (severity)? The answer provides some good insights into the reconstruction quality even when the listener is not fooled.

  4. Is the familiarity of the song correlated with the detectability or the severity? The answer gives some intuition about the quality of the reconstruction and ensures that we are not only fooling the non-familiar test subjects.

In order to ensure that our experiment provides meaningful results truly describing the potential of the proposed algorithm, our subjects were familiar with the tested music genres and we have collected ratings for familiarity and liking the songs.

Vii-a Testing methodology


The sound material consisted of songs from the following genres: pop, rock, jazz, classical. These genres were selected to cover the most common listening habits and, with respect to music structure also include many other, similar genres like blues, country, folk, oldies, hip-hop, etc. Six songs per genre were selected from hundreds of songs with the aim to well-represent the genre.


In order to test subjects familiar with our material, in a self-assessment questionnaire, a candidate had to provide the average weekly listening duration (in hours) to the genres pop, rock, jazz, classical, and others. For the evaluation, only candidates listening at least 4 hours per week to music from all four main genres in total were considered. In total, 15 subjects were selected for the test. They were paid on an hourly basis.

Figure 7: The interface used in the experiment. See text for more details.


In each trial, subject listened to a sound stimulus and was asked to pay attention to a potential artifact (see Fig. 7). A slider scrolled horizontally while the sample was played indicating the current position within a stimulus. The subject was asked to tag the artifact’s position by aligning a second slider with the begin of the perceived artifact. Then, while listening again to the same stimulus, the subject was asked to confirm (and re-align if required) the slider position and answer three questions:

  1. Severity (S): How poor was it ("Wie schlimm ist es")? The possible answers were: (0) no issue ("Kein Fehler"), (1) not disturbing ("Nicht störend"), (2) mildly disturbing ("Leicht störend"), and (3) not acceptable ("Nicht akzeptabel").

  2. Familiarity (F): How familiar are you with this song ("Wie gut kennen Sie dieses Stück"): (0) never heard before ("Noch nie gehört"), (1) I have heard it before ("Schon mal gehört"), (2) I often listen to ("Höre ich öfters"), (3) I know it well ("Kenne ich gut"), and (4) I can play/sing it ("Kann ich spielen/singen").

  3. Liking (L): How do you like this song ("Wie gefällt Ihnen dieses Stück"): (0) not at all ("Gar nicht"), (1) I can not tell ("Kann nicht sagen"), (2) nice ("Nett"), (3) very nice ("Sehr nett"), and (4) amazing ("Bin begeistert").

The questions were answered by tapping on the corresponding category. Then, the subject continued with the next trial by tapping the "next" button.

Before the experiment, the subject was informed about the purpose and procedure of the experiment and an exemplary reconstruction was presented. Any questions with respect to the procedure were clarified.


Three conditions were tested. For the inpainting condition, the song was corrupted at a random place with the gap of 1 s duration and then reconstructed with the default parameters from Tab. II. The reconstructed song was cropped 2 to 4 seconds (randomly varying) before and after the gap resulting in samples of 5 to 9-s duration. The gap was not allowed to be within the first and last 30 s of the song, but the inpainting was allowed to use the full song for processing. For the reference condition, the song was cropped at a random place with a duration varying from 5 to 10 seconds. The reference condition did not contain any artifact and was used to estimate the sensitivity of a subject. For the click condition, a click was superimposed to the song at a random position and the result was cropped 2.5 to 4.5 s before and after the click’s position resulting in samples of 5 to 9-s duration. The artifact in this condition was used as a reference artifact and was clearly audible.444For other music genres like electronic music, the click might not be always audible and an other type of reference artifact would have been required.

In total, three inpainted, one reference, and one click conditions were created per song.

The combination of genres, songs-per-genre, and conditions-per-song resulted in a block of 120 stimuli. All stimuli were normalized in the level (the click condition was normalized before superimposing the click). Within the block, the order of the stimuli and conditions was random.

Each subject was tested with two blocks, resulting in 240 trials per subject in total. Subjects were allowed to take a break at any time, with one planned break per block. For each subject, the test lasted approximately 2.5 hours.

Vii-B Results

Detection rate of the artifacts

The detection results are shown in the left panel of Fig. 8. The average detection rates for the click, inpainting, and reference conditions were , , and

, respectively. The high detection rate and small variance in the click condition demonstrate a good attention of our subjects, for whom even a single click was clearly audible. The clearly non-zero rate in the reference condition shows that our subjects were highly motivated in finding artifacts. The detection rate in the inpainted condition was between those from the reference and click conditions. Note that the reference condition did not contain any artifacts, thus, the artifact’s detection rate in that condition is here referred to as the false-alarm rate.

The large variance of the false-alarm rate shows that it is listener-specific. Thus, for further analysis, the detection rates from the inpainted condition were related to the listener-specific false-alarm rate, i.e., the sensitivity index was used [macmillan2004detection]. The false-alarm rate can be considered as a reference for guessing, thus, indicates that the artifacts was detected at the level of chance rate. The right panel of Fig. 8 shows the statistics of for the inpainting and the click conditions. For the click condition, the average across all subjects was , again demonstrating a good detectability of the clicks. For the inpainting condition, the average was , i.e., slightly above guessing (

). A t-test performed on listener’s

s showed a significant (

) difference from guessing, indicating that the our listeners, as a group, were able to often detect the artifacts better than guessing. A listener-specific analysis, however, showed that only seven out of our 15 subjects were able to detect the inpainting better than chance, as revealed by a 2-by-2 contingency table analysis with the false-alarm and inpainting-detection rates evaluated at a significance level of 0.05.

Figure 8: Detectability of artifacts is much lower than those of clicks but slightly higher than guessing. Left: Statistics of the rate of perceived artifacts across all subjects. Right: Statistics of the sensitivity index , i.e., the inpainting-detection rate relative to the false-alarm rate, across all subjects.

of 1 corresponds to the chance rate. Condition: Reference (R), inpainted (I), and click (C). Statistics: Median (circle), 25% and 75% quartiles (thick lines), coverage of 99.3% (thin lines, assuming normal distribution), outliers (crosses, horizontally jittered for a better visibility).

Influence of familiarity on the detectability

A natural question that arises for this method is, in how far familiarity with a song will influence the detectability of the artifacts. While a comprehensive answer to this question is beyond the scope of this paper and would require a whole new study, here we aim at a brief impression for our subject pool.

Fig. 9 shows the detection rate (left panel) and the (right panel) as functions of the familiarity ratings. While there seems to be a correlation of detectability and familiarity, surprisingly the link is not very strong. Arguably, there seems to be nearly no difference in the detection rates between songs rated with familiarity rating between 2 ("I often listen to") and of 4 ("I can sing/play it"), while there seems to be some difference to the other ratings of less familiarity. Interestingly, even for the very familiar songs the detection rate is much lower than for clicks and the is only twice as large as that for the the chance rate.

Figure 9: Detectability is not much related with the familiarity. Left: Statistics of the rate of perceived artifacts across all subjects as a function of the familiarity rating. Right: Statistics of the sensitivity index as a function the familiarity rating. All other conventions as in Fig. 8.

Detection of the artifact position

Subjects who successfully detected an artifact should be able to provide an information about its position within the stimulus, i.e., the perceived position of the artifact should correlate with its actual position. The left panel of Fig. 10 shows the perceived positions plotted versus the actual positions of the artifacts, for an exemplary average listener. The reported perceived artifact’s position might refer to gap’s begin or end, with the choice even varying from stimulus to stimulus. Thus, we correlated the reported position with the begin, the end, and the nearer of the two positions (referred to as "best choice"). The "best choice" positions are highlighted by triangles.

Across all subjects, correlation coefficients’ statistics is shown in the center panel of Fig. 10. The moderate correlations indicate that as soon as our subjects detected an artifact, they had some estimate of its position within the stimulus. In contrast, for the clicks, the high correlation indicates that our subjects were able to exactly determine and report the position of the click artifact.

In order to determine the precision in the reporting the artifact’s position, we also calculated the difference between the perceived and actual artifact’s position. The standard deviation of these differences calculated for a subject is referred to as the precision error. Their statistics across subjects is shown in the right panel of Fig. 10. For the click condition, the average precision error across all subjects was  ms. It describes the procedural precision of subjects within our task. For the inpainting condition, the average precision error considering the artifact’s begin, end, and "best choice" as the actual position was  ms,  ms, and  ms, respectively. The "best choice" shows the lowest precision errors, being more than six times larger than the procedural precision error. This indicates that even if detected, our subjects had large difficulties to determine the artifact’s position and these difficulties did not originate from the task.

Figure 10: The position of perceived artifacts is weakly correlated with their actual position. Left: Perceived versus actual artifact’s begin and positions (blue squares and green circles, respectively) for an exemplary subject. Triangles show the "best choice", i.e., perceived positions being nearer to either begin or end actual positions. Center: Statistics of the correlation coefficients for all subjects. Right: Statistics of the precision error for all subjects. B, E: perceived position versus begin and end of the artifact, respectively, in the inpainting condition. X: perceived position versus "best choice" in the inpainting condition. C: perceived position of the click in the click condition. , , : cross-correlation coefficient for the condition B, E, and X, respectively, of the exemplary listener. All other conventions as in Fig. 8).

Disturbance rate of detected artifacts

Finally, we have analyzed the ratings we have collected. The left panel of Fig. 11 shows the statistics of the severity ratings reported in the inpainted and click conditions. For the click condition, most of the ratings were between 1 ("not disturbing") and 3 ("not acceptable") with an average across all subjects of . This indicates that on average, our subjects rated the clicks as disturbing. In contrast, for the inpainted condition, most of the ratings were between 0 ("no issue") and 1 ("not disturbing") with an average of . This indicates that on average, our subjects rated the inpainting results halfway between "no issue" and "not disturbing".

This analysis considered all inpainted stimuli so far, ignoring the fact that for some of them our subjects detected the artifact and for some not. A statistic of severity ratings considering detected artifacts only (i.e., ) is shown in the center part of the left panel in Fig. 11. The average across all subjects was . This is higher than the average considering all severity ratings, but still significantly () lower than the severity of the clicks as revealed by a paired t-test calculated between the ratings for clicks and inpainted but detected artifacts. This indicates that even when the inpainting artifacts were perceived, their severity was rated significantly lower than that of the clicks.

Figure 11: Statistics of ratings across all subjects. Left: severity ratings (S). Center: Familiarity (F) and liking (L) ratings. Condition: Inpainted (I), click (C), ratings considering perceived artifacts only (S>0). Right: Statistics of Pearson’s correlation coefficients between S and F (SF), S and L (SL), as well as F and L (FL). All other conventions as in Fig. 8.

Influence of the familiarity on the severity

The stimulus’ familiarity and liking might also have influenced our experimental outcome. The average ratings for the familiarity and liking are shown in the center panel of Fig. 11. Most of the familiarity ratings were between category 1 ("I have heard it several times") and 2 ("I often listen to"), with an across-subject average of . Considering the perceived artifacts only (i.e., ), the average increased to . This increase was significant (, paired t-test on all and the perceived only ratings), indicating that our subjects were slightly more familiar with stimuli containing detectable artifacts. The liking ratings were mostly between 1 ("I cannot tell") and ("very nice"), with an average of . Considering perceived artifacts only, the average increased to . This increase was not significant (, paired t-test between all and the perceived only ratings). As it seems, the artifact’s detectability was not related to the song liking.

The link between the severity of an artifact and the familiarization and/or liking ratings was further investigated by calculating the Pearson’s correlation coefficients between the severity and other ratings. The right panel of Fig. 11 shows the group statistics of the correlation coefficients, which, on average, were and for the correlation of severity with familiarity and liking, respectively. Such low correlations indicate that neither the familiarity nor liking was clearly linked with the perceived artifact’s severity. Out of curiosity, also the correlation between the familiarity and liking was calculated, resulting in an across-subject average of . This correlation indicates a good link between the familiarity and liking of our stimuli, but also raises evidence that familiarity and liking are not fully equivalent.

Viii Conclusions

We have introduced a method for the restoration of audio signals in the presence of corruption/loss of data over an extended, connected period of time. Since, for complex audio signals, the length of the lost segment usually prohibits the inference of the correct data purely from the adjacent reliable data, our solution is based on the larger scale structure of the underlying audio signal. The reliable data is analyzed, detecting spectro-temporal similarities, resulting in a graph representation of the signal’s temporal evolution that indicates strong similarities. Inpainting of the lost data is then achieved by determining two suitable transitions between the border regions around the corrupted signal segment and a region that is considered to be similar. In other words, the algorithm jumps from shortly before the gap to a similar section of the audio signal and, after some time, back to a position shortly after the gap, effectively exchanging the corrupted piece with a suitable substitute. Consequently, the algorithm is capable of efficiently exploiting naturally occurring redundancies in the reliable data.

In order to test the efficiency of our algorithm, we have conducted a psychoacoustic evaluation. The results show that our listeners were able to detect of the artifacts implying that our method completely fooled our listeners more than of the time. Our listeners showed a false-alarm rate of , indicating that sensitivity of correctly detecting a gap was with rather low (as compared with for well-detectable clicks and with for the chance rate). In fact, listener-specific analysis showed that only seven out of 15 tested listeners were able to detect the inpainting on a statistical significant level. Our study showed two additional quality signs of our method. First, the detected artifacts were rated on average between “not disturbing” and “mildly disturbing”. Second, even though detected, our subjects only vaguely determined the artifact’s position, with the six-fold detection precision error than that in the reference condition. While our test was limited to four music genres, they covered many music structures usually found in other genres. However, inpainting performed on a very different genre like the contemporary electronic music might have led to different results, both numeric and perceptual.

Besides having built and tested a novel audio inpainting algorithm, it is worth noting that the graph constructed with our method gives an intuitive analysis of the signal at hand, exposing self-similarities and global structure and can be used for a number of different purposes. For example, a song can be re-composed by following the edges of the graph while respecting the global music structure. Multiple matches in a highly repetitive song can be used as a tool for further song modifications, offering a creative tool for algorithmic composing, e.g., in the field of contemporary electronic music.

Similarity graphs can be used in many applications, thus, it is important to further improve this kind of signal representation. Hence, future work includes closing the gap between the internal similarity measures and human hearing by incorporating perceptually motivated similarity measures derived, possibly, from a perceptually-motivated representation [DBLP:journals/corr/NecciariHBP16] or a computational model of the auditory system [Irino:2006b]. Such a modification will greatly improve the reliability of the algorithm and its results. It seems worth noting, however, that even after considering an auditory model, reliable retrieval of strongly context-sensitive data such as speech and singing voice will require additional contextual information and might be better achieved by a generative approach [saino2006hmm], applied after separating voice and music in the signal [li2007separation].


We thank the reviewers and the editor for their review this publication and their helpful suggestions. We thank Pierre Vandergheynst for his support during this project. His ideas and suggestions have helped significantly to this contribution.

This work has been supported by the Swiss Data Science Center and by the Austrian Science Fund (FWF) projects FLAME (Frames and Linear Operators for Acoustical Modeling and Parameter Estimation; Y 551-N13) and MERLIN (Modern methods for the restoration of lost information in digital signals; I 3067-N30)

Quantity Variable used Default value Unit
Audio features
Maximum sampling frequency
Size of the patch samples
Number of frequencies -
Length of the window samples
Type of window - ’Itersine’ -
Dynamic range dB
Trade-off between the amplitude and phase -
Initial number of neighbors -
Kernel length -
Hard threshold for the weight matrix -
Regularization parameter 1 -
Regularization parameter 2 -
Table II: Default parameters of the algorithm