I Introduction
The loss or corruption of entire segments of audio data is a highly important problem in music enhancement and restoration. Such corruptions can range from short bursts in the range of few milliseconds to extended distortions that persist over several hundred or even thousands of milliseconds. Short distortions such as clicks or clipping have seen extensive coverage in the literature [siedenburg2013audio, adler2011constrained, godsill2002digital], while the concealment of moderate length distortions, roughly in the range of to at most ms, is treated in packet loss compensation [perkins1998survey, bahat2015self] and previous work on audio inpainting [adler2012audio, siedenburg2013audio, lagrange2005long]. For such corruptions, it is often reasonable to assume that the lost signal is almost stationary for the duration of the corruption and/or can be inferred from the reliable information surrounding the unreliable segment. For longer duration loss, such an assumption is increasingly unrealistic and a restoration technique cannot rely only on local information. Here, we propose a method to compensate for such extended data loss by considering information from the entirety of uncorrupted audio available.
Data loss or corruption in the range of seconds can have various causes, e.g. partially damaged physical media, such as phonograph cylinders, shellac or vinyl records or even magnetic tapes. In live music recordings, imperfections due to unwanted noise sources originating from the audience, the artists themselves or the environment are quite common. Even in audio transmission, a short, but total, loss of the connection between transmitter and receiver may lead to data loss beyond just a few hundred milliseconds. In each of these scenarios, the data loss has highly unpleasant consequences for a listener, and it is usually not feasible to reconstruct the lost content from local information only.
Previous work on concealment of data loss in audio, though mostly considering shorter corruption duration, has been performed under various names, depending on the target application and the employed methodology: Audio inpainting [adler2012audio]
, audio interpolation
[Etter1996:Interpolation_AR], waveform substitution [goodman1986waveform], or imputation
[smaragdis2009missing] to name but a few. We will use the terminology of audio inpainting in the remainder of this contribution. When missing parts have a length no longer than ms, sparsitybased techniques can be successful [adler2012audio, siedenburg2013audio, adler2011constrained]. Otherwise, techniques relying on autoregressive modeling [Etter1996:Interpolation_AR], sinusoidal modeling [lagrange2005long, lukin2008:parametric.interp.gaps] or based on selfcontent [bahat2015self] have been proposed. The latter provided promising results for speech signals with distortions up to seconds, while the former rely on a simple signal model that does not comply with complex music signals.In this contribution, we propose a new algorithm, specifically targeted at the concealment of long duration distortions in the range of several seconds given a single piece of music. The task of determining distortion locations is highly applicationdependent and may be anything from trivial to very difficult. For the sake of focus, we assume the location of the distortion to be known. Our method arises from the assumption that, across many musical genres, the repetition, or variation, of distinct and recurring patterns (themes, melodies, rhythms, etc) is a central stylistic element and thus heavily featured. When listening to music, we detect and memorize such internal redundancies, thereby learning the mid and largescale structures of a music piece [mcadams1987]. The exploitation of such redundancies in the computational analysis and processing of music seems only natural and, indeed, has been proposed before, see e.g. [foote1999visualizing, jehan2005creating, jehan2004event] or [rafii2013repeating]. The latter also provides a more extensive discussion of repetition as an essential element of many musical genres. Although music information retrieval (MIR) provides many sophisticated methods for the analysis of micro and macroscopic structures in music, properly handled, a simple timefrequency analysis can provide all the necessary information to uncover significant similarities in music signals. The contributions of this work are the design of appropriate timefrequency features and their use for generating a map of similarities in music signals, as well as the use of the generated similarity map to drive the automatic concealment of long duration data loss.
Ia Related Work
Selfsimilarity in music has previously been employed in several areas of music analysis and processing, e.g. beat estimation and segmentation, and is often based on similarity matrices, as proposed by Foote
[foote1999visualizing]. The similarity matrix can be constructed from various features, see e.g. [foote2001beat, bartsch2001catch, cooper2002automatic, foote2000automatic]. Selfsimilarity has also been successfully used for music/voice separation and speech enhancement [rafii2013repeating, rafii2013online]. Finally, the automatic analysis of musical structure based on similarities is already found in [silva2016simple], where it was used across songs for cover song detection. An alternate approach can be found in [jehan2005creating, jehan2004event]^{1}^{1}1These studies led to the founding of "The Echo Nest", see http://the.echonest.com/, a company specialized into audio feature design. The idea of a similarity graph already appears in the infinite jukebox: http://labs.echonest.com/Uploader/index.html.. There, the division of music into short, rhythmdependent pieces is proposed, each of which is supposed to correspond to a single beat. Local features are obtained for each piece by combining previously established rhythm, timbre and pitch features, but the implementation details of their method are not disclosed. In this contribution, we propose a simple timefrequency feature built from the shorttime Fourier magnitude and phase that implicitly encodes rhythmic, timbral and pitch characteristics of the analyzed signal all at once. We build a sparse similarity graph based on this feature that highlights only the strongest connections in a music piece. This similarity graph can be seen as a postprocessed variant of Foote’s similarity matrix and is used to perform data loss concealment by detecting suitable transitions between similar segments in a piece of music.The audio inpainting problem has mainly been addressed from a sparsity point of view. The hypothesis is that audio is often approximately sparse in a timefrequency representation, i.e. it can be estimated using only a few timefrequency atoms. Using classical or optimization techniques, algorithms have been designed to inpaint short audio gaps [adler2012audio, siedenburg2013audio]
. Such methods strive for approximate recovery of the lost data by sparse approximation in a timefrequency representation such as the shorttime Fourier transform (STFT). Both their numerical and perceptual restoration quality quickly degrade when the duration of lost data intervals exceeds
ms. When applied to significantly longer gaps, these methods will simply fade out/in at the gap border and introduce silence in the inner gap region. Audio inpainting is known as "waveform substitution" [goodman1986waveform] by the community addressing packet loss recovery techniques [perkins1998survey]. Most packet loss methods, however, are naturally designed for low delay processing and compromise computation speed over quality, see also [bahat2015self] for a short overview. In that contribution, Bahat et al. propose an algorithm searching for similar parts of the signal using timeevolving features, conceptually resembling our own contribution. The method in [bahat2015self] is designed for packet loss concealment in speech transmission, however, and was tested only on gaps up to seconds. The reliance on Mel frequency cepstral coefficients (MFCC) is a good match for speech, but not optimally suited for music. In another approach, Martin et al. [martin2011exemplar] proposed an inpainting algorithm taking advantage of the redundancy in tonal feature sequences of a music piece. Their method is able to conceal defects with a length of several seconds, but performance of this algorithm depends on the amount of repetitive tonal sequences in a music piece [martin2011exemplar] and it was only applied when a recurrence of the lost tonal sequence was present in the reliable signal. It should be noted that parallel work on audio inpainting using selfsimilarity by Manilow and Pardo [manilow2017leveraging] has been presented while the present manuscript was under review.IB Structure of the paper
After the introduction, we introduce the idea of the similarity graph, Section II. The general method and construction of the graph is presented in Section III. Technical details about the graph construction such as the exact choice of features and parameters are deferred to Section IV. In Section V, we detail how the similarity graph can be used for audio inpainting. Finally, the performance of the algorithm is discussed, based on both a basic verification experiment and though extensive listening tests, Section VI.
Ii A transition graph encoding music structures
The problem we consider, i.e. how to restore a piece of music when an extended, connected piece has been lost or corrupted, often requires us to abandon the idea of exact recovery. In the case where only a short segment (up to about ms) has been lost [adler2012audio], or the signal can be described by a very simple structure [lagrange2005long], it may be possible to infer the missing information from the regions directly adjacent to the distortion with sufficient quality. However, for complex music signals and corruptions of longer duration, such inference remains out of reach. Instead, we employ an analysis of the overarching medium and largescale structure of a music piece, determining redundancies in the signal to be exploited in the search for a replacement for the distorted signal segment.
Conceptually, such analysis can be seen as a music segmentation into chorus and verse, motifs and their variation, sections of equal or different meters, etc [macpherson2008form]. The main difference to our approach is that, instead of working with highlevel cognitive concepts such as meter and motifs, we instead consider a basic timefrequency representation of the signal. In that representation, all the structures contained in a music recording are still preserved, although it is not always easily accessible to the human observer.
It is clear that repetition and less obvious redundancies do not occur to an equivalent degree in every music piece. While they are an essential stylistic element to pop and rock music, certain movements, e.g. in contemporary music, attempt the active avoidance of the familiar. But even if a pattern is not repeated in the exactly same fashion, the conscious variation of previous structures, rhythmic, harmonic or otherwise, is an integral part of most music. Note that the grade of selfsimilarity inside a single recording may vary greatly.
Going back to the original problem of music restoration, it seems natural to exploit this type of redundancy in the musical piece to be restored. The temporal evolution of spectral content provides a surprisingly suitable first approximation of musical features. Inspired by this observation, we construct an audio similarity graph. The vertices of the graph represent small parts of musical content, while the edges indicate the similarity between the segments in terms of local spectral content. The crucial step towards good performance is the enforcement of temporal coherence. This is achieved by selecting transitions that persist over time, i.e. similarity is not instantaneous, but present for some period of time.
Iii Method
The ultimate goal of this contribution is to provide a means for autonomous concealment of signal defects with a duration of a few hundred to several thousand milliseconds. Here, we assume that the position of the defects is already known. The restoration should sound natural and respect the overall structure of the signal under scrutiny. For short distortions, this implies, to some degree, the recovery of the lost information in the defective region. For long gaps and dynamic signals, we argue that it is of much greater importance than the transitions between the reliable signal segments and the proposed restoration sound natural. The further away from the transition points we are into the restored region, the less important exact recovery becomes versus the restoration making sense in the signal context. Therefore, we suggest an analysis of the signal structure with the proposed similarity graph, to determine the most natural fit for the distorted region from unaffected portions of the signal. The resulting method is an abstracted and autonomous version of manual restoration by searching the reliable signal for a replacement for the defective region. Since the proposed method forgoes the synthesis of new audio content and provided that enough reliable signal content is available, the proposed method can handle signal defects of arbitrary length without affecting audio quality.
We obtain from a shorttime Fourier transform [allen1977unified, Grochening:2001a, ga46] simple similarity features carrying important temporal and spectral information. On the basis of these features, a similarity graph is constructed, representing the temporal evolution and structure of the signal. If some signal segment is known to be defective, it is now sufficient to determine another segment of similar length, such that the beginning and end of the substitute resemble the signal before and after the defect. By placing the candidate segment at the previously corrupted position, the defect can be concealed.
The proposed algorithm, illustrated in Figure 1, searches for a replacement segment that optimally satisfies the three following criteria:

The transitions and (light green dashed lines) resulting from the pasting operation should be perceptually transparent, i.e., the listener should not be able to notice the transition, even if the replacement segment does not correspond exactly to the missing data.

Some leeway is required for placing the transitions around the gap, represented by , . However, the transition areas should not be unnecessarily long.

The length of the piece should remain approximately the same, i.e., the replacement duration should be close to the gap plus its surroundings, .
Some margin for compromise is, however, essential to the construction of a good solution. Since the question of how strictly the reliable content is to be preserved, i.e. how long and may be, is highly applicationdependent, a parameter in the optimization scheme enables the tuning of this property.
In practice, at least for the inpainting problem, it is unnecessary to construct the full similarity graph. Consequently, we construct a sparsified graph which considers unique and strong matches only. Weak matches are discarded. Only the strongest from a cluster of (temporally close) matches are considered. Finally, only edges connected to at least one node in the vicinity of the gap are relevant, since and are supposed to be small, see Figure 1.
Iiia Creation of the similarity graph
The generation of the graph can be structured coarsely into distinct stages. In this section we disregard some technical details, instead concentrating on the general idea. The technical details of the individual steps of our method can be found in Sections IV and V.
1. Compute basic similarity features
To determine temporal similarities in a signal, we have to settle on a feature that encodes the local signal behavior and a distance measure
that allows the comparison of feature vectors. For simplicity, and because the results were comparable to more sophisticated features, we settle here on a weighted combination of two features obtained directly from a shorttime Fourier (STFT) analysis of the signal. Let
be the matrix of shorttime Fourier coefficients, with denoting the coefficient obtained at the th time position in the th channel. , see Section IV, can be decomposed uniquely into its magnitude and phase asSince the human auditory system perceives loudness approximately as a logarithmic function of sound pressure, the first part of our proposed feature is essentially a time slice of the dBspectrogram, i.e.
Note that direct spectrogram features have already proven to be useful in other applications, e.g. repetitionbased source separation, see [rafii2013repeating].
Additionally, the timedirection partial derivative of the phase provides an estimate of the local instantaneous frequency [augfla95:reassign, Holighaus:2016:RSG:2910117.2910282]. Let denote the matrix containing the values of the time direction partial derivative of the phase, i.e. . The second part of our proposed feature is essentially
and . While puts a strong emphasis on signal components of high amplitude, attains large values for sinusoidal, or slowly frequencyvarying, components independent of their magnitude, see also Figure 2. This second part of the feature serves to emphasize low amplitude harmonic components, which may be highly important for perceived similarity. The actual feature , defined in Section IV, is conceptually equivalent to , but implements some additional scaling. Locality of the features is implied by obtaining the features from a STFT. The distance between two features at is simply the squared Euclidean distance of and .
2. Create a preliminary similarity graph
The full (unprocessed) similarity graph determined from the given feature vectors would simply have all the time positions as vertices and edges connecting each vertex to every other vertex, with the associated weights derived from the distance between the associated features.
The creation of such a graph is not only very expensive, but we are further only interested in a small number of strongest connections for every vertex. Therefore, we only determine the nearest neighbors, in terms of feature distance. Since this operation is expensive, we use the FLANN library (Fast Library for Approximate Nearest Neighbors) [muja2014scalable] to efficiently provide an approximate solution. For the determined neighbors, the edge weights are recorded in the adjacency matrix as
(1) 
for some , following a traditional graph construction scheme, see also Figure 3 (left).
3. Enhance timepersistent similarities
The individual features obtained from the STFT usually characterize signal’s properties on a local time interval and do not capture the longterm signal’s spectral characteristics. In order to capture longer temporal structures of a signal, we refine the graph by emphasizing its edges whenever a sequence of features at consecutive time positions is similar to another. In practice, this is achieved by convolving the weight matrix with a diagonal kernel , for some , with
The resulting adjacency matrix is given as
(2) 
see also Figure 3 (middle). Note that, in order to obtain an matrix and for the above equation to be valid is implicitly extended to an matrix with zeros on any side.
4. Delete insignificant similarities/Merge clustered similarities
After the convolution with the diagonal kernel, the weight matrix of our graph has been populated with a large number of nonzero entries, clustered around the entries of . The maxima of such clusters represent the strongest similarities between two regions of the signal. Moreover, only strong connections indicate significant similarities. Therefore, we delete all edges with weights below a certain threshold and select from every cluster of connections only the strongest, i.e the one with locally the largest weight. This last step leads to the weight matrix which is associated to the graph we use for our inpainting algorithm. For an example of the final, sparsified adjacency matrix, see Figure 3 (right). Figure 4 shows the difference between the original graph after Step 2 and part of the refined graph after Step 4.
IiiB Application: Audio inpainting and the reduced similarity graph
The usage of the similarity graph for solving an inpainting problem is rather straightforward. According to the paradigm described in Figure 1, we want to find two edges and , such that

is close to the beginning of the distorted region and is close to its end,

is approximately equal to and

and are large.
An appropriate choice of and is determined by optimizing these criteria over all possible choices, for and in some limited range around the signal defect. The signal segment corresponding to the local features is then substituted for the original signal in the range corresponding to .
For the purpose of inpainting, we are only interested in edges that connect to at least one vertex either shortly before, or shortly after, the signal defect. Hence, only a small horizontal (or vertical) slice of the sparse matrix has to be computed, greatly reducing the complexity of the graph creation. Figure 5 shows an example of such a reduced graph (not to be confused with the sparse graph) and the determined transitions indexed by and indexed by for an exemplary signal and defect. In practice we use the reduced for graph all experiment of this paper.
Iv The similarity graph in detail
Iva Local audio features
Building a similarity graph for full music pieces from STFT features is in practice challenging simply due to the number and size of the obtained features. To be efficient, the number of features has to remain small in contrast to the complexity of audio signals. Our solution leverages two techniques to obtain a good tradeoff: 1) an adequate subsampling, and 2) a tight lowredundancy STFT.
While audio signals are often sampled at a very high rate, to compute reliable audio features, a much lower rate is usually sufficient. We choose a maximum sampling rate of Hz (default kHz, see Table II for all default parameters). If a given signal is sampled at a higher rate Hz, is decimated with a decimation factor , after the application of an antialiasing filter. We denote the decimated signal by .
The shorttime Fourier transform (STFT) of with respect to a (realvalued) window function , hop size and channels is defined as
for and . Recall the decomposition of into magnitude and phase: , , . By default, we choose to equal a point Itersine window [wesfreid1993adapted], and . This particular construction leads to an redundant tight frame, hence preserving equally each signal frequency component.
The separate parts and of the feature vector are obtained as follows.
dBSpectrogram
Let , and . For more convenient handling, is limited to a fixed range and peaknormalized, resulting in
where , if , and otherwise. By default, dB. Figure 2 (top) shows for an exemplary audio signal.
Relative instantaneous frequency
In [augfla95:reassign], the authors show that an instantaneous frequency estimate can be associated to by
(3) 
where and is a discrete derivative of . The second term in the equation above is in fact an equivalent expression for the partial derivative of , with respect to . might fluctuate quickly and its range depends on . Both these properties are undesired for our purpose. Therefore, we consider only its relative part, i.e. the second term in Eq. 3, and perform a channelwise smoothing of each , , by convolution with a localized kernel (default: point Hann window). Additionally, the expression for is unstable in regions of small magnitude [xxlbayjailsoend11]. With , we define
The combined feature vector is obtained as
for . We choose a default value of , since this choice resulted in similar importance placed on both subfeatures.
IvB Creation of the similarity graph
When it comes to the graph creation, we desire an automatic parameter selection adapting to the audio features. For the creation of the initial graph, we only need to determine the value of in the expression (1) for the preliminary weight matrix. Denoting as the set of approximate nearest neighbors of the vertex , our solution is to set to the average squared nearest neighbor distance
Thus if and are close, and decreasing towards , the more and differ. Our experiments showed that of is a good default value, which should be increased if the music is expected to be very redundant.
To obtain from in (2), the length of the convolution kernel must be fixed. After the convolution, the edges in the graph describe the similarity of signal segments of seconds duration. The choice of determines the importance of long duration similarities over such with short duration. We used as a default value in order to consider roughly halfsecond segments for signals sampled at kHz, see Figure 6.
To transition from to , we first perform a thresholding by . In , each entry can be 1 at maximum, see (1). In , solitary entries will be smaller than 1 and entries surrounded by other highvalued entries will be larger than 1. In order to suppress solitary entries, we used as a default value. The final step consisting of selecting the local maxima by choosing points that are equal to or larger than the four direct neighbors. In detail is defined as
The notation should be interpreted as the collection of all possible choices, i.e. the maximum over all direct neighbors in
When applying the calculation of the transition graph to a signal where the distorted area is known, the computational cost can be further reduced. In particular, only a partial transition graph needs to be computed because we are only interested in outgoing connections within the short region before, and incoming connections within a short region immediately after the distortion. Conceptually, we consider only small and , cp. Figure 1 and Section VA. Therefore, the nearest neighbors search and all the following operations, is not performed on all nodes, but only for a small subset of features in the direct vicinity of the signal defect. This allows us to not only greatly reduce the computation cost, but also reduce the size of the optimization problem described in Section VA. An example of such resulting graph is given in Figure 5.
V The inpainting step in detail
Va Selection of optimal transitions
To select the optimal transition, we need to transform the three conditions of Section IIIB into a mathematical objective function. Let denote the index of the nodes corresponding to the start and end of the distorted region. In the notation of the previous section, only edge with and are considered acceptable. In our experiments, we observed that setting to a length corresponding to approximately 5 seconds yielded good results. The region considered for possible transition can be seen as the red interval in Fig 5.
Among all acceptable edges, we search for the solution that minimizes the objective function
(4) 
Compare the definition of with Figure 1 to see that: The first term controls the difference , the second term the distances from the defect and the third term controls the quality of the transitions. By tuning and , we can vary the importance of the individual terms. In our experiments, and have provided good results.
Since the number of acceptable transitions is small, the computational benefit from using a sophisticated optimization algorithm is negligible. Hence, we solve the optimization problem by simply computing exhaustively the values of the objective function for each set with and .
VB Signal reconstruction
When two audio signals are concatenated naively, discontinuities and phase jumps might result in clicking artifacts. To reduce these effects, a smoothed transition is clearly preferred. We propose the following: Since the features are obtained from a STFT with time step , with respect to a possibly decimated signal, the time resolution of the similarity graph analysis equals samples. In other words, the preliminary solution obtained in the previous step suggests the insertion of the signal samples in place of . To further improve the transition, we allow to adjust the transition positions and by up to half the similarity graph’s time resolution, i.e. samples. The optimal adjustment is determined by maximizing a correlation, as proposed in [bahat2015self] and described below. Denote by the length of the analysis window and . The final transitions are given by , where
Here, is the vector
The obtained indices maximize the correlations between the original signal and the inpainting candidate.
In order to obtain smooth transitions in the restored signal, we perform a timefrequency domain crossfading. Conceptually, this requires us to consider different arrays of shorttime Fourier coefficients with time step offsets , and , respectively:
This ensures that in , the th time frame is centered at the signal position and in , the th time frame is centered at position . The analysis window is chosen, such that on the undecimated signal , it mimics acting on . Hence, its length and the number of channels are set to . Thus, by default, we choose to also be of Itersine shape.
The restored signal can now be obtained by applying the inverse STFT to the combined matrix,
In practice, complexity is further reduced without altering the result, by computing , , only for the timepositions relevant to the crossfading, thus obtaining two small submatrices of . Note that any coefficient vector , , , only affects the reconstruction on an interval equal to the window length . Hence, both transitions have a duration of and the first can be recovered from
and similarly for the second transition. Here, is a generous estimate of the ratio between the window length and the hop size. The inverse STFT is then applied to these submatrices and the crossfade regions, which are obtained as the central part of those inverse STFTs, are placed at the desired position in the signal. All other operations are performed in the time domain. To ensure equivalence with a complete STFT computation, the segments have to start/end samples before/after the crossfading.
Vi Numerical evaluations
In this section we provide a numerical evaluation of the proposed algorithm.
First, we verify the algorithm in a setting where the gap content is provided with the remaining signal. A correct implementation should be able to perfectly replace the gap by exactly the lost content. Second, we investigate algorithm’s computational performance in terms of average runtime.
For the evaluations, the algorithm was implemented in MATLAB. The implementation is based on LTFAT [ltfatnote030]
for feature extraction, and on the GSPBox
[perraudin2014gspbox] for graph creation. For noncommercial use, the algorithm is available online^{2}^{2}2https://epfllts2.github.io/rrphtml/audio_inpainting/, alongside a browserbased demonstration ^{3}^{3}3https://lts2.epfl.ch/webaudioinpainting/. Table II provides a summary of the algorithm parameters used for the evaluations.Via Verification
Here, we address the question whether the algorithm perfectly recovers the gap when an exact copy of the missing segment is present within the reliable signal. For this purpose, we used a set of uncorrupted audio signals with various content and at the sampling rate of Hz. First, redundant signals were created by repeating the signal, i.e. placing a copy of the signal at its end. Then, each redundant signal was corrupted by creating a gap of seconds. For each signal, the experiment was repeated five times with randomly chosen position of a gap, yielding corrupted signals. Then the algorithm was applied on each of the corrupted signal. In all reconstructions, the norm difference between the original and reconstructed signals was in the range of numerical precision, implying that each corrupted signal was perfectly restored. Hence, we consider the implementation of the presented algorithm as verified.
ViB Computational complexity
The algorithm can be separated into different steps that all have different computational requirements. Here, we investigated the individual costs of each step and their relative importance in the overall performance of the algorithm. The evaluation was performed on a modern notebook ( GHz Intel i7, 2 cores, GB RAM) for the same set of corrupted signals as in Sec. VII. Table I
shows mean and standard deviation of the computation time per minute of audio signal. On average, each minute of audio signal required
s computation time for the reconstruction.The feature computation, graph creation and the selection of the optimal transition, performed on the reduced sparse graph (see Figure 5), scale linearly with the length of the provided reliable data, in terms of both storage and time complexity. As a result, our result consists of the timing per minute of analyzed music. In all our experiments, the reliable data was given by a full song, without the corrupted segment. Note that linear complexity can only be achieved by considering the reduced graph. For the full sparse graph , complexity of the graph creation is and the transition selection would even scale roughly quadratically, i.e. be . Even if the selection is restricted to the range considered in the reduced graph, linear complexity would be out of reach. Therefore, the computation time per minute is not a reliable indicator anymore. We just remark that on the dataset used, the graph construction was on average times slower, while the average duration for transition selection increased by a factor of , when performed on the full sparse graph. Although we did not systematically evaluate memory usage of the method, it should be noted that restricting to the reduced graph is considerably more efficient in that regard, as well.
If multiple corruptions are to be removed using the same set of reliable data, the algorithm benefits from the fact that features only need to be computed once. Since the feature computation is the bottleneck of the method (this can be seen in Table I), this may lead to significant boosts of computational performance in the case of multiple gaps.
Processing step  Reduced graph (Mean)  Reduced graph (STD) 

Feature extraction  
Graph construction  
Transition selection  
Signal reconstruction  
Total 
Vii Perceptual evaluation
In order to estimate the potential of the proposed algorithm for music, we conducted a psychoacoustic test, in which we evaluated the impact of the artifacts occurring from inpainting various songs from a music database. In particular, we were interested in addressing the following questions:

How often are subjects able to detect an alteration (detectability)? The answer gives us access to how often our algorithm is able to fool the listener.

How precisely can subjects pinpoint the alteration? The answer gives us an indication of the inpainting quality and of the confidence of the test subject.

How disturbing are the detected artifacts (severity)? The answer provides some good insights into the reconstruction quality even when the listener is not fooled.

Is the familiarity of the song correlated with the detectability or the severity? The answer gives some intuition about the quality of the reconstruction and ensures that we are not only fooling the nonfamiliar test subjects.
In order to ensure that our experiment provides meaningful results truly describing the potential of the proposed algorithm, our subjects were familiar with the tested music genres and we have collected ratings for familiarity and liking the songs.
Viia Testing methodology
Material
The sound material consisted of songs from the following genres: pop, rock, jazz, classical. These genres were selected to cover the most common listening habits and, with respect to music structure also include many other, similar genres like blues, country, folk, oldies, hiphop, etc. Six songs per genre were selected from hundreds of songs with the aim to wellrepresent the genre.
Subjects
In order to test subjects familiar with our material, in a selfassessment questionnaire, a candidate had to provide the average weekly listening duration (in hours) to the genres pop, rock, jazz, classical, and others. For the evaluation, only candidates listening at least 4 hours per week to music from all four main genres in total were considered. In total, 15 subjects were selected for the test. They were paid on an hourly basis.
Task
In each trial, subject listened to a sound stimulus and was asked to pay attention to a potential artifact (see Fig. 7). A slider scrolled horizontally while the sample was played indicating the current position within a stimulus. The subject was asked to tag the artifact’s position by aligning a second slider with the begin of the perceived artifact. Then, while listening again to the same stimulus, the subject was asked to confirm (and realign if required) the slider position and answer three questions:

Severity (S): How poor was it ("Wie schlimm ist es")? The possible answers were: (0) no issue ("Kein Fehler"), (1) not disturbing ("Nicht störend"), (2) mildly disturbing ("Leicht störend"), and (3) not acceptable ("Nicht akzeptabel").

Familiarity (F): How familiar are you with this song ("Wie gut kennen Sie dieses Stück"): (0) never heard before ("Noch nie gehört"), (1) I have heard it before ("Schon mal gehört"), (2) I often listen to ("Höre ich öfters"), (3) I know it well ("Kenne ich gut"), and (4) I can play/sing it ("Kann ich spielen/singen").

Liking (L): How do you like this song ("Wie gefällt Ihnen dieses Stück"): (0) not at all ("Gar nicht"), (1) I can not tell ("Kann nicht sagen"), (2) nice ("Nett"), (3) very nice ("Sehr nett"), and (4) amazing ("Bin begeistert").
The questions were answered by tapping on the corresponding category. Then, the subject continued with the next trial by tapping the "next" button.
Before the experiment, the subject was informed about the purpose and procedure of the experiment and an exemplary reconstruction was presented. Any questions with respect to the procedure were clarified.
Conditions
Three conditions were tested. For the inpainting condition, the song was corrupted at a random place with the gap of 1 s duration and then reconstructed with the default parameters from Tab. II. The reconstructed song was cropped 2 to 4 seconds (randomly varying) before and after the gap resulting in samples of 5 to 9s duration. The gap was not allowed to be within the first and last 30 s of the song, but the inpainting was allowed to use the full song for processing. For the reference condition, the song was cropped at a random place with a duration varying from 5 to 10 seconds. The reference condition did not contain any artifact and was used to estimate the sensitivity of a subject. For the click condition, a click was superimposed to the song at a random position and the result was cropped 2.5 to 4.5 s before and after the click’s position resulting in samples of 5 to 9s duration. The artifact in this condition was used as a reference artifact and was clearly audible.^{4}^{4}4For other music genres like electronic music, the click might not be always audible and an other type of reference artifact would have been required.
In total, three inpainted, one reference, and one click conditions were created per song.
The combination of genres, songspergenre, and conditionspersong resulted in a block of 120 stimuli. All stimuli were normalized in the level (the click condition was normalized before superimposing the click). Within the block, the order of the stimuli and conditions was random.
Each subject was tested with two blocks, resulting in 240 trials per subject in total. Subjects were allowed to take a break at any time, with one planned break per block. For each subject, the test lasted approximately 2.5 hours.
ViiB Results
Detection rate of the artifacts
The detection results are shown in the left panel of Fig. 8. The average detection rates for the click, inpainting, and reference conditions were , , and
, respectively. The high detection rate and small variance in the click condition demonstrate a good attention of our subjects, for whom even a single click was clearly audible. The clearly nonzero rate in the reference condition shows that our subjects were highly motivated in finding artifacts. The detection rate in the inpainted condition was between those from the reference and click conditions. Note that the reference condition did not contain any artifacts, thus, the artifact’s detection rate in that condition is here referred to as the falsealarm rate.
The large variance of the falsealarm rate shows that it is listenerspecific. Thus, for further analysis, the detection rates from the inpainted condition were related to the listenerspecific falsealarm rate, i.e., the sensitivity index was used [macmillan2004detection]. The falsealarm rate can be considered as a reference for guessing, thus, indicates that the artifacts was detected at the level of chance rate. The right panel of Fig. 8 shows the statistics of for the inpainting and the click conditions. For the click condition, the average across all subjects was , again demonstrating a good detectability of the clicks. For the inpainting condition, the average was , i.e., slightly above guessing (
). A ttest performed on listener’s
s showed a significant () difference from guessing, indicating that the our listeners, as a group, were able to often detect the artifacts better than guessing. A listenerspecific analysis, however, showed that only seven out of our 15 subjects were able to detect the inpainting better than chance, as revealed by a 2by2 contingency table analysis with the falsealarm and inpaintingdetection rates evaluated at a significance level of 0.05.
Influence of familiarity on the detectability
A natural question that arises for this method is, in how far familiarity with a song will influence the detectability of the artifacts. While a comprehensive answer to this question is beyond the scope of this paper and would require a whole new study, here we aim at a brief impression for our subject pool.
Fig. 9 shows the detection rate (left panel) and the (right panel) as functions of the familiarity ratings. While there seems to be a correlation of detectability and familiarity, surprisingly the link is not very strong. Arguably, there seems to be nearly no difference in the detection rates between songs rated with familiarity rating between 2 ("I often listen to") and of 4 ("I can sing/play it"), while there seems to be some difference to the other ratings of less familiarity. Interestingly, even for the very familiar songs the detection rate is much lower than for clicks and the is only twice as large as that for the the chance rate.
Detection of the artifact position
Subjects who successfully detected an artifact should be able to provide an information about its position within the stimulus, i.e., the perceived position of the artifact should correlate with its actual position. The left panel of Fig. 10 shows the perceived positions plotted versus the actual positions of the artifacts, for an exemplary average listener. The reported perceived artifact’s position might refer to gap’s begin or end, with the choice even varying from stimulus to stimulus. Thus, we correlated the reported position with the begin, the end, and the nearer of the two positions (referred to as "best choice"). The "best choice" positions are highlighted by triangles.
Across all subjects, correlation coefficients’ statistics is shown in the center panel of Fig. 10. The moderate correlations indicate that as soon as our subjects detected an artifact, they had some estimate of its position within the stimulus. In contrast, for the clicks, the high correlation indicates that our subjects were able to exactly determine and report the position of the click artifact.
In order to determine the precision in the reporting the artifact’s position, we also calculated the difference between the perceived and actual artifact’s position. The standard deviation of these differences calculated for a subject is referred to as the precision error. Their statistics across subjects is shown in the right panel of Fig. 10. For the click condition, the average precision error across all subjects was ms. It describes the procedural precision of subjects within our task. For the inpainting condition, the average precision error considering the artifact’s begin, end, and "best choice" as the actual position was ms, ms, and ms, respectively. The "best choice" shows the lowest precision errors, being more than six times larger than the procedural precision error. This indicates that even if detected, our subjects had large difficulties to determine the artifact’s position and these difficulties did not originate from the task.
Disturbance rate of detected artifacts
Finally, we have analyzed the ratings we have collected. The left panel of Fig. 11 shows the statistics of the severity ratings reported in the inpainted and click conditions. For the click condition, most of the ratings were between 1 ("not disturbing") and 3 ("not acceptable") with an average across all subjects of . This indicates that on average, our subjects rated the clicks as disturbing. In contrast, for the inpainted condition, most of the ratings were between 0 ("no issue") and 1 ("not disturbing") with an average of . This indicates that on average, our subjects rated the inpainting results halfway between "no issue" and "not disturbing".
This analysis considered all inpainted stimuli so far, ignoring the fact that for some of them our subjects detected the artifact and for some not. A statistic of severity ratings considering detected artifacts only (i.e., ) is shown in the center part of the left panel in Fig. 11. The average across all subjects was . This is higher than the average considering all severity ratings, but still significantly () lower than the severity of the clicks as revealed by a paired ttest calculated between the ratings for clicks and inpainted but detected artifacts. This indicates that even when the inpainting artifacts were perceived, their severity was rated significantly lower than that of the clicks.
Influence of the familiarity on the severity
The stimulus’ familiarity and liking might also have influenced our experimental outcome. The average ratings for the familiarity and liking are shown in the center panel of Fig. 11. Most of the familiarity ratings were between category 1 ("I have heard it several times") and 2 ("I often listen to"), with an acrosssubject average of . Considering the perceived artifacts only (i.e., ), the average increased to . This increase was significant (, paired ttest on all and the perceived only ratings), indicating that our subjects were slightly more familiar with stimuli containing detectable artifacts. The liking ratings were mostly between 1 ("I cannot tell") and ("very nice"), with an average of . Considering perceived artifacts only, the average increased to . This increase was not significant (, paired ttest between all and the perceived only ratings). As it seems, the artifact’s detectability was not related to the song liking.
The link between the severity of an artifact and the familiarization and/or liking ratings was further investigated by calculating the Pearson’s correlation coefficients between the severity and other ratings. The right panel of Fig. 11 shows the group statistics of the correlation coefficients, which, on average, were and for the correlation of severity with familiarity and liking, respectively. Such low correlations indicate that neither the familiarity nor liking was clearly linked with the perceived artifact’s severity. Out of curiosity, also the correlation between the familiarity and liking was calculated, resulting in an acrosssubject average of . This correlation indicates a good link between the familiarity and liking of our stimuli, but also raises evidence that familiarity and liking are not fully equivalent.
Viii Conclusions
We have introduced a method for the restoration of audio signals in the presence of corruption/loss of data over an extended, connected period of time. Since, for complex audio signals, the length of the lost segment usually prohibits the inference of the correct data purely from the adjacent reliable data, our solution is based on the larger scale structure of the underlying audio signal. The reliable data is analyzed, detecting spectrotemporal similarities, resulting in a graph representation of the signal’s temporal evolution that indicates strong similarities. Inpainting of the lost data is then achieved by determining two suitable transitions between the border regions around the corrupted signal segment and a region that is considered to be similar. In other words, the algorithm jumps from shortly before the gap to a similar section of the audio signal and, after some time, back to a position shortly after the gap, effectively exchanging the corrupted piece with a suitable substitute. Consequently, the algorithm is capable of efficiently exploiting naturally occurring redundancies in the reliable data.
In order to test the efficiency of our algorithm, we have conducted a psychoacoustic evaluation. The results show that our listeners were able to detect of the artifacts implying that our method completely fooled our listeners more than of the time. Our listeners showed a falsealarm rate of , indicating that sensitivity of correctly detecting a gap was with rather low (as compared with for welldetectable clicks and with for the chance rate). In fact, listenerspecific analysis showed that only seven out of 15 tested listeners were able to detect the inpainting on a statistical significant level. Our study showed two additional quality signs of our method. First, the detected artifacts were rated on average between “not disturbing” and “mildly disturbing”. Second, even though detected, our subjects only vaguely determined the artifact’s position, with the sixfold detection precision error than that in the reference condition. While our test was limited to four music genres, they covered many music structures usually found in other genres. However, inpainting performed on a very different genre like the contemporary electronic music might have led to different results, both numeric and perceptual.
Besides having built and tested a novel audio inpainting algorithm, it is worth noting that the graph constructed with our method gives an intuitive analysis of the signal at hand, exposing selfsimilarities and global structure and can be used for a number of different purposes. For example, a song can be recomposed by following the edges of the graph while respecting the global music structure. Multiple matches in a highly repetitive song can be used as a tool for further song modifications, offering a creative tool for algorithmic composing, e.g., in the field of contemporary electronic music.
Similarity graphs can be used in many applications, thus, it is important to further improve this kind of signal representation. Hence, future work includes closing the gap between the internal similarity measures and human hearing by incorporating perceptually motivated similarity measures derived, possibly, from a perceptuallymotivated representation [DBLP:journals/corr/NecciariHBP16] or a computational model of the auditory system [Irino:2006b]. Such a modification will greatly improve the reliability of the algorithm and its results. It seems worth noting, however, that even after considering an auditory model, reliable retrieval of strongly contextsensitive data such as speech and singing voice will require additional contextual information and might be better achieved by a generative approach [saino2006hmm], applied after separating voice and music in the signal [li2007separation].
Acknowledgment
We thank the reviewers and the editor for their review this publication and their helpful suggestions. We thank Pierre Vandergheynst for his support during this project. His ideas and suggestions have helped significantly to this contribution.
This work has been supported by the Swiss Data Science Center and by the Austrian Science Fund (FWF) projects FLAME (Frames and Linear Operators for Acoustical Modeling and Parameter Estimation; Y 551N13) and MERLIN (Modern methods for the restoration of lost information in digital signals; I 3067N30)
Quantity  Variable used  Default value  Unit 
Audio features  
Maximum sampling frequency  
Size of the patch  samples  
Number of frequencies    
Length of the window  samples  
Type of window    ’Itersine’   
Dynamic range  dB  
Tradeoff between the amplitude and phase    
Graph  
Initial number of neighbors    
Kernel length    
Hard threshold for the weight matrix    
Optimization  
Regularization parameter 1    
Regularization parameter 2   
Comments
There are no comments yet.