Sound recordings are used in various ecological studies, including wildlife monitoring by acoustic surveys. Such surveys require automatic detection of target sound events in the large amount of data produced. However, current detectors, especially those relying on band-limited energy as the main feature, are severely impacted by wind, which causes transient energy increases. The rapid dynamics of this noise invalidate standard noise estimators, and no satisfactory method for dealing with it exists in bioacoustics, where simple training and generalization between conditions are important.
We propose to estimate the transient noise level by fitting short-term spectrum models to a wavelet packet representation. This estimator is then combined with log-spectral subtraction to stabilize the background level. The resulting adjusted wavelet series can be analysed by standard energy detectors. We use real data from long-term monitoring to tune this workflow, and test it on two acoustic surveys of birds. Additionally, we show how the estimator can be incorporated in a denoising method to restore sound.
The proposed noise-robust workflow greatly reduced the number of false alarms in the surveys, compared to unadjusted energy detection. As a result, the acoustic survey efficiency (precision of the estimated call density) improved for both species. Denoising was also more effective when using the short-term estimate, whereas standard wavelet shrinkage with a constant noise estimate struggled to remove the effects of wind.
In contrast to existing methods, the proposed estimator can adjust for transient broadband noises without requiring additional hardware or extensive tuning to each species. It improved the detection workflow based on very little training data, making it particularly attractive for detection of rare species.
In recent years, acoustic surveys based on long-term recordings have emerged as a powerful tool in ecology. Such surveys can cover large scales in both time and space, making them invaluable for monitoring animal species in conservation and behaviour research (see reviews by Shonfield and Bayne (2017); Sugai et al. (2018)). Many further applications for such monitoring at the human-wildlife interface have been proposed, such as poaching detectors (Astaras et al., 2017), warning systems for elephant approach (Zeppelzauer et al., 2015), or farm animal welfare monitoring (Mcloughlin et al., 2019).
A key step in most of these tasks is the detection of target sounds in the recordings. The resulting annotations can then be used in various inference models, population size estimation (Dawson and Efford, 2009), source localization (Rhinehart et al., 2020), or for other purposes. Since the amounts of data recorded often total in the thousands of hours, and calls are intermittent within them, automatic detection is necessary, and choosing the right methods can have a large impact on survey efficiency (Juodakis et al., 2021a). Thus, developing detectors that can be applied to natural soundscapes is an important and active area of research.
A major obstacle for current bioacoustic sound detectors is environmental noise, in particular wind (Priyadarshani et al., 2018). Wind interaction with microphones creates noise in the form of transient peaks, with higher power in lower frequencies (Walker and Hedlin, 2009; Nelke, 2016). Detection in bioacoustics, at least in initial stages, typically identifies sound events as increases in energy, possibly band-filtered (e.g., Prince et al. (2019)), transformed (Priyadarshani et al., 2020) or in the spectrogram representation (Lasseck, 2013). Wind peaks can appear as such increases, and therefore create false positives, thus greatly reducing the detection performance. More complex recognisers are also impaired by wind: Digby et al. (2013)
used a decision tree based on handcrafted species-specific features that performed considerably worse in windy conditions, whileZnidersic et al. (2021) observed similar issues when estimating call counts based on acoustic indices of 1-minute frames. While the exact mechanism of this effect is not clear, rapid changes in background energy and degradation of target sound features are likely causes, and methods robust to these factors are needed to allow detection in natural conditions.
Various approaches to wind noise suppression have been developed for different tasks. Classic denoising methods such as the Wiener or MMSE filters are not applicable to wind because of its rapid dynamics. Neural networks have been successfully used for speech denoising, e.g.,Keshavarzi et al. (2018), and in public competitions (Kahl et al., 2019)
. However, their adoption in bioacoustic practice has been limited, primarily because they require large quantities of training data, which is rarely available for wildlife. In addition, the black box nature of deep learning makes it unclear if such models would generalize to different surveys, as similar geophonic noise sources in different areas can have different noise profiles(Metcalf et al., 2020). For example, in a recent study Vickers et al. (2021)
observed that denoising by neural networks does not help subsequent call detection with unseen types of noise. Subsequent ecological inference often also makes some assumptions about the detection probability (e.g., smooth decrease with distance,Dawson and Efford (2009)) that are difficult to verify with such methods, so more transparent wind-robust detectors are needed.
Some simpler methods for wind denoising have been developed in other fields, but are not applicable to bioacoustics. For example, the signal centroids method (Nelke et al., 2014) relies on the target having high dominant frequency, which is simply not true for many vocalizing species. Other methods require pitch estimation (Nelke and Vary, 2015), which is itself a complex task for distant and noisy sounds in natural environments. Another distinct research area is noise mitigation by shielding, mechanical integration, or multi-microphone coherence (Walker and Hedlin, 2009). We will not consider these options in this study, as they require physical modifications to hardware, complicate recorder deployment, and do not help analyse historical or ongoing survey data.
Therefore, we propose a new procedure for single-microphone estimation of transient broadband noise. We use it to improve the noise-robustness of an acoustic event detection method. We will first describe the theoretical basis of this method, and then demonstrate its usage on two surveys of birds. We also show how this estimate can be incorporated in a denoising method to restore clean sound for listening or visualization. The proposed noise estimator is found to considerably improve the efficiency of acoustic surveys, and is easily adaptable to different species and noise profiles.
2 Materials and methods
2.1 Overview of the proposed detector
The main method proposed in this paper is a wind-robust energy detector. It detects signals in a target frequency band using these steps:
Sound is converted to a wavelet packet tree (WPT) representation, and node(s) corresponding to the target frequencies are chosen;
The estimated noise level is used in log-scale spectral subtraction to adjust the target band energy;
Adjusted energies are analysed by a changepoint detection algorithm, presented previously (Juodakis et al., 2021b), to detect increases, which are assumed to be calls.
We will now present each of these components in more detail, starting with the final detection stage which is used to guide the other parts of the method.
2.2 Energy-based signal detection
In the energy detection framework, the sequence of observations is modelled as the sum of a stationary noise process and signal :
The signal is transient, and its presence is detected based on the observed energy . This is motivated by assuming that both are independent white Gaussian processes:
Then, testing against a fixed threshold is a generalized likelihood ratio test for the hypotheses:
and its properties, such as error rates, can be determined theoretically (Chen, 2010). For example, false alarm rates can be controlled at rate by setting the threshold to , where is the CDF of distribution, and is estimated utilising the stationarity assumption, e.g., from quiet frames. Larger intervals can be tested using windowed statistics such as or . The energy can also be compared at many positions, to locate the start and end of signal activity (changepoint detection; Page (1954)). We will use a variation of this changepoint procedure, presented in Juodakis et al. (2021b), as the main detector in this study, but our results apply to any method that uses the energy for detection.
In all the above methods, the stationary noise model (2) is key to detection. Wind, and transient broadband noises in general, violate this assumption and harm the performance. For example, if the noise is Gaussian with transient increases in power:
then the test , established on quiet periods as before, will have a false alarm rate greater than . In fact, without assuming any further features distinguishing the wind and signal processes, the situations and are not identifiable: this can be seen in an example of a (band-limited) energy series where both a bird call and a wind gust correspond to a transient increase (Figure 1). Therefore, no conclusions about the performance of energy detection under these conditions can be made, in contrast to the stationary background model (1)–(2).
2.3 Wavelet packet representation
Energy detection is typically applied to a band-limited representation of the recorded sound, in order to reduce the impact of non-target signals. In this paper, we will use sound frequency subbands obtained from a wavelet packet tree (WPT). Bioacoustic detectors using WPT have been used previously (Zhang and Li, 2015; Priyadarshani et al., 2020; Juodakis et al., 2021b). It is a multiscale decomposition, defined at each scale by a set of orthogonal filters , which approximately bandpass the signal to the frequency range . Here is the Nyquist frequency, equal to half of the sampling rate. Denoting by the original sound waveform, each node of the WPT thus contains the series of coefficients:
obtained by applying the filter .
Before application, the user selects a node likely to contain the target signal, i.e., the signal is detected in . The choice can be made a priori based on the expected frequency range of the target, or by a training process, for example as described in Priyadarshani et al. (2020). (We will assume that a single target node is chosen for simplicity of notation, but the method proposed here directly applies with several nodes as well.)
In addition to removing out-of-band interference, the wavelet transform decorrelates various types of noise (Wornell, 1993), thus allowing methods derived under white Gaussian noise assumptions to be applied to more diverse problems. Increases in energy, caused by either in-band signal or transient broadband noises, are preserved by this transform (Figure 1).
2.4 Interpolating noise level
To allow detection in windy conditions, we propose a new method for estimating the level of transient broadband noise. The main idea is, at each time point, to fit a regression line to subband energies, and use that to interpolate the energy in the target band.
We will assume for now that the following model consisting of only wind and target signal, and later discuss relaxing it to allow other signals and noises.
For any WPT node :
where and are the wavelet-transformed wind noise and signal components, respectively.
We wish to estimate (or the noise power ), while for all other bands the are directly observed.
where is the frequency, a constant, and a factor that depends on the wind strength at time and microphone properties. This is an example of the class of processes, for which decorrelation by wavelet transform has been extensively studied: the resulting wavelet coefficients are almost uncorrelated within and between nodes (Wornell, 1993). In addition, their distribution is Gaussian if the original process is Gaussian, also known as fractional Brownian noise (proven trivially, by linearity of the transform), or for other processes converges to Gaussian with sufficiently large (Atto and Pastor, 2010), or by averaging multiple coefficients (Serroukh et al., 2000).
Thus, we will continue assuming that at each time point, the wind WPT coefficients vectoris multivariate Gaussian:
The variance of the coefficients is close to the PSD at each node’s centre frequency(Atto et al., 2010; Moulines et al., 2007):
Their energy is the square of a Gaussian and thus -distributed:
In other words, there is a linear relationship between and . This suggests that the target band noise level could be interpolated using an OLS regression of and :
Furthermore, we can use the properties , where is the digamma function, and with being the generalized Riemann (Hurwitz) zeta function (Veitch and Abry, 1999), to obtain:
In other words, the error term is homoscedastic, so standard OLS results can be applied to show the estimation properties. In particular, we have that, under Assumption 1 and wind model (4), it is consistent: .
Note also that instead of individual coefficients, short-term sums could be used as the in the regression. If the wind strength factor remains the same over this window, repeating the analysis above shows that the only change is smaller variance of the error term, as decreases for longer windows.
Relaxing Assumption 1. While the
model covers a variety of noise processes, noise spectra obtained in field conditions may deviate from this model, due to frequency response of the microphone, shielding, and the recording device. Other noises may also be present. For example, white noise can be captured by the model as the special case with, but if both white and wind noise are present at comparable power, the resulting spectrum will no longer be . To allow adaptation to these issues, we propose including higher polynomial degrees of in the regression (5). We investigate possible choices of the polynomial degree on a pilot dataset in a later section.
Alternatively, Achard and Coeurjolly (2010) proposed some estimators designed to specifically reduce bias for estimation of contaminated processes. However, these estimators only outperform the basic if the contamination model is specified correctly, and even then require sufficient sample size, so we do not explore them further here.
Further, other narrow-band signals may be present, causing local deviations from the model. We then model the noise power as a contaminated distribution
is a random variable representing the power of contaminating signal and noise, andis the rate of contamination. If the contamination is sufficiently loud, so most of the mass of is above , the distribution of can be highly asymmetric, and thus estimation by OLS is significantly biased, especially if higher order terms are included.
An intuitive solution is to replace the square loss (5
) used in regression with an asymmetric loss function, such as:
and then estimate the noise level by:
This procedure is in fact identical to quantile regression (QR), so estimates the -quantile of the (contaminated) noise level given (Koenker and Bassett, 1978). This allows us to use the known properties of quantile regression to analyse this estimator.
It can be shown that if sufficiently low quantile is chosen, then estimates the -quantile of the uncontaminated noise distribution (Supplementary Material A). Thus the interpolation of the noise level is biased, but a bias adjustment factor can be obtained if is available. Alternatively, note that by averaging the energy statistic over many windows, arbitrary variance reduction of the noise can be achieved, with all quantiles approaching , and thus can be made consistent.
2.5 Proposed adjustment method: log spectral subtraction
Once an estimate of the noise level is obtained, it can be used to estimate the clean signal from . A common method for this is spectral subtraction (Vaseghi, 2009), stated in its power form as:
where is an estimate of . This results in strong suppression in low SNR conditions, which is desirable for many applications.
However, this adjustment does not work well for signal detection. So far, we have assumed that the noise power is a -distributed random variable, and interpolation can at best provide some estimate of its expectation . The distribution of , produced by spectral subtraction, will then be a left-censored and shifted distribution (Figure 2A-B). Furthermore, we show that this adjusted distribution will still depend on (Supplementary Material B). Both of these issues violate the stationary Gaussian model (2), and so energy-based detectors applied after spectral subtraction even with perfect estimates will still not have the expected performance.
We propose that spectral subtraction for detection should be carried out on log scale, or equivalently:
In contrast to (9), this log spectral subtraction will produce distributed as (Figure 2C). Given an accurate estimate , the adjusted estimate will match model (2) with , and optimal performance of the detection methods derived under this model can be expected.
2.6 Validating the noise spectrum fit on pilot data
To investigate whether field recording data matches the spectrum profile, we conducted a pilot experiment on a set of short clips randomly selected from a larger monitoring project.
Over 2018–2019, nightly acoustic monitoring was conducted with passive recorders in Zealandia sanctuary, Wellington, New Zealand. We selected five nights of recordings from this data, obtained over various months using two devices (SM2, Wildlife Acoustics). Recorders were attached to trees at about 1–1.5 m above ground, with one located in a relatively exposed position on a hilltop, and the other in a sheltered valley. We extracted 3 audio clips of 0.1 seconds from each night and device, starting within one minute of 23:00, and manually verified that no distinct animal calls are heard in these clips. For comparison with a different hardware, we extracted 3 similar clips from one night from a different monitoring project, conducted in 2021 in Ponui island, New Zealand, using an AR4 recorder (Department of Conservation). All clips were resampled to 16000 Hz.
The clips were subject to WPT using two different wavelets: discrete Meyer, which approximates ideal bandpassing, and order 8 Symlet, based on its estimation performance in Atto et al. (2010). Energy within each node ( using the notation above) was averaged over the 0.1 s clip and shown as the spectrum estimate at node centre frequency . For comparison, we also plotted the spectra obtained from the clips by periodogram, Daniell-smoothed with a 7-bin kernel and downsampled.
OLS regression models were then fitted to the log frequencies vs. log energies, as in (5). We used either all nodes between 150–7500 Hz (“full spectrum”, excludes only edge bands that have filtering effects) or nodes between 150–6000 Hz, to focus on the more likely wind range. A series of models were fitted, from linear to 6th degree polynomial, and evaluated by the small-sample corrected Akaike criterion: .
2.7 Case study: applying the proposed noise-robust detection
We demonstrate the proposed wind-robust detection method on two bird surveys. The first is a survey of Australasian bittern (Botaurus poiciloptilus), conducted near Lake Ellesmere, New Zealand. The male bitterns emit ‘boom’ calls at low frequency (around 150 Hz), meaning that their detection is particularly affected by wind. The survey was conducted for 2 hours using 7 recorders at 8 kHz sampling rate. Playback was used to solicit calls, and for the purpose of method evaluation we count both playback and responses as true calls. The second survey is of little spotted kiwi (Apteryx owenii) in Zealandia wildlife sanctuary, New Zealand. Male kiwi calls are a sequence of around 20 repeated syllables in the 2-3 kHz band. Eight recorders with 6 hours of sound from each were used. These surveys were previously used to evaluate sound detection methods in Juodakis et al. (2021b), and further details about this data are provided therein.
The recordings were analysed using the changepoint detector from Juodakis et al. (2021b). Briefly, a training process uses a small number of annotated files to characterize the wavelet nodes and duration of each species calls. The survey files are then analysed to detect periods of increased energy in these nodes. The detector can adapt to long-term changes in background level, but transient events such as wind are not removed and cause false positives (Juodakis et al., 2021b). The wind-adjusted analyses use the same detectors with the same parameter settings, but reduce noise level by log-spectral subtraction as described here. The analysis was repeated using either the OLS or QR noise estimate. The same window length is used for both detection and fitting of the noise spectrum models. For the quantile estimate , we set , and in the case of bittern the estimate was adjusted upwards by 0.4 (this factor, based on (6), is typically negligible and only used here because of the low sampling rate of the bittern recordings).
Evaluation is based on the precision of a spatial capture-recapture model (SCR), as proposed previously in Juodakis et al. (2021a). SCR is a general framework for inferring population density from imperfectly detected cues (Dawson and Efford, 2009). Its key component is a detection function , modelling the probability of detecting calls emitted at distance from the recorder. In the grid-based SCR, as used here, this probability is estimated from calls simultaneously detected by more than one recorder. Another option is to calibrate from external data, in which case the SCR reduces to the distance sampling model (Borchers et al., 2015). The density of animal calls, assumed proportional to the density of animals, is estimated using this
. As this density is the main target of ecological interest, we use its standard error (SE) to evaluate the detection methods.
After applying the detection algorithms, equal number of reported segments from each method were reviewed manually. The verified detections were used to fit an SCR model, and the density SE estimated by bootstrap (Stevenson et al., 2015) is reported, as well as the coefficient of variation to allow differences in the density estimate . We refer the reader to Borchers et al. (2015); Stevenson et al. (2015) for a full introduction to acoustic SCR, and to Juodakis et al. (2021a) or Juodakis et al. (2021b) for details on formatting data for this type of model.
2.8 Using the noise estimate to restore clean sound
The proposed estimator of broadband noise level ( or ) can also be combined with other sound analysis methods, not only detectors. We demonstrate how it can be used for restoring clean sound by wavelet shrinkage.
Wavelet shrinkage by soft-thresholding is a popular denoising method (Donoho, 1995). The soft-thresholding modifies the WPT coefficients by translating them towards 0:
where is some estimate of the noise SD in the node , and tunes the strength of the thresholding. This is based on the assumption that target signal energy will be concentrated in only a few coefficients after the wavelet transform, and so shrinking all coefficients will mostly reduce noise. The adjusted WPT is then inverted to reconstruct a denoised sound waveform (see e.g., Wornell (1993)), which is simpler compared to inverting a spectrogram.
As a test, we create noisy files by mixing 2 min clips of windy background with bird sound examples. Background clips (5 files) were selected from Zealandia monitoring data. Bird sounds were 6 clips from the xeno-canto database and 6 clips of rich soundscapes from Zealandia monitoring. The xeno-canto examples were chosen to have a clear foreground and low background noise, because evaluating the denoising requires clean reference sounds. The Zealandia examples were taken from dawn or dusk choruses, to capture rich soundscapes that are difficult to denoise, although they have non-negligible background noise, which may impact the subsequent denoising metrics. The clips were mixed at +12 dB, 0 dB, or -12 SNR (the latter was only used with xeno-canto examples, as the soundscapes are too quiet to produce audible residual signal then), producing 300 min of noisy sound in total.
Each file was then analysed by constructing the WPT, and for each time window noise level estimates were obtained by fitting a cubic polynomial to the WPT as described above. Since these values estimate the log-energies of noise, we can obtain adaptive estimates of as or . We use these and in (10) to obtain OLS-denoised or QR-denoised coefficients. For comparison, we test a constant threshold with and ; this is a robust estimate of the noise SD, commonly recommended for wavelet shrinkage, and leading to various optimal theoretical properties (Donoho, 1995). The resulting adjusted WPT was inverted to reconstruct the denoised sound file following standard wavelet methods (Donoho, 1995).
The success of denoising was evaluated by estimating the SNR improvement in dB:
where and are the clean, noisy and denoised waveforms of a file. We also calculate the SI-SDR, which is a robust modification of SNR that is invariant to scale changes introduced during denoising (Roux et al., 2019).
3.1 Field recordings indicate non-linear background spectra
The pilot dataset revealed the presence of a variety of background noise spectra (Figure 3A). Overall, the noise power was higher in lower frequencies, higher in the January and March nights when more wind gusts were audible, and more variable for the recorder in an exposed location, which is in line with the model (note that some files have an additional peak corresponding to strong cicada noise, at around 3000 Hz). Spectra taken close in time to each other show little variation, suggesting that the estimation is precise in stable conditions.
However, over longer periods, spectral shapes varied considerably, deviating from the predicted log-log line. Even within the same device and same minute, wind gusts caused some considerable changes in spectrum shape (see top lines in Figure 3A, “windy” recorder). Similar spectra were obtained with a different wavelet, or by a smoothed periodogram (Supplementary Material, Figure S1), indicating that the shape variation is not caused by the chosen estimation method.
Linear models were also not supported by the fit statistics: when fitting the full-spectrum, average AICc for the linear model was 35.0, while the higher order polynomials had AICc between 17.0–23.6. Even if the frequency range is limited to <6000 Hz, the linear model is still insufficient (AICc 28.5, but 16.5–21.3 for higher order models). The optimal model degree by this criterion was 5 (for full spectrum) or 3 (for <6000 Hz). Some examples of 3rd degree and linear spectrum fits are shown in Figure 3B-C. Based on these results, we chose to use a 3rd degree model fitted to <6000 Hz spectrum in the detector, as it seems to provide sufficient flexibility in the range where wind noise is the most prominent, without great sensitivity to interferences or large computational cost.
3.2 Evaluating robust detection on surveys
In all tested settings, we observed that wind adjustment greatly reduced the number of false positives. In the bittern survey, 859 detections were obtained using the OLS-adjusted method. Without the adjustment, 1505 segments were reported, of which 57% were reviewed to equalize the effort across both methods. Most of the additional false positives in this set were indeed wind or other broadband noises such as plane overflights, which were also removed by the proposed adjustment. Fitting SCR models to the two sets of detections confirms that the adjustment greatly improves survey efficiency, with about two-fold lower coefficient of variation for the estimated density (Table 1).
An even greater contrast is seen in the kiwi survey. With the same thresholds, the adjusted detection resulted in so few false positives that it required extreme downsampling of the unadjusted data (SCR models could not be reliably fitted). Therefore the results shown here use a two times smaller threshold for the adjusted detector. This produced 323 detections, mostly true positives, with the rest caused by sounds of other species in the target bands, such as the kaka parrot (Nestor meridionalis). In comparison, the unadjusted detector produced 1315 detections (25% reviewed), and 4 times less precise density estimates (CoV 52.5% vs 12.2%, Table 1).
|no adj.||OLS adj.||no adj.||OLS adj.|
|Density CoV (%)||99.5||52.5||51.1||12.2|
Detection results from two bird surveys, obtained using a wavelet changepoint detector with or without a wind noise adjustment. The adjustment uses the OLS spectrum fit presented in this paper. The main evaluation metrics are highlighted: standard error (SE) or the coefficient of variation (CoV) of the survey density estimate. Also shown are the estimates of the density itself and of the detection radius parameter.
When estimating wind noise by quantile regression instead of OLS, slightly more false positives were produced, with 1025 total detections for bittern and 360 for kiwi. The detected segments mostly matched those reported by OLS, so we do not analyse these further here.
3.3 Incorporating the wind estimator into denoising
The proposed wind noise level estimators are useful for denoising as well. Wavelet shrinkage with wind-adaptive thresholds, either estimated by OLS or by QR, considerably improved the SNR (Figure 4). In contrast, the same denoising method with a constant threshold led to very little improvement: while it decreased the overall background noise, most of the noise energy in these examples came from wind gusts, which this method could not remove. Note that in some cases SNR even apparently decreased: because some white noise was present in the “clean” recordings as well, removing that decreased the measured match between the reference and denoised files, and was thus counted as a loss of signal by this metric. (This effect also contributed to the lower denoising performance seen when using the soundscape references, which had more residual noise than the xeno-canto clips.)
Similar results are seen when using the SI-SDR metric (Supplementary Material, Figure S2). As this metric is invariant to the initial mixing SNR, it produces more uniform measures over the tested files, removing the spurious improvement peak seen at 0 dB with the xeno-canto examples. The difference between OLS and QR estimation methods was very small in the metrics used here, although in favour of QR in every case.
To gain some insight into the working of each method, we show spectrograms of the denoised outputs from one example clip in Figure 5. Denoising with a constant threshold successfully removed the more stable parts of the background noise, as indicated by uniform grey areas in Figure 5A; however, it had very little effect on the wind gust seen around 10 s from the start of the clip. Both of the time-varying estimators successfully modelled this gust, which led to its removal. The main differences between the estimation by OLS and QR is seen during the time periods when loud calls are present: as predicted, these calls affect the OLS fit more, and cause over-adjustment (grey gaps) or under-adjustment (green residual noise) in the 0-2 kHz frequency range (Fig. 5B). The QR estimate was robust to these effects (Fig. 5C).
4.1 Summary and alternative design considerations
In this study, we proposed a new noise estimator based on fitting a polynomial model to wavelet packet node energies. This estimator was combined with log spectral subtraction to stabilize the noise level. In our case study this adjustment greatly reduced the number of false positive detections and led to more efficient acoustic surveys. Additionally, we showed that the estimator can be incorporated in a wavelet denoising method to restore sound polluted by broadband noise.
Although our initial motivation was wind noise, which in theory is associated with a specific (log-)linear spectrum shape, the pilot experiment indicated that a more flexible model was needed. The resulting polynomial estimator now also captures more general broadband noises besides wind. This is useful for our surveys, but in other cases the target signal may be broadband, such as insect stridulations (Field and Rind, 1992). Our method is thus limited to signals with characteristic frequency bands. However, choosing other filterbanks instead of the wavelet packet, such as the Mel, gammatone or Greenwood (Zeppelzauer et al., 2015), may concentrate different sounds better, thus allowing analysis of a wide variety of tasks.
We proposed to fit the spectrum using quantile regression to account for asymmetric contamination when other signals are present. Nonetheless, the standard OLS fitting appeared surprisingly robust, although choosing QR is safer when the soundscape is particularly rich, or precise noise estimation around calls is important. This may be a useful precursor step for automatic analysis of dawn choruses, in which the high density of calls presents a challenge for current detection software (Brooker et al., 2020).
4.2 Differences from other noise estimators
The surveys analysed here highlight some of the issues with applying other noise estimation methods in bioacoustics. Since our spectrum model uses short time windows (on the order of 0.1 s) and is not smoothed over time, it can adapt to fast transients, while methods such as PCEN (Lostanlen et al., 2019) or MMSE-STSA (Brown et al., 2018) critically rely on the noise changing more slowly than the signal, so would be unusable with the c. 30-second-long kiwi call. Methods designed to remove low-frequency noises, such as presented in Nelke et al. (2014), cannot be applied to the 150 Hz bittern sounds, and in general require knowledge of the other signals expected in the environment. In contrast, the proposed noise estimator needs very little tuning to be applied to different species: the main parameter is the frequency range of the target, which is retrieved from the detection stage. Furthermore, it can often be used even without specifying the signal bands at all, as shown in the denoising examples.
Thus, the proposed framework is designed for low training data situations that are common in wildlife research, where recording collection and expert annotation is expensive. In our survey analysis workflow, only the wavelet energy detector needs training, which has a simple structure and so can be trained with less than an hour of data (Priyadarshani et al., 2020; Juodakis et al., 2021b). Neural networks could be used to create noise-robust detectors that outperform our results if given sufficient data, but this likely means at least thousands of clips, as in e.g., Vickers et al. (2021)
. Options to reduce this requirement, such as by transfer learning or using weak labels(Serizel et al., 2018), are actively researched. Even with that, our method will still remain useful, as it provides a fast and robust initial screening step at very little cost in terms of missed calls, and its output can be verified by a more sophisticated procedure if available.
Additionally, we have used a wavelet transform throughout all stages of the sound analysis. This transform can be easily inverted, as we have done here to recreate the denoised audio files, whereas most other methods produce spectrograms, inverting which would be more complicated (Zhu et al., 2007).
4.3 Evaluating the improvements in practice
The metrics chosen to evaluate the proposed methods may not represent all practical needs. For the detection stage, we conducted a grid-based survey in the SCR framework and measured its efficiency as proposed in Juodakis et al. (2021a)
. Alternative measures, such as the F-score, are common in the acoustic detection community (see e.g.,Priyadarshani et al. (2020)). In contrast to these, our SCR metric directly measures the precision of the estimate of interest, and thus the power to conduct ecologically relevant comparisons. To the best of our knowledge, this study is the first to explicitly demonstrate that survey efficiency is gained by using noise-robust sound analysis methods. The metric is also quite general, as various bioacoustic survey designs can be expressed as special cases of SCR (Borchers et al., 2015). Ultimately, in the present case, the robust detector showed much lower false alarm rate with almost no loss in true detections, so it should be identified as an improvement by most metrics.
It is yet more complicated to evaluate the benefits of denoising in bioacoustics. If the cleaned sound is used for human listening, presence of perceptual artefacts such as musical noise may be more important than SNR (Vaseghi, 2009). Metrics such as PESQ have been designed to capture the subjective quality of sound (Rix et al., 2001), but they rely on speech-specific properties and do not directly apply to other species. In the context of ecological monitoring, the primary application of denoising currently is to improve the classification of calls by neural networks, as in e.g., Vickers et al. (2021). Deep learning is also suggested for more holistic ecoacoustic assessments, outside traditional surveys, and removing noise is also of interest there (Fairbrass et al., 2018). Because of the black-box nature of these methods and variety in the network and training setups, it is not clear whether SNR, PESQ or other metrics would actually be predictive of their performance. Standardizing the protocols of training and applying neural networks in bioacoustics would allow one to investigate this relationship, and to further develop denoising methods that are beneficial in ecology practice.
This research is supported by the New Zealand Marsden Fund, which is administered by the Royal Society of New Zealand Te Apārangi under grants 17-MAU-154 and 17-UOA-295. We also thank Danielle Shanahan for the opportunity to work in Zealandia, and Alberto De Rosa for the Ponui field data.
Achard and Coeurjolly (2010)
Achard, S. and Coeurjolly, J.-F. (2010) Discrete variations of the fractional Brownian motion in the presence of outliers and an additive noise.Statistics Surveys, 4, 117–147.
- Astaras et al. (2017) Astaras, C., Linder, J. M., Wrege, P., Orume, R. D. and Macdonald, D. W. (2017) Passive acoustic monitoring as a law enforcement tool for Afrotropical rainforests. Frontiers in Ecology and the Environment, 15, 233–234.
Atto and Pastor (2010)
Atto, A. and Pastor, D. (2010) Central limit theorems for wavelet packet decompositions of stationary random processes.IEEE Transactions on Signal Processing, 58, 896–901.
Atto et al. (2010)
Atto, A. M., Pastor, D. and Mercier, G. (2010) Wavelet packets of fractional Brownian motion: Asymptotic analysis and spectrum estimation.IEEE Transactions on Information Theory, 56, 4741–4753.
- Borchers et al. (2015) Borchers, D. L., Stevenson, B. C., Kidney, D., Thomas, L. and Marques, T. A. (2015) A unifying model for capture–recapture and distance sampling surveys of wildlife populations. Journal of the American Statistical Association, 110, 195–204.
- Brooker et al. (2020) Brooker, S. A., Stephens, P. A., Whittingham, M. J. and Willis, S. G. (2020) Automated detection and classification of birdsong: An ensemble approach. Ecological Indicators, 117, 106609.
- Brown et al. (2018) Brown, A., Garg, S. and Montgomery, J. (2018) Automatic and efficient denoising of bioacoustics recordings using MMSE STSA. IEEE Access, 6, 5010–5022.
- Chen (2010) Chen, Y. (2010) Improved energy detector for random signals in Gaussian noise. IEEE Transactions on Wireless Communications, 9, 558–563.
- Dawson and Efford (2009) Dawson, D. K. and Efford, M. G. (2009) Bird population density estimated from acoustic signals. Journal of Applied Ecology, 46, 1201–1209.
- Digby et al. (2013) Digby, A., Towsey, M., Bell, B. D. and Teal, P. D. (2013) A practical comparison of manual and autonomous methods for acoustic monitoring. Methods in Ecology and Evolution, 4, 675–683.
- Donoho (1995) Donoho, D. (1995) De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41, 613–627.
- Fairbrass et al. (2018) Fairbrass, A. J., Firman, M., Williams, C., Brostow, G. J., Titheridge, H. and Jones, K. E. (2018) CityNet—deep learning tools for urban ecoacoustic assessment. Methods in Ecology and Evolution, 10, 186–197.
- Field and Rind (1992) Field, L. and Rind, F. (1992) Stridulatory behaviour in a New Zealand weta, Hemideina crassidens. Journal of Zoology, 228, 371–394.
- Juodakis et al. (2021a) Juodakis, J., Castro, I. and Marsland, S. (2021a) Precision as a metric for acoustic survey design using occupancy or spatial capture-recapture. Environmental and Ecological Statistics, 28, 587–608.
- Juodakis et al. (2021b) Juodakis, J., Marsland, S. and Priyadarshani, N. (2021b) A changepoint prefilter for sound event detection in long-term bioacoustic recordings. The Journal of the Acoustical Society of America, 150, 2469–2478.
- Kahl et al. (2019) Kahl, S., Stöter, F.-R., Goëau, H., Glotin, H., Planque, R., Vellinga, W.-P. and Joly, A. (2019) Overview of BirdCLEF 2019: Large-Scale Bird Recognition in Soundscapes. In CLEF 2019 Working Notes, vol. 2380 of CEUR Workshop Proceedings, 1–9. Cappellato, L. and Ferro, N. and Losada, D. E. and Müller, H., Lugano, Switzerland: CEUR.
Keshavarzi et al. (2018)
Keshavarzi, M., Goehring, T., Zakis, J., Turner, R. E. and Moore, B. C. J. (2018) Use of a deep recurrent neural network to reduce wind noise: Effects on judged speech intelligibility and sound quality.Trends in Hearing, 22.
- Koenker and Bassett (1978) Koenker, R. and Bassett, G. (1978) Regression quantiles. Econometrica, 46, 33.
Lasseck, M. (2013) Bird song classification in field recordings: Winning
solution for NIPS4B 2013 competition.
Proceedings of ’Neural Information Processing Scaled for Bioacoustics: From Neurons to Big Data - NIP4B’, 176–181.
- Lostanlen et al. (2019) Lostanlen, V., Salamon, J., Cartwright, M., McFee, B., Farnsworth, A., Kelling, S. and Bello, J. P. (2019) Per-channel energy normalization: Why and how. IEEE Signal Processing Letters, 26, 39–43.
- Mcloughlin et al. (2019) Mcloughlin, M. P., Stewart, R. and McElligott, A. G. (2019) Automated bioacoustics: methods in ecology and conservation and their potential for animal welfare monitoring. Journal of The Royal Society Interface, 16, 20190225.
- Metcalf et al. (2020) Metcalf, O. C., Lees, A. C., Barlow, J., Marsden, S. J. and Devenish, C. (2020) hardRain: An R package for quick, automated rainfall detection in ecoacoustic datasets using a threshold-based approach. Ecological Indicators, 109, 105793.
- Moulines et al. (2007) Moulines, E., Roueff, F. and Taqqu, M. S. (2007) On the spectral density of the wavelet coefficients of long-memory time series with application to the log-regression estimation of the memory parameter. Journal of Time Series Analysis, 28, 155–187.
- Nelke (2016) Nelke, C. (2016) Wind noise reduction: signal processing concepts. Ph.D. thesis, RWTH Aachen University, Germany.
- Nelke et al. (2014) Nelke, C. M., Chatlani, N., Beaugeant, C. and Vary, P. (2014) Single microphone wind noise PSD estimation using signal centroids. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
- Nelke and Vary (2015) Nelke, C. M. and Vary, P. (2015) Wind noise short term power spectrum estimation using pitch adaptive inverse binary masks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
- Page (1954) Page, E. S. (1954) Continuous inspection schemes. Biometrika, 41, 100.
Prince et al. (2019)
Prince, P., Hill, A., Covarrubias, E. P., Doncaster, P., Snaddon, J. and Rogers, A. (2019) Deploying acoustic detection algorithms on low-cost, open-source acoustic sensors for environmental monitoring.Sensors, 19, 553.
- Priyadarshani et al. (2018) Priyadarshani, N., Marsland, S. and Castro, I. (2018) Automated birdsong recognition in complex acoustic environments: a review. Journal of Avian Biology, 49, jav–01447.
- Priyadarshani et al. (2020) Priyadarshani, N., Marsland, S., Juodakis, J., Castro, I. and Listanti, V. (2020) Wavelet filters for automated recognition of birdsong in long-time field recordings. Methods in Ecology and Evolution, 11, 403–417.
- Rhinehart et al. (2020) Rhinehart, T. A., Chronister, L. M., Devlin, T. and Kitzes, J. (2020) Acoustic localization of terrestrial wildlife: Current practices and future opportunities. Ecology and Evolution, 10, 6794–6818.
- Rix et al. (2001) Rix, A., Beerends, J., Hollier, M. and Hekstra, A. (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). IEEE.
- Roux et al. (2019) Roux, J. L., Wisdom, S., Erdogan, H. and Hershey, J. R. (2019) SDR – half-baked or well done? In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
- Serizel et al. (2018) Serizel, R., Turpault, N., Eghbal-Zadeh, H. and Shah, A. P. (2018) Large-scale weakly labeled semi-supervised sound event detection in domestic environments. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 19–23.
- Serroukh et al. (2000) Serroukh, A., Walden, A. T. and Percival, D. B. (2000) Statistical properties and uses of the wavelet variance estimator for the scale analysis of time series. Journal of the American Statistical Association, 95, 184–196.
- Shonfield and Bayne (2017) Shonfield, J. and Bayne, E. (2017) Autonomous recording units in avian ecological research: current use and future applications. Avian Conservation and Ecology, 12.
- Stevenson et al. (2015) Stevenson, B. C., Borchers, D. L., Altwegg, R., Swift, R. J., Gillespie, D. M. and Measey, G. J. (2015) A general framework for animal density estimation from acoustic detections across a fixed microphone array. Methods in Ecology and Evolution, 6, 38–48.
- Sugai et al. (2018) Sugai, L. S. M., Silva, T. S. F., Ribeiro, J. W. and Llusia, D. (2018) Terrestrial passive acoustic monitoring: Review and perspectives. BioScience, 69, 15–25.
- Vaseghi (2009) Vaseghi, S. V. (2009) Advanced Digital Signal Processing and Noise Reduction. Wiley.
- Veitch and Abry (1999) Veitch, D. and Abry, P. (1999) A wavelet-based joint estimator of the parameters of long-range dependence. IEEE Transactions on Information Theory, 45, 878–897.
- Vickers et al. (2021) Vickers, W., Milner, B., Risch, D. and Lee, R. (2021) Robust North Atlantic right whale detection using deep learning models for denoising. The Journal of the Acoustical Society of America, 149, 3797–3812.
- Walker and Hedlin (2009) Walker, K. T. and Hedlin, M. A. (2009) A review of wind-noise reduction methodologies. In Infrasound Monitoring for Atmospheric Studies, 141–182. Springer Netherlands.
- Wornell (1993) Wornell, G. W. (1993) Wavelet-based representations for the 1/f family of fractal processes. In Proceedings of the IEEE, vol. 81, 1428–1450. IEEE.
- Zeppelzauer et al. (2015) Zeppelzauer, M., Hensman, S. and Stoeger, A. S. (2015) Towards an automated acoustic detection system for free-ranging elephants. Bioacoustics, 24, 13–29.
- Zhang and Li (2015) Zhang, X. and Li, Y. (2015) Adaptive energy detection for bird sound detection in complex environments. Neurocomputing, 155, 108–116.
Zhu et al. (2007)
Zhu, X., Beauregard, G. T. and Wyse, L. L. (2007) Real-time signal estimation from modified short-time Fourier transform magnitude spectra.IEEE Transactions on Audio, Speech and Language Processing, 15, 1645–1653.
- Znidersic et al. (2021) Znidersic, E., Towsey, M. W., Hand, C. and Watson, D. M. (2021) Eastern black rail detection using semi-automated analysis of long-duration acoustic recordings. Avian Conservation and Ecology, 16.
A Estimation of quantiles with contamination
Let be a random variable following a contaminated mixture distribution, i.e., its CDF is , with a contaminating variable that is concentrated at larger values than . Specifically, denoting the median of as , define this requirement as:
Then its -quantile for is:
The , and variables correspond to , and from (7). Thus, the quantile regression estimate , which converges to under standard regression conditions, also converges to the quantile of the “clean” distribution .
Naturally, neither nor the exact range of over which the requirement (11) holds are known in advance. However, choosing a small quantile such as should work for most situations in practice: even under 50 % contamination, this corresponds to , and we can reasonably expect that most signals will significantly exceed the median of the background.
B Tail probabilities of spectral subtraction estimates
Let be a random process with , where the noise strength varies with time. Assume the expected value of the process at each time , i.e., is known exactly. We wish to find the tail probabilities of , obtained by standard (power) spectral subtraction (9):
This is by definition a left-censored variable, with pdf
The tail probability for any is
Note that in the denoising context, this “denoised” distribution still depends on the noise strength .
Under log spectral subtraction, we immediately have that , and thus the tail probabilities for any are . The rectification can be omitted if the distribution properties for are also relevant.