Long-distance Detection of Bioacoustic Events with Per-channel Energy Normalization

11/01/2019 ∙ by Vincent Lostanlen, et al. ∙ 0

This paper proposes to perform unsupervised detection of bioacoustic events by pooling the magnitudes of spectrogram frames after per-channel energy normalization (PCEN). Although PCEN was originally developed for speech recognition, it also has beneficial effects in enhancing animal vocalizations, despite the presence of atmospheric absorption and intermittent noise. We prove that PCEN generalizes logarithm-based spectral flux, yet with a tunable time scale for background noise estimation. In comparison with pointwise logarithm, PCEN reduces false alarm rate by 50x in the near field and 5x in the far field, both on avian and marine bioacoustic datasets. Such improvements come at moderate computational cost and require no human intervention, thus heralding a promising future for PCEN in bioacoustics.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The deployment of autonomous recording units offers a minimally invasive sampling of acoustic habitats [shonfield2017ace], with numerous applications in ecology and conservation biology [efford2009ecology]. In this context, there is an extensive literature on tailoring spectrogram parameters to a specific task of detection or classification: the effects of window size, frequency scale, and discretization are now well understood [ulloa2016screening, knight2019bioacoustics]

. However, the important topic of loudness mapping, i.e. representing contrast in the time–frequency domain, has received less attention.

This article investigates the impact of distance between sensor and source on the time–frequency representation of acoustic events. In particular, we point out that measuring local contrast by a difference in pointwise logarithms, as is routinely done in machine learning for bioacoustics, suffers from numerical instabilities in the presence of atmospheric attenuation and intermittent noise. To address this problem, we propose to employ an adaptive gain control technique known as per-channel energy normalization (PCEN)


We deliberately err on the side of design simplicity: rather than training a sophisticated classifier, we apply a constant threshold on the time series of max-pooled PCEN magnitudes. In doing so, our goal is not to achieve the lowest possible false alarm rate, but to argue in favor of replacing the logarithmic mapping of loudness by PCEN in all systems for long-distance sound event detections, including more powerful yet opaque ones such as deep neural networks

[zinemanas2019fruct, cartwright2019waspaa].

Section 2 discusses the theoretical benefits of such a replacement: it proves that PCEN extends temporal context beyond a single temporal frame, thus improving effective detection radius. Sections 3 and 4 present applications to avian and underwater bioacoustics respectively, thereby revealing complementary issues: while bird call detection focuses on mitigating atmospheric absorption at high audible frequencies (1–), whale call detection focuses on mitigating the interference of amplitude-modulated noise from near-field passing ships at low audible frequencies (50–).

This work is partially supported by NSF awards 1633206 and 1633259, the Leon Levy Foundation, and the Pinkerton Foundation. The source code to reproduce experiments and figures is available at:
Figure 1: Effect of pointwise logarithm (left) and per-channel energy normalization (PCEN, right) on the same Common Nighthawk vocalization, as recorded from various distances. White dots depict the time–frequency locations of maximal spectral flux (left) or maximal PCEN magnitude (right). The spectrogram covers a duration of and a frequency range between 2 and .

2 spectrotemporal measures of novelty

2.1 Averaged spectral flux

Let be the magnitude spectrogram of some discrete-time waveform

. In full generality, the ordinal variable

may either represent frequency on a linear scale, a mel scale, or a logarithmic scale. Given , an implementation of spectral flux composes three operators: loudness mapping, contrast estimation, and feature aggregation. In its most widespread variant, named averaged spectral flux, these three operators respectively correspond to pointwise logarithm, rectified differentiation, and frequential averaging:


where is the number of frequency bands in . The motivation underlying this design choice finds its roots in psychoacoustics, and notably the Weber-Fechner law, which states that the relationship between stimulus and sensation is logarithmic [klapuri1999icassp]. We may also remark that Equation 1 is invariant to gain. Indeed, multiplying the waveform by some constant incurs a multiplication by in each frequency band of , and thus an additive bias of in , which eventually cancels after first-order differentiation. In the case of a single point source at some distance , the relative change in acoustic pressure caused by a spherical wave propagation is proportional to . Therefore, in a lossless medium without reflections, logarithm-based spectral flux is invariant to geometric spreading insofar as acoustic sources do not overlap in the time–frequency domain.

2.2 Max-pooled spectral flux

The situation is different in an absorbing medium. Indeed, heat conduction and shear viscosity, in conjunction with molecular relaxation processes, attenuate sine waves in quadratic proportion to their frequency [sutherland1998handbook]. Under standard atmospheric conditions, this attenuation is below at , yet of the order of at . As a result, bird calls spanning multiple octaves lose in bandwidth as they travel through air. A simple workaround is to replace the frequential averaging in Equation 1 by a max-pooling operator. This replacement yields the max-pooled spectral flux


which performs differentiation on a single frequency band, and is thus invariant to the low-pass filtering effect induced by absorption. However, as illustrated in Figure 1, the definition above suffers from numerical instabilities. Indeed, discards all but two scalar values, corresponding to neighboring time–frequency bins in the spectrogram .

2.3 Max-pooled per-channel energy normalization

In order to associate invariance and stability, this article proposes to increase the time scale of contrast estimation beyond a single spectrogram frame. To this end, we replace both the logarithmic mapping of loudness and the first-order differentiation by a procedure of per-channel energy normalization (PCEN). PCEN was recently introduced as a trainable acoustic frontend for far-field automatic speech recognition

[wang2017icassp]. In full generality, PCEN results from an equation of the form


where the gain control matrix proceeds from by first-order IIR filtering:


Note that the definition in Equation 3 differs from the original definition [wang2017icassp] by a factor of . This is in order to allow the limit case

to remain nonzero. Investigating the role of all parameters in PCEN is beyond the scope of this paper; we refer to the asymptotic analysis of

[lostanlen2018spl] in this regard. Rather, we focus on the smoothing parameter as striking a tradeoff between numerical stability () and rapid adaptation to nonstationary in background noise (). The following proposition, proven in Section 6, asserts that PCEN is essentially a generalization of spectral flux.

Proposition 2.1.

At the limit in Equations 3 and 4, and for any finite value of , tends towards


which is a smooth approximation of the summand in Equation 1.

For the sake of simplicity, we adopt the PCEN parametrization that is prescribed by Proposition 2.1: we set , , , and . Derecursifying the autoregressive dependency in 4 and summarizing across frequencies yields the max-pooled PCEN detection function


3 Application to avian bioacoustics

3.1 CONI-Knight dataset of Common Nighthawk calls

We consider the problem of detecting isolated calls from breeding birds in a moderately cluttered habitat. To this end, we use the CONI-Knight dataset [knight2018bioacoustics], which contains vocalizations from five different adult male Common Nighthawks (Chordeiles minor), as recorded by autonomous recording units in a regenerating pine forest north of Fort McMurray, AB, CA. The acoustic sensor network forms a linear transect, in which the distance between each microphone and the vocalizing individual varies from to . The dataset contains positive audio clips in total, each lasting . These clips were annotated by an expert, as part of a larger collection of continuous recordings which lasts seven hours in total. We represent each of these clips by their mel-frequency magnitude spectrograms, consisting of bands between and , and computed with a Hann window of duration ( samples) and hop (

samples). These parameters are identical as in the state-of-the-art deep learning model for bird species recognition from flight calls

[salamon2017fusing]. We use the librosa implementation of PCEN [mcfee2019librosa] with , i.e. an averaging time scale of about .

Figure 1 displays the mel-frequency spectrogram of one call at various distances, after processing them with either pointwise logarithm (left) or PCEN (right). Atmospheric absorption is particularly noticeable above , especially in the highest frequencies. Furthermore, we observe that max-pooled spectral flux is numerically unstable, because it triggers at different time–frequency bins from one sensor to the next. In comparison, PCEN is more consistent in reaching maximal magnitude at the onset of the call, and at the same frequency band.

Figure 2: Detection of Common Nighthawk calls: evolution of mean time between false alarms at half recall (MTBFA@50) as a function of distance between sensor of source. Shaded areas denote interquartile variations across individual birds. See Section 3 for details.

3.2 Evaluation: mean time between false alarms at half recall

Our evaluation procedure consists in two stages: distance-specific threshold calibration and estimation of false alarm rate. In the first stage, we split the dataset of positive clips (i.e. containing one vocalization) into disjoint subsets of increasing average distance; sort the values of the detection function over this subset in decreasing order; and set the detection threshold at the median value, thus yielding a detection recall of . In the second stage, we run the detector on an external dataset of negative recordings, i.e. containing no vocalizations from the species of interest; apply the detection thresholds that were prescribed by the first stage; and count the number of false alarms, i.e. values of the detection function that are above threshold. Dividing the total duration of the dataset of negative recordings by this number of peaks above threshold yields the mean time between false alarms at half recall (MTBFA@50) of the detector, which grows in inverse proportion to false alarm rate. We repeat this operation over all available subsets to obtain a curve that decreases with distance, and which reflects the ability of the detection curve to generalize from near-field to far-field events.

Figure 3: Detection of North Atlantic Right Whale calls: evolution of mean time between false alarms at half recall (MTBFA@50) as a function of distance between sensor of source. Shaded areas denote interquartile variations across days. See Section 4 for details.

3.3 Results and discussion

In the case of the Common Nighthawk, we choose the BirdVox-DCASE-20k dataset [lostanlen2018bvdcase20k] as a source of negative recordings. A derivative of BirdVox-full-night [lostanlen2018icassp], this dataset has been divided into k ten-second soundscapes from six autonomous recording units in Ithaca, NY, USA, and annotated by an expert for presence of bird calls. Among these k soundscapes, are guaranteed to contain no bird call, and a fortiori no Common Nighthawk call. These recordings amount to 27 hours of audio, i.e. over 30M spectrogram frames. For each detection function, we subtract the minimum value over each -second scene to the frame-wise value, in order to account for the nonstationarity in background noise at the scale of multiple hours.

Figure 2 summarizes our results. We find that max-pooled PCEN enjoys a five-fold reduction in false alarm rate with respect to average spectral flux. In addition, the false alarm rate at of max-pooled PCEN is comparable with the false alarm rate of averaged spectral flux at . As a post hoc qualitative analysis, we compute novelty curves for recordings of outdoor noise from the ESC-50 dataset [piczak2015mlsp]: geophony (rain, wind), biophony (crickets), and anthropophony (helicopter, chainsaw). For max-pooled spectral flux, we find that the main causes of false alarms are pouring water (% of total amount), crackling fire (%), and water drops (%).

4 Application to marine bioacoustics

4.1 CCB18 dataset of North Atlantic Right Whale calls

We consider the problem of detecting isolated calls from whales in a noisy environment. To this end, we use the CCB18 dataset, which contains vocalizations from about 80 North Atlantic Right Whales (Eubalaena glacialis), as recorded by nine underwater sensors during five days in Cape Cod Bay, MA, USA. The distance between sensor and source is estimated by acoustic beamforming, similarly as in [clark2019jcrm]

. The dataset contains 40k clips in total, each lasting two seconds. These clips were annotated by an expert, as part of a larger collection of continuous recordings which lasts 1k hours in total. We represent each of these clips by their short-term Fourier transform (STFT) magnitude spectrograms, consisting of

bands between and , and computed with a Hann window of duration and hop of . We set , i.e. an averaging time scale of about . We choose the ShipsEar dataset as a source of negative recordings [santos2016appliedacoustics]. This dataset contains 90 ship underwater noise recordings from vessels of various sizes, most of them acquired at a distance of or less. These 90 recordings amount to 189 minutes of audio, i.e. 177k spectrogram frames.

4.2 Results and discussion

Figure 3

summarizes our results. First, we find that averaged spectral flux leads to poor false alarm rates, even in the near field. We postulate that this is because, in the CCB18 dataset, ship passage events occasionally introduce high received levels of noise. In other words, distance sets an upper bound, but no lower bound, on signal-to-noise ratio. Therefore, achieving

recall with averaged spectral flux requires to employ a low detection threshold, which in turn triggers numerous false alarms.

Secondly, we find that, across the board, replacing averaged spectral flux by max-pooled spectral flux allows a two-fold reduction in false alarm rate. We postulate that this improvement is due to the fact that whale calls are locally sinusoidal whereas near-field ship noise is broadband. Indeed, the max-pooled spectral flux of a chirp is above its averaged spectral flux, with a ratio of the order of ; whereas the averaged and max-pooled spectral fluxes of a Dirac impulse are the same. Therefore, maximum pooling is particularly well suited to the extraction of chirps in noise [bock2013dafx].

Thirdly, we find that, in the near field, replacing spectral flux by PCEN leads to a -fold reduction in false alarm rate. We postulate that this is because ship noise has rapid amplitude modulations, at typical periods of 50 to (i.e. engine speeds of 120 to 1200 rotations per minute). If this period approaches twice the hop duration (i.e. in our case), short-term magnitudes and may correspond precisely to intake and expansion in the two-stroke cycle of the ship, thus eliciting large values of spectral flux. Nevertheless, in the case of PCEN, the periodic activation of one every other frame causes to be of the order of , assuming that the parameter is large enough to encompass multiple periods. Therefore, peaks at in the absence of any transient signal. This peak value is relatively low in comparison with the max-pooled PCEN of a near- or mid-field whale call.

Fourthly, we find that the false alarm rate of max-pooled PCEN increases exponentially with distance, until reaching comparable values as max-pooled spectral flux at a distance of . This decay is due, in part, to geometric spreading, but also to more complex acoustic phenomena, such as reflections and scattering with the surface as well as the ocean floor [hodges2010book]. At these large distances, a successful detector should not only denoise, but also dereverberate whale calls. Max-pooled PCEN does not have any mechanism for dereverberation, and thus falls short of that objective. Thus, an ad hoc detection function is no longer sufficient, and the resort to advanced machine learning techniques appears as necessary. We must note, however, that deep convolutional networks in the time–frequency domain rely on the same functional blocks as max-pooled PCEN — i.e. rectified extraction of local contrast and max-pooling — albeit in a more sophisticated, data-driven fashion. Consequently, we believe that PCEN, whether parametrized by feature learning or by domain-specific knowledge, has a promising future in deep learning for environmental bioacoustics.

5 Conclusion

An adequate representation of loudness in the time–frequency domain is paramount to efficient sound event detection. This is particularly true in bioacoustic monitoring applications, where the source of interest may vocalize at a large distance to the microphone. Our experiments on the Common Nighthawk and the North Atlantic Right Whale demonstrate that, given a simple maximum pooling procedure across frequencies, per-channel energy normalization (PCEN) outperforms conventional (logarithm-based) spectral flux. Beyond the direct comparison between ad hoc detection functions at various distances, this study illustrates the appeal in replacing pointwise logarithm by PCEN in time–frequency representations of mid- and far-field audio signals. In the future, PCEN could be used, for example, as a similarity measure for spectrotemporal template matching; as an input to deep convolutional networks in the time–frequency domain [lostanlen2019plosone]; or as a frequency-dependent acoustic complexity index for visualizing nonstationary effects in “false color spectrograms” [towsey2014procedia] of open soundscapes.

6 Appendix: proof of Proposition 2.1


Applying Taylor’s theorem to the exponential function yields


with an error term proportional to , which vanishes at the limit as long as remains nonzero. On the left-hand side, we recognize with and . On the right-hand side, the finite factor tends towards for . The limit allows to replace by . We conclude with


Interestingly, the distinction between Equation 1 and Equation 5

mirrors the distinction between the rectified linear unit (ReLU)

and the softplus in deep learning. ∎

7 Acknowledgment

We wish to thank D. Santos-Domínguez for sharing his dataset.