Presence-absence estimation in audio recordings of tropical frog communities

by   Andrés Estrella Terneux, et al.

One non-invasive way to study frog communities is by analyzing long-term samples of acoustic material containing calls. This immense task has been optimized by the development of Machine Learning tools to extract ecological information. We explored a likelihood-ratio audio detector based on Gaussian mixture model classification of 10 frog species, and applied it to estimate presence-absence in audio recordings from an actual amphibian monitoring performed at Yasuní National Park in the Ecuadorian Amazonia. A modified filter-bank was used to extract 20 cepstral features that model the spectral content of frog calls. Experiments were carried out to investigate the hyperparameters and the minimum frog-call time needed to train an accurate GMM classifier. With 64 Gaussians and 12 seconds of training time, the classifier achieved an average weighted error rate of 0.9 for nine species classification, as compared to 3 features. For testing, 10 GMMs were trained using all the available training-validation dataset to study 23.5 hours in 141, 10-minute long samples of unidentified real-world audio recorded at two frog communities in 2001 with analog equipment. To evaluate automatic presence-absence estimation, we characterized the audio samples with 10 binary variables each corresponding to a frog species, and manually labeled a sub-set of 18 samples using headphones. A recall of 87.5 suggests good generalization ability of the algorithm, and provides evidence of the validity of this approach to study real-world audio recorded in a tropical acoustic environment. Finally, we applied the algorithm to the available corpus, and show its potentiality to gain insights into the temporal reproductive behavior of frogs.


page 4

page 5

page 8

page 17

page 19


Automatic Detection and Compression for Passive Acoustic Monitoring of the African Forest Elephant

In this work, we consider applying machine learning to the analysis and ...

Deep Networks tag the location of bird vocalisations on audio spectrograms

This work focuses on reliable detection and segmentation of bird vocaliz...

Detection of blue whale vocalisations using a temporal-domain convolutional neural network

We present a framework for detecting blue whale vocalisations from acous...

Classification of Anuran Frog Species Using Machine Learning

Acoustic classification of frogs has gotten a lot of attention recently ...

HumBugDB: A Large-scale Acoustic Mosquito Dataset

This paper presents the first large-scale multi-species dataset of acous...

Implementing simple spectral denoising for environmental audio recordings

This technical report details changes applied to a noise filter to facil...

Automatic Detection of Expressed Emotion from Five-Minute Speech Samples: Challenges and Opportunities

We present a novel feasibility study on the automatic recognition of Exp...

1 Introduction

In long term ecological studies, it is important to quantify changes that occur on biodiversity and the ecosystem as a whole. Large scale temporal and spatial studies to understand the natural and anthropogenic induced population dynamics are demanded by the scientific community. In addition, recent anuran population declines around the world have motivated studies to gain an understanding of the phenomenon diechman . One common way to obtain information is to assess frog communities since they are considered accurate indicators of environmental stress due to their aquatic and terrestrial habitatSimon2011 ; Relyea2005 . Researchers have been recording anuran audio signals in a labor intensive task that generates an ever increasing amount of audio data by using hand-held microphones, and networks of automated programmable recording equipment that stays in the field for months at a timeFarina2017 . Manual analysis of hundreds of hours of audio is rather impractical and involves a long and tedious process acevedo_using_2006 which has been aided in recent years by the use of modern software that relies on Machine Learning (ML) algorithmsdiechman . Many important efforts have been made lately to provide a one-fits-all solution; however, evidence suggests that site and taxa specific algorithms are required to obtain the high levels of accuracy and reliability in automatic animal recognition systems necessary to extract ecological information XIE2016627 ; Towsey2012 ; Potamitis2014c ; Ulloa2016 ; Bardeli2010 . Many frog call recognition approaches have been proposed in the literature, yet it remains unclear their suitability for the analysis of long audio samples recorded on the wild in places with high biological diversity using legacy recording equipment without a systematic approach. Years of frog-call data collection with variable quality audio samples remain unidentified and archived in audio repositories. Thus, the need for ML aid in site-specific frog species presence-absence estimation based on frog call detection in long audio recordings.

Male frogs use acoustic signaling for advertisement purposes to attract potential mates, defend their territory and show distressduellman_biology_1994 . Anuran vocalizations are commonly composed of a call that is formed by one or many sequenced notes also known as syllables. A syllable is an acoustic signal produced by air blown though vocal cords and resonated by a vocal sac (duellman_biology_1994, , Chapter 4). A single call is chosen as the basic element for frog species detection since it exhibits heterospecific nature.

Several studies found in the literature are focused on frog species automatic recognition. For instance, Brandes brandes_feature_2008

introduced feature vectors extracted from spectrograms, and modeled calls of

frogs recorded in the Amazon basin with hidden Markov models (HMM). Huan et al.  

huang_frog_2009 developed a frog sound identification system extracting features representing frog call syllables previously segmented reporting up to

recognition rate using support vector machine (SVM) classification. The dataset consisted of

species, 2 of which were clearly misclassified requiring further analysis. Lee et al.lee_automatic_2006 proposed a method using averaged Mel-frequency cepstral coefficients (MFCC) and linear discriminant analysis (LDA) to automatically identify types of frogs. Chen et al. chen_automatic_2012

suggested a method based on pre-classification of syllable lengths, and a multi-stage averaged spectrum (MSAS) with template matching. This approach reported the best recognition rate on a dataset of 18 frog calls when compared to other methods based on dynamic time warping (DTW), k-nearest-neighbor (kNN) and SVM. Bedoya et al. 

bedoya_automatic_2014 suggested an unsupervised methodology for automatic identification based on a fuzzy classifier and MFCCs. The method was tested successfully with species of anurans found in Colombia. Aboudan et al. aboudan_acoustic_2013

tested the ability of MFCC and linear predictive cepstral coefficients (LPCC) in the frog recognition process using Gaussian mixture models (GMM), but no real-world recordings containing frog calls was studied. Recently, an end-to-end Deep Neural Networks approach using convolutional neural nets (CNN) to classify spectrograms have been tried exhibiting 77% classification accuracy showing a limitation in using that approach when a little amount of training data is available, which is normally the case with new species in the field. Xie et al.

XIE2016627 ; XIE201713

proposed an intelligent system for estimating frog species richness and abundance that presented important results in long recordings made in Australia using a combination of acoustic features and random forests. These studies are a very important contribution to the state-of-the-art. However, none reported the application of their algorithms for the analysis of real-world, noise-contaminated audio recordings from an environment such as the Amazonian rainforest of Yasuní National Park (YNP) in eastern Ecuador.

MFCCs have been applied in an out-of-the-box fashion, which exhibited limitations in their ability to model animal sound as reported in Towsey2012 ; cheng_call-independent_2010 ; fox_call-independent_2008 . This behavior is expected since the Mel-frequency filter bank used to generate MFCCs was designed based on the auditory properties of human hearing slaney_auditory_1998 and aims to model human voice.

In this paper, we propose a modification to the Mel-scale filter-bank based on the spectral content of frog calls to obtain a modified cepstral feature set (m-FCC), and compare it experimentally to the performance of standard MFCC and PLP features sets used in speaker recognition. We performed experiments to find the minimum time of frog calls required to train accurate GMMs and investigated the hyperparameters of the models that minimize the error rate in the training-development set. In addition, a one-vs-all Receiver Operating Characteristics (ROC) analysis per class was performed to identify a threshold vector to allow a likelihood-ratio detector to reject sound segments that do not belong to the model set. The threshold is applied to control the sensitivity and specificity of the detection desired per class. For testing, we trained 10 GMMs of frog species using the labeled training-development dataset111An on-line demonstration is available at, and applied those models to estimate frog species presence-absence in 141 (23.5 hours), 10-minute-long audio samples from a different distribution, with reduced quality, that was not used for training and validation of the algorithm. Performance evaluation in the practical presence-absence task validates the proposed approach in real-world conditions and prove the utility of the algorithm when unidentified acoustic data require analysis.

This paper is organized as follows. Section 2 describes the study site, the recording protocol used to register the frog calls in the wild, and the acoustic characteristics of the species available. Section 3 details the procedure followed for selection and annotation of the ground-truth dataset used in the experiments. Front-end segmentation is described and the modified cepstral features filter-bank presented. Section 4 is divided in two parts. The first one explains the experimental design and results of the parameter investigation. The second part describes the testing phase in real audio samples made by researchers in the wild. Finally, a discussion is presented in section 5, and conclusions summarized in section 6.

2 Materials

2.1 Study Site

Figure 1: Location of the study area. a. Ecuador in South America. b. YNP in Ecuador. c. PUCE’s Yasuní Research Station in YNP.

Frog calls were recorded within Yasuní National Park, which is located in the central eastern sector of the Ecuadorian Amazon region (S, W), in the provinces of Orellana and Pastaza (Figure 1). This national park area is primarily rainforest that lies within the Napo moist forests ecoregion, and is considered one of the most biodiverse places on earthbass_global_2010 . It was designated as UNESCO Biosphere Reserve in 1989. Its climate is characterized by warm temperatures averaging to for all months; rainfall is high, approximately throughout the year. Relative humidity of YNP is between 80% and 94%. Average elevation of the park is low, from approximately to above sea level; the territory is frequently crossed by hills of to high. Soil is mostly geologically young, product of fluvial sediments by the erosion of the Andes mountains ministerio_del_ambiente_plan_2011 .

2.2 Acoustic Environment

The acoustic environment of the Amazon basin is known to present a challenge for signal processing algorithms in automatic analysis of recordings sueur2012global . This region has a tropical rainforest soundscape with high sound diversity riede_monitoring_1993 . In this paper, we focused on frog calls and any other sound source is considered noise. Previous studies sueur2012global have identified three main types of noise when recording soundscapes; namely, biotic noise, antrophogenic noise and environmental noise. A combination of these types of noise is present in the dataset used for this study. Some recordings contain antrophogenic noise like human voice and 60 Hertz ”humming” of a nearby electric generator while others biotic noise from insects, mammals and nearby species. In addition, we identified noise of broad-band transient nature that resulted from friction of the microphone boom with the surrounding vegetation and water drops falling on the microphone while recording. Figure 2 shows the typical pond soundscape found within YNP in which a chorus of Rinhella margaritifera amidst anthropogenic and biotic noise could be observed.

Figure 2: Spectrogram of a typical YNP soundscape showing a) frog call in a Rinhella margaritifera chorus amidst b)antrophogenic and c)biotic noises.

2.3 Acoustic Recording Protocol

The audio database containing frog calls used in this study was provided by Museo de Zoología (QCAZ) of the Pontificia Universidad Católica del Ecuador (PUCE) ron_amphibiawebecuador._2016 . The material was unlabeled and a few files contained only voice annotation made in the field. For training and validation experiments, we used recordings made with a Sennheiser K6-ME67TM unidirectional microphone attached to digital recorders Olympus LS-10TM or Marantz PMD660TM with sampling frequencies of and at 16-bit resolution. Sound was archived in lossless WAV files in order to preserve the integrity of the audio. The recording schedule was from 7:00 p.m. to 2:00 a.m. at natural ponds and trails located within the YNP by different researchers during distinct sessions ranging from 2003 to 2013. Since locating the exact position of frogs calling in the wild at night is a difficult task, frog calls were registered aiming the microphone to the zone where the frogs were heard calling. Distance to the frog is therefore not available and varies within the dataset from a few meters to tens of meters for loud species. Considering that the distance to the frog is uncertain, we focused on the SNR when evaluating the detectability of a frog call in the sound file estrella_selection_2017 .

The audio used for testing was recorded at two ponds located close to PUCE’s Yasuní research station by placing an omni-directional microphone 1.5 meters above the surface attached to a cassette recorder on a daily schedule during February (8 days), April (17 days), July (12 days), August (16 days), and September (13 days) of 2001. Pond 1 was recorded from 8:50 p.m. to 9:00 p.m, and Pond 2 from 1:50 a.m. to 2:00 a.m. The recordings were performed prior to a Visual Encounter SurveyPadilla:Thesis:2005 . The analog audio was transfered to digital audio in WAV format at 48000 kHz sampling rate using a USB digital audio converter in 2012.

2.4 Study Species

We selected for our experiments the frog species listed in Table 1, which were chosen based upon availability at the time of labeling. Although more than 130 frog species have been identified so far in the study zone, typically only a few species are active at the same time and place. This is an important constraint for an automated analysis method since only a small subset of acoustic models are required for classification given a geographic location and timespan. To account for calls or sounds that are not modeled by the system, the option of unknown sound is included and its output can be studied by an specialist if necessary. Acoustic power in the frog calls of Table 1 is mostly distributed in the range 430 to 7500 Hz depending on the species. Spectrograms of calls for each species with 1024 samples, 50% overlap and Blackman-Harris window are shown in Figure 3.

Code Species # of Calls Seconds Freq. range in Hz
Boana alfaroi 98 44.7 1660 - 3100
Dendropsophus bifurcus 103 53.1 2300 - 3390
Boana cinerascens 169 105.5 1300 - 1530
Pristimantis conspicillatus 186 78 1630 - 3900
Leptodactylus discodactylus 330 146.3 1680 - 3260
Osteocephalus fuscifacies 50 23.52 1000 - 2500
Boana lanciformis 97 57.7 500 - 2720
Rhinella margaritifera 108 124.2 800 - 1700
Dendropsophus parviceps 38 40.1 5660 - 7500
Engystomops petersi 316 137.2 430 - 3140
Total 1495 810.32
Table 1: Study Species
Figure 3: Spectrograms of 3 calls per species in the training-development set. a.Boana alfaroi b.Dendropsophus bifurcus c.Boana cinerascens d.Pristimantis conspicillatus e.Leptodactylus discodactylus f.Boana lanciformis g.Rhinella margaritifera h.Dendropsophus parviceps i.Engystomops petersi

3 Methods

3.1 Frog-call Dataset

We generated a ground-truth corpus of frog calls for training and testing ML algorithmsestrella_selection_2016 ; dataset . From the unlabeled audio provided by QCAZ museum of Zoology, we manually selected audio files containing frog calls. Nine species had enough acoustic material, from 40 to 146 seconds of calls, that allowed the creation of training-validation subsets used in the first set of experiments. Since classes were unbalanced, the split was made selecting 6,12 and 18 seconds of calls for training and the rest used for validation. was also included to generate a model for long recording analysis during testing. Most frog calls were chosen with SNR higher than 3 dB, but we also included calls with background noise and some interference to study the performance of the algorithm in the noisy conditions that occur in the study zone. Field recordings containing human voice, mechanical artifacts or inter-specific overlapping calls were used neither to create the training-validation dataset nor to train the final GMMs. Table 1 presents the number of calls and seconds of audio per each species available in the labeled dataset.

Labeling of frog calls was aided by a short time energy (STE) based automatic segmentation algorithm described in section 3.2. Automatic segmentation of frog calls into syllables have been previously attempted by Jaafar et al. jaafar_automatic_2013 with interesting results. The front-end STE segmentation algorithm outputs the start and end points of a segment containing frog calls within the selected portion of audio. Each segment was manually labeled according to the species it belonged to by placing cue points signaling the start and end of the section containing calls. Automatic segmentation was preferred since an early attempt to perform manual endpoint selection resulted in lack of consistency among different annotators.

For testing the algorithm in real world conditions, a subset of 18 audio samples of the 141 unidentified files was manually labeled by AET using headphones and spectrogram visualization. A vector of 10 binary variables (representing the species) was assigned per sample according to [,,,,,,,,,], in which one is presence and zero absence. For instance, [0 1 0 0 0 0 0 0 0 0], represents presence of D. bifurcus and absence of all the others.

3.2 Frog Call Segmentation

Since a frog-call was chosen as the basic element of species identification, a segmentation technique that detects calls while avoiding portions of silence and noise was required for front-end processing. We adapted a classic voice analysis silence-removal method rabiner_algorithm_1975 based on bandpass-filtering, STE estimation and thresholding. Figure  4 shows the algorithm pipeline.

Figure 4: Frog call segmentation diagram.

First, the whole audio sample was divided into consecutive 30-second frames. A band-pass finite impulse response (FIR) filter was applied to the original audio signal. The cut-off frequencies are user defined and were chosen based on the frequency range spanning most of the call energy of the objective species. Table 1

shows the frequency ranges of the filters used to generate the training set. The boundaries were calculated as the points were the spectral power is -20 dB relative to the point of maximum power of the frog call. It should be noted that the filter is applied only prior audio segmentation, but the original unfiltered audio is used for feature extraction and classification. Second, a STE sequence is generated from

consecutive frames with no-overlapping of the filtered signal  according to equation 1.


where is the energy of frame , is the filtered discrete-time signal and is the number of samples of each frame.

A moving-average (MA) filter was applied to the STE sequence to smooth transients and delimit STE of the whole frog call (or consecutive calls) instead of each separate note. The value of the MA = 12 was chosen empirically since it is related to the minimum frog-call duration that will be segmented.

3.3 Endpoint Detection

The smoothed STE sequence was then transformed to , , and the following routine was applied to estimate the start and end points of the frog calls in the frame.

  1. Define a threshold value according to


    . Where C is a constant determined empirically.

  2. If consecutive values of are over the threshold, set a start-point. Subsequently, if consecutive values are below the threshold set the end-point.

The threshold allows for fine tuning the sensitivity of the endpoint detection. Its value is related to the SNR of the segmented audio that undergoes classification. Figure 5 shows calls of Dendrophsophus bifurcus detected applying the segmentation algorithm to a field recording.

Figure 5: Segmentation of a Dendropsophus bifurcus field recording. (a) 20 seconds of segmented input audio signal with start and end points. (b) STE sequence of 10 frames of input signal (c) Smoothed STE in , with threshold in dashed line.

3.4 Acoustic features extraction

Mel-frequency cepstral coefficients (MFCC) mermelstein1976distance and perceptual linear predictive analysis (PLP) Hermansky1992 have been the dominant feature sets used in automatic speech recognition systems hermansky2013 as well as in automatic recognition of animal sounds with interesting results in birds fox_call-independent_2008 ; cheng_call-independent_2010 , odontocetes roch_gaussian_2007 , anurans bedoya_automatic_2014 . Those feature sets are optimized for human voice processing and have been applied mostly without modification to the problem of animal sound recognition obtaining important results. However, a close observation to the spectral energy of frog-calls reveal a different distribution than that of human voice. Therefore, it is not optimal to apply standard MFCC or PLP features without some modification to capture the spectral characteristics that differentiate frog calls. We propose using hand-crafted cepstral coefficients with a modified filter-bank distribution following the layout shown in Figure 6 for frog-call recognition and compared its performance with standard MFCC and PLP-RASTA features. The procedure to extract the cepstral feature set is summarized as follows:

Fragments of sound containing frog-calls resulting from the segmentation step described in Section 3.2 were divided into 20 ms frames with 75% overlap. Each frame was then pre-emphasized using the filter described by


and Hamming-windowed to minimize discontinuities on the edges. The discrete Fourier transform (DFT) was taken and the triangular-shaped 40-element filterbank of Figure 

6(b) was applied.

Figure 6: (a) layout of the mel-scale filter-bank (b)modified filter-bank proposed for frog-call identification

The log of the energy of each filter was obtained and the discrete cosine transform (DCT) of the resultant vector of log-energies calculated to decorrelate the energies. Finally, the 20 first elements of the resultant vector were concatenated per each frame and the resultant matrix used as feature set for the classification step. A detailed description of Mel-cepstrum computation can be found in cheng_call-independent_2010 ; bedoya_automatic_2014 ; mermelstein1976distance .

3.5 Frog species model representation

Gaussian mixture models have been shown to accurately model speaker identities when short-utterances of unconstrained speech are available for classification. reynolds_robust_1995 . In the case of animal sound recognition, GMMs have been applied to identify vocalizations of individual birdscheng_call-independent_2010 and marine mammals roch_gaussian_2007 . Since our aim is to identify the species of frogs calling in audio samples, we modeled each species by multi-variate GMMs. Formally,

A Gaussian mixture density is defined according to:


where is a -dimensional feature vector, are the component densities, and are the mixture weights that satisfy the constraint reynolds_robust_1995 . Each component density is a Gaussian function of variables:


with mean vector and covariance matrix . Each GMM is denoted by its mean vector, covariance matrix and the mixture weights according to:


We need a model for each species available in the labeled dataset of Table 1. Thus, we generated a set of training vectors for every utterance in the dataset by extracting cepstral features at fixed time steps. The resultant matrix was then used to train models , ,…,

for each species by maximum likelihood estimation (MLE) using the expectation-maximization (EM) algorithm described in 

dempster1977maximum ; reynolds_robust_1995 ; cheng_call-independent_2010

. The model was initialized by setting the mean values using the k-means++ algorithm 

arthur2007k , the initial covariance matrix was set as diagonal with element

as the variance of

, and the initial mixing proportions were set as uniform.

3.6 Frog species identification

The output of front-end segmentation yields a sequence of vectors that contain characteristics of the sound source that generated it. We need to find the species model with the maximum a posterioriprobability for . Formally,



is the hypothesized frog species and S is the number of models. Even though prior probabilities could be defined based on the geographic location of the study, we assume identical prior probabilities of frog species

and remove since it is the same for all models. The decision rule becomes:


Since front-end processing produces variable size audio segments, we need to normalize for size . Applying logarithms and assuming independence between observations, the species identification stage calculates:


3.7 Species detection

With the hypothesized frog species , we need to determine if the sound segment was actually produced by . To accomplish this goal, we applied the log-likelihood ratio statistic defined by:


where is a score that informs how likely it is for a given call segment to belong or not to the hypothesized species model that represents , and

is the probability density function of the set of alternative species in the model set. This task is known as verification in the speaker detection literature

Reynolds2000 . Although more than one species usually call at the same time, for the application of presence-absence estimation we focused on single-species detection per segment. A close observation to the calling patterns of frogs during reproduction reveals that frogs call repeatedly in choruses sharing the time and frequency resources available. For a 10-minute sound sample, we assumed that at least one segment of sound is single-species composed. To represent the pdf of the alternative species model we applied the median() function of the set of non hypothesized models in the set :


To find a threshold of the score that permits detection of frog calls with high likelihood while rejecting non relevant sound segments (overlapped calls, noise, non modeled species, etc), we defined threshold values per class applying one-vs-all receiver operating characteristic (ROC) analysis.

3.8 Threshold vector selection

Figure 7: Per class Receiver operating characteristics curves and the chosen operating points.

ROC curves hanley1982meaning ; Fawcett2005 for the multi-class detection problem are shown in Figure 7. A one-vs-all approach was used to calculate the true positive rate (TPR) versus the false positive rate (FPR) for each class in relation to a varying log-likelihood ratio score. To consider an acoustic event as a class detection, its score value must be above a defined threshold per class. Higher threshold values enable a selective detection while lower values increase the sensitivity. A trade-off between false detections and false rejections exists and the operating point depends on the application. In the case of frog call presence-absence estimation in long recordings it is desirable to minimize false alarms while maintaining a TPR as high as possible. Since the audio recorded in YNP was registered with a unique microphone in a fixed site, it is important to detect all the frog calls possible in the range of the microphone. Therefore, the operating point requires high TPR while maintaining FPR reasonably low. Some intuition of the behavior and frequency of calling per species is desirable when setting the threshold. We began with a fixed threshold vector of operating points corresponding to FPR of 5% for all the species and fine tuned the thresholds comparing the analysis output to the manually annotated corpus. The final threshold vector used in the analysis of section 4.3 was [5 3 6 9 6.75 5.25 5.5 11 6 6] and the corresponding operating points are shown in Figure 7.

4 Experimental Results

We divided this section in two parts. First, a set of experiments was designed to identify the hyperparameter values that perform best in recognizing the species of frogs from a frog call in the development dataset as well as the minimum time required to train accurate GMMs. Second, the parameters obtained in the first part were used to train production GMMs using all the material available in the dataset. Automatic analysis of real field recordings coming from a different distribution was performed, and its results compared to human level performance in the presence-absence task.

4.1 Parameter investigation

The number of components in a mixture needed to model frog species adequately, and the minimum training time required were determined by the following experiment. Nine frog models with 2, 4, 8, 16, 32 and 64 component Gaussian densities and diagonal covariance matrix were trained using 6, 12, 18 seconds of frog-call corresponding to 600, 1200 and 12800 12-dimensional mel-cepstral feature vectors. The dataset was divided into a training set of 6, 12 and 18 seconds and the remaining calls were used for testing. We applied 10-fold cross-validation with random segment selection on all the dataset per each species to model the distribution of the weighted error rate (WER). The WER was calculated as the average on the individual per species Bayesian error rate of the nine species to account for the unbalanced classes. Figure 8 shows the distribution of WER for different training times and number of mixtures M for the 10-fold cross-validation procedure.

Figure 8: Weighted error rate for different training times.

The following observations can be made from the results. For the 3 training times tested, there was an increase in identification performance from 2 to 16 mixture components leveling off for 32 and 64 components. However, when training with 6 seconds the identification performance degrades with increasing model order suggesting that at least 12 seconds of training set and more than 32 mixture components are required to model the frog species adequately. This result is important since it provides a guideline for the lower limit of training set required when training frog species GMMs.

Frog Model Order
Species M = 2 M = 4 M = 8 M = 16 M = 32 M = 64
H. alfaroi  0
D. bifurcus  0.8
H. cinerascens  0.2
P. conspicillatus  0.3
L. discodactylus  0.2
B. lanciformis  1.6
R. margaritifer  3.2
D. parviceps  3
E. petersi  0
Table 2: Frog call recognition error rate per species with respect to the number of component densities for 20 m-FCC.

Finally, in Table 2 the error rate per species is presented according to the number of Gaussian mixture components used to model each species with 12 seconds of training time and 20-mFCC. Variable error rates across distinct M suggests that the optimal model order is not the highest for each species in the data set. A different model order could be chosen for each species to avoid over-fitting as suggested by Cheng in cheng_call-independent_2010 . However, we selected model order M = 64 that gives the minimum error rate on the average since it is not clear how to choose the optimal model order for species that are not included in the training - validation set, but could be included in the future.

4.2 Features comparison

In order to investigate the WER with respect to the feature set used to model the frog calls, GMMs were trained using 12 MFCC, 20 MFCC, 20 PLP-RASTA and 20 modified-filterbank cepstral features(m-FCC). The results are shown in Figure 9 which presents the WER with respect to the number of mixture components for each feature set. Standard MFCC used in automatic speaker recognition exhibited the lowest performance with a slight improvement when the number of features increased from 12 to 20. Additionally, 19th PLP-RASTA outperformed standard MFCC performance by approximately 1% when more than 16 mixture components were used suggesting that the Bark filter-bank used to calculate the PLP feature set allows the spectral characteristics of the frog calls to be captured better than MFCC. Finally, classification performance of the 20 cepstral features calculated using the modified filter-bank described in Section 3.4 surpass the others. The results suggests that the modification to the filter-bank in order to model the spectral shapes frog calls rather than those of human voice is appropriate.

Figure 9: Weighted error rate for GMMs trained with different cepstral features.

4.3 Field recordings analysis

Ten GMMs were trained using all the labeled dataset, and applied to analyze 23.5 hours of audio in 50 WAV files with focus on presence-absence estimation. Each file contained three 10-minute samples delimited using cue points to inform its position to the algorithm. The audio contains unidentified calls amidst different types of noise and distortion resultant from volume variation during recording, clipping, malfunctioning cable, microphone friction, rain and digitalization noise. Scanning each file took 50 seconds approximately with a laptop running a 2.6 GHz processor and 16 GB RAM. Figure 13 shows a 40-second snippet of segmentation stage, and classified segments with their respective likelihood-ratio score are presented in Figure 10.

Figure 10: Snippet of frog call detections. The numbers in the first line are the likelihood-ratio scores for each segment. Below is the species code D.bifurcus (2), B.cinerascens (3), B.lanciformis (7). (12) and (13) are classifications that did not pass the set threshold.

Presence-absence estimation results of all the corpus are summarized in Figure 11 for both ponds studied. One binary vector per sample was obtained and plotted per day and month according to the sampling schedule available in the 5-month period. To evaluate its performance, the 18 audio samples of February were manually annotated by AET using headphones and spectrogram visualization. The resultant 10-variable binary vectors were compared to the output of the algorithm variable by variable, and binary classification performance metrics were applied. The resultant scores are presented in Table 3

Metric Score
Recall 0.875
Precision 1
F1 0.933
MCC 0.914
Specificity 1
Accuracy 0.966
Table 3: Performance measures for the presence-absence task

Calls of B. alfaroi, P. conspicillatus, R. margaritifer and D. parviceps did not exist in the recordings and were correctly estimated as absent by the machine with the exception of one sample in July in which a false detection of P. conspicillatus occurred. In contrast, D. bifurcus was detected in all the annotated samples by both human and machine. The algorithm was able to estimate absence correctly for the species that did not call during the sample while presence of O. fuscifacies and B. lanciformis presented a challenge. Those species called only once or twice per sample, making it difficult to detect them during manual screening as well as during automatic analysis. Nevertheless, automatic analysis of February detected presence of O. fuscifacies in a sample that was not detected by manual labeling initially. Close observation of the sample in the position that the algorithm detected the call, enabled the researcher to identify and label that sample correctly in the case in which only one call existed.

It is observed that the call detector performed well in terms of accuracy and precision while maintaining high specificity. In other words, the detector exhibited no false positives in the labeled set (false species presence) and did not confound between detected species. On the other hand, recall of 0.875 signifies that a species present in the recording was not detected. This behavior was mostly due to species calling once during the sample time and pose a limitation that can be solved by increasing the sampling time or using a species specific approach setting a lower threshold for the desired species. These results prove that the proposed learner is able to generalize to real world audio that has not been used for training and validation, and was recorded with different equipment.

Overall, more detections occurred in Pond 1 than in Pond 2 suggesting higher acoustic activity in Pond 1 during the duration of the study. In Pond 1, the highest number of detections belonged to B. cinerascens accounting for 58% of detections followed by D. bifurcus with 23% and L. discodactylus with 17%. The remaining 2% belonged to O. fuscifacies and B. lanciformis. In contrast, Pond 2 shown mostly detections of D. bifurcus and E. petersi.

Figure 11: Number of frog call detections on Pond 1 at YNP-PUCE station during the daily 10-minute recordings performed during a) February

Figure 11 presents the number of detections of the three species which called the most during the sampling of February in Pond 1. A researcher can gain insights about the reproductive activity of those species with longer and planned acoustic samples. For instance, the circadian reproductive activity, and probably abundance might be extracted. However, it is still not clear how to extrapolate the number of males calling to the actual population including females and juveniles with the usage of the proposed approach in YNP.

Figure 12: Results of automatic 10 species presence-absence estimation in the available corpus. Species detected manually are shown by a circle.

Finally, results in Figure 12 gives a clue of the possibility to study seasonality and species richness in long-term sampling using acoustic methods in tropical frog communities. This is an interesting topic currently explored by ecologistsdiechman .

5 Discussion

Monitoring animal sound in the tropical rainforest using automatic approaches is known to be challenging problem because of the high amount of noise present and variable conditionsriede_monitoring_1993 ; XIE201713 . Currently, many algorithms have been proposed that allow researchers to study audio in search of calls of birdsUlloa2016 , insectsPotamitis2014c , odontocetesroch_gaussian_2007 , and frogsbedoya_automatic_2014 ; aboudan_acoustic_2013 ; XIE201713 . However, more research is needed to asses their suitability to study audio recordings made in frog communities with high biological diversity such as Yasuní National Park of Ecuador. An automatic approach to estimate presence-absence in long-term audio recorded on the site with local taxa is necessary to help researchers gain understanding of the dynamics and ecology of local frogs.

In this study, we applied frog call recognition with verification stage to study real-world audio recordings. Frog call classification have been attempted in selected calls with high SNR in previous studies bedoya_automatic_2014 ; brandes_feature_2008 showing that calls with low SNR were misclassified due to interference and noise. We found that it is not necessary to detect all the frog calls in the audio sample to achieve the goal of 10-species presence-absence estimation. As long as one call is detected, it is enough for species presence, which is highly probable in ponds where frogs call repeatedly to attract mates. Threshold setting then becomes an important step to tune the detector and set a desired operating point which minimize false alarms. Nonetheless, for frogs that called once during the sample, a limitation of this approach was observed which was also the case during human annotation by AET using headphones. For instance, calls of O. fuscifacies and B. lanciformis that occurred once or twice during the sample were not heard, and in consequence not labeled the first time. Observing the results of the algorithm in those samples, which shown detections of those species, enabled the researcher to go back to the recording and verify that they were really present. In that context, the algorithm already was helpful to complement human performance and save time for labeling acoustic samples, which is an important task to prepare ground truth for developing Machine Learning algorithms.

Even though previous studies suggested that MFCC coefficients are not suited for animal call recognitionTowsey2012 ; roch_gaussian_2007 , they used MFCCs for voice recognition without modification. We found that a filter-bank modification based on the spectral content of frog-calls enabled the resulting cepstral features to improve classification performance using GMMs. This is an important result since it suggests that hand-crafted cepstral coefficients perform better when focused on the spectral characteristics of the taxa of interest. Despite efforts to develop a one-suits-all system, generally algorithms that perform well in some situations tend to do poorly in other datasets as stated by the no free lunch theorem. We applied our approach to two distinct audio samples recorded at different ponds within YNP with consistent results, suggesting a good generalization capability. In addition, the audio used for training the GMMs was captured with digital equipment whereas the audio used for testing was recorded using an analog cassette recorder with different microphones. This result is important since suggests the possibility of studying audio coming from different sources and equipment, which is normally the case in the field where multiple people record calls in different timestamps.

Front-end processing is very important to obtain the results shown in this study. In XIE2016627 , Xie et al. proposed a frog acoustic activity detector in order to focus their classifier only in frog calls. We aimed to keep front-end segmentation as simple as possible to allow verification after classification to remove non-frog call audio. Verification was able to reject noise coming from malfunctioning cable, human voice, unknown species, calls overlap, cellphone noise, etc. The STE segmentation approach applied proved good for situations where frogs call intermittently with at least 10 ms of inactivity between calls. In species like D. bifurcus which call in choruses in a non-stop way a limitation was identified. Since this is a variable size segment approach focused on a band of interest, it could be improved if multi-frequency segmentation is applied as suggested in inproceedings

, and classification and verification applied to each resulting sequence and adding the species detected. Finally, recent advances in end-to-end convolutional neural networks CNNs and recurrent neural networks (RNN), that study all the audio without prior segmentation could provide a way to remove front-end processing altogether. Nonetheless, the processing power needed for that approach might be prohibitive when using desktop computers, and require cloud processing that is expensive. Research in Deep Neural Networks applications is advancing fast and we expect that using that approach important results will be obtained in the future. Therefore, we open the data-set to the research community

dataset and provide a baseline for comparison.

6 Conclusion

The proposed approach proved a helpful tool in estimating presence-absence of frog species in pond recordings made in the wilderness of YNP in Ecuador. Several hundred hours of unidentified acoustic material still exist at QCAZ-PUCE archive, and the application of automatic analysis could save researchers’ time with metadata generation that can be verified in a fraction of the time that takes to listen each recording one-by-one. Fast audio appraisal and inventory generation without the need of specialists can be performed with acceptable level of performance by using metadata from presence-absence estimation. In systematic Ecoacoustic recordings used in wildlife monitoring, summing the results of multiple learners trained with specific taxa cohorts might provide a way to estimate biodiversity and study the composition of the soundscape. However, it is difficult to deeply asses those applications at this stage without more acoustic data available.

Machine Learning aid in the automatic evaluation of frog communities in wildlife recordings introduces a potent technology that is complementary to existing survey techniques used currently by researchers in the wild. Our team is exploring diversity indexes estimation based on applying the proposed approach to 24-hour-long recordings made in the Mindo region in which critically endangered frog species call in a different setting with more silence between calls, which makes it simpler for front-end segmentation to extract calls from the background; thus, simplifying the digital signal processing pipeline.

Finally, automatic analysis of audio records of frog communities might be useful for researchers studying environmental changes since frog presence is related to the health of the ecosystem, and their disappearance provides clues of contamination or climate change effects that could be helpful in developing sustainable solutions. Important applications in wildlife surveillance are envisioned that could be enhanced by wireless acoustic sensors networks in the wilderness.

Figure 13: 30 seconds sample of Pond 1 recording. a. Segmented spectrogram, b. STE, c. Threshold


The study was supported by Pontificia Universidad Católica del Ecuador research projects 2015. Project L13304: ”Diseño de un algoritmo de procesamiento de audio para un sistema que permita la automatización del inventario, caracterización y el monitoreo de poblaciones de las ranas del Ecuador, caso de estudio Parque Nacional Yasuní”. We would like to thank Santiago Ron for providing access to the database of frog calls recorded in Yasuní National Park and for his valuable advice. To the staff at Estación Científica Yasuní for their cooperation during the execution of field recordings. To Jean Camino, Franco Cisneros and Eduardo Silva for their important contribution in manual labeling of the frog calls used in training and development. To Daniela Pareja and Daniel Rivadeneira for their help during field work recording frog calls at YNP trails. To Samael Padilla for kindly providing access to the data gathered during his thesis work. To Paloma Lima for her help in early stages of the manuscript.